# PRODCOM data as PRObs Observations This repository converts data from the [PRODCOM](https://ec.europa.eu/eurostat/web/prodcom) database into a structure defined by the [Physical Resources Observatory (PRObs)](https://github.com/probs-lab/probs-ontology) ontology. See [DEVELOPING.md](DEVELOPING.md) for more information about using this repository. ## Dataset structure - Repository is a datalad dataset - Input data files needing preprocessing are located in `raw_data/`. - Preprocessed data files ready for conversion are located in `data/`. - All custom code is located in `scripts/`. - Converted data is saved to `outputs/`. ## Installation ### Getting the code To clone the datalad dataset, in a shell/command window (e.g. git-bash) type: ```shell datalad clone https://github.com/probs-lab/prodcom-data.git ``` ### Setting up the virtual environment and installing dependencies: To create a virtual environment using conda/miniconda: ```shell cd prodcom-data conda env create ``` ## Running the code After installation: - Open a terminal / git-bash window - Navigate to ```prodcom-data``` folder, e.g. ```cd prodcom-data``` - Activate environment using ```conda activate prodcom-data``` To download the example RDF output files (```outputs/sold_production``` and ```outputs/total_production```) from the server use: ```shell datalad get outputs ``` To download the input csv files used to generate the output files use: ```shell datalad get bulk_data ``` These files have been generated from bulk csv files downloaded from the [eurostat](https://ec.europa.eu/eurostat/web/prodcom/database) website, which have been split into files for each country and for each year (see [DEVELOPING.md](DEVELOPING.md)). The `dodo.py` script can be used to preprocess the files in `raw_data` and convert the files in the `data` and `bulk_data` folders: To preprocess input data files run the script: ```shell doit run preprocess ``` To convert the preprocessed data in the `data` folder run: ```shell doit run convert_data ``` To convert all files in the `bulk_data` folder run: ```shell doit convert_bulk ``` To run all necessary tasks (i.e. preprocessing and conversion) simply run: ```shell doit ``` Individual files can be converted by running the `convert_data.py` script with appropriate parameters specifying the file type and the input and output filenames: ```shell scripts/convert_data.py prodcom data/PRODCOM2016DATA.csv outputs/PRODCOM2016DATA.nt.gz ``` For conversion of the example PRODCOM data files in folder ```raw_data``` the type `prodcom` should be specified. Types `prodcom_list` and `prodcom_correspondence` are also defined, along with `prodcom_bulk_sold` and `prodcom_bulk_total` (for processing bulk files in folder `bulk_data`). # Converting new data For conversion of new data files (possibly in a different format from the examples) see the [DEVELOPING.md](DEVELOPING.md) file. ## Testing the code To test the code, after installing the software and running the `doit` script: ```shell cd tests pytest ```