PRODCOM data as PRObs Observations
This repository converts data from the PRODCOM database into a structure defined by the Physical Resources Observatory (PRObs) ontology.
See DEVELOPING.md for more information about using this repository.
Dataset structure
- Repository is a datalad dataset
- Input data files needing preprocessing are located in
raw_data/
.
- Preprocessed data files ready for conversion are located in
data/
.
- All custom code is located in
scripts/
.
- Converted data is saved to
outputs/
.
Installation
Getting the code
To clone the datalad dataset, in a shell/command window (e.g. git-bash) type:
datalad clone https://github.com/probs-lab/prodcom-data.git
Setting up the virtual environment and installing dependencies:
To create a virtual environment using conda/miniconda:
cd prodcom-data
conda env create
Running the code
After installation:
- Open a terminal / git-bash window
- Navigate to
prodcom-data
folder, e.g. cd prodcom-data
- Activate environment using
conda activate prodcom-data
To download the example RDF output files (outputs/sold_production
and outputs/total_production
) from the server use:
datalad get outputs
To download the input csv files used to generate the output files use:
datalad get bulk_data
These files have been generated from bulk csv files downloaded from the eurostat website, which have been split into files for each country and for each year (see DEVELOPING.md).
The dodo.py
script can be used to preprocess the files in raw_data
and convert the files in the data
and bulk_data
folders:
To preprocess input data files run the script:
doit run preprocess
To convert the preprocessed data in the data
folder run:
doit run convert_data
To convert all files in the bulk_data
folder run:
doit convert_bulk
To run all necessary tasks (i.e. preprocessing and conversion) simply run:
doit
Individual files can be converted by running the convert_data.py
script with appropriate parameters specifying the file type and the input and output filenames:
scripts/convert_data.py prodcom data/PRODCOM2016DATA.csv outputs/PRODCOM2016DATA.nt.gz
For conversion of the example PRODCOM data files in folder raw_data
the type prodcom
should be specified. Types prodcom_list
and prodcom_correspondence
are also defined, along with prodcom_bulk_sold
and prodcom_bulk_total
(for processing bulk files in folder bulk_data
).
Converting new data
For conversion of new data files (possibly in a different format from the examples) see the DEVELOPING.md file.
Testing the code
To test the code, after installing the software and running the doit
script:
cd tests
pytest