PRODCOM data converted with the PRObs ontology

stephenjboyle 52bb69541b Edit README.md		7 miesięcy temu
.datalad	e031539dc6 [DATALAD] new dataset	1 rok temu
bulk_data	d313e13296 Update sold production files	7 miesięcy temu
data	24c613ddd2 Update for ontology changes	8 miesięcy temu
ontology	a2cfb9daef Delete copy of ontology file	10 miesięcy temu
outputs	4117cd4a8e Add rdf output files	7 miesięcy temu
raw_data	d5c3f42535 Add RDFox scripts/rules to load PRODCOM bulk data	10 miesięcy temu
scripts	7da3095146 Update to fix test queries using object name	7 miesięcy temu
tests	7f2c76ece0 Modify bulk tests to use object codes	7 miesięcy temu
.gitattributes	ef3f58e9a7 Git attributes for bulk data files	7 miesięcy temu
.gitignore	e2e79c1544 Converted to datalad	1 rok temu
DEVELOPING.md	6a99e6ad59 Update DEVELOPING.md	7 miesięcy temu
README.md	52bb69541b Edit README.md	7 miesięcy temu
dodo.py	08615d2a92 Add total production to bulk processing	8 miesięcy temu
environment.yml	e899370c68 Update environment file	7 miesięcy temu

PRODCOM data as PRObs Observations

This repository converts data from the PRODCOM database into a structure defined by the Physical Resources Observatory (PRObs) ontology.

See DEVELOPING.md for more information about using this repository.

Dataset structure

Repository is a datalad dataset
Input data files needing preprocessing are located in raw_data/.
Preprocessed data files ready for conversion are located in data/.
All custom code is located in scripts/.
Converted data is saved to outputs/.

Installation

Getting the code

To clone the datalad dataset, in a shell/command window (e.g. git-bash) type:

datalad clone https://github.com/probs-lab/prodcom-data.git

Setting up the virtual environment and installing dependencies:

To create a virtual environment using conda/miniconda:

cd prodcom-data
conda env create

Running the code

After installation:

Open a terminal / git-bash window
Navigate to prodcom-data folder, e.g. cd prodcom-data
Activate environment using conda activate prodcom-data

To download the example RDF output files (outputs/sold_production and outputs/total_production) from the server use:

datalad get outputs

To download the input csv files used to generate the output files use:

datalad get bulk_data

These files have been generated from bulk csv files downloaded from the eurostat website, which have been split into files for each country and for each year (see DEVELOPING.md).

The dodo.py script can be used to preprocess the files in raw_data and convert the files in the data and bulk_data folders:

To preprocess input data files run the script:

doit run preprocess

To convert the preprocessed data in the data folder run:

doit run convert_data

To convert all files in the bulk_data folder run:

doit convert_bulk

To run all necessary tasks (i.e. preprocessing and conversion) simply run:

doit

Individual files can be converted by running the convert_data.py script with appropriate parameters specifying the file type and the input and output filenames:

scripts/convert_data.py prodcom data/PRODCOM2016DATA.csv outputs/PRODCOM2016DATA.nt.gz

For conversion of the example PRODCOM data files in folder raw_data the type prodcom should be specified. Types prodcom_list and prodcom_correspondence are also defined, along with prodcom_bulk_sold and prodcom_bulk_total (for processing bulk files in folder bulk_data).

Converting new data

For conversion of new data files (possibly in a different format from the examples) see the DEVELOPING.md file.

Testing the code

To test the code, after installing the software and running the doit script:

cd tests
pytest

README.md