PRODCOM data converted with the PRObs ontology

stephenjboyle 52bb69541b Edit README.md 7 miesięcy temu
.datalad e031539dc6 [DATALAD] new dataset 1 rok temu
bulk_data d313e13296 Update sold production files 7 miesięcy temu
data 24c613ddd2 Update for ontology changes 8 miesięcy temu
ontology a2cfb9daef Delete copy of ontology file 10 miesięcy temu
outputs 4117cd4a8e Add rdf output files 7 miesięcy temu
raw_data d5c3f42535 Add RDFox scripts/rules to load PRODCOM bulk data 10 miesięcy temu
scripts 7da3095146 Update to fix test queries using object name 7 miesięcy temu
tests 7f2c76ece0 Modify bulk tests to use object codes 7 miesięcy temu
.gitattributes ef3f58e9a7 Git attributes for bulk data files 7 miesięcy temu
.gitignore e2e79c1544 Converted to datalad 1 rok temu
DEVELOPING.md 6a99e6ad59 Update DEVELOPING.md 7 miesięcy temu
README.md 52bb69541b Edit README.md 7 miesięcy temu
dodo.py 08615d2a92 Add total production to bulk processing 8 miesięcy temu
environment.yml e899370c68 Update environment file 7 miesięcy temu

README.md

PRODCOM data as PRObs Observations

This repository converts data from the PRODCOM database into a structure defined by the Physical Resources Observatory (PRObs) ontology.

See DEVELOPING.md for more information about using this repository.

Dataset structure

  • Repository is a datalad dataset
  • Input data files needing preprocessing are located in raw_data/.
  • Preprocessed data files ready for conversion are located in data/.
  • All custom code is located in scripts/.
  • Converted data is saved to outputs/.

Installation

Getting the code

To clone the datalad dataset, in a shell/command window (e.g. git-bash) type:

datalad clone https://github.com/probs-lab/prodcom-data.git

Setting up the virtual environment and installing dependencies:

To create a virtual environment using conda/miniconda:

cd prodcom-data
conda env create

Running the code

After installation:

  • Open a terminal / git-bash window
  • Navigate to prodcom-data folder, e.g. cd prodcom-data
  • Activate environment using conda activate prodcom-data

To download the example RDF output files (outputs/sold_production and outputs/total_production) from the server use:

datalad get outputs

To download the input csv files used to generate the output files use:

datalad get bulk_data

These files have been generated from bulk csv files downloaded from the eurostat website, which have been split into files for each country and for each year (see DEVELOPING.md).

The dodo.py script can be used to preprocess the files in raw_data and convert the files in the data and bulk_data folders:

To preprocess input data files run the script:

doit run preprocess

To convert the preprocessed data in the data folder run:

doit run convert_data

To convert all files in the bulk_data folder run:

doit convert_bulk

To run all necessary tasks (i.e. preprocessing and conversion) simply run:

doit

Individual files can be converted by running the convert_data.py script with appropriate parameters specifying the file type and the input and output filenames:

scripts/convert_data.py prodcom data/PRODCOM2016DATA.csv outputs/PRODCOM2016DATA.nt.gz

For conversion of the example PRODCOM data files in folder raw_data the type prodcom should be specified. Types prodcom_list and prodcom_correspondence are also defined, along with prodcom_bulk_sold and prodcom_bulk_total (for processing bulk files in folder bulk_data).

Converting new data

For conversion of new data files (possibly in a different format from the examples) see the DEVELOPING.md file.

Testing the code

To test the code, after installing the software and running the doit script:

cd tests
pytest