PRODCOM data converted with the PRObs ontology

Rick Lupton 81689300f9 Add log level option to convert_data.py 9 meses atrás
.datalad e031539dc6 [DATALAD] new dataset 1 ano atrás
bulk_data d313e13296 Update sold production files 1 ano atrás
data 24c613ddd2 Update for ontology changes 1 ano atrás
ontology a2cfb9daef Delete copy of ontology file 1 ano atrás
outputs 9c4c9596cf Update RDF total production files 11 meses atrás
raw_data d5c3f42535 Add RDFox scripts/rules to load PRODCOM bulk data 1 ano atrás
scripts 81689300f9 Add log level option to convert_data.py 9 meses atrás
tests c3dc78edfb Update to fix missing region for observations with no measurement 11 meses atrás
.gitattributes ef3f58e9a7 Git attributes for bulk data files 1 ano atrás
.gitignore e2e79c1544 Converted to datalad 1 ano atrás
DEVELOPING.md 6a99e6ad59 Update DEVELOPING.md 1 ano atrás
README.md 52bb69541b Edit README.md 1 ano atrás
dodo.py 08615d2a92 Add total production to bulk processing 1 ano atrás
environment.yml e899370c68 Update environment file 1 ano atrás

README.md

PRODCOM data as PRObs Observations

This repository converts data from the PRODCOM database into a structure defined by the Physical Resources Observatory (PRObs) ontology.

See DEVELOPING.md for more information about using this repository.

Dataset structure

  • Repository is a datalad dataset
  • Input data files needing preprocessing are located in raw_data/.
  • Preprocessed data files ready for conversion are located in data/.
  • All custom code is located in scripts/.
  • Converted data is saved to outputs/.

Installation

Getting the code

To clone the datalad dataset, in a shell/command window (e.g. git-bash) type:

datalad clone https://github.com/probs-lab/prodcom-data.git

Setting up the virtual environment and installing dependencies:

To create a virtual environment using conda/miniconda:

cd prodcom-data
conda env create

Running the code

After installation:

  • Open a terminal / git-bash window
  • Navigate to prodcom-data folder, e.g. cd prodcom-data
  • Activate environment using conda activate prodcom-data

To download the example RDF output files (outputs/sold_production and outputs/total_production) from the server use:

datalad get outputs

To download the input csv files used to generate the output files use:

datalad get bulk_data

These files have been generated from bulk csv files downloaded from the eurostat website, which have been split into files for each country and for each year (see DEVELOPING.md).

The dodo.py script can be used to preprocess the files in raw_data and convert the files in the data and bulk_data folders:

To preprocess input data files run the script:

doit run preprocess

To convert the preprocessed data in the data folder run:

doit run convert_data

To convert all files in the bulk_data folder run:

doit convert_bulk

To run all necessary tasks (i.e. preprocessing and conversion) simply run:

doit

Individual files can be converted by running the convert_data.py script with appropriate parameters specifying the file type and the input and output filenames:

scripts/convert_data.py prodcom data/PRODCOM2016DATA.csv outputs/PRODCOM2016DATA.nt.gz

For conversion of the example PRODCOM data files in folder raw_data the type prodcom should be specified. Types prodcom_list and prodcom_correspondence are also defined, along with prodcom_bulk_sold and prodcom_bulk_total (for processing bulk files in folder bulk_data).

Converting new data

For conversion of new data files (possibly in a different format from the examples) see the DEVELOPING.md file.

Testing the code

To test the code, after installing the software and running the doit script:

cd tests
pytest