This repository converts data from the PRODCOM database into a structure defined by the Physical Resources Observatory (PRObs) ontology.
See DEVELOPING.md for more information about using this repository.
raw_data/
.data/
.scripts/
.outputs/
.To clone the datalad dataset, in a shell/command window (e.g. git-bash) type:
datalad clone https://github.com/probs-lab/prodcom-data.git
To create a virtual environment using conda/miniconda:
cd prodcom-data
conda env create
After installation:
prodcom-data
folder, e.g. cd prodcom-data
conda activate prodcom-data
To download the example output data files from the server use:
datalad get outputs
To preprocess input data files run the script:
doit run preprocess
To convert the preprocessed data in the data
folder run:
doit run convert_data
To run all necessary tasks (i.e. preprocessing and conversion) simply run:
doit
Individual files can be converted by running the convert_data.py
script with appropriate parameters specifying the file type and the input and output filenames:
scripts/convert_data.py prodcom data/PRODCOM2016DATA.csv outputs/PRODCOM2016DATA.nt.gz
For conversion of the example PRODCOM data files the type prodcom
should be specified. Types prodcom_list and prodcom_correspondence are also defined.
For conversion of new data files (possibly in a different format from the examples) see the DEVELOPING.md file.
To test the code, after installing the software and running the doit
script:
cd tests
pytest