DEVELOPING.md 3.0 KB

Downloading Prodcom Data

Prodcom data can be downloaded from the eurostat website.

For downloads of bulk data, from Downloads select option "Prodcom data". File ds-056120 contains sold production, exports and imports data. File ds-056121 contains total production data. They can be downloaded in csv format by selecting the corresponding sdmx-csv option under "Download".

Custom datasets for Sold Production and Total Production can be configured for download by specifying categories and associated values to include in table rows and columns. The example files,


# Converting new data

To add a new file {filename}.csv for conversion, copy the file to the `data` folder, create a file {filename}\_defs.dlog containing appropriate meta-data (e.g. to specify the prefixes to use in the converted RDF data) and modify `dodo.py` as appropriate.
 
If the data format is different to that of the examples, a new file load\_data\_{type}.rdfox will need to be created to specify the mapping between the columns of the input csv file and the columns of the RDFOX tuple table used to store input data. A file map_{type}.dlog will also need to be created to specify conversion rules. These files should be copied to the ```scripts``` folder.


To convert the data run `doit run convert_data`.

The results will be in the `outputs/` folder.

To test the expected values are present, run `pytest`.

# Configuring Datalad to use GIN as a Datasource

Due to the large size of the csv input files and rdf output files, the git prodcom-data repository uses the GIN (G-Node Infrastructure) data management system to host data. The git repository is configured for Gin using the following commands:

datalad siblings add \ -d . \ --name gin \ --pushurl git@gin.g-node.org:/probs-lab/prodcom-data.git \ --url https://gin.g-node.org/probs-lab/prodcom-data

 
To prevent git-annex from ignoring the new dataset use:
 

git config --unset-all remote.gin.annex-ignore

 
Push the repository (state and contents) to Gin using:
 

datalad push --to gin

Configure Gin sibling as a "common data source" using:
 

datalad siblings configure \ --name gin \ --as-common-datasrc gin-src

Push the repository to Gin again and then to github:

datalad push --to gin datalad push --to origin ```