Browse Source

Update DEVELOPING.md

stephenjboyle 6 months ago
parent
commit
6a99e6ad59
1 changed files with 51 additions and 2 deletions
  1. 51 2
      DEVELOPING.md

+ 51 - 2
DEVELOPING.md

@@ -1,11 +1,60 @@
+# Downloading Prodcom Data
+
+Prodcom data can be downloaded from the [eurostat](https://ec.europa.eu/eurostat/web/prodcom/database) website. 
+
+For downloads of bulk data, from [Downloads](https://ec.europa.eu/eurostat/databrowser/bulk) select option "Prodcom data". 
+File ds-056120 contains sold production, exports and imports data. File ds-056121 contains total production data. They can be downloaded in csv format by selecting the corresponding sdmx-csv option under "Download".
+
+Custom datasets for [Sold Production](https://ec.europa.eu/eurostat/databrowser/product/view/ds-056120) and [Total Production](https://ec.europa.eu/eurostat/databrowser/product/view/ds-056121) can be configured for download by specifying categories and associated values to include in table rows and columns. The example files,
+```raw_data/PRODCOM2016DATA.csv```, ```raw_data/PRODCOM2017DATA.csv``` and ```raw_data/PRODCOM2017DATA.csv``` have been created using all values of PRCCODE for rows and INDCATORS with values EXPQNT, EXPVAL, IMPQNT, IMPVAL, PRODQNT, PRODVAL and QNTUNIT for columns. TIME_PERIOD (set to 2016, 2017 or 2018) and DECL (set to United Kingdom - 006) are used as page specifications. In the Download menu the option "Spreadsheet" under "Data on this page only" can be used to download the table data in .xlxs format.
+
 # Converting new data
 
-To add a new file {filename}.csv for conversion, copy the file to the `data` folder, create a file {filename}\_defs.dlog containing appropriate meta-data (specifying the prefixes to use in the converted RDF data) and modify `dodo.py` as appropriate.
+To add a new file {filename}.csv for conversion, copy the file to the `data` folder, create a file {filename}\_defs.dlog containing appropriate meta-data (e.g. to specify the prefixes to use in the converted RDF data) and modify `dodo.py` as appropriate.
  
 If the data format is different to that of the examples, a new file load\_data\_{type}.rdfox will need to be created to specify the mapping between the columns of the input csv file and the columns of the RDFOX tuple table used to store input data. A file map_{type}.dlog will also need to be created to specify conversion rules. These files should be copied to the ```scripts``` folder.
 
+
 To convert the data run `doit run convert_data`.
 
 The results will be in the `outputs/` folder.
 
-To test the expected values are present, run `pytest`.
+To test the expected values are present, run `pytest`.
+
+# Configuring Datalad to use GIN as a Datasource
+
+Due to the large size of the csv input files and rdf output files, the git prodcom-data repository uses the GIN (G-Node Infrastructure) data management system to host data. The git repository is configured for Gin using the following commands:
+
+```
+datalad siblings add \
+ -d . \
+ --name gin \
+ --pushurl git@gin.g-node.org:/probs-lab/prodcom-data.git \
+ --url https://gin.g-node.org/probs-lab/prodcom-data \
+ ```
+ 
+To prevent git-annex from ignoring the new dataset use:
+ 
+ ```
+ git config --unset-all remote.gin.annex-ignore 
+ ```
+ 
+Push the repository (state and contents) to Gin using:
+ 
+ ```
+ datalad push --to gin
+ ```
+Configure Gin sibling as a "common data source" using:
+ 
+ ```
+ datalad siblings configure \
+   --name gin \
+   --as-common-datasrc gin-src
+   ```
+Push the repository to Gin again and then to github:
+
+```
+datalad push --to gin
+datalad push --to origin
+ ```
+