# PRODCOM data as PRObs Observations

This repository converts data from the [PRODCOM](https://ec.europa.eu/eurostat/web/prodcom) database into a structure defined by the [Physical Resources Observatory (PRObs)](https://github.com/probs-lab/probs-ontology) ontology.

See [DEVELOPING.md](DEVELOPING.md) for more information about using this repository.

## Dataset structure

- Repository is a datalad dataset
- Input data files needing preprocessing are located in `raw_data/`.
- Preprocessed data files ready for conversion are located in `data/`. 
- All custom code is located in `scripts/`.
- Converted data is saved to `outputs/`.

## Installation

### Getting the code

To clone the datalad dataset, in a shell/command window (e.g. git-bash) type:

```shell
datalad clone https://github.com/probs-lab/prodcom-data.git
```
### Setting up the virtual environment and installing dependencies:

To create a virtual environment using conda/miniconda:

```shell
cd prodcom-data
conda env create
```

## Running the code

After installation:

- Open a terminal / git-bash window
- Navigate to ```prodcom-data``` folder, e.g. ```cd prodcom-data```
- Activate environment using ```conda activate prodcom-data```

To download the example RDF output files (```outputs/sold_production``` and ```outputs/total_production```) from the server use:

```shell
datalad get outputs
```

To download the input csv files used to generate the output files use:

```shell
datalad get bulk_data
```
These files have been generated from bulk csv files downloaded from the [eurostat](https://ec.europa.eu/eurostat/web/prodcom/database) website, which have been split into files for each country and for each  year (see [DEVELOPING.md](DEVELOPING.md)).

The `dodo.py` script can be used to preprocess the files in `raw_data` and convert the files in the `data` and `bulk_data` folders:

To preprocess input data files run the script:

```shell
doit run preprocess
```

To convert the preprocessed data in the `data` folder run:

```shell
doit run convert_data
```

To convert all files in the `bulk_data` folder run:

```shell
doit convert_bulk
```

To run all necessary tasks (i.e. preprocessing and conversion) simply run:

```shell
doit
```

Individual files can be converted by running the `convert_data.py` script with appropriate parameters specifying the file type and the input and output filenames:

```shell
scripts/convert_data.py prodcom data/PRODCOM2016DATA.csv outputs/PRODCOM2016DATA.nt.gz
```

For conversion of the example PRODCOM data files in folder ```raw_data``` the type `prodcom` should be specified. Types `prodcom_list` and `prodcom_correspondence` are also defined, along with `prodcom_bulk_sold` and `prodcom_bulk_total` (for processing bulk files in folder `bulk_data`).


# Converting new data

For conversion of new data files (possibly in a different format from the examples) see the [DEVELOPING.md](DEVELOPING.md) file.

## Testing the code

To test the code, after installing the software and running the `doit` script:

```shell
cd tests
pytest
```