# Compression

If we take a look at the file size of the "radar_trap.nix" file in its last version it is grater than **80MB** (Depends a bit on the number of images stored)!

The reason is the image data of the individual images have a shape of 1024 * 768 * 4 * 4 byte (float32 values) which sums to about 12.5 MB per picture.

An easy way to work around this is to enable dataset compression in the **HDF5** backend. Simply open a file with the ``DeflateNormal`` flag when creating it.

``nixfile = nixio.File.open("radar_trap.nix", nixio.FileMode.Overwrite, compression=nixio.Compression.DeflateNormal)``

If a file is created with this flag, all DataArrays will be losslessly compressed using the embedded gzip algorithm.





## Compression comes at a price

The file size should be tremendously reduced. On the other hand, compression is not for free, **it reduces read and write performance**. 

For some data it might be preferrable to not compress but have a higher read/write performance. In such cases you can switch compression **on** (``nixio.Compression.DeflateNormal``) or **off** (``nixio.Compression.No``) when creating a **DataArray**.

``block.create_data_array("name", "type", data=data, compression=nixio.Compression.Deflate.No)``

**Note:** Once a DataArray has been created the compression can not be changed.

# Saving space by not storing doubles

One way of saving space is to enable compressions, another way is to not save double values but integers.

For example when we digitize measurements we use a Data Acquisition (DAQ) system that converts measured voltages into digital representations. These systems work with a limited resolution, e.g. 16 bit on a limited range of input voltages e.g. +- 10V.

![data acquisition](resources/data_acquisition.png)

Thus, the measured voltage is converted into 16 bit integer values (-32768 -> +32767). The conversion is a simple linear equation:

y = x ยท 2^16 / 20

Strictly speaking it makes little sense (except for convenience) to store the voltages as doubles with the full resolution of the doubles if the recording system has only 16 bit resolution!

*int16* values take 2 bytes while the full *double* takes 8 bytes of memory.

The **DataArray** offers a convenient way store the data as they come from the DAQ-board and store the DAQ's calibration as polynomials. The nixio library will then transparently apply them to return the converted data.


In [85]:
import nixio
import numpy as np

def get_calibration(order=2):
 x = np.arange(-10, 10, 0.1)
 y = 2**16 * x / 20
 p = list(np.polynomial.polynomial.polyfit(y, x, order))
 return p


dt = 0.001
time = np.arange(0, 100, dt)
data = np.sin(time * 2 * np.pi * 2)
digitized_values = np.round(2**16 * data / 20)
digitized_values = np.asarray(digitized_values, dtype=int)

nixfile = nixio.File.open("data_acquisition.nix", nixio.FileMode.Overwrite)
block = nixfile.create_block("session 1", "nix.session")
data_array = block.create_data_array("measurement", "nix.sampled",
 dtype=nixio.DataType.Int16, 
 data=digitized_values,
 label="measured voltage",
 unit="V")
data_array.append_sampled_dimension(dt, label="time", unit="s")

data_array.polynom_coefficients = get_calibration(1)
data_array.expansion_origin = 0.0

nixfile.close()


# Chunking

When the backend reads or writes data from/to file, is does it in *chunks* (for experts, **DataArrays** are resizable, therefore chunking is always enabled).

![chunk large](resources/chunks_big.png)

The data may be accessed as a whole big chunk, or in smaller pieces. 

![chunk large](resources/chunks_small.png)

In reality, there is nothing like a 2d memory space, memory addresses are always contiguous. That means 

The *chunk* size affects a few things:

1. The read and write speed (large datasets can be read faster with larger chunks).
2. The resize performance and overhead.
3. The efficiency of the compression.



## Read/write performance

Generally one could think about large datasets can be written and read faster with large chunks. This is not wrong unless the usual access is in small pieces. Then the backend would need to read the full chunk to memory (probably decompress it) and then return the small piece of data the user requested.

## Resize performance

Let's assume that we have already filled the full 9 by 9 chunk with data. Now we want to increase the dataset by another 3 by 3 bit of data. With the large chunks we would ask the backend to reserve the full 9 by 9 matrix, and write just 9 data points into it. Reserving large amounts of memory takes more time, and if not filled up with meaningful data, creates larger files than strictly necessary.

## Compression performance

Compression is more or less efficient depending on the chunk size.


The chunk size is automatically defined upon creation of the **DataArray**.

``block.create_data_array("name", "type", data=data)``

The **HDF5** backend will try to figure out the optimal chunk size depending on the shape of the data. If one wants to affect the chunking and has a good idea about the usual read and write access patterns (e.g. I know that I will always read one second of data at a time). One can create the **DataArray** with a defined shape and later write the data.

```python
 data_array = block.create_data_array("name", "id", dtype=nixio.DataType.Double,
 shape=(chunk_samples, number_of_channels), label="voltage", unit="mV")
 data_array.append_sampled_dimension(0.001, label="time", unit="s")
 data_array.append_set_dimension(labels=["channel %i" % i for i in range(number_of_channels)])

 data_array.write_direct(data)
```

**Note:** If we do not provide the data at the time of **DataArray** creation, we need to provide the data type *dtype*.


In [19]:
import nixio
import time
import numpy as np


def record_data(samples, channels, dt):
 data = np.zeros((samples, channels))
 t = np.arange(samples) * dt
 for i in range(channels):
 phase = i * 2 * np.pi / channels
 data[:, i] = np.sin(2 * np.pi * t + phase) + (np.random.randn(samples) * 0.1)

 return data


def write_nixfile(filename, chunk_samples=1000, number_of_channels= 10, dt=0.001, chunk_count=100, compression=nixio.Compression.No):
 nixfile = nixio.File.open(filename, nixio.FileMode.Overwrite, compression=compression)
 block = nixfile.create_block("Session 1", "nix.recording_session")
 data_array = block.create_data_array("multichannel_data", "nix.sampled.multichannel", dtype=nixio.DataType.Double,
 shape=(chunk_samples, number_of_channels), label="voltage", unit="mV")
 data_array.append_sampled_dimension(0.001, label="time", unit="s")
 data_array.append_set_dimension(labels=["channel %i" % i for i in range(number_of_channels)])
 
 total_samples = chunk_count * chunk_samples
 data = record_data(total_samples, number_of_channels, dt)
 chunks_recorded = 0
 t0 = time.time()
 while chunks_recorded < chunk_count:
 start_index = chunk_samples * chunks_recorded
 if chunks_recorded == 0:
 data_array.write_direct(data[start_index:start_index + chunk_samples, :])
 else:
 data_array.append(data[start_index:start_index+chunk_samples, :], axis=0)
 chunks_recorded += 1
 total_time = time.time() - t0

 nixfile.close()
 return total_time


time_needed = write_nixfile("chunking_test.nix", chunk_samples=100000, chunk_count=10)
print(time_needed)


0.16677212715148926


# Bugs, issues and contributions

NIX is an open source project; the source code can be found on [github](https://github.com/G-Node/nixpy). As with any software there are bugs, inconsistencies or usability quirks.

If you find any or have any comments, please check the issues on the projects [issue tracker](https://github.com/G-Node/nixpy/issues) if this is already known and feel invited to open a new one if your issue is something that has not been addressed yet.

You are also welcome to contribute to the source code e.g. if there is a feature you would find helpful but which has not been implemented yet.
