{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Compression\n", "\n", "If we take a look at the file size of the \"radar_trap.nix\" file in its last version it is grater than **80MB** (Depends a bit on the number of images stored)!\n", "\n", "The reason is the image data of the individual images have a shape of 1024 * 768 * 4 * 4 byte (float32 values) which sums to about 12.5 MB per picture.\n", "\n", "An easy way to work around this is to enable dataset compression in the **HDF5** backend. Simply open a file with the ``DeflateNormal`` flag when creating it.\n", "\n", "``nixfile = nixio.File.open(\"radar_trap.nix\", nixio.FileMode.Overwrite, compression=nixio.Compression.DeflateNormal)``\n", "\n", "If a file is created with this flag, all DataArrays will be losslessly compressed using the embedded gzip algorithm.\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Compression comes at a price\n", "\n", "The file size should be tremendously reduced. On the other hand, compression is not for free, **it reduces read and write performance**. \n", "\n", "For some data it might be preferrable to not compress but have a higher read/write performance. In such cases you can switch compression **on** (``nixio.Compression.DeflateNormal``) or **off** (``nixio.Compression.No``) when creating a **DataArray**.\n", "\n", "``block.create_data_array(\"name\", \"type\", data=data, compression=nixio.Compression.Deflate.No)``\n", "\n", "**Note:** Once a DataArray has been created the compression can not be changed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Saving space by not storing doubles\n", "\n", "One way of saving space is to enable compressions, another way is to not save double values but integers.\n", "\n", "For example when we digitize measurements we use a Data Acquisition (DAQ) system that converts measured voltages into digital representations. These systems work with a limited resolution, e.g. 16 bit on a limited range of input voltages e.g. +- 10V.\n", "\n", "![data acquisition](resources/data_acquisition.png)\n", "\n", "Thus, the measured voltage is converted into 16 bit integer values (-32768 -> +32767). The conversion is a simple linear equation:\n", "\n", "y = x ยท 2^16 / 20\n", "\n", "Strictly speaking it makes little sense (except for convenience) to store the voltages as doubles with the full resolution of the doubles if the recording system has only 16 bit resolution!\n", "\n", "*int16* values take 2 bytes while the full *double* takes 8 bytes of memory.\n", "\n", "The **DataArray** offers a convenient way store the data as they come from the DAQ-board and store the DAQ's calibration as polynomials. The nixio library will then transparently apply them to return the converted data.\n" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [], "source": [ "import nixio\n", "import numpy as np\n", "\n", "def get_calibration(order=2):\n", " x = np.arange(-10, 10, 0.1)\n", " y = 2**16 * x / 20\n", " p = list(np.polynomial.polynomial.polyfit(y, x, order))\n", " return p\n", "\n", "\n", "dt = 0.001\n", "time = np.arange(0, 100, dt)\n", "data = np.sin(time * 2 * np.pi * 2)\n", "digitized_values = np.round(2**16 * data / 20)\n", "digitized_values = np.asarray(digitized_values, dtype=int)\n", "\n", "nixfile = nixio.File.open(\"data_acquisition.nix\", nixio.FileMode.Overwrite)\n", "block = nixfile.create_block(\"session 1\", \"nix.session\")\n", "data_array = block.create_data_array(\"measurement\", \"nix.sampled\",\n", " dtype=nixio.DataType.Int16, \n", " data=digitized_values,\n", " label=\"measured voltage\",\n", " unit=\"V\")\n", "data_array.append_sampled_dimension(dt, label=\"time\", unit=\"s\")\n", "\n", "data_array.polynom_coefficients = get_calibration(1)\n", "data_array.expansion_origin = 0.0\n", "\n", "nixfile.close()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Chunking\n", "\n", "When the backend reads or writes data from/to file, is does it in *chunks* (for experts, **DataArrays** are resizable, therefore chunking is always enabled).\n", "\n", "![chunk large](resources/chunks_big.png)\n", "\n", "The data may be accessed as a whole big chunk, or in smaller pieces. \n", "\n", "![chunk large](resources/chunks_small.png)\n", "\n", "In reality, there is nothing like a 2d memory space, memory addresses are always contiguous. That means \n", "\n", "The *chunk* size affects a few things:\n", "\n", "1. The read and write speed (large datasets can be read faster with larger chunks).\n", "2. The resize performance and overhead.\n", "3. The efficiency of the compression.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Read/write performance\n", "\n", "Generally one could think about large datasets can be written and read faster with large chunks. This is not wrong unless the usual access is in small pieces. Then the backend would need to read the full chunk to memory (probably decompress it) and then return the small piece of data the user requested.\n", "\n", "## Resize performance\n", "\n", "Let's assume that we have already filled the full 9 by 9 chunk with data. Now we want to increase the dataset by another 3 by 3 bit of data. With the large chunks we would ask the backend to reserve the full 9 by 9 matrix, and write just 9 data points into it. Reserving large amounts of memory takes more time, and if not filled up with meaningful data, creates larger files than strictly necessary.\n", "\n", "## Compression performance\n", "\n", "Compression is more or less efficient depending on the chunk size.\n", "\n", "\n", "The chunk size is automatically defined upon creation of the **DataArray**.\n", "\n", "``block.create_data_array(\"name\", \"type\", data=data)``\n", "\n", "The **HDF5** backend will try to figure out the optimal chunk size depending on the shape of the data. If one wants to affect the chunking and has a good idea about the usual read and write access patterns (e.g. I know that I will always read one second of data at a time). One can create the **DataArray** with a defined shape and later write the data.\n", "\n", "```python\n", " data_array = block.create_data_array(\"name\", \"id\", dtype=nixio.DataType.Double,\n", " shape=(chunk_samples, number_of_channels), label=\"voltage\", unit=\"mV\")\n", " data_array.append_sampled_dimension(0.001, label=\"time\", unit=\"s\")\n", " data_array.append_set_dimension(labels=[\"channel %i\" % i for i in range(number_of_channels)])\n", "\n", " data_array.write_direct(data)\n", "```\n", "\n", "**Note:** If we do not provide the data at the time of **DataArray** creation, we need to provide the data type *dtype*.\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.16677212715148926\n" ] } ], "source": [ "import nixio\n", "import time\n", "import numpy as np\n", "\n", "\n", "def record_data(samples, channels, dt):\n", " data = np.zeros((samples, channels))\n", " t = np.arange(samples) * dt\n", " for i in range(channels):\n", " phase = i * 2 * np.pi / channels\n", " data[:, i] = np.sin(2 * np.pi * t + phase) + (np.random.randn(samples) * 0.1)\n", "\n", " return data\n", "\n", "\n", "def write_nixfile(filename, chunk_samples=1000, number_of_channels= 10, dt=0.001, chunk_count=100, compression=nixio.Compression.No):\n", " nixfile = nixio.File.open(filename, nixio.FileMode.Overwrite, compression=compression)\n", " block = nixfile.create_block(\"Session 1\", \"nix.recording_session\")\n", " data_array = block.create_data_array(\"multichannel_data\", \"nix.sampled.multichannel\", dtype=nixio.DataType.Double,\n", " shape=(chunk_samples, number_of_channels), label=\"voltage\", unit=\"mV\")\n", " data_array.append_sampled_dimension(0.001, label=\"time\", unit=\"s\")\n", " data_array.append_set_dimension(labels=[\"channel %i\" % i for i in range(number_of_channels)])\n", " \n", " total_samples = chunk_count * chunk_samples\n", " data = record_data(total_samples, number_of_channels, dt)\n", " chunks_recorded = 0\n", " t0 = time.time()\n", " while chunks_recorded < chunk_count:\n", " start_index = chunk_samples * chunks_recorded\n", " if chunks_recorded == 0:\n", " data_array.write_direct(data[start_index:start_index + chunk_samples, :])\n", " else:\n", " data_array.append(data[start_index:start_index+chunk_samples, :], axis=0)\n", " chunks_recorded += 1\n", " total_time = time.time() - t0\n", "\n", " nixfile.close()\n", " return total_time\n", "\n", "\n", "time_needed = write_nixfile(\"chunking_test.nix\", chunk_samples=100000, chunk_count=10)\n", "print(time_needed)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Bugs, issues and contributions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NIX is an open source project; the source code can be found on [github](https://github.com/G-Node/nixpy). As with any software there are bugs, inconsistencies or usability quirks.\n", "\n", "If you find any or have any comments, please check the issues on the projects [issue tracker](https://github.com/G-Node/nixpy/issues) if this is already known and feel invited to open a new one if your issue is something that has not been addressed yet.\n", "\n", "You are also welcome to contribute to the source code e.g. if there is a feature you would find helpful but which has not been implemented yet.\n" ] } ], "metadata": { "interpreter": { "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 2 }