encrypted_annex.rst 16 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336
  1. .. index:: ! Usecase; Encrypted data storage and transport
  2. .. _usecase_encrypted_annex:
  3. Encrypted data storage and transport
  4. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  5. Some data are not meant for everybody's eyes - you can share a picture from a midflight-plane-window-view without a problem on your social media account, but you `shouldn't post a photo of your plane ticket next to it <https://mango.pdf.zone/finding-former-australian-prime-minister-tony-abbotts-passport-number-on-instagram>`_.
  6. But there are also data so sensitive that not only should you not share them anywhere, you also need to make sure that they are inaccessible even when someone sneaks into your storage system or intercepts a file transfer - things such as passwords, private messages, or medical data.
  7. One technical solution for this problem is `encryption <https://en.wikipedia.org/wiki/Encryption>`_.
  8. During encryption, the sensitive data are obfuscated in a structured but secret manner, and only authorized agents have the knowledge how to decrypt the data back into a usable state.
  9. Because encryption is relevant or at least attractive to many applications involving data in :term:`DataLad dataset`\s, this use case demonstrates how to
  10. utilize `git-annex's encryption <https://git-annex.branchable.com/encryption>`_ to keep data safely encrypted when it is not being used.
  11. This example workflow mimics a sensitive data deposition use case, where data need to be securely deposited on one machine, and downloaded to another machine for processing and storage.
  12. To make this work, in our workflow we will combine several independent pieces:
  13. #. Using git-annex encryption
  14. #. Using RIA stores
  15. #. Working in temporary (ephemeral) clones that combine information from
  16. several remotes.
  17. Note that these components were selected to fit a particular setup, but neither RIA stores nor temporary clones are required components for encryption.
  18. The challenge
  19. =============
  20. Bob does data collection through a web form. Incoming form entries
  21. are saved as individual files by the web server. Bob periodically
  22. imports the data to his local machine, and uses them to produce an
  23. updated summary. For security reasons, he does not want the data to lie
  24. around unencrypted.
  25. .. figure:: ../artwork/src/encryption_sketch.svg
  26. The DataLad approach
  27. ====================
  28. Bob sets up his remote machine (web server) so that incoming data are
  29. saved in a DataLad dataset. On that machine, he also creates a RIA
  30. sibling for the dataset (accessible from the outside through SSH, and
  31. only to a few authorized users), and enables encryption for annexed
  32. contents. He chooses a scheme in which the server only has the
  33. encryption key. Each new data file will be saved in the dataset,
  34. pushed to that RIA store, and dropped from the dataset, leaving only
  35. the encrypted copy on the remote server. On his local machine, he also
  36. sets up RIA stores, enables encryption, and scripts his analysis to
  37. check out the datasets locally, fetch updates from the remote machine,
  38. and push data back to local stores. In this way, data are not
  39. available anywhere in decrypted form, except when it is being
  40. processed.
  41. Step by step
  42. ============
  43. Before we start: GnuPG
  44. ----------------------
  45. DataLad relies on :term:`git-annex` to manage large file content, and git-annex relies in part on `GnuPG <https://gnupg.org>`__ to manage encryption via *public-key cryptography*.
  46. `Public key cryptography <https://en.wikipedia.org/wiki/Public-key_cryptography>`_ relies on key pairs for encryption and decryption.
  47. To proceed with the next steps, you will need at least one pair of GPG
  48. keys, one *private* (used for decryption) and one *public* (used for
  49. encryption). The relevant keys need to be known to the GPG program on
  50. the machine you are using.
  51. We won't go into detail about GPG, but the most useful commands are:
  52. - ``gpg --full-generate-key`` to generate a key pair,
  53. - ``gpg --list-keys`` and ``gpg --list-secret-keys`` to display known public and private keys,
  54. - ``gpg --export -a [keyID] > my_public_key.asc`` to export a public key, and
  55. - ``gpg --import my_public_key.asc`` to import the previously exported key on another machine.
  56. Setup overview
  57. --------------
  58. In this workflow, we describe encrypted data storage and transport between two locations.
  59. We will call them *remote server* and *local machine*.
  60. In this example, the *remote server* is where the data originates (is created or deposited), and the *local server* is where the data is downloaded to, processed, and saved for future access.
  61. Choices made in the following examples (encryption modes, sibling types, data flow) were dictated by the particular setup of the use case leading to this chapter.
  62. Specifically, our data was entered through a web form; the script responsible for serving the website wrote incoming data into JSON files and saved them into a DataLad dataset.
  63. Although the remote server used for data deposition provided authenticated access only, it was hosted outside of what we considered our trusted infrastructure.
  64. Because of that, for further processing we fetched the encrypted data onto our local machine, which was the only place where we could store decryption keys and access credentials.
  65. Ultimately, the examples should provide a good overview of encrypted data workflows with DataLad, easily adapted to different setups.
  66. Remote server: encrypted deposition
  67. -----------------------------------
  68. On the remote server, we start by creating a DataLad dataset to track the deposition of raw data.
  69. Since encryption is enabled for git-annex :term:`special remote`\s (thus only applies to annexed files), we stay with the default dataset configuration, which annexes all files.
  70. .. code-block:: bash
  71. $ datalad create incoming_data
  72. $ cd incoming data
  73. Then, we create a local :term:`Remote Indexed Archive (RIA) store` as a :term:`sibling` for the dataset. We chose a local RIA because we don't want to move the data outside the server yet.
  74. .. note::
  75. Using a RIA store is a choice for this use case, but *not* a requirement for data encryption. Encryption can be enabled in the same way for any kind of git-annex :term:`special remote`.
  76. In fact, the primary use-case for encryption in git-annex is sending file content to remotes hosted by an untrusted party.
  77. The created sibling is called ``entrystore`` in the example below, but by default, a RIA sibling consists of two parts, with ``entrystore`` being only one of them.
  78. The other, which by default uses the sibling name with a ``-storage`` suffix ("``entrystore-storage``"), is an automatically created :term:`special remote` to store annexed files in.
  79. .. code-block:: bash
  80. $ datalad create-sibling-ria \
  81. --new-store-ok --name entrystore \
  82. --alias incoming-data \
  83. ria+file:///data/project/store
  84. Now we tell git-annex to encrypt annexed content placed in the store.
  85. We choose regular public key encryption with shared filename encryption (``sharedpubkey``).
  86. In this method, access to *public* keys is required to store files in the remote, but a *private* key is required for retrieval.
  87. So if we only store our public key on the machine, an intruder will have no means to decrypt the data even if they gain access to the server.
  88. .. code-block:: bash
  89. $ git annex enableremote \
  90. entrystore-storage \
  91. encryption=sharedpubkey \
  92. keyid=9AB670707D8EA564119785922EF857223E033AF1
  93. enableremote entrystore-storage (encryption setup) (to gpg keys: 2EF857223E033AF1) ok
  94. (recording state in git...)
  95. If we want to add another encryption key, the step above can be repeated
  96. with ``keyid+=...``.
  97. With this setup, whenever a new data file is uploaded into the dataset on the server, this file needs to be saved, pushed to encrypted storage, and finally, the unencrypted file needs to be dropped:
  98. .. code-block:: bash
  99. $ datalad save -m "Adding new file" entry-file-name.dat
  100. $ datalad push --to entrystore entry-file-name.dat
  101. $ datalad drop entry-file-name.dat
  102. An important technical detail about git-annex is that ``sharedpubkey`` mode encrypts file *content* using GPG, but file *names* using `HMAC <https://en.wikipedia.org/wiki/HMAC>`_.
  103. However, the "HMAC cipher" (the secret used to encrypt) is stored unencrypted in the git repository.
  104. This makes it possible to add new files without access to the private GPG keys - but also means that
  105. access to the git repository will reveal file names.
  106. Since a RIA store combines a bare git repository with annex storage in the same location, this means that we should take care to not include sensitive information in file names.
  107. You can see `git-annex's documentation <https://git-annex.branchable.com/encryption>`__ and the section :ref:`privacy` for more details.
  108. Local machine: Decryption
  109. -------------------------
  110. In order to retrieve the encrypted data securely from the remote server and perform processing on unencrypted data, we start once again by creating a DataLad dataset:
  111. .. code-block:: bash
  112. $ datalad create derived_data
  113. $ cd derived_data
  114. We then install the dataset from the RIA store on the remote server as a subdataset with input data using :dlcmd:`clone` and an :term:`SSH` URL to the dataset in the RIA store.
  115. .. code-block:: bash
  116. $ datalad clone -d . ria+ssh://... inputs
  117. Next, we can retrieve all data:
  118. .. code-block:: bash
  119. $ datalad get inputs
  120. As long as we have the required private key, GPG will be used to quietly
  121. decrypt all files during the ``get`` operation, so our dataset clone
  122. will contain already decrypted data.
  123. At this stage we may add our data processing code (likely putting it
  124. under ``code`` directory, and using ``.gitattributes`` to decide whether
  125. code files should be tracked by :term:`Git`), and use ``datalad run`` to produce
  126. derived data.
  127. Since we intend all our data to be encrypted at rest also on this
  128. machine, we will create another set of RIA siblings and tell git-annex to use encryption.
  129. Because here we have access to our private key, we will use the default, more flexible, scheme with ``hybrid`` encryption keys.
  130. Note: In the ``hybrid`` mode, a private key is needed for *both* retrieval
  131. and deposition of annexed contents, but it is easy to add new keys
  132. without having to re-encrypt data.
  133. File content and names are encrypted with a symmetric cipher, which is itself encrypted using GPG and stored encrypted in the git repository.
  134. See `git-annex's documentation <https://git-annex.branchable.com/encryption>`__ for more details.
  135. .. code-block:: bash
  136. $ datalad create-sibling-ria \
  137. --new-store-ok --name localstore \
  138. --alias derived \
  139. ria+file:///data/project/store
  140. $ git annex enableremote \
  141. localstore-storage \
  142. encryption=hybrid \
  143. keyid=2EF857223E033AF1
  144. We repeat the same for the input subdataset, so that we can maintain a local copy of the raw data.
  145. .. code-block:: bash
  146. $ cd input
  147. $ datalad create-sibling-ria \
  148. --name localstore --alias raw \
  149. ria+file:///data/project/store
  150. $ git annex enableremote \
  151. localstore-storage \
  152. encryption=hybrid \
  153. keyid+=2EF857223E033AF1
  154. $ cd ..
  155. Depending on what is more convenient for us, we could either decide to keep the current dataset clones and drop only the annexed file content after pushing, or treat the clones as temporary and remove them altogether.
  156. Here, we will use the second option.
  157. For this reason, we need to declare the current clones "dead" to git-annex before pushing, so that subsequent clones from the RIA store won't consider this location for obtaining files.
  158. Since we gave the super- and subdataset's siblings the same name, "``localstore``", we can use ``push --recursive``.
  159. .. code-block:: bash
  160. $ datalad foreach-dataset git annex dead here
  161. $ datalad push --recursive --to localstore
  162. And in the end we can clean up by removing the temporary clone:
  163. .. code-block:: bash
  164. $ cd ..
  165. $ datalad drop --recursive --what all --dataset derived_data
  166. .. note::
  167. Although locations declared to be "dead" are not considered for obtaining files, they still leave a record in the git-annex branch.
  168. An even better solution would be to create the repository (and subsequent temporary clones) using git-annex's private mode, however, it is not yet fully supported by DataLad.
  169. See `git-annex's documentation <https://git-annex.branchable.com/tips/cloning_a_repository_privately>`__
  170. for private mode and `this DataLad issue <https://github.com/datalad/datalad/issues/6456>`__
  171. tracking DataLad's support for the configuration.
  172. Performing updates with temporary (ephemeral) clones
  173. ----------------------------------------------------
  174. The remaining part of the workflow focuses on working with temporary
  175. clones and using them to transfer updates between different data stores.
  176. The process is not affected by whether encryption was used or not (as it
  177. happens quietly on ``get`` & ``push``).
  178. Any time we want to include new data from ``entrystore`` in our local
  179. copy / derived dataset, we would start by cloning the derived dataset
  180. from the local RIA, and getting the input subdataset (without getting
  181. contents yet):
  182. .. code-block:: bash
  183. $ datalad clone \
  184. ria+file:///data/project/entrystore#~derived \
  185. derived_data
  186. $ cd derived_data
  187. $ datalad get --no-data inputs
  188. Our next step would be to obtain files from the remote server that we
  189. don't yet have locally. At this moment it is a good idea to stop and
  190. consider what the input dataset "knows" about other locations:
  191. .. code-block:: bash
  192. $ datalad siblings -d inputs
  193. .: here(+) [git]
  194. .: origin(-) [/data/project/store/8e4/65aa4-af88-4abd-aaa0-d248339780be (git)]
  195. .: localstore-storage(+) [ora]
  196. .: entrystore-storage(+) [ora]
  197. Since we cloned the superdataset from local RIA store, also the subdataset has the `origin` (:term:`Git` :term:`remote`) pointing to that store.
  198. It also has the ``local-storage`` and ``entrystore-storage`` :term:`sibling`\s; these are the
  199. git-annex :term:`special remote`\s for the local and remote RIA stores, respectively.
  200. But to learn about new files that were added in the remote server since we last cloned from there, we need the Git remote.
  201. .. note::
  202. In the future, adding the git remote manually may become unnecessary.
  203. See `this issue <https://github.com/datalad/datalad-next/issues/170>`__ tracking related work in DataLad-next extension.
  204. Let's add it then (note that when working with ``datalad
  205. siblings`` or ``git remote`` commands, we cannot use the
  206. ``ria+ssh://...#~alias`` URL, and need to use the actual SSH URL and file system path).
  207. .. code-block:: bash
  208. $ cd inputs
  209. $ git remote add entrystore \
  210. ssh://example.com:/data/project/store/alias/incoming-data
  211. Now we can obtain updates from the entrystore sibling (pair). We may
  212. choose to fetch only, to see what is new before merging:
  213. .. code-block:: bash
  214. $ datalad update --sibling entrystore --how fetch
  215. $ datalad diff --from main --to entrystore/main
  216. If there were no updates reported, we could decide to finish our work
  217. right there. Since there are new files, we will integrate the changes
  218. (since we didn't change the input dataset locally, there is no practical
  219. difference in using ``ff-only`` versus ``merge``).
  220. .. code-block:: bash
  221. $ datalad update --sibling entrystore --how merge
  222. .. find-out-more:: A note to users of Python API
  223. The results of the ``diff`` command include files that were not changed, so to look for changes we need to filter them by state;
  224. e.g. if we only expect additions, we can do this:
  225. .. code-block:: python
  226. added_files = subds.diff(
  227. fr='main',
  228. to='entrystore/main',
  229. result_filter=lambda x: x['state'] == 'added',
  230. )
  231. Now that we have the latest version of the subdataset, we can repeat the update procedure (note that this time we push to ``origin``)
  232. .. code-block:: bash
  233. $ datalad save -m "Updated subdataset"
  234. $ datalad run ...
  235. $ datalad foreach-dataset git annex dead here
  236. $ datalad push --recursive --to origin
  237. $ cd ..
  238. $ datalad drop --recursive --what all --dataset derived_data
  239. Note: in this case our input dataset has two RIA siblings, one local (``ria+file://``) and one remote (``ria+ssh``).
  240. Due to this difference, they should be configured with different "cost" for updating data (inspect the output of ``git annex info entrystore-storage``).
  241. The section :ref:`cloneprio` shows how this can be done.
  242. So when DataLad gets files as part of ``datalad run``, the local storage will be prioritized, and only the recently added files will be downloaded from the remote storage.
  243. Subsequent push will bring the local storage up to
  244. date, and the process can be repeated.