101-163-summary.rst 1.5 KB

1234567891011121314151617181920212223242526272829303132
  1. .. _gobigsummary:
  2. Summary
  3. -------
  4. If you want to go big, DataLad is a suitable tool and can overcome shortcomings
  5. of Git and git-annex, if used correctly. Scaling up involves
  6. some thought, and in some instances compromise, though.
  7. - The general mechanism that allows scaling up is nesting datasets. This process
  8. can be done by hand or programmatically. Recursive operations ease working
  9. across a hierarchy of datasets and create a monorepo-like experience
  10. - Beware of accidentally placing to many (even small) files into Git's version
  11. control in a single dataset!
  12. ``.gitignore`` files can keep irrelevant files out of version control, the
  13. ``explicit`` option :dlcmd:`run` may be helpful, and
  14. custom largefile rules in ``.gitattributes`` may be necessary to override
  15. dataset configurations such as ``text2git``.
  16. - Don't consider only the limits of version control software, but also the
  17. limits of your file system. Too many files in single directories can become
  18. problematic even without version control.
  19. - If things go wrong, it's not all lost. There are ways to clean up your dataset
  20. if it ever gets clogged, although they are the software equivalent of a
  21. blowtorch and should be handled with care.
  22. Now what can I do with it?
  23. ^^^^^^^^^^^^^^^^^^^^^^^^^^
  24. Go big, if you want to. :ref:`Distribute 80TB of files <usecase_HCP_dataset>`
  25. or `more <https://github.com/datalad/datalad-ukbiobank>`_, or version control
  26. large analyses with minimized performance loss of your version control tools.