123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129 |
- #!/bin/bash
- ##### Datalad Handbook 3.2 ######
- ##### DataLad-centric analysis with job scheduling and parallel computing
- # http://handbook.datalad.org/en/latest/beyond_basics/101-170-dataladrun.html
- # top-level analysis dataset with subdatasets
- # $ datalad create parallel_analysis
- # $ cd parallel_analysis
- # a) pipeline dataset (with a configured software container)
- # $ datalad clone -d . https://github.com/ReproNim/containers.git
- # b) input dataset
- # $ datalad clone -d . /path/to/my/rawdata
- # data analysis with software container that performs a set of analyses
- # results will be aggregated into a top-level dataset
- # Individual jobs are computed in throw-away dataset clones (& branches)
- # to avoid unwanted interactions between parallel jobs.
- # results are pushed back (as branches) to into the target dataset.
- # A manual merge aggregates all results into the master branch of the dataset.
- # following analysis processes rawdata with
- # - a pipeline from
- # - collects outcomes in toplevel parallel_analysis dataset
- # You could also add and configure the container using datalad containers-add
- # to the top-most dataset. This solution makes the container less usable, though.
- # If you have more than one application for a container, keeping it as a
- # standalone dataset can guarantee easier reuse.
- # what you will submit as a job with a job scheduler
- # is a shell script that contains all relevant data analysis steps
- # and not a datalad containers-run call
- # but datalad run does not support concurrent execution in the same dataset clone.
- # Solution: create one throw-away dataset clone for each job.
- # We treat cluster compute nodes like contributors to the analyses:
- # They clone the analysis dataset hierarchy into a temporary location,
- # run the computation, push the results, and remove their temporary dataset again
- # The compute job clones the dataset to a unique place, so that it can run a
- # containers-run command inside it without interfering with any other job.
- # fail whenever something is fishy, use -x to get verbose logfiles
- set -e -u -x
- # we pass arbitrary arguments via job scheduler and can use them as variables
- indir=$1
- # The first part of the script is therefore to navigate to a unique location,
- # and clone the analysis dataset to it.
- # go into unique location
- cd /tmp
- # clone the analysis dataset. flock makes sure that this does not interfere
- # with another job finishing and pushing results back at the same time
- flock --verbose $DSLOCKFILE datalad clone /data/group/psyinf/studyforrest-srm-movies chaeusler-concat
- cd chaeusler-concat
- # This dataset clone is temporary: It will exist over the course of one analysis/job only,
- # but before it is being purged, all of the results it computed will be pushed
- # to the original dataset. This requires a safe-guard: If the original dataset
- # receives the results from the dataset clone, it knows about the clone and its
- # state. In order to protect the results from someone accidentally synchronizing
- # (updating) the dataset from its linked dataset after is has been deleted,
- # the clone should be created as a “trow-away clone” right from the start. By
- # running git annex dead here, git-annex disregards the clone, preventing the
- # deletion of data in the clone to affect the original dataset.
- # announce the clone to be temporary
- git annex dead here
- # The datalad push to the original clone location of a dataset needs to be prepared
- # carefully. The job computes one result (out of of many results) and saves it,
- # thus creating new data and a new entry with the run-record in the dataset
- # history. But each job is unaware of the results and commits produced by other
- # branches. Should all jobs push back the results to the original place (the
- # master branch of the original dataset), the individual jobs would conflict with
- # each other or, worse, overwrite each other (if you don’t have the default
- # push configuration of Git).
- # The general procedure and standard Git workflow for collaboration, therefore,
- # is to create a change on a different, unique branch, push this different
- # branch, and integrate the changes into the original master branch via a merge
- # in the original dataset4.
- # In order to do this, prior to executing the analysis, the script will checkout
- # a unique new branch in the analysis dataset. The most convenient name for the
- # branch is the Job-ID, an identifier under which the job scheduler runs an
- # individual job. This makes it easy to associate a result (via its branch)
- # with the log, error, or output files that the job scheduler produces5, and
- # the real-life example will demonstrate these advantages more concretely.
- # git checkout -b <name> creates a new branch and checks it out
- # checkout a unique branch
- git checkout -b "job-$JOBID"
- # $JOB-ID isn’t hardcoded into the script but it can be given to the script as
- # an environment or input variable at the time of job submission.
- # Next, its time for the containers-run command. The invocation will depend on
- # the container and dataset configuration (both of which are demonstrated in the
- # real-life example in the next section), and below, we pretend that the
- # container invocation only needs an input file and an output file. These input
- # file is specified via a bash variables ($inputfile) that will be defined in
- # the script and provided at the time of job submission via command line argument
- # from the job scheduler, and the output file name is based on the input file name.
- # After the containers-run execution in the script, the results can be pushed back to the dataset sibling origin6:
- # run the job
- datalad run \
- -m "Concatenating and z-scoring runs of ${indir}" \
- --explicit \
- --input $indir \
- --output $indir \
- ./code/data_mask_concat_runs.py \
- -sub "{inputs}" -outdir "{outputs}"
- # push, with filelocking as a safe-guard
- flock --verbose $DSLOCKFILE datalad push --to origin
- # Done - job handler should clean up workspace
- # manually merge it
- # git merge -m "Merge results from job cluster XY" $(git branch -l | grep 'job-' | tr -d ' ')
- # delete branches matching a pattern
- # git branch | grep "job-*" | xargs git branch -D
|