chaeusler
/
studyforrest-ppa-srm


			
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129
							#!/bin/bash

##### Datalad Handbook 3.2 ######
##### DataLad-centric analysis with job scheduling and parallel computing
# http://handbook.datalad.org/en/latest/beyond_basics/101-170-dataladrun.html

# top-level analysis dataset with subdatasets
# $ datalad create parallel_analysis
# $ cd parallel_analysis
# a) pipeline dataset (with a configured software container) 
# $ datalad clone -d . https://github.com/ReproNim/containers.git
# b) input dataset
# $ datalad clone -d . /path/to/my/rawdata
# data analysis with software container that performs a set of analyses
# results will be aggregated into a top-level dataset 

# Individual jobs are computed in throw-away dataset clones (& branches) 
# to avoid unwanted interactions between parallel jobs.

# results are pushed back (as branches) to into the target dataset.

# A manual merge aggregates all results into the master branch of the dataset.

# following analysis processes rawdata with 
# - a pipeline from 
# - collects outcomes in toplevel parallel_analysis dataset 


# You could also add and configure the container using datalad containers-add 
# to the top-most dataset. This solution makes the container less usable, though.
# If you have more than one application for a container, keeping it as a 
# standalone dataset can guarantee easier reuse. 

# what you will submit as a job with a job scheduler
# is  a shell script that contains all relevant data analysis steps
# and not a datalad containers-run call

# but datalad run does not support concurrent execution in the same dataset clone.
# Solution: create one throw-away dataset clone for each job.

# We treat cluster compute nodes like contributors to the analyses: 
# They clone the analysis dataset hierarchy into a temporary location,
# run the computation, push the results, and remove their temporary dataset again

# The compute job clones the dataset to a unique place, so that it can run a 
# containers-run command inside it without interfering with any other job. 

# fail whenever something is fishy, use -x to get verbose logfiles
set -e -u -x

# we pass arbitrary arguments via job scheduler and can use them as variables
indir=$1

# The first part of the script is therefore to navigate to a unique location, 
# and clone the analysis dataset to it.
# go into unique location
cd /tmp
# clone the analysis dataset. flock makes sure that this does not interfere
# with another job finishing and pushing results back at the same time
flock --verbose $DSLOCKFILE datalad clone /data/group/psyinf/studyforrest-srm-movies chaeusler-concat

cd chaeusler-concat
# This dataset clone is temporary: It will exist over the course of one analysis/job only, 
# but before it is being purged, all of the results it computed will be pushed 
# to the original dataset. This requires a safe-guard: If the original dataset 
# receives the results from the dataset clone, it knows about the clone and its 
# state. In order to protect the results from someone accidentally synchronizing 
# (updating) the dataset from its linked dataset after is has been deleted, 
# the clone should be created as a “trow-away clone” right from the start. By 
# running git annex dead here, git-annex disregards the clone, preventing the 
# deletion of data in the clone to affect the original dataset.
# announce the clone to be temporary
git annex dead here

# The datalad push to the original clone location of a dataset needs to be prepared 
# carefully. The job computes one result (out of of many results) and saves it, 
# thus creating new data and a new entry with the run-record in the dataset 
# history. But each job is unaware of the results and commits produced by other 
# branches. Should all jobs push back the results to the original place (the 
# master branch of the original dataset), the individual jobs would conflict with 
# each other or, worse, overwrite each other (if you don’t have the default 
# push configuration of Git).

# The general procedure and standard Git workflow for collaboration, therefore, 
# is to create a change on a different, unique branch, push this different 
# branch, and integrate the changes into the original master branch via a merge 
# in the original dataset4.

# In order to do this, prior to executing the analysis, the script will checkout 
# a unique new branch in the analysis dataset. The most convenient name for the 
# branch is the Job-ID, an identifier under which the job scheduler runs an 
# individual job. This makes it easy to associate a result (via its branch) 
# with the log, error, or output files that the job scheduler produces5, and 
# the real-life example will demonstrate these advantages more concretely.

# git checkout -b <name> creates a new branch and checks it out
# checkout a unique branch
git checkout -b "job-$JOBID"

# $JOB-ID isn’t hardcoded into the script but it can be given to the script as 
# an environment or input variable at the time of job submission. 

# Next, its time for the containers-run command. The invocation will depend on 
# the container and dataset configuration (both of which are demonstrated in the 
# real-life example in the next section), and below, we pretend that the 
# container invocation only needs an input file and an output file. These input 
# file is specified via a bash variables ($inputfile) that will be defined in 
# the script and provided at the time of job submission via command line argument
# from the job scheduler, and the output file name is based on the input file name.

# After the containers-run execution in the script, the results can be pushed back to the dataset sibling origin6:

# run the job
datalad run \
-m "Concatenating and z-scoring runs of ${indir}" \
--explicit \
--input $indir \
--output $indir \
./code/data_mask_concat_runs.py \
-sub "{inputs}" -outdir "{outputs}"
# push, with filelocking as a safe-guard
flock --verbose $DSLOCKFILE datalad push --to origin

# Done - job handler should clean up workspace

# manually merge it 
# git merge -m "Merge results from job cluster XY" $(git branch -l | grep 'job-' | tr -d ' ')
# delete branches matching a pattern
# git branch | grep "job-*" | xargs git branch -D