Running on many samples

To run the pipeline on a few samples if relatively straightforward, but when we need to run it on 100s of samples it can get a bit unwieldy to type every command out. For this purpose it is a good idea to write a wrapper script to handle the running.

As an example let's assume that you have the following files:

.
└── fastq
    ├── ERR1664623_1.fastq.gz
    ├── ERR1664623_2.fastq.gz
    ├── ERR1664624_1.fastq.gz
    ├── ERR1664624_2.fastq.gz
    ├── ERR1664635_1.fastq.gz
    ├── ERR1664635_2.fastq.gz
    ├── ERR1664636_1.fastq.gz
    ├── ERR1664636_2.fastq.gz
    ├── ERR1664664_1.fastq.gz
    ├── ERR1664664_2.fastq.gz
    ├── ERR1664665_1.fastq.gz
    └── ERR1664665_2.fastq.gz

First we will need to create a list of the samples prefixes that you would like to run. To do this we can run the following command.

ls fastq/ | grep _1.fastq.gz | sed 's/_1.fastq.gz//' > samples.txt

This will put all the file prefixes into a file called samples.txt. We can then use parallel to run our tb-profiler command for each sample in the file like this. Before we run tb-profiler we should make the folders where it will store the bam vcf and result files. We have to do this because ottherwise the multiple instances of tb-profiler run by parallel will all try to create the same folders at the same time and you will run into an error.

mkdir bam vcf results

Now we are ready to run tb-profiler in parallel.

cat samples.txt | parallel --bar -j 2 tb-profiler profile -1 fastq/{}_1.fastq.gz -2 fastq/{}_2.fastq.gz -p {}

You can adjust the -j parameter to allow for more jobs to run in parallel. I have set this to 2 but if you have a HPC or powerful computer you can increase this.

Last updated