T
T
Tutorials
Search…
Parallel
Using parallel to automate jobs
Parallel is a super useful tool which you can use to automate running of commands. For example lets say you have some fastQ data for 100 samples and you need to run the same command for all of them. Parallel will help by 1) allowing you to write a general template for the command which it will run for all the samples and 2) handleing the job queue and allowing you to run the commands in parallel (hence the name!)

Simple example

x+x=?x + x = ?
That is a pretty simple equation. Let's say I want to run this for a few ddifferent values of x. You can use bash to do that by running:
1
expr 1 + 1
Copied!
For a few numbers this is pretty easy but it can get quite cumbersome to to it for a few hundred or thousand iterations. You can use parallel to automate this task.
First off, let's create a file with a sequence of numbers, these will serve as our values of x
1
seq 0 10 > numbers.txt
Copied!
Now lets to use parallel to execute a simple command. Run the following command, and then we'll break it down.
1
cat numbers.txt | parallel -j 1 "expr {} + {}"
Copied!
We can supply our values of to parallel using stdin, i.e. by printing the values (cat numbers.txt) and passing them to parallel using a | symbol. Then we need to tell parallel to do something with the values it receives. You can can see we have replace the x values with the symbols {}. This is a kind of place holder value which parallel will replace for each line with the value of the line. Finally the -j 1 flag tells parallel to run 1 job at a time. Here is a schematic of what parallel is doing
For each line, the {} symbols are replaced with the value that is passed to parallel

A practical example

Let's have a look at a slightly more compliced example. We have some fastQ file which we would like to run the fastq2vcf.py script on. Our directory structure might look like this (check our the the pipeline tutorial if you want to actually download the data and test this example out)
1
.
2
├── ERR1664619_1.fastq.gz
3
├── ERR1664619_2.fastq.gz
4
├── ERR1664620_1.fastq.gz
5
├── ERR1664620_2.fastq.gz
6
├── ERR1664621_1.fastq.gz
7
├── ERR1664621_2.fastq.gz
8
└── H37Rv.fa
9
Copied!
An example of the command to run could be the following:
1
fastq2vcf.py all --read1 ERR1664619_1.fastq.gz --read2 ERR1664619_2.fastq.gz --ref H37Rv.fa --prefix ERR1664619
Copied!
We can alter this command into parallel template format by replacing the sample names with {}
1
fastq2vcf.py all --read1 {}_1.fastq.gz --read2 {}_2.fastq.gz --ref H37Rv.fa --prefix {}
Copied!
Next we can get the list of samples to provide to parallel. Run the following command to get all the sample anmes and store them in a file:
1
ls *_1.fastq.gz | sed 's/_1.fastq.gz//' > samples.txt
Copied!
Now we are ready to run the command for all samples:
1
cat samples.txt | parallel -j 2 "fastq2vcf.py all --read1 {}_1.fastq.gz --read2 {}_2.fastq.gz --ref H37Rv.fa --prefix {}"
Copied!
I have set -j to 2 here, but if you are running on a machine with a high number of threads you can run more commands together by increasing the value. Just remember the your job might already be using multiple threads. For example if your job used 4 thread by default and you are running two jobs in parallel with -j 2 then you could be using up to 8 threads at a time.
That is all there is to it, now you can run hundreds of jobs with just a single command. Hopefully this should save you a lot of time, use it to do something fun! :)