Parallel

Using parallel to automate jobs

Parallel is a super useful tool which you can use to automate running of commands. For example lets say you have some fastQ data for 100 samples and you need to run the same command for all of them. Parallel will help by 1) allowing you to write a general template for the command which it will run for all the samples and 2) handleing the job queue and allowing you to run the commands in parallel (hence the name!)

Simple example

x+x=?x + x = ?

That is a pretty simple equation. Let's say I want to run this for a few ddifferent values of x. You can use bash to do that by running:

expr 1 + 1

For a few numbers this is pretty easy but it can get quite cumbersome to to it for a few hundred or thousand iterations. You can use parallel to automate this task.

First off, let's create a file with a sequence of numbers, these will serve as our values of x

seq 0 10 > numbers.txt

Now lets to use parallel to execute a simple command. Run the following command, and then we'll break it down.

cat numbers.txt | parallel -j 1 "expr {} + {}"

We can supply our values of to parallel using stdin, i.e. by printing the values (cat numbers.txt) and passing them to parallel using a | symbol. Then we need to tell parallel to do something with the values it receives. You can can see we have replace the x values with the symbols {}. This is a kind of place holder value which parallel will replace for each line with the value of the line. Finally the -j 1 flag tells parallel to run 1 job at a time. Here is a schematic of what parallel is doing

A practical example

Let's have a look at a slightly more compliced example. We have some fastQ file which we would like to run the fastq2vcf.py script on. Our directory structure might look like this (check our the the pipeline tutorial if you want to actually download the data and test this example out)

.
├── ERR1664619_1.fastq.gz
├── ERR1664619_2.fastq.gz
├── ERR1664620_1.fastq.gz
├── ERR1664620_2.fastq.gz
├── ERR1664621_1.fastq.gz
├── ERR1664621_2.fastq.gz
└── H37Rv.fa

An example of the command to run could be the following:

fastq2vcf.py all --read1 ERR1664619_1.fastq.gz --read2 ERR1664619_2.fastq.gz --ref H37Rv.fa --prefix ERR1664619

We can alter this command into parallel template format by replacing the sample names with {}

fastq2vcf.py all --read1 {}_1.fastq.gz --read2 {}_2.fastq.gz --ref H37Rv.fa --prefix {}

Next we can get the list of samples to provide to parallel. Run the following command to get all the sample anmes and store them in a file:

ls *_1.fastq.gz | sed 's/_1.fastq.gz//' > samples.txt

Now we are ready to run the command for all samples:

cat samples.txt | parallel -j 2 --bar "fastq2vcf.py all --read1 {}_1.fastq.gz --read2 {}_2.fastq.gz --ref H37Rv.fa --prefix {}"

I have set -j to 2 here, but if you are running on a machine with a high number of threads you can run more commands together by increasing the value. Just remember the your job might already be using multiple threads. For example if your job used 4 thread by default and you are running two jobs in parallel with -j 2 then you could be using up to 8 threads at a time. I've also added --bar to give us a progress bar.

Multiple variables

What if your fastq files are all in different locations? or you want to use a different ID than is present in the fastq file? This is where having multiple variables in one line comes in handy. To do this you should pass a tab separated file to parallel.

The easiest way to do this is by opening up excel and creating your file. To modify the example above we will create a three column file with the columns being 1) the sample ID, 2) the forward read and 3) the reverse read as shown below. Once you have entered all the IDs and fastqs, select your data (taking care to only select cells with data in them as shown below) and copy it.

Note that I have entered the full path to the reads. This is important if you are going to run the pipeline in a different directory to where the reads are stored

Once you have copied your data, open up the terminal and navigate to the directory you want to run the pipeline in. Type in nano fastq_files.txt and hit enter. This will open up a text editor you can use to paste in the data you have copied.

Then hit the keys control + x followed by y and finally enter to save the file. To check if everything looks ok run cat fastq_files.txt to see the content of the file. It should look like the screenshot below.

Now we have our three column file we can run parallel as follows:

cat fastq_files.txt | parallel -j 2 --col-sep "\t" --bar  "fastq2vcf.py all --read1 {2} --read2 {3} --ref H37Rv.fa --prefix {1}"

There are a few important changes from the previous parallel command. Firstly, the addition of the --col-sep "\t" argument which tells parallel that the line contains different column separated by a tab. Secondly the numbers between the {} symbols which indicate which column to use. For example --prefix {1} indicates that the first column should be used for the prefix.

That is all there is to it, now you can run hundreds of jobs with just a single command. Hopefully this should save you a lot of time!

Last updated