FastQ to VCF
This page describes the pipeline to go all the way from fastq data to a merged multi-sample VCF
To perform population level or phylogenetic analyses we need to have all our samples in one file. Some programs will require a fasta file others a text based genotype matrix etc...
The fastq2matix pipeline can automate some of these steps for you. The general workflow of the pipeline is the following:
For each sample the raw fastq data is processed into a genomic vcf (gvcf). This is a slightly different format which encodes information on the non-variant sites as well as the variant sites. This extra information allows us to ascertain whether absence of a variant in a sample is due to the reference allele being present or lack of data. If you are interested you can read more about gvcf format here.
The gvcfs are imported into a database
Joint genotyping is performed on all samples together
Installing prerequisites
The pipeline makes use of a lot of open-source bioinformatics tools such as bwa, samtools and gatk. Let's make an environment with everything installed.
Now we can activate the environment with
Remember,each time you open a new terminal window and would like to use the pipeline, you have to activate the environment.
Next we will download the latest version of the pipeline and install
Now you should be ready to go!
Usage
Mapping and calling variants
As an example I am going to use 3 M. tuberculosis samples, but feel free to substitute in your own data. First use cd
to navigate to a directory where you would like to work in. Then we are ready to download the fastq data and the reference genome:
After running those commands you should have 6 fastq files and the reference genome in a directory structure like this:
The first part of the pipeline performs the prcessing of the fastq data into gvcf data in 5 main steps:
Trimming: FastQs are trimmed using trimmomatic
Mapping: Trimmed data is aligned to the reference genome
Duplicate reads marking: With samtools
Base quality recalibration: Performed with gatk
Variant calling: Genomics VCFs are called using gatk HaplotypeCaller
All these can be performed with the command fastq2vcf.py all
. Here is the command for sample ERR1664619:
You need to give the reads with --read1
and --read2
, the reference genome with --ref
and the prefix for the output file with --prefix
. There are some other optional commands but these are the required ones. For example if you are working on a server with multiple cores you can increase the number of threads using --threads
. The higher the number, the faster the job will run. Just make sure you don't increase it to more threads then there are available on your system
Try run it for samples ERR1664620 and ERR1664621. If you are stuck have a look at the command above and just replace the sample names.
After running this command you should have some new files for each of the samples. You should see a bam file and a vcf file (ending in .g.vcf.gz). All the files are important for the next steps, so don't move or delete any of them yet.
Your directory should contain the following files:
Merging VCF files
Now it is time to import the vcf files into a genomics database. This is a database format that is developed by gatk, but we don't actually have to know any more about it.
To run to the next step we first need to create a file with the sample names in it. Create a text file called samples.txt
and put the following contents into it.
Now let's run the import step:
The arguments are pretty self explanitory. If fo some reason your VCF files are in a different directory then you can change --vcf-dir
, but ours are in the current directory so we use --vcf-dir .
.
After this has finished you should be able to see that there are several new directories, starting with mtb_Chromosome. Why are there multiple directories for one database? To speed things up we have partitioned the genome into 20 pieces and run them in parallel.
Now we are ready for the final step. We run the join genotyping with:
After this has finished you should have a vcf named mtb.2020_06_24.genotyped.vcf.gz. This contains the genotype calls for all variant positions across all your samples. This file can be converted into your desired format using bcftools query. For example lets make a very simple genotype matrix:
The vcf is a good starting point for your population based questions, but remember it is not perfect and it is important to perform filtering on the file to remove false positive and false negative calls. But that is a topic for another day. For now, have a look at the commands you ran and give it a go with your own data!
Last updated