FastQ to VCF

This page describes the pipeline to go all the way from fastq data to a merged multi-sample VCF

To perform population level or phylogenetic analyses we need to have all our samples in one file. Some programs will require a fasta file others a text based genotype matrix etc...

The fastq2matix pipeline can automate some of these steps for you. The general workflow of the pipeline is the following:

  1. For each sample the raw fastq data is processed into a genomic vcf (gvcf). This is a slightly different format which encodes information on the non-variant sites as well as the variant sites. This extra information allows us to ascertain whether absence of a variant in a sample is due to the reference allele being present or lack of data. If you are interested you can read more about gvcf format here.

  2. The gvcfs are imported into a database

  3. Joint genotyping is performed on all samples together

Installing prerequisites

The pipeline makes use of a lot of open-source bioinformatics tools such as bwa, samtools and gatk. Let's make an environment with everything installed.

conda create -n fastq2matrix python=3.7 bwa samtools bcftools parallel gatk4=4.1.4.1 delly tqdm trimmomatic minimap2 biopython bedtools r-ggplot2

Now we can activate the environment with

conda activate fastq2matrix

Remember,each time you open a new terminal window and would like to use the pipeline, you have to activate the environment.

Next we will download the latest version of the pipeline and install

cd ~
git clone https://github.com/pathogenseq/fastq2matrix.git
cd fastq2matrix
python setup.py install

Now you should be ready to go!

Usage

Mapping and calling variants

As an example I am going to use 3 M. tuberculosis samples, but feel free to substitute in your own data. First use cd to navigate to a directory where you would like to work in. Then we are ready to download the fastq data and the reference genome:

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/009/ERR1664619/ERR1664619_*.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/000/ERR1664620/ERR1664620_*.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/001/ERR1664621/ERR1664621_*.fastq.gz
wget ftp://ftp.ensemblgenomes.org/pub/release-47/bacteria//fasta/bacteria_0_collection/mycobacterium_tuberculosis_h37rv/dna/Mycobacterium_tuberculosis_h37rv.ASM19595v2.dna.toplevel.fa.gz
gunzip Mycobacterium_tuberculosis_h37rv.ASM19595v2.dna.toplevel.fa.gz
mv Mycobacterium_tuberculosis_h37rv.ASM19595v2.dna.toplevel.fa H37Rv.fa

After running those commands you should have 6 fastq files and the reference genome in a directory structure like this:

.
├── ERR1664619_1.fastq.gz
├── ERR1664619_2.fastq.gz
├── ERR1664620_1.fastq.gz
├── ERR1664620_2.fastq.gz
├── ERR1664621_1.fastq.gz
├── ERR1664621_2.fastq.gz
└── H37Rv.fa

The first part of the pipeline performs the prcessing of the fastq data into gvcf data in 5 main steps:

  1. Trimming: FastQs are trimmed using trimmomatic

  2. Mapping: Trimmed data is aligned to the reference genome

  3. Duplicate reads marking: With samtools

  4. Base quality recalibration: Performed with gatk

  5. Variant calling: Genomics VCFs are called using gatk HaplotypeCaller

All these can be performed with the command fastq2vcf.py all. Here is the command for sample ERR1664619:

fastq2vcf.py all --read1 ERR1664619_1.fastq.gz --read2 ERR1664619_2.fastq.gz --ref H37Rv.fa --prefix ERR1664619

You need to give the reads with --read1 and --read2, the reference genome with --ref and the prefix for the output file with --prefix. There are some other optional commands but these are the required ones. For example if you are working on a server with multiple cores you can increase the number of threads using --threads. The higher the number, the faster the job will run. Just make sure you don't increase it to more threads then there are available on your system

Try run it for samples ERR1664620 and ERR1664621. If you are stuck have a look at the command above and just replace the sample names.

After running this command you should have some new files for each of the samples. You should see a bam file and a vcf file (ending in .g.vcf.gz). All the files are important for the next steps, so don't move or delete any of them yet.

Your directory should contain the following files:

.
├── ERR1664619_1.fastq.gz
├── ERR1664619_2.fastq.gz
├── ERR1664619.g.vcf.gz
├── ERR1664619.g.vcf.gz.tbi
├── ERR1664619.g.vcf.gz.validated
├── ERR1664619.mkdup.bam
├── ERR1664619.mkdup.bam.bai
├── ERR1664619.mkdup.bamstats
├── ERR1664620_1.fastq.gz
├── ERR1664620_2.fastq.gz
├── ERR1664620.g.vcf.gz
├── ERR1664620.g.vcf.gz.tbi
├── ERR1664620.g.vcf.gz.validated
├── ERR1664620.mkdup.bam
├── ERR1664620.mkdup.bam.bai
├── ERR1664620.mkdup.bamstats
├── ERR1664621_1.fastq.gz
├── ERR1664621_2.fastq.gz
├── ERR1664621.g.vcf.gz
├── ERR1664621.g.vcf.gz.tbi
├── ERR1664621.g.vcf.gz.validated
├── ERR1664621.mkdup.bam
├── ERR1664621.mkdup.bam.bai
├── ERR1664621.mkdup.bamstats
├── H37Rv.dict
├── H37Rv.fa
├── H37Rv.fa.amb
├── H37Rv.fa.ann
├── H37Rv.fa.bwt
├── H37Rv.fa.fai
├── H37Rv.fa.pac
└── H37Rv.fa.sa

Merging VCF files

Now it is time to import the vcf files into a genomics database. This is a database format that is developed by gatk, but we don't actually have to know any more about it.

To run to the next step we first need to create a file with the sample names in it. Create a text file called samples.txt and put the following contents into it.

ERR1664619
ERR1664620
ERR1664621

Now let's run the import step:

merge_vcfs.py import --sample-file samples.txt --ref H37Rv.fa --prefix mtb --vcf-dir .

The arguments are pretty self explanitory. If fo some reason your VCF files are in a different directory then you can change --vcf-dir, but ours are in the current directory so we use --vcf-dir ..

After this has finished you should be able to see that there are several new directories, starting with mtb_Chromosome. Why are there multiple directories for one database? To speed things up we have partitioned the genome into 20 pieces and run them in parallel.

Now we are ready for the final step. We run the join genotyping with:

merge_vcfs.py genotype --ref H37Rv.fa --prefix mtb

After this has finished you should have a vcf named mtb.2020_06_24.genotyped.vcf.gz. This contains the genotype calls for all variant positions across all your samples. This file can be converted into your desired format using bcftools query. For example lets make a very simple genotype matrix:

bcftools query -f '%POS\t%REF\t%ALT[\t%GT]\n' mtb.2020_06_24.genotyped.vcf.gz | head

The vcf is a good starting point for your population based questions, but remember it is not perfect and it is important to perform filtering on the file to remove false positive and false negative calls. But that is a topic for another day. For now, have a look at the commands you ran and give it a go with your own data!

Last updated