This page describes a basic procedure to generate consensus sequences
Setting up
First we'll need to get some data. As an example I've used some M. tuberculosis data, but just substitute in your own data if you have. Run the following code to get some raw NGS data for 6 isolates and the reference genome
You'll also need to have the latest versions of bwa, samtools, bcftools, gatk and snp-sites installed, if you have conda installed you can use the following commands to install/update:
I've used -t 4 with bwa and -@ 4 with samtools. These parameters set the number of threads used to run the programs. More threads = faster runtime (usually).
We now have alignments in BAM format. You can inspect these with tablet if you want to see what the data looks like. Tools like tablet and variant callers usually need the BAM files to be indexed. You can do this with samtools:
samtools index ERR1664623.bam
samtools index ERR1664624.bam
samtools index ERR1664635.bam
samtools index ERR1664636.bam
samtools index ERR1664664.bam
samtools index ERR1664665.bam
Step 2: Variant Calling
We will now perform variant calling. There are many programs that you can use but for this tutorial we will go with gatk. Before we start variant calling, we need to index the reference using samtools and gatk.
This will create a genomic VCF file containing all the variant positions and their alternate alleles.
Step 3: Consensus building
We will now create a consensus sequence for all isolates by substituting in the alternate alleles into the reference at their respective positions. This can be done using bcftools. First we will create a bed file containing the locations of low depth regions. We would like to mask these in the consensus sequence as we cannot do not know if they should be reference or alternate allele.
The important parameter is there specified with 'FMT/DP<3'. This tells bcftools to select sites with less than 3 reads aligning to the position. You can increase this if you want to be more stringent.
Now we must create a VCF file containing only variant positions for all isolates:
Again, downstream tools require indexing. We have to index the VCF files with bcftools:
bcftools index ERR1664623.snps.vcf.gz
bcftools index ERR1664624.snps.vcf.gz
bcftools index ERR1664635.snps.vcf.gz
bcftools index ERR1664636.snps.vcf.gz
bcftools index ERR1664664.snps.vcf.gz
bcftools index ERR1664665.snps.vcf.gz
Finally we can now use the filtered VCF and the mask bed file to create our consensus sequences:
Because we have simply substituted in the alternate alleles into the reference sequence, all isolate consensus sequences should be the same length and we can just concatenate them together in one file without aligning. Then we can select only variant sites using snp-sites.