Consensus building

This page describes a basic procedure to generate consensus sequences

Setting up

First we'll need to get some data. As an example I've used some M. tuberculosis data, but just substitute in your own data if you have. Run the following code to get some raw NGS data for 6 isolates and the reference genome

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/003/ERR1664623/ERR1664623_*.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/004/ERR1664624/ERR1664624_*.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/005/ERR1664635/ERR1664635_*.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/006/ERR1664636/ERR1664636_*.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/004/ERR1664664/ERR1664664_*.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/005/ERR1664665/ERR1664665_*.fastq.gz
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/195/955/GCF_000195955.2_ASM19595v2/GCF_000195955.2_ASM19595v2_genomic.fna.gz
gunzip GCF_000195955.2_ASM19595v2_genomic.fna.gz
mv GCF_000195955.2_ASM19595v2_genomic.fna H37Rv.fa

You'll also need to have the latest versions of bwa, samtools, bcftools, gatk and snp-sites installed, if you have conda installed you can use the following commands to install/update:

conda install -c bioconda bwa samtools bcftools snp-sites gatk4

Step 1: Aligning

First we need to index the reference file with BWA.

bwa index H37Rv.fa

Next we can start aligning the raw data:

bwa mem -t 4 -R "@RG\tID:ERR1664623\tSM:ERR1664623\tPL:Illumina" H37Rv.fa ERR1664623_1.fastq.gz ERR1664623_2.fastq.gz | samtools view -@ 4 -b - | samtools sort -@ 4 -o ERR1664623.bam -
bwa mem -t 4 -R "@RG\tID:ERR1664624\tSM:ERR1664624\tPL:Illumina" H37Rv.fa ERR1664624_1.fastq.gz ERR1664624_2.fastq.gz | samtools view -@ 4 -b - | samtools sort -@ 4 -o ERR1664624.bam -
bwa mem -t 4 -R "@RG\tID:ERR1664635\tSM:ERR1664635\tPL:Illumina" H37Rv.fa ERR1664635_1.fastq.gz ERR1664635_2.fastq.gz | samtools view -@ 4 -b - | samtools sort -@ 4 -o ERR1664635.bam -
bwa mem -t 4 -R "@RG\tID:ERR1664636\tSM:ERR1664636\tPL:Illumina" H37Rv.fa ERR1664636_1.fastq.gz ERR1664636_2.fastq.gz | samtools view -@ 4 -b - | samtools sort -@ 4 -o ERR1664636.bam -
bwa mem -t 4 -R "@RG\tID:ERR1664664\tSM:ERR1664664\tPL:Illumina" H37Rv.fa ERR1664664_1.fastq.gz ERR1664664_2.fastq.gz | samtools view -@ 4 -b - | samtools sort -@ 4 -o ERR1664664.bam -
bwa mem -t 4 -R "@RG\tID:ERR1664665\tSM:ERR1664665\tPL:Illumina" H37Rv.fa ERR1664665_1.fastq.gz ERR1664665_2.fastq.gz | samtools view -@ 4 -b - | samtools sort -@ 4 -o ERR1664665.bam -

I've used -t 4 with bwa and -@ 4 with samtools. These parameters set the number of threads used to run the programs. More threads = faster runtime (usually).

We now have alignments in BAM format. You can inspect these with tablet if you want to see what the data looks like. Tools like tablet and variant callers usually need the BAM files to be indexed. You can do this with samtools:

samtools index ERR1664623.bam
samtools index ERR1664624.bam
samtools index ERR1664635.bam
samtools index ERR1664636.bam
samtools index ERR1664664.bam
samtools index ERR1664665.bam

Step 2: Variant Calling

We will now perform variant calling. There are many programs that you can use but for this tutorial we will go with gatk. Before we start variant calling, we need to index the reference using samtools and gatk.

gatk CreateSequenceDictionary -R H37Rv.fa
samtools faidx H37Rv.fa
gatk HaplotypeCaller -I ERR1664623.bam -R H37Rv.fa -O ERR1664623.vcf.gz -ERC GVCF
gatk HaplotypeCaller -I ERR1664624.bam -R H37Rv.fa -O ERR1664624.vcf.gz -ERC GVCF
gatk HaplotypeCaller -I ERR1664635.bam -R H37Rv.fa -O ERR1664635.vcf.gz -ERC GVCF
gatk HaplotypeCaller -I ERR1664636.bam -R H37Rv.fa -O ERR1664636.vcf.gz -ERC GVCF
gatk HaplotypeCaller -I ERR1664664.bam -R H37Rv.fa -O ERR1664664.vcf.gz -ERC GVCF
gatk HaplotypeCaller -I ERR1664665.bam -R H37Rv.fa -O ERR1664665.vcf.gz -ERC GVCF

This will create a genomic VCF file containing all the variant positions and their alternate alleles.

Step 3: Consensus building

We will now create a consensus sequence for all isolates by substituting in the alternate alleles into the reference at their respective positions. This can be done using bcftools. First we will create a bed file containing the locations of low depth regions. We would like to mask these in the consensus sequence as we cannot do not know if they should be reference or alternate allele.

bcftools view -i 'FMT/DP<3' ERR1664623.vcf.gz | bcftools convert --gvcf2vcf -f H37Rv.fa| bcftools query -f '%CHROM\t%POS\t%POS\n' > ERR1664623.mask.bed
bcftools view -i 'FMT/DP<3' ERR1664624.vcf.gz | bcftools convert --gvcf2vcf -f H37Rv.fa| bcftools query -f '%CHROM\t%POS\t%POS\n' > ERR1664624.mask.bed
bcftools view -i 'FMT/DP<3' ERR1664635.vcf.gz | bcftools convert --gvcf2vcf -f H37Rv.fa| bcftools query -f '%CHROM\t%POS\t%POS\n' > ERR1664635.mask.bed
bcftools view -i 'FMT/DP<3' ERR1664636.vcf.gz | bcftools convert --gvcf2vcf -f H37Rv.fa| bcftools query -f '%CHROM\t%POS\t%POS\n' > ERR1664636.mask.bed
bcftools view -i 'FMT/DP<3' ERR1664664.vcf.gz | bcftools convert --gvcf2vcf -f H37Rv.fa| bcftools query -f '%CHROM\t%POS\t%POS\n' > ERR1664664.mask.bed
bcftools view -i 'FMT/DP<3' ERR1664665.vcf.gz | bcftools convert --gvcf2vcf -f H37Rv.fa| bcftools query -f '%CHROM\t%POS\t%POS\n' > ERR1664665.mask.bed

The important parameter is there specified with 'FMT/DP<3'. This tells bcftools to select sites with less than 3 reads aligning to the position. You can increase this if you want to be more stringent.

Now we must create a VCF file containing only variant positions for all isolates:

bcftools view -V indels -c 1 -a ERR1664623.vcf.gz -Oz -o ERR1664623.snps.vcf.gz
bcftools view -V indels -c 1 -a ERR1664624.vcf.gz -Oz -o ERR1664624.snps.vcf.gz
bcftools view -V indels -c 1 -a ERR1664635.vcf.gz -Oz -o ERR1664635.snps.vcf.gz
bcftools view -V indels -c 1 -a ERR1664636.vcf.gz -Oz -o ERR1664636.snps.vcf.gz
bcftools view -V indels -c 1 -a ERR1664664.vcf.gz -Oz -o ERR1664664.snps.vcf.gz
bcftools view -V indels -c 1 -a ERR1664665.vcf.gz -Oz -o ERR1664665.snps.vcf.gz

Again, downstream tools require indexing. We have to index the VCF files with bcftools:

bcftools index ERR1664623.snps.vcf.gz
bcftools index ERR1664624.snps.vcf.gz
bcftools index ERR1664635.snps.vcf.gz
bcftools index ERR1664636.snps.vcf.gz
bcftools index ERR1664664.snps.vcf.gz
bcftools index ERR1664665.snps.vcf.gz

Finally we can now use the filtered VCF and the mask bed file to create our consensus sequences:

bcftools consensus -f H37Rv.fa -m ERR1664623.mask.bed ERR1664623.snps.vcf.gz | tr '*' 'N' | sed 's/>.*/>ERR1664623/' > ERR1664623.consensus.fasta
bcftools consensus -f H37Rv.fa -m ERR1664624.mask.bed ERR1664624.snps.vcf.gz | tr '*' 'N' | sed 's/>.*/>ERR1664624/' > ERR1664624.consensus.fasta
bcftools consensus -f H37Rv.fa -m ERR1664635.mask.bed ERR1664635.snps.vcf.gz | tr '*' 'N' | sed 's/>.*/>ERR1664635/' > ERR1664635.consensus.fasta
bcftools consensus -f H37Rv.fa -m ERR1664636.mask.bed ERR1664636.snps.vcf.gz | tr '*' 'N' | sed 's/>.*/>ERR1664636/' > ERR1664636.consensus.fasta
bcftools consensus -f H37Rv.fa -m ERR1664664.mask.bed ERR1664664.snps.vcf.gz | tr '*' 'N' | sed 's/>.*/>ERR1664664/' > ERR1664664.consensus.fasta
bcftools consensus -f H37Rv.fa -m ERR1664665.mask.bed ERR1664665.snps.vcf.gz | tr '*' 'N' | sed 's/>.*/>ERR1664665/' > ERR1664665.consensus.fasta

Because we have simply substituted in the alternate alleles into the reference sequence, all isolate consensus sequences should be the same length and we can just concatenate them together in one file without aligning. Then we can select only variant sites using snp-sites.

cat *.consensus.fasta > alignment.genome.fasta
snp-sites alignment.genome.fasta > alignment.snps.fasta

The alignment.snps.fasta file is ready to use for phylogenetic analysis, so thats it for now!

Last updated