Consensus building

This page describes a basic procedure to generate consensus sequences

Setting up

First we'll need to get some data. As an example I've used some M. tuberculosis data, but just substitute in your own data if you have. Run the following code to get some raw NGS data for 6 isolates and the reference genome

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/003/ERR1664623/ERR1664623_*.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/004/ERR1664624/ERR1664624_*.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/005/ERR1664635/ERR1664635_*.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/006/ERR1664636/ERR1664636_*.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/004/ERR1664664/ERR1664664_*.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/005/ERR1664665/ERR1664665_*.fastq.gz
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/195/955/GCF_000195955.2_ASM19595v2/GCF_000195955.2_ASM19595v2_genomic.fna.gz
gunzip GCF_000195955.2_ASM19595v2_genomic.fna.gz
mv GCF_000195955.2_ASM19595v2_genomic.fna H37Rv.fa

You'll also need to have the latest versions of bwa, samtools, bcftools, gatk and snp-sites installed, if you have conda installed you can use the following commands to install/update:

conda install -c bioconda bwa samtools bcftools snp-sites gatk4

Step 1: Aligning

First we need to index the reference file with BWA.

bwa index H37Rv.fa

Next we can start aligning the raw data:

I've used -t 4 with bwa and -@ 4 with samtools. These parameters set the number of threads used to run the programs. More threads = faster runtime (usually).

We now have alignments in BAM format. You can inspect these with tablet if you want to see what the data looks like. Tools like tablet and variant callers usually need the BAM files to be indexed. You can do this with samtools:

Step 2: Variant Calling

We will now perform variant calling. There are many programs that you can use but for this tutorial we will go with gatk. Before we start variant calling, we need to index the reference using samtools and gatk.

This will create a genomic VCF file containing all the variant positions and their alternate alleles.

Step 3: Consensus building

We will now create a consensus sequence for all isolates by substituting in the alternate alleles into the reference at their respective positions. This can be done using bcftools. First we will create a bed file containing the locations of low depth regions. We would like to mask these in the consensus sequence as we cannot do not know if they should be reference or alternate allele.

The important parameter is there specified with 'FMT/DP<3'. This tells bcftools to select sites with less than 3 reads aligning to the position. You can increase this if you want to be more stringent.

Now we must create a VCF file containing only variant positions for all isolates:

Again, downstream tools require indexing. We have to index the VCF files with bcftools:

Finally we can now use the filtered VCF and the mask bed file to create our consensus sequences:

Because we have simply substituted in the alternate alleles into the reference sequence, all isolate consensus sequences should be the same length and we can just concatenate them together in one file without aligning. Then we can select only variant sites using snp-sites.

The alignment.snps.fasta file is ready to use for phylogenetic analysis, so thats it for now!

Last updated

Was this helpful?