Sample QC

When analysing NGS data for a sample it is a good idea to check the quality of your data. The general pipeline goes FastQ -> BAM -> VCF. At each stage there are a few things to look out for.

Raw data (fastq)

Metric

Explanation

Comments

Base quality

The machine outputs an estimated quality per base that is sequenced. This estimated by the sequencing machine itself (see here for more info)

Ideally, you want more of the bases to be of high quality (>Q30). Low-quality bases may be incorrect and lead wrong conclusions.

Contamination

Tools like kraken2 and centrifuge can assign each read to a species or other taxonomic classification. This helps to look for contamination

The more reads assigned to your organism of interest, the better. Drop samples with high levels of contamination if you can afford to.

Alignment (bam)

Metric

Explanation

Comments

% reads mapping

We can check the proportion of the total reads which align to the reference. This will tell you if what you sequenced is actually what you wanted to sequence

In a best-case scenario, all your reads should align to the reference. Lower numbers may indicate contamination or a reference which is divergent from your strain.

Reference coverage

Tools like samtools and bedtools can estimate how much of the reference is covered by our data.

This also looks at the bam file but looks at it from the perspective of the reference. We want as much of the reference covered by at least 5 reads so that we can confidently call variants and look for differences. Depending on your question you might want all of the genome covered, or a selection of genes. If the value is low, this does not mean that your data is bad quality, but that you don't have enough of it.

Last updated