Sample QC
Last updated
Last updated
When analysing NGS data for a sample it is a good idea to check the quality of your data. The general pipeline goes FastQ -> BAM -> VCF. At each stage there are a few things to look out for.
Metric
Explanation
Comments
Base quality
The machine outputs an estimated quality per base that is sequenced. This estimated by the sequencing machine itself (see for more info)
Ideally, you want more of the bases to be of high quality (>Q30). Low-quality bases may be incorrect and lead wrong conclusions.
Contamination
Tools like and can assign each read to a species or other taxonomic classification. This helps to look for contamination
The more reads assigned to your organism of interest, the better. Drop samples with high levels of contamination if you can afford to.
Metric
Explanation
Comments
% reads mapping
We can check the proportion of the total reads which align to the reference. This will tell you if what you sequenced is actually what you wanted to sequence
In a best-case scenario, all your reads should align to the reference. Lower numbers may indicate contamination or a reference which is divergent from your strain.
Reference coverage
Tools like samtools and bedtools can estimate how much of the reference is covered by our data.
This also looks at the bam file but looks at it from the perspective of the reference. We want as much of the reference covered by at least 5 reads so that we can confidently call variants and look for differences. Depending on your question you might want all of the genome covered, or a selection of genes. If the value is low, this does not mean that your data is bad quality, but that you don't have enough of it.