Sample QC
When analysing NGS data for a sample it is a good idea to check the quality of your data. The general pipeline goes FastQ -> BAM -> VCF. At each stage there are a few things to look out for.
Raw data (fastq)
Metric | Explanation | Comments |
Base quality | The machine outputs an estimated quality per base that is sequenced. This estimated by the sequencing machine itself (see here for more info) | Ideally, you want more of the bases to be of high quality (>Q30). Low-quality bases may be incorrect and lead wrong conclusions. |
Contamination | Tools like kraken2 and centrifuge can assign each read to a species or other taxonomic classification. This helps to look for contamination | The more reads assigned to your organism of interest, the better. Drop samples with high levels of contamination if you can afford to. |
Alignment (bam)
Metric | Explanation | Comments |
% reads mapping | We can check the proportion of the total reads which align to the reference. This will tell you if what you sequenced is actually what you wanted to sequence | In a best-case scenario, all your reads should align to the reference. Lower numbers may indicate contamination or a reference which is divergent from your strain. |
Reference coverage | Tools like samtools and bedtools can estimate how much of the reference is covered by our data. | This also looks at the bam file but looks at it from the perspective of the reference. We want as much of the reference covered by at least 5 reads so that we can confidently call variants and look for differences. Depending on your question you might want all of the genome covered, or a selection of genes. If the value is low, this does not mean that your data is bad quality, but that you don't have enough of it. |
Last updated