Kraken2

Kraken2 is a tool which allows you to classify sequences from a fastq file against a database of organisms. This is useful when looking for a species of interest or contamination. The database consists of a list of kmers and the mapping of those onto taxonomic classifications. Kraken2 breaks up your sequence into a kmers and compares to the database to find the most likely taxonomic assignment.

Example

Prerequisites

We will have to install some scripts from this repo which we are going to use later.

git clone https://github.com/pathogenseq/pathogenseq-scripts.git
cd pathogenseq-scripts
python setup.py install

Additionally, you will need the fastq2matrix package installed and seqtk tool. If you don't have them you can install with.

git clone https://github.com/pathogenseq/fastq2matrix.git
cd pathogenseq-scripts
python setup.py install

conda install seqtk -y

We will run through an example using a reads from a library classified as Mycobacterium tuberculosis from the SRA.

fasterq-dump -p ERR2513180

Running Kraken2

We should have the two read files for the isolate ERR2513180. We can now run kraken2. You will need to specify the database with --db, the output with --output, the report with --report and the read files. We also need to tell kraken2 that the files are paired. For readers who are using the s3 server the databases are located at /opt/storage2/db/kraken2/. We will be using the standard database, which contains sequences from viruses, bacteria and human.

kraken2 --threads 10 --db /opt/storage2/db/kraken2/standard --output ERR2513180.output.txt --report ERR2513180.report.txt --paired ERR2513180_1.fastq.gz ERR2513180_2.fastq.gz

The report file contains a hierarchical output file contains the taxonomic classification for each read. Let's have a look at the report. You can open it up with less on the terminal or any other text editor/viewer. The format of the report is the following:

  1. Percentage of fragments covered by the clade rooted at this taxon

  2. Number of fragments covered by the clade rooted at this taxon

  3. Number of fragments assigned directly to this taxon

  4. A rank code, indicating (U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. Taxa that are not at any of these 10 ranks have a rank code that is formed by using the rank code of the closest ancestor rank with a number indicating the distance from that rank. E.g., "G2" is a rank code indicating a taxon is between genus and species and the grandparent taxon is at the genus rank.

  5. NCBI taxonomic ID number

  6. Indented scientific name

Importantly we should be able to see 99.19% of reads belonging to the Mycobacterium genus. In the next level (G1) we can see the reads divided between Mycobacterium avium complex (MAC) (15.69%) and Mycobacterium tuberculosis complex (15.07%). You might be wondering where the other 68.43% went. Much of the sequence is conserved within the Mycobacterium genus and so cannot be assigned to any further level than the Genus level (G).

You might be interested in extracting a particular species from the data. In my this case, we would like to keep the Mycobacterium tuberculosis data. To do this we must extract all reads which classify as Mycobacterium tuberculosis complex but also reads which are classified as Mycobacterium genus. We can therefore remove all reads belonging to Mycobacterium avium complex (MAC) and all nested taxa (tax-tree). From the kraken2 report we can find the taxid we will need for the next step (120793).

Performing the filtering

To do this step we will need to run the extract_classified_reads.py script which we installed earlier. We can either tell the script to extract or exclude reads from a tax-tree. We will also need to pass a file to the script which contains the taxonomic IDs from the NCBI. If you are reading this and have access to the s3 node then it is located at /opt/storage2/db/kraken2/nodes.dmp.

extract_classified_reads.py  --R1 ERR2513180_1.fastq --R2 ERR2513180_2.fastq --kraken2-output ERR2513180.output.txt --tax-dump /opt/storage2/db/kraken2/nodes.dmp --exclude 120793

After running this command you should be able to see two files named ERR2513180_1.kraken2_filt.fastq.gz and ERR2513180_2.kraken2_filt.fastq.gz which contain the filtered reads.

That's all there is to it!

Last updated