Writing a custom collate script
Result files for each individual run are by default stored in json files (ending in .results.json). These can be collated into a merged file to display the drug resistance mutations using tb-profiler collate
. However, you may want to customise what information is extracted from the result files and displayed in the collated output. This page will provide an outline how to write your own parser.
Script skeleton
Open up your favourite editor, paste the following code and save the file as tbprofiler_custom_script.py
This will form the basic structure of the parser. Let's see what each section does.
Lines 3-11: Here we import libraries that are useful in the script.
Lines 13: This line starts the definition of the main
function. It takes one argument in the form of an object of arguments which is constructed using the argument parser.
Line 14: The location of the bed file is stored in the variable bed_file
. This is constructed using the prefix of the database supplied in the command line arguments (tbdb by default).
Line 15: We use the tbprofiler.get_lt2drugs
function to extract the locus_tag to associated drug mapping in a dictionary with the following format:
Line 17-20: If the --samples
argument was supplied, the file is read in and stored as a list. Otherwise the list is derived from looking the the result directory (specified with --dir
).
Lines 22-23: A loop is constructed to loop over the samples. For each iteration the json result file is loaded into a dictionary structure. The key-value structure is explained below.
Lines 26-34: These lines parse the arguments supplied to the script (for example --samples
). You can add in new arguments and they will appear as properties of the args
object in the main
function.
Result data structure
The individual result files are read into a dictionary on line 23. There is a lot of information in the result. Below is a snippet of the dictionary.
The mutations are split into two lists "dr_variants" and "other_variants". The dr_variants are all associated with drug resistance. The other_variants are mutations in drug resistant genes but have not (yet!) been associated with resistance. Some of the other mutations may be phylogenetic mutations which have nothing to do with resistance while others may be but might not have enough evidence yet.
At the moment if you run the script and point it to a directory with some result files in using --dir
, it will read in all the data but not actually do anything with it.
Working example script
As an example I will extend its functionality to output a simple CSV file with the name of the sample and the drug-resistance type.
First I will create a CSV writer object. Add the following code before the for loop.
Close the file handle after the loop finishes using
We have used args.out
property when creating the filehandle. For this to work we have to add an extra argument to the argument parser. Add the following for code together with the other arguments:
Now lets add in the code that determines the drug resistance class. By default tb-profiler only classes samples into sensitive, MDR, XDR and drug-resistant. Let's extend this to find pre-MDR and pre-XDR.
First we can create a set which we will then populate with a list of the drugs which the sample is resistant to. We can do this by looping over the variants. Add in the following code inside of the for loop:
Drug class definitions
Samples can be classed into different types using the following definitions
Coding this up
We can see straight away that the definitions are just a few boolean operations. We can store the boolean values by looking up the drugs in the resistant_drugs
set we created before. First we will create a list of the fluoroquinolones (FLQ) and second line injectables (SLI). Add this code anywhere in the main function but before the loop:
Then create the boolean values. Put these inside the loop after you have created the data
variable:
To test for rifampicin and isoniazid we simple check if they exists in resistant_drugs
. To check for FLQ and SLI resistance we find the intersections between FLQ_set
and resistant_drugs
and count. If the value is greater than 0 then we have resistance in that category.
Now we are ready to perform the construction of the classes using our boolean variables. We can run through the table above using a series of if-else statements.
We have stored the drug resistance class in the variable drtype
. Now all that is left to do is to write this to the CSV file. We can do this using:
Your script should now look like this:
Try running using it and check if the output looks like it should:
If it is working as expected, try adding in other pieces of info to the output. For example, you can extract the sub-lineage by accessing data["sublin"]
. Have a look at the result file structure ( you can open it with Firefox to render it in a more readable format).
Hopefully you can use this template as a boilerplate for your own uses. There are a few examples available here if you are interested. If you have any questions or if you have a cool script that you would like to make available to the community don't hesitate to contact me by email or through GitHub. Thanks for reading!
Last updated