Fst with delly

The Fst statistic allows you to find variants that are present in specific populations. This is useful when trying to find lineage specific mutation for example. We usually think of these mutations of being SNPs or if you are feeling adventurous you can look at small indels, but wecan also do it for large structural variants such as deletions. In this tutorial I will show you how to find population specific large deletions using delly and vcftools.

Indented learning outcomes

Understand the difference between bcftools and delly for calling variants
Be confident in using bcftools to merge single bcfs into multi-sample bcfs
Understand the filtering parameters
Use vcftools to calculate Fst

Create a multi-sample vcf

I will assume that you already have the filtered bcf files for all the samples you wish to analyse. First up, we should merge them using delly.

delly merge -o deletions_to_genotype.bcf *.filtered.delly.bcf

This command will store all deletion positions in a new bcf file. Assuming you have stored all you sample names in a file called all_samples.txt you can run the following code to re-genotype the bam files to look for the specific deletions in the merged deletions_to_genotype.bcf file.

cat all_samples.txt | parallel -j 5 --bar "delly call -g ~/refgenome/MTB-h37rv_asm19595v2-eg18.fa -v deletions_to_genotype.bcf -o {}.genotyped.bcf  /path/to/bams/{}.bam"

You should change the directory where it says /path/to/bams/ to wherever you have stored your alignment files. Note that delly can work from bam or cram format, so no need to convert. This command should product a new bcf file for each sample (ending in .genotyped.bcf). We will now merge the genotyped bcfs. Since the files all contains calls for the same deletions (with specific IDs assigned during the previous delly merge step), this time the merge command will be slightly different. The resulting file will not only contain the positions, but also the calls for each sample.

bcftools merge -m id -O b -o merged.bcf *genotyped.bcf

We can see that we added the -m id parameter to invoke the previously mentioned difference.

Filtering

Sample filtering

First we must find samples with low number of non-missing calls. To check the total number of variants we have genotyped we can run the following.

bcftools view to_genotype.bcf -H | wc -l

In this example it is 294. Now we can check how many calls were made for each sample using bcftools and extract only samples with a low missing count. It is a good idea to remove samples with >20% missing data. We will use 294*0.2 rounded up to the nearest ten = 60.

bcftools +smpl-stats merged.bcf | grep ^FLT0 | awk '$12<60'  | cut -f2 > good_samples.txt

Now we can extract the good samples from the bcf.

bcftools view -S good_samples.txt merged.bcf -Oz -o merged.sample_filt.vcf.gz

Site filtering

Next we will filter this file again to mask calls that are heterozygous (i.e. have mixed ref/alt calls) and to remove deletions with >20% missing data.

bcftools view merged.sample_filt.vcf.gz | bcftools filter -e 'GT="het"' -S .  | bcftools view -i 'F_PASS(GT!="mis")>0.8' -Oz -o merged.sample_filt.site_filt.vcf.gz

Calculating the Fst

Now we are ready to calculate the Fst for each deletion using vcftools. You will need to have sample names of the two populations you are comparing in different files. I will call them pop1.txt and pop2.txt. After doing this you can run the folling command.

vcftools --gzvcf merged.sample_filt.site_filt.vcf.gz --weir-fst-pop pop1 --weir-fst-pop pop

Vcftools will create a file called out.weir.fst which contains the start position of each deletion and the corresponding fst.

Have a look at some of these in the bam/cram files to confirm them. IGV is a good tool that you can use to view multiple bam files. Here is an example of two samples, one with and one without the variant.

If you are interested in a particular variant you can the calls for all samples using the following command (using position 3779833 as an example).

bcftools view merged.filt.vcf.gz Chromosome:3779833 | bcftools query -f '[%SAMPLE\t%GT\t%DR\t%DV\t%RR\t%RV\n]'

That's all there is to it, have fun finding your deletions!

PreviousMapping pipeline NextKraken2

Last updated 5 years ago

Was this helpful?