# FastQ to VCF

To perform population level or phylogenetic analyses we need to have all our samples in one file. Some programs will require a fasta file others a text based genotype matrix etc...&#x20;

The **fastq2matix** pipeline can automate some of these steps for you. The general workflow of the pipeline is the following:

1. For each sample the raw fastq data is processed into a genomic vcf (gvcf). This is a slightly different format which encodes information on the non-variant sites as well as the variant sites. This extra information allows us to ascertain whether absence of a variant in a sample is due to the reference allele being present or lack of data. If you are interested you can read more about gvcf format [here](https://gatk.broadinstitute.org/hc/en-us/articles/360035531812-GVCF-Genomic-Variant-Call-Format).&#x20;
2. The gvcfs are imported into a database
3. Joint genotyping is performed on all samples together

## Installing prerequisites

The pipeline makes use of a lot of open-source bioinformatics tools such as bwa, samtools and gatk. Let's make an environment with everything installed.

```bash
conda create -n fastq2matrix python=3.7 bwa samtools bcftools parallel gatk4=4.1.4.1 delly tqdm trimmomatic minimap2 biopython bedtools r-ggplot2
```

Now we can activate the environment with

```bash
conda activate fastq2matrix
```

{% hint style="info" %}
Remember,each time you open a new terminal window and would like to use the pipeline, you have to **activate** the environment.
{% endhint %}

Next we will download the latest version of the pipeline and install

```bash
cd ~
git clone https://github.com/pathogenseq/fastq2matrix.git
cd fastq2matrix
python setup.py install
```

Now you should be ready to go!

## Usage

### Mapping and calling variants

As an example I am going to use 3 *M. tuberculosis* samples, but feel free to substitute in your own data. First use `cd` to navigate to a directory where you would like to work in. Then we are ready to download the fastq data and the reference genome:

```bash
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/009/ERR1664619/ERR1664619_*.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/000/ERR1664620/ERR1664620_*.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR166/001/ERR1664621/ERR1664621_*.fastq.gz
wget ftp://ftp.ensemblgenomes.org/pub/release-47/bacteria//fasta/bacteria_0_collection/mycobacterium_tuberculosis_h37rv/dna/Mycobacterium_tuberculosis_h37rv.ASM19595v2.dna.toplevel.fa.gz
gunzip Mycobacterium_tuberculosis_h37rv.ASM19595v2.dna.toplevel.fa.gz
mv Mycobacterium_tuberculosis_h37rv.ASM19595v2.dna.toplevel.fa H37Rv.fa
```

After running those commands you should have 6 fastq files and the reference genome in a directory structure like this:

```bash
.
├── ERR1664619_1.fastq.gz
├── ERR1664619_2.fastq.gz
├── ERR1664620_1.fastq.gz
├── ERR1664620_2.fastq.gz
├── ERR1664621_1.fastq.gz
├── ERR1664621_2.fastq.gz
└── H37Rv.fa

```

The first part of the pipeline performs the prcessing of the fastq data into gvcf data in **5** main steps:

1. **Trimming:** FastQs are trimmed using trimmomatic
2. **Mapping:** Trimmed data is aligned to the reference genome
3. **Duplicate reads marking:** With samtools
4. **Base quality recalibration:** Performed with gatk
5. **Variant calling:** Genomics VCFs are called using gatk HaplotypeCaller

All these can be performed with the command `fastq2vcf.py all`. Here is the command for sample **ERR1664619**:

```
fastq2vcf.py all --read1 ERR1664619_1.fastq.gz --read2 ERR1664619_2.fastq.gz --ref H37Rv.fa --prefix ERR1664619
```

You need to give the reads with `--read1` and `--read2`, the reference genome with `--ref` and the prefix for the output file with `--prefix`. There are some other optional commands but these are the required ones. For example if you are working on a server with multiple cores you can increase the number of threads using `--threads`. The higher the number, the faster the job will run. Just make sure you don't increase it to more threads then there are available on your system

Try run it for samples **ERR1664620** and **ERR1664621**. If you are stuck have a look at the command above and just replace the sample names.

After running this command you should have some new files for each of the samples. You should see a **bam** file and a **vcf** file (ending in .g.vcf.gz). All the files are important for the next steps, so don't move or delete any of them yet.

Your directory should contain the following files:

```bash
.
├── ERR1664619_1.fastq.gz
├── ERR1664619_2.fastq.gz
├── ERR1664619.g.vcf.gz
├── ERR1664619.g.vcf.gz.tbi
├── ERR1664619.g.vcf.gz.validated
├── ERR1664619.mkdup.bam
├── ERR1664619.mkdup.bam.bai
├── ERR1664619.mkdup.bamstats
├── ERR1664620_1.fastq.gz
├── ERR1664620_2.fastq.gz
├── ERR1664620.g.vcf.gz
├── ERR1664620.g.vcf.gz.tbi
├── ERR1664620.g.vcf.gz.validated
├── ERR1664620.mkdup.bam
├── ERR1664620.mkdup.bam.bai
├── ERR1664620.mkdup.bamstats
├── ERR1664621_1.fastq.gz
├── ERR1664621_2.fastq.gz
├── ERR1664621.g.vcf.gz
├── ERR1664621.g.vcf.gz.tbi
├── ERR1664621.g.vcf.gz.validated
├── ERR1664621.mkdup.bam
├── ERR1664621.mkdup.bam.bai
├── ERR1664621.mkdup.bamstats
├── H37Rv.dict
├── H37Rv.fa
├── H37Rv.fa.amb
├── H37Rv.fa.ann
├── H37Rv.fa.bwt
├── H37Rv.fa.fai
├── H37Rv.fa.pac
└── H37Rv.fa.sa

```

### Merging VCF files

Now it is time to import the vcf files into a genomics database. This is a database format that is developed by gatk, but we don't actually have to know any more about it.&#x20;

To run to the next step we first need to create a file with the sample names in it. Create a text file called `samples.txt` and put the following contents into it.

```
ERR1664619
ERR1664620
ERR1664621
```

Now let's run the import step:

```
merge_vcfs.py import --sample-file samples.txt --ref H37Rv.fa --prefix mtb --vcf-dir .
```

The arguments are pretty self explanitory. If fo some reason your VCF files are in a different directory then you can change `--vcf-dir`, but ours are in the current directory so we use `--vcf-dir .`.

After this has finished you should be able to see that there are several new directories, starting with **mtb\_Chromosome**. Why are there multiple directories for one database? To speed things up we have partitioned the genome into 20 pieces and run them in parallel.

Now we are ready for the final step. We run the join genotyping with:

```
merge_vcfs.py genotype --ref H37Rv.fa --prefix mtb
```

After this has finished you should have a vcf named **mtb.2020\_06\_24.genotyped.vcf.gz**. This contains the genotype calls for all variant positions across all your samples. This file can be converted into your desired format using bcftools query. For example lets make a very simple genotype matrix:

```
bcftools query -f '%POS\t%REF\t%ALT[\t%GT]\n' mtb.2020_06_24.genotyped.vcf.gz | head
```

The vcf is a good starting point for your population based questions, but remember it is not perfect and it is important to perform filtering on the file to remove false positive and false negative calls. But that is a topic for another day. For now, have a look at the commands you ran and give it a go with your own data!
