# Running on many samples

To run the pipeline on a few samples if relatively straightforward, but when we need to run it on 100s of samples it can get a bit unwieldy to type every command out. For this purpose it is a good idea to write a wrapper script to handle the running.

As an example let's assume that you have the following files:

```
.
└── fastq
    ├── ERR1664623_1.fastq.gz
    ├── ERR1664623_2.fastq.gz
    ├── ERR1664624_1.fastq.gz
    ├── ERR1664624_2.fastq.gz
    ├── ERR1664635_1.fastq.gz
    ├── ERR1664635_2.fastq.gz
    ├── ERR1664636_1.fastq.gz
    ├── ERR1664636_2.fastq.gz
    ├── ERR1664664_1.fastq.gz
    ├── ERR1664664_2.fastq.gz
    ├── ERR1664665_1.fastq.gz
    └── ERR1664665_2.fastq.gz
```

First we will need to create a list of the samples prefixes that you would like to run. To do this we can run the following command.

```
ls fastq/ | grep _1.fastq.gz | sed 's/_1.fastq.gz//' > samples.txt
```

This will put all the file prefixes into a file called samples.txt. We can then use parallel to run our tb-profiler command for each sample in the file like this. Before we run tb-profiler we should make the folders where it will store the bam vcf and result files. We have to do this because ottherwise the multiple instances of tb-profiler run by parallel will all try to create the same folders at the same time and you will run into an error.

```
mkdir bam vcf results
```

Now we are ready to run tb-profiler in parallel.

```
cat samples.txt | parallel --bar -j 2 tb-profiler profile -1 fastq/{}_1.fastq.gz -2 fastq/{}_2.fastq.gz -p {}
```

You can adjust the `-j` parameter to allow for more jobs to run in parallel. I have set this to **2** but if you have a HPC or powerful computer you can increase this.
