Index of /examples/bioinformatics/gatk

Name	Last modified	Size

Parent Directory		-
tmp/	2021-05-07 13:29	-
out/	2021-05-07 13:01	-
data/	2021-05-07 13:01	-
qlog/	2021-05-07 13:00	-
ref/	2021-05-07 11:41	-
qsub/	2021-05-07 11:40	-

RCS GATK Example

Directory Structure

data dedicated folder to put the sequence data files.
data/SRR062634_X200/ - data folder that will store the first 200 spots of SRR062634 fastq file
data/SRR062634_X400/ - data folder that will store the first 400 spots of SRR062634 fastq file
ref links to the centralized reference repository on SCC from GATK resource bundle.

the reference files used in this example includes:
Homo_sapiens_assembly38.fasta and its bwa index files - human genome reference h38
Homo_sapiens_assembly38.dict - human genome reference dictionary
Homo_sapiens_assembly38.dbsnp138.vcf - SNP calling reference
Homo_sapiens_assembly38.known_indels.vcf.gz - INDEL calling reference

out dedicated folder to organize output files.
out/gatk_SRR062634_200_xxxx the subfolder to store pipeline result for sample read dataset SRR062634_X200, where xxxx stands for the job id
out/gatk_SRR062634_400_xxxx the subfolder to store pipeline result for sample read dataset SRR062634_X400, where xxxx stands for the job id
qsub dedicated folder to store all qsub scripts.
qsub/gatk_job.qsub - GATK variant calling pipeline example batch script, using the first 200 spots of SRR062634
qsub/gatk_job_400.qsub - GATK variant calling pipeline example batch script, slightly bigger dataset, using the first 400 spots of SRR062634
qlog dedicated folder to store all qsub output logs.

Notes

To view all available gatk versions on SCC, execute:
[scc1 ] module avail gatk

Load gatk module first:
[scc1 ] module load gatk/4.2.0.0 # please specify version

To get help of all available GATK command tools, use:
[scc1 ] gatk -h

To get help of GATK individual tool:
[scc1 ] gatk toolname -h

For example:
[scc1 ] gatk HaplotypeCaller -h

The above command shows the help page for the GATK variant calling tool, HaplotypeCaller. The page contains all the options and their meanings for this tool, many of them are specific to the tool only.

Note: the above command format is SCC specific since we have created a wrapper script for GATK jar. The standard calling format shall be:


[scc1 ] java -jar $GATK_LOCAL_JAR -h 

[scc1 ] java -jar $GATK_LOCAL_JAR HaplotypeCaller -h

The wrapper has preset the JVM maximum memory to be 2G (-Xmx2g), so if users want more memory or add other JVM runtime parameters, they shall use the standard form.

One can do a lot with the GATK commandline tools. Here we provide two comprehensive GATK variant calling pipeline example, gatk_job_200.qsub and gatk_job_400.qsub to demonstrate to our users how to use GATK on SCC and how to adjust the resource request parameters accordingly.

To run the job:


[scc1 ] cd gatk_example_dir #  make sure to get into the top directory of the gatk example, the script use this folder as anchor to get access to the subdirectories, qsub/, qlog/, data/, out/. You may also use absolute path if you prefer

[scc1 ] qsub gatk_job_200.qsub

The pipeline will download the data from NCBI automatically, and store it in data/SRR062634_X200/ looks like:

data/SRR062634_X200/SRR062634_1.fastq
data/SRR062634_X200/SRR062634_2.fastq

After the above job's completion, there will be two files in qlog/ looks like:

qlog/gatk_SRR062634_200.oxxxxxxx
qlog/gatk_SRR062634_200.poxxxxxxx # 'xxxxxxx' stands for job id

And there will be a bunch of output files generated during the pipeline run in out/gatk_SRR062634_200_xxxx/, where xxxx is the job id. The final vcf result is:

out/gatk_SRR062634_200_xxxx/SRR062634-GATK-HC.vcf

You may also try to run the bigger job:


[scc1 ] cd gatk_example_dir # get into example's top directory 

[scc1 ] qsub qsub/gatk_job_400.qsub # call qsub job script from the top directory

After the above job's completion, there will be two files in qlog/ looks like:

qlog/gatk_SRR062634_400.oxxxxxxx
qlog/gatk_SRR062634_400.poxxxxxxx # 'xxxxxxx' stands for job id

And there will be a bunch of output files generated during the pipeline run in out/gatk_SRR062634_400_xxxx/, where xxxx is the job id. The final vcf result is:

out/gatk_SRR062634_400_xxxx/SRR062634-GATK-HC.vcf

GATK links

Contact Information

Research Computing Services: help@scc.bu.edu

Note: RCS example programs are provided "as is" without any warranty of any kind. The user assumes the intire risk of quality, performance, and repair of any defect. You are welcome to copy and modify any of the given examples for your own use.