Index of /examples/bioinformatics/gatk

[ICO]NameLast modifiedSizeDescription

[PARENTDIR]Parent Directory   -  
[DIR]tmp/ 2021-05-07 13:29 -  
[DIR]out/ 2021-05-07 13:01 -  
[DIR]data/ 2021-05-07 13:01 -  
[DIR]qlog/ 2021-05-07 13:00 -  
[DIR]ref/ 2021-05-07 11:41 -  
[DIR]qsub/ 2021-05-07 11:40 -  

RCS GATK Example

Directory Structure


Notes

To view all available gatk versions on SCC, execute:
[scc1 ] module avail gatk

Load gatk module first:
[scc1 ] module load gatk/4.2.0.0 # please specify version

To get help of all available GATK command tools, use:
[scc1 ] gatk -h

To get help of GATK individual tool:
[scc1 ] gatk toolname -h

For example:
[scc1 ] gatk HaplotypeCaller -h

The above command shows the help page for the GATK variant calling tool, HaplotypeCaller. The page contains all the options and their meanings for this tool, many of them are specific to the tool only.

Note: the above command format is SCC specific since we have created a wrapper script for GATK jar. The standard calling format shall be:
[scc1 ] java -jar $GATK_LOCAL_JAR -h
[scc1 ] java -jar $GATK_LOCAL_JAR HaplotypeCaller -h


The wrapper has preset the JVM maximum memory to be 2G (-Xmx2g), so if users want more memory or add other JVM runtime parameters, they shall use the standard form.

One can do a lot with the GATK commandline tools. Here we provide two comprehensive GATK variant calling pipeline example, gatk_job_200.qsub and gatk_job_400.qsub to demonstrate to our users how to use GATK on SCC and how to adjust the resource request parameters accordingly.

To run the job:
[scc1 ] cd gatk_example_dir # make sure to get into the top directory of the gatk example, the script use this folder as anchor to get access to the subdirectories, qsub/, qlog/, data/, out/. You may also use absolute path if you prefer
[scc1 ] qsub gatk_job_200.qsub


The pipeline will download the data from NCBI automatically, and store it in data/SRR062634_X200/ looks like:


After the above job's completion, there will be two files in qlog/ looks like:


And there will be a bunch of output files generated during the pipeline run in out/gatk_SRR062634_200_xxxx/, where xxxx is the job id. The final vcf result is:


You may also try to run the bigger job:
[scc1 ] cd gatk_example_dir # get into example's top directory
[scc1 ] qsub qsub/gatk_job_400.qsub # call qsub job script from the top directory


After the above job's completion, there will be two files in qlog/ looks like:


And there will be a bunch of output files generated during the pipeline run in out/gatk_SRR062634_400_xxxx/, where xxxx is the job id. The final vcf result is:

GATK links

Contact Information

Research Computing Services: help@scc.bu.edu

Note: RCS example programs are provided "as is" without any warranty of any kind. The user assumes the intire risk of quality, performance, and repair of any defect. You are welcome to copy and modify any of the given examples for your own use.