Index of /examples/bioinformatics/sratoolkit

Name	Last modified	Size

Parent Directory		-
data/	2021-10-14 14:47	-
dbGaP/	2021-10-14 16:06	-
in/	2023-07-26 14:19	-
out/	2021-10-14 15:35	-
qlog/	2021-04-09 12:34	-
qsub/	2024-08-01 11:28	-

RCS SRAToolkit Example

Directory Structure

data - dedicated folder to store download sequence data.
in - dedicated folder to store all input files.
in/PRJNA525241_SRR_Acc_List.txt - SRR accession list for BioProject PRJNA525241.
out - dedicated folder to organize output files.
qsub - dedicated folder to store all qsub scripts.
qsub/sra_job.qsub - batch script example
qlog -d dedicated folder to store all qsub output logs.
dbGaP -d dedicated folder to give example on how to download dbGap data.

Notes

1. To view all available SRAToolkit versions on SCC, execute:


    [scc1 ] module avail sratoolkit

2. Before run any SRA ulitilies, first to load sratoolkit to make them available:


    [scc1 ] module load sratoolkit/2.11.1

Commandline Examples:

Ex1: To check space requirement for download target:


    [scc1 ]  vdb-dump --info SRR030257  # this will show result on the screen
    [scc1 ]  vdb-dump --info SRR030257 --output-file out/vdb-dump_info.out  # this will store output to file

* As a rule of thumb, the fasterq-dump guide suggests getting the size of the accession using 'vdb-dump', then estimating 4x for the output and 4x for the temp files. So the above sequence download will need 169M*4=676M for output, and another 676M for temporary files.

Ex2: To download data from NCBI SRA repository, run the following command. It's a good habit to specify the temporary folder in scratch and store the downloaded sequence data in dedicated data folder:


    [scc1 ] rm -rf data # make sure data/ doesn't exist
    [scc1 ] fasterq-dump -p -t /scratch/$USER --outdir data SRR030257 2>&1 | tee out/fasterq-dump.out

The above command will output on screen as recorded in out/fasterq-dump.out.
And the downloaded sequence files can be found in data/:

    [scc1 ]$ ls -l data
    total 1487440
    -rw-r--r-- 1 yshen16 scv 761565376 Oct 14 14:48 SRR030257_1.fastq
    -rw-r--r-- 1 yshen16 scv 761565376 Oct 14 14:48 SRR030257_2.fastq

There are a lot more functions and utlities included in this tool to assist sequence retrival and process. Please refer to NCBI SRA toolkit documentation website for more information.

Batch Job Example:

You can write a qsub script to download data for multiple accessions. An example qsub is in qsub/sra_job.qsub


    [scc1 ] cd qsub # get into the qsub script location
    [scc1 ] qsub sra_job.qsub

After the above job's completion, there will be two files in qlog/ looks like:

qlog/sra_example.oxxxxxxx
qlog/sra_example.poxxxxxxx # 'xxxxxxx' stands for job number

And two folders generated in out/:

out/sra - store the sra files for PRJNA525241
out/PRJNA525241 - store the fastq files for PRJNA525241

dbGaP Data Download Example:

SRA contains a lot of protected data that requires extra steps to gain access to them.

Step 1:

First, you need to follow the link, https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=dbgap_use to get a repository key, called 'NGC' key file.

Step 2:

Second, once you've obtained the key file, you may continue to follow the step 3 here for demo purpose, we just download the test key from NCBI:


    [scc1 ] cd dbGaP # cd to dbGaP folder
    [scc1 ] wget ftp://ftp.ncbi.nlm.nih.gov/sra/examples/decrypt_examples/prj_phs710EA_test.ngc

The above command will download 'prj_phs710EA_test.ngc' in the dbGaP/ folder.

Step 3:

Supply the key file when working with dbGaP data. Here are some examples:


    [scc1 ] vdb-dump --info SRR1219902 --ngc prj_phs710EA_test.ngc  2>&1 | tee vdb-dump-dbGaP.out
    [scc1 ] more vdb-dump-dbGaP.out # as you can see the size of this download will be huge: 15Gx4=60G for output and temporary files each.
    [scc1 ] # we are not going to download all of the dataset, but rather use sam-dump tool to download portion
    [scc1 ] sam-dump --output-file SRR1219902.region.15_28196787-28197287.sam --ngc prj_phs710EA_test.ngc --aligned-region 15:28196787-28197287 SRR1219902

After the above command, there will be a sam file in dbGaP/ SRR1219902.region.15_28196787-28197287.sam looks like:

Step 4: Extra instructions on how to change default download folders:

Many times users want to control where the data would be downloaded to due to the large amount of data usually contained in sra run. Here is the brief instruction on this aspect.


    [scc1 ] # set sratoolkit user default path, the follow change the default path to current working directory:   
    [scc1 ] vdb-config --set /repository/user/default-path=$(pwd)
    [scc1 ] # set sratoolkit download data root path:   
    [scc1 ] vdb-config --set /repository/user/default-path=$(pwd)/SRR1219902
    [scc1 ] # double check the change have been recorded
    [scc1 ] cat ~/.ncbi/user-settings.mkfg
    [scc1 ] # redo the sam-dump and this time, check that the refseq/ and sra/ were put in  the SRR1219902 instead
    [scc1 ] sam-dump --output-file SRR1219902.region.15_28196787-28197287.2.sam --ngc prj_phs710EA_test.ngc --aligned-region 15:28196787-28197287 SRR1219902
    [scc1 ] find SRR1219902

SRA Toolkit links

Contact Information

Research Computing Services: help@scc.bu.edu

Note: RCS example programs are provided "as is" without any warranty of any kind. The user assumes the intire risk of quality, performance, and repair of any defect. You are welcome to copy and modify any of the given examples for your own use.