data - dedicated folder to store download sequence data.
in - dedicated folder to store all input files.
in/PRJNA525241_SRR_Acc_List.txt - SRR accession list for BioProject PRJNA525241.
out - dedicated folder to organize output files.
qsub - dedicated folder to store all qsub scripts.
qsub/sra_job.qsub - batch script example
qlog -d dedicated folder to store all qsub output logs.
dbGaP -d dedicated folder to give example on how to download dbGap data.
Notes
1. To view all available SRAToolkit versions on SCC, execute:
[scc1 ] module avail sratoolkit
2. Before run any SRA ulitilies, first to load sratoolkit to make them available:
[scc1 ] module load sratoolkit/2.11.1
Commandline Examples:
Ex1: To check space requirement for download target:
[scc1 ] vdb-dump --info SRR030257 # this will show result on the screen
[scc1 ] vdb-dump --info SRR030257 --output-file out/vdb-dump_info.out # this will store output to file
* As a rule of thumb, the fasterq-dump guide suggests getting the size of the accession using 'vdb-dump', then estimating 4x for the output and 4x for the temp files. So the above sequence download will need 169M*4=676M for output, and another 676M for temporary files.
Ex2: To download data from NCBI SRA repository, run the following command. It's a good habit to specify the temporary folder in scratch and store the downloaded sequence data in dedicated data folder:
[scc1 ] rm -rf data # make sure data/ doesn't exist
[scc1 ] fasterq-dump -p -t /scratch/$USER --outdir data SRR030257 2>&1 | tee out/fasterq-dump.out
The above command will output on screen as recorded in out/fasterq-dump.out.
And the downloaded sequence files can be found in data/:
[scc1 ]$ ls -l data
total 1487440
-rw-r--r-- 1 yshen16 scv 761565376 Oct 14 14:48 SRR030257_1.fastq
-rw-r--r-- 1 yshen16 scv 761565376 Oct 14 14:48 SRR030257_2.fastq
There are a lot more functions and utlities included in this tool to assist sequence retrival and process. Please refer to NCBI SRA toolkit documentation website for more information.
Batch Job Example:
You can write a qsub script to download data for multiple accessions. An example qsub is in qsub/sra_job.qsub
[scc1 ] cd qsub # get into the qsub script location
[scc1 ] qsub sra_job.qsub
After the above job's completion, there will be two files in qlog/ looks like:
qlog/sra_example.oxxxxxxx
qlog/sra_example.poxxxxxxx # 'xxxxxxx' stands for job number
And two folders generated in out/:
out/sra - store the sra files for PRJNA525241
out/PRJNA525241 - store the fastq files for PRJNA525241
dbGaP Data Download Example:
SRA contains a lot of protected data that requires extra steps to gain access to them.
Second, once you've obtained the key file, you may continue to follow the step 3
here for demo purpose, we just download the test key from NCBI:
[scc1 ] cd dbGaP # cd to dbGaP folder
[scc1 ] wget ftp://ftp.ncbi.nlm.nih.gov/sra/examples/decrypt_examples/prj_phs710EA_test.ngc
The above command will download 'prj_phs710EA_test.ngc' in the dbGaP/ folder.
Step 3:
Supply the key file when working with dbGaP data. Here are some examples:
[scc1 ] vdb-dump --info SRR1219902 --ngc prj_phs710EA_test.ngc 2>&1 | tee vdb-dump-dbGaP.out
[scc1 ] more vdb-dump-dbGaP.out # as you can see the size of this download will be huge: 15Gx4=60G for output and temporary files each.
[scc1 ] # we are not going to download all of the dataset, but rather use sam-dump tool to download portion
[scc1 ] sam-dump --output-file SRR1219902.region.15_28196787-28197287.sam --ngc prj_phs710EA_test.ngc --aligned-region 15:28196787-28197287 SRR1219902
After the above command, there will be a sam file in dbGaP/ SRR1219902.region.15_28196787-28197287.sam looks like:
Step 4: Extra instructions on how to change default download folders:
Many times users want to control where the data would be downloaded to due to the large amount of data usually contained in sra run. Here is the brief instruction on this aspect.
[scc1 ] # set sratoolkit user default path, the follow change the default path to current working directory:
[scc1 ] vdb-config --set /repository/user/default-path=$(pwd)
[scc1 ] # set sratoolkit download data root path:
[scc1 ] vdb-config --set /repository/user/default-path=$(pwd)/SRR1219902
[scc1 ] # double check the change have been recorded
[scc1 ] cat ~/.ncbi/user-settings.mkfg
[scc1 ] # redo the sam-dump and this time, check that the refseq/ and sra/ were put in the SRR1219902 instead
[scc1 ] sam-dump --output-file SRR1219902.region.15_28196787-28197287.2.sam --ngc prj_phs710EA_test.ngc --aligned-region 15:28196787-28197287 SRR1219902
[scc1 ] find SRR1219902
Note: RCS example programs are provided "as is" without any warranty of any kind. The user assumes the intire risk of quality, performance, and repair of any defect. You are welcome to copy and modify any of the given examples for your own use.