This is the first of a series of posts on the topic of NGS data analysis. I intend to talk about read mapping to reference genomes, variant detection, genome assembly, perhaps RNA expression analysis (more-distant future).
Before all that, however, we’ll need some data to analyse. Where do you get those from? Well, it depends. If you’re working in a lab you probably got the data from one of your experiments: you sequenced a new organism, you investigated RNA changes under certain conditions or you simply re-sequenced a genome that is already known and just want to find new SNPs. But for those of us who do not have access to that kind of data on a “daily” basis we must find a different solution. With the prices of genome sequencing going down, it becomes easier and more affordable to actually sequence your own genome (more on that another time though). Human genomes are pretty large, however, and for that reason not very convenient to use while learning about NGS analysis pipelines due to the amount of time and computational resources required at many analysis steps. We would rather prefer something smaller and more manageable. But where do we find such data?
The answer is NCBI. I’m sure most of you are familiar with it – in their own words:
The National Center for Biotechnology Information advances science and health by providing access to biomedical and genomic information.
In short, it’s a gold mine of biological information and data. For our purposes, it allows to download NGS reads for thousands of experiments and organisms that were uploaded by other researchers. More specifically, we’ll use the Sequence Read Archive, or SRA. Let’s dive right into it then!
Below, we’ll install NCBI’s SRA Toolkit (it’s a set of tools that allow you to download and access data from the Sequence Read Archive, or SRA), configure it and download some data from SRA (we’ll use those data for analysis in one of the next tutorials).
Installation & configuration
Firstly, Download the SRA Toolkit installation file for your OS:
~$ curl -Lo sratoolkit.tar.gz http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-mac64.tar.gz
~$ curl -Lo sratoolkit.tar.gz http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-ubuntu64.tar.gz
Unpack it:
~$ tar -vxzf sratoolkit.tar.gz
Add the location to PATH, for example:
~$ export PATH=$PATH:~/sratoolkit.2.10.5-ubuntu64/bin
Start the configuration tool by running:
~$ vdb-config -i
- on CACHE tab, set location of user repository – it should be an empty directory where your downloads will be stored
- on AWS tab, check report cloud instance identity
- save and exit
Finally, test that everything works correctly:
~$ fastq-dump —-stdout -X 2 SRR390728
As a result, you should see the reads shown below – your toolkit is now configured and ready.
Read 2 spots for SRR390728
Written 2 spots for SRR390728
@SRR390728.1 1 length=72
CATTCTTCACGTAGTTCTCGAGCCTTGGTTTTCAGCGATGGAGAATGACTTTGACAAGCTGAGAGAAGNTNC
+SRR390728.1 1 length=72
;;;;;;;;;;;;;;;;;;;;;;;;;;;9;;665142;;;;;;;;;;;;;;;;;;;;;;;;;;;;;96&&&&(
@SRR390728.2 2 length=72
AAGTAGGTCTCGTCTGTGTTTTCTACGAGCTTGTGTTCCAGCTGACCCACTCCCTGGGTGGGGGGACTGGGT
+SRR390728.2 2 length=72
;;;;;;;;;;;;;;;;;4;;;;3;393.1+4&&5&&;;;;;;;;;;;;;;;;;;;;;<9;<;;;;;464262
Downloading sample data
After configuring the SRA Toolkit, we can proceed to downloading some data. In the next tutorials we’ll be working with the genome of Mycobacterium tuberculosis H37Ra so we’ll try to fetch some sequencing reads for that genome here.
In the first place, we need to identify the accession number of the reads we want to download. To do that, go to SRA and select the Search tab.

Type Mycobacterium tuberculosis H37Ra
in the search field and hit Search. In the search result table you should see how many datasets/studies were found in different categories.

Select the 9 public access SRA Experiments. You should now see a list of all those experiments:

Select the first result – “WGS of Mycobacterium tuberculosis H37Ra” (WGS stands for Whole Genome Sequencing). You should now be able to see details of that dataset: what instrument/technology was used (Illumina HiSeq 2500; I’ll probably devote a separate post to an overview of currently available technologies), read-type (paired-end vs single-end) , genome size (378.2M base pairs), run id (SRR6407486) and others. Go ahead and click on the run id in the small table at the bottom – you’ll be re-directed to a page with additional information. There, on the Data access tab you can see various ways of accessing that dataset. As we’ll be using the SRA Toolkit, we don’t need to bother with that for now.
Copy the run access number/id (SRR6407486). In your terminal, go to the location where you want to store/work with the data. Next, use the fasterq-dump
tool to fetch the reads from SRA for the ID we found above:
~$ mkdir mtb_data && cd mtb_data
~/mtb_data$ fasterq-dump --split-files SRR6407486
The --split-files
flag indicates that we want the tool to generate one file per read set. In our case, since we have paired-end reads, we should end up with two files. First, you should see the following output:
spots read : 1,890,901
reads read : 3,781,802
reads written : 3,781,802
If you now run the ls
command in the same directory, you should see two files present:
SRR6407486_1.fastq
SRR6407486_2.fastq
Presto! Here are our paired-end reads for the genome of M. tuberculosis H37Ra. In the next post, we’ll look at assessing the quality of those reads – an important step before we continue to any analysis.
Resources
- Sequence Read Archive
- SRA Toolkit documentation
- More on SRA Toolkit installation & configuration
- fasterq-dump how-to

Hey! I’m a protein biochemist by training with a more recent software development experience. Here, I’ll be trying to combine a bit of both – let’s see though where we’ll end up with that 🙂