Where do I get the NGS data from?

This is the first of a series of posts on the topic of NGS data analysis. I intend to talk about read mapping to reference genomes, variant detection, genome assembly, perhaps RNA expression analysis (more-distant future).

Before all that, however, we’ll need some data to analyse. Where do you get those from? Well, it depends. If you’re working in a lab you probably got the data from one of your experiments: you sequenced a new organism, you investigated RNA changes under certain conditions or you simply re-sequenced a genome that is already known and just want to find new SNPs. But for those of us who do not have access to that kind of data on a “daily” basis we must find a different solution. With the prices of genome sequencing going down, it becomes easier and more affordable to actually sequence your own genome (more on that another time though). Human genomes are pretty large, however, and for that reason not very convenient to use while learning about NGS analysis pipelines due to the amount of time and computational resources required at many analysis steps. We would rather prefer something smaller and more manageable. But where do we find such data?

The answer is NCBI. I’m sure most of you are familiar with it – in their own words:

The National Center for Biotechnology Information advances science and health by providing access to biomedical and genomic information.

In short, it’s a gold mine of biological information and data. For our purposes, it allows to download NGS reads for thousands of experiments and organisms that were uploaded by other researchers. More specifically, we’ll use the Sequence Read Archive, or SRA. Let’s dive right into it then!

Below, we’ll install NCBI’s SRA Toolkit (it’s a set of tools that allow you to download and access data from the Sequence Read Archive, or SRA), configure it and download some data from SRA (we’ll use those data for analysis in one of the next tutorials).


Installation & configuration

Firstly, Download the SRA Toolkit installation file for your OS:

~$ curl -Lo sratoolkit.tar.gz http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-mac64.tar.gz
~$ curl -Lo sratoolkit.tar.gz http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-ubuntu64.tar.gz

Unpack it:

~$ tar -vxzf sratoolkit.tar.gz

Add the location to PATH, for example:

~$ export PATH=$PATH:~/sratoolkit.2.10.5-ubuntu64/bin

Start the configuration tool by running:

~$ vdb-config -i
  • on CACHE tab, set location of user repository – it should be an empty directory where your downloads will be stored
  • on AWS tab, check report cloud instance identity
  • save and exit

Finally, test that everything works correctly:

~$ fastq-dump —-stdout -X 2 SRR390728

As a result, you should see the reads shown below – your toolkit is now configured and ready.

Read 2 spots for SRR390728
Written 2 spots for SRR390728
@SRR390728.1 1 length=72
CATTCTTCACGTAGTTCTCGAGCCTTGGTTTTCAGCGATGGAGAATGACTTTGACAAGCTGAGAGAAGNTNC
+SRR390728.1 1 length=72
;;;;;;;;;;;;;;;;;;;;;;;;;;;9;;665142;;;;;;;;;;;;;;;;;;;;;;;;;;;;;96&&&&(
@SRR390728.2 2 length=72
AAGTAGGTCTCGTCTGTGTTTTCTACGAGCTTGTGTTCCAGCTGACCCACTCCCTGGGTGGGGGGACTGGGT
+SRR390728.2 2 length=72
;;;;;;;;;;;;;;;;;4;;;;3;393.1+4&&5&&;;;;;;;;;;;;;;;;;;;;;<9;<;;;;;464262

Downloading sample data

After configuring the SRA Toolkit, we can proceed to downloading some data. In the next tutorials we’ll be working with the genome of Mycobacterium tuberculosis H37Ra so we’ll try to fetch some sequencing reads for that genome here.

In the first place, we need to identify the accession number of the reads we want to download. To do that, go to SRA and select the Search tab.

Type Mycobacterium tuberculosis H37Ra in the search field and hit Search. In the search result table you should see how many datasets/studies were found in different categories.

Select the 9 public access SRA Experiments. You should now see a list of all those experiments:

Select the first result – “WGS of Mycobacterium tuberculosis H37Ra” (WGS stands for Whole Genome Sequencing). You should now be able to see details of that dataset: what instrument/technology was used (Illumina HiSeq 2500; I’ll probably devote a separate post to an overview of currently available technologies), read-type (paired-end vs single-end) , genome size (378.2M base pairs), run id (SRR6407486) and others. Go ahead and click on the run id in the small table at the bottom – you’ll be re-directed to a page with additional information. There, on the Data access tab you can see various ways of accessing that dataset. As we’ll be using the SRA Toolkit, we don’t need to bother with that for now.

Copy the run access number/id (SRR6407486). In your terminal, go to the location where you want to store/work with the data. Next, use the fasterq-dump tool to fetch the reads from SRA for the ID we found above:

~$ mkdir mtb_data && cd mtb_data
~/mtb_data$ fasterq-dump --split-files SRR6407486 

The --split-files flag indicates that we want the tool to generate one file per read set. In our case, since we have paired-end reads, we should end up with two files. First, you should see the following output:

spots read      : 1,890,901
reads read      : 3,781,802
reads written   : 3,781,802

If you now run the ls command in the same directory, you should see two files present:

SRR6407486_1.fastq 
SRR6407486_2.fastq

Presto! Here are our paired-end reads for the genome of M. tuberculosis H37Ra. In the next post, we’ll look at assessing the quality of those reads – an important step before we continue to any analysis.

Resources

Leave a Reply