Short read sequencing

The most important thing to understand about modern sequencing technology is that it produces short reads, typically between 50 to 150 base pairs. This is in contrast to Sanger sequencing, which could produce much longer reads (up to 1kb). And time goes on, technologies to read very long sequences are being developed (e.g. Pacific Biosciences) but they are not yet mainstream. So we'll focus on short reads.

The general idea behind short-read sequencing is that you take your long stretch of nucleic acids and chop it up:


These short fragments are easy to sequence in large batches. Peter will talk about some more details of the chemistry of short read sequencing, but it's important for now that you realize that what you get off the machine are "randomly" chopped up fragments of DNA.

What to do with short reads

The challenge with short read data is figuring out how it goes together. One option is de novo assembly. In this case, you chop up a bunch of copies of the same sequence and hope that the different fragments from different copies have some overlap, so you can figure out how to put them back together. We won't talk about this method today, although there are many approaches to de novo assembly. Many of these are based on a mathematical structure called a De Bruijn graph.

We will focus on aligning reads to a known reference genome. This means that there is an important requirement:


Now that we have that out of the way...

Note that the reference genome does not necessarily have to be a member of the same species, but it's helpful if the reference genome that you are mapping to is not too distantly related to our favorite organism.

Getting acquainted with the data

The data you get back from an Illumina sequencer comes in a format called fastq. Note the similarity to fasta: it's essentially a fasta file with a bit of extra information (specifically, the base quality). Each entry in a fastq file consists of four lines. Why four lines when it really makes sense to parse a line at a time? Who knows. An entry in a fastq file looks like this:
@HS2:202:D0YYKACXX:6:1106:16588:78868 1:N:0:GGCTAC
The first line (with the "@" symbol) gives some information about the sequencer and a unique identifier for the read. The second line gives the actual sequence read. The third line can be a little variable: sometimes, after the "+" you'll have the same thing that followed the "@" in the first line, but in other cases (like the example here), you might not have anything. The fourth line are the so-called quality scores for each of the bases, and they bear a bit more discussion.

Sequencing machines don't read the sequence of DNA the same way we read the words on a page. Instead, they actually look at hundreds of pictures, each with millions of little colored dots, and try to guess the nucleotide at that position in the read based on the color. But it's only a guess! Thus, they try to provide some measure of the certainty of the base call. They are given by a simple formula:

where p is the probability that a base is called incorrectly (this number is somewhat mysterious). Thus the quality score divided by 10 gives the order of magnitude of the error probability: Q = 20 means that there is a 10^-2 = .01 probability of the base being called incorrectly, Q = 30 means that there is a 10^-3 = .001 probability that the base is called incorrectly, etc.

But wait! The quality score line has letters, not numbers! The reason for this is simple: 30, for example, takes two characters, but if we want a 1-to-1 correspondence between quality and a nucleotide, this won't work. So they apply one final transformation: first they add 33 to the quality value and then translate that into an ASCII character. Thus, to translate "C" (the quality of the first nucleotide) into a quality score, look up on the table the "Dec" (or DECimal) corresponding to C, and subtract 33. Thus, since C corresponds to 67, and 67-33=24, and the first nucleotide is relatively untrustworthy.

A quick note on how to read fastq files in python. The normal "for line in file:" type of loop won't work here, because each entry corresponds to four lines. So perhaps a better structure is an while loop where you break at the end of the file:
while True:
     id1 = fastq_file.readline().strip()
     if id1=="":
     seq = fastq_file.readline().strip()
     id2 = fastq_file.readline().strip()
     qual = fastq_file.readline().strip()

Mapping by string matching

There are two things you need to start mapping to a reference genome:
  1. A bunch of short reads
  2. A reference genome
But what do you do with them? Think about what you would do if I gave you this reference sequence (in fasta format):
and this read:
and asked you to align the read to the reference by hand. The first thing I would do is simply start at the beginning and see if it fits there:
doesn't work, nor does
and so on, until I find that it fits somewhere:
We can implement this in python using the .find method for strings and a try-except loop. For example, if the whole genome is a single chromosome,
my_genome = read_genome_fasta("genome.fa")
     print my_read, my_genome.find(my_read)
except ValueError:
     print my_read, "unmapped"
However, this is a very labor intensive process! What if the sequence of the read had come from the very end of the chromosome? Clearly there must be a better way to do this.

Mapping by hashing

Remember how dictionaries can be used to look up the value associated with a particular key very quickly? We can use that to our advantage. Suppose that instead of checking through the entire genome every time, we instead put every k-mer (that is, sequence of k nucleotides) into a dictionary and for our read asked what position it had in that dictionary. But first we have to chose a good k to use. The first thought might be to make k equal to the length of your read; however, this is often unnecessary and dangerous, because later parts of the read might be more prone to errors. A good rule of thumb is to chose k so that 4^k is about equal to the genome size: this will make sure that most k-mers appear at most once in the genome.For example, if we have made a function called make_genome_dict that takes as arguments a genome and a k and returns a dictionary of all k-mers and their positions,
my_genome_dict = make_genome_dict(genome, 10)
if my_read in my_genome_dict:
     print my_read[0:10], my_genome_dict[my_read]
     print my_read, "unmapped"
which should print
so it found that our read maps to position 19 (remember: 0 offset!) in the reference genome.One problem here is that there could be some issues with polymorphism: what if the second position in our read at a C instead of a T?

Mapping by indexing (a.k.a. what you'll do in real life)

The way that most mainstream read-mapping programs work nowadays is to do something even fancier than mapping by hashing, but instead build an index of the genome in a special way. One of the most common ways is called a "Burrows-Wheeler index", and is used by BWA (Burrows-Wheeler Aligner) as well as BoWTie (Burrows-Wheeler Transform, capitalization mine). Instead of thinking about how to build an aligner that works by indexing, we'll instead focus on using one of these. For this course, we'll use Bowtie.

One of the most useful things is the Bowtie getting started page, which walks you through some basic usage. In short, there are 2 essential steps to mapping reads using Bowtie:
  1. Build an index of the reference genome
  2. Map the reads, using the index
To build the index, you need to use the program bowtie-build. This program takes very simple inputs:

$ bowtie-build path/to/reference.fasta path/to/index

The results of a bowtie-build command will be a series of files. If I am in a folder with the E. coli genome in a file called E_coli_genome.fasta, then I would build the index like so:

$ ls
$ head E_coli_genome.fasta
>F dna:plasmid plasmid:EB1_e_coli_k12:F:1:99159:1
$ bowtie-build E_coli_genome.fasta E_coli_index
<tons of output redacted>
$ ls
E_coli_genome.fasta E_coli_index.1.ebwt E_coli_index.2.ebwt E_coli_index.3.ebwt E_coli_index.4.ebwt E_coli_index.rev.1.ebwt E_coli_index.rev.2.ebwt

All of those files that end in .ebwt are the indices that bowtie built. These files are not human-readable.

Finally, you need to map the reads using the index. A standard usage of bowtie to map the reads is

$ bowtie -S --un path/for/unmapped_reads.fastq path/to/index path/to/reads.fastq > path/for/mapped_reads.sam

Breaking that down, there are a few things:
  • -S tells Bowtie to output the reads in SAM format, which is a standard format for mapped reads. We'll talk about it later
  • --un path/for/unmapped_reads.fastq tells Bowtie where to put reads that it fails to map. It will simply puke their fastq entries to the path that is given after --un
  • path/to/index tells Bowtie where to find the index that you built using bowtie-build
  • reads.fastq is the file containing all your reads
  • > path/for/mapped_reads.sam tells Bowtie where to put all the reads that it maps.
There are a few extra options: for example, if you are working on a server or a computer with multiple cores you can put "-p" followed by a number to tell Bowtie how many cores to use. If you just type "bowtie" on the command line you will get a list of options.

For example, in a directory with a bunch of E. coli reads (that happens to be one directory deeper than the folder in which I built my index),

$ ls
$ head E_coli_reads.fastq
@HS2:202:D0YYKACXX:6:1101:1158:2193 1:N:0:TTAGGC
@HS2:202:D0YYKACXX:6:1101:1264:2152 1:N:0:TTAGGC
@HS2:202:D0YYKACXX:6:1101:1464:2207 1:N:0:TTAGGC
$ bowtie -p 5 -S --un unmapped.fastq ../E_coli_index E_coli_reads.fastq > E_coli_mapped_reads.sam
# reads processed: 25000
# reads with at least one reported alignment: 24705 (98.82%)
# reads that failed to align: 295 (1.18%)
Reported 24705 alignments to 1 output stream(s)

Bowtie outputs some very important statistics: the fraction of aligned reads and the fraction of unaligned reads. Here we can see that we probably mapped the right reads to the right organism: almost 99% of reads aligned. Why did 1% of reads fail to align? It could be a number of reasons, including (but not limited to) extremely error-filled reads, contamination, or an excess of polymorphism compared to the reference.

Let's take a quick look inside the SAM file:

@HD VN:1.0 SO:unsorted
@SQ SN:F LN:99159
@SQ SN:Chromosome LN:4639675
@PG ID:Bowtie VN:0.12.7 CL:"bowtie -p 5 -S --un unmapped.fastq ../E_coli_index E_coli_reads.fastq"

The lines tagged "@SQ" give the names of the chromosomes and their lengths. In this case, we have two chromosomes: the F plasmid and the main E. coli chromosome. On lines 3 (0 offset!) and higher, we have the information we're really interested in: where the reads mapped.

The first field is the read name, from the "@" line of the fastq file. The second field is an arcane bitwise flag field, which I will explain shortly. The third field is the name of the chromosome to which the read mapped: in E. coli this is a kind of useless but in organisms with more than one chromosome this is a very important thing to know! Then comes the position on the chromosome. It's important to note that this is always the position relative to the first position in the reference fasta file. The next few fields aren't terribly important, until you get to the sequence of the read and then the quality scores, followed by another set of relatively unimportant fields. If you are interested, you can check out the full specification of the SAM file format.

A few words on the bitwise flag. Basically, it is a compact way to express a bunch of things that are either true or false about the read; for example, whether the read aligned only once or not, whether the read is on the + strand or not, etc. Since something that is true or false can be expressed as either 1 (if true) or 0 (if false) you can represent the status of a bunch of these kinds of things as a binary number.

Because this isn't a course on binary numbers, we'll focus on one of the important things, which is whether the read mapped to the plus strand or the minus strand. Basically, and simply, for a read that maps to the minus strand, then when you take the bit-wise AND of the flag for that read and the number 16, it should be greater than 0. Don't worry about what that means, but in python it is easy to implement. For example, if the flag is "16" for a given read, then
>>> 16&16
while if the flag is, say, 4
>>> 16&4
Because 0 is always False and a number greater than 0 is always True, you can use a simple if statement if you need to do different things to the plus and minus strand.
if flag & 16:
     #do stuff that you would do if the read mapped to the minus strand
     #do stuff that you would do if the read mapped to the plus strand.

Feel free to ask about what is actually going on here, but for this course you should only need to use it to this complexity.

And that is the basics of mapping reads!


1. Properties of reads
Before you even map any reads, it's important to understand some interesting properties about Illumina data. Instead of working on all the data (which could take a long time), download the data from the above link and extract it. Use that data for all of the following
  1. Compute the average quality score for each position in a read. Hint: you will need to use the function ord() to turn ASCII characters into decimal numbers
  2. Compute the average frequency of each of the 4 nucleotides, as well as the missing character "N" at each position in the read
Does what you see surprise you?

2. Mapping by string searching
Build a basic mapping tool that will search for each of the reads in the E. coli genome and output the position where they match the E. coli genome. You will need to
  1. Read in the genome in fasta format (you should be able to do this with a module from last week)
  2. Search each chromosome in the genome for the read

Running this program will take a very long time. To debug it, you may want to use an even smaller subset of the data than I provided, by using the head command, for example. When you are running it on the total dataset, I suggest running it in the background while you do the next problem.

When you run the program, make sure to use the time command on the terminal to see how long it takes, e.g.

$ time python

WARNING: make sure to search both the + and - strands!

3. Mapping by hashing
Build a basic mapping tool that will hash the E. coli genome into a dictionary (of arbitrary size k-mers) and then use that dictionary to map the reads to the E. coli genome. Using k = 11, time the program. How long did it take? How does it come to mapping by string searching? You will need to
  1. Read in the genome in fasta format
  2. Make a dictionary with every k-mer in the genome
  3. Search the dictionary for the first k nucleotides of every read

When you run this, time it. How fast is it compared to mapping by string searching?

WARNING: make sure to hash both the + and - strands!

4. Mapping by indexing
First, install bowtie. Do this by following a similar set of commands to that you did to install blast.

Using bowtie-build and bowtie, map the reads to the E. coli genome. How long did it take? Count the number of reads that mapped to the plus strand vs. the minus strand. Is it what you expect?

CHALLENGE: Download the linked set of reads (supposedly) coming from E. coli. What fraction of reads mapped? Do you think something went wrong? What could have gone wrong (hint: RNAseq experiments are prepared by humans...)? Figure out a way to test your hypothesis and test it!

Solutions (for the programming problems):

import sys
import collections
#open the fastq file
reads = open(sys.argv[1],"r")
#loop through the fastq file
initialized = 0
while True:
    #read the details of the reads
    id = reads.readline().strip()
    if id == "":
    seq = reads.readline().strip()
    id2 = reads.readline().strip()
    qual = reads.readline().strip()
    #learn the length of the reads
    if initialized == 0:
        quality = []
        nucs = []
        for i in range(len(seq)):
            initialized = 1
    #get the nucleotides and the quality scores
    for i, nuc in enumerate(seq):
        qual_score = ord(qual[i])-33
        quality[i] += qual_score
print "A\tC\tT\tG\tN\tqual"
for i in range(len(nucs)):
    total_count = sum(nucs[i].values())
    for nuc in ["A","C","G","T","N"]:
        print str(float(nucs[i][nuc])/total_count) + "\t",
    print float(quality[i])/total_count
import sys
complement = {"A": "T", "C": "G", "G":"C", "T":"A"}
def reverse_complement(seq):
    return(''.join(map(lambda x: complement[x], seq))[::-1])
def read_fasta(file):
    fasta_file = open(file,"r")
    sequences = {}
    for line in fasta_file:
        if line[0] == ">":
            cur_name = line.strip()[1:]
    for key in sequences:
        sequences[key] = ''.join(sequences[key])
    return sequences
fasta_file_path = sys.argv[1]
reads_path = sys.argv[2]
my_genome = read_fasta(fasta_file_path)
my_reverse_genome = {}
for chrom in my_genome:
    my_reverse_genome[chrom] = reverse_complement(my_genome[chrom])
 #   print my_reverse_genome[chrom][:10]
 #   print my_genome[chrom][-10:]
sys.stderr.write("Done reading genome!\n")
reads = open(reads_path,"r")
read_num = 0
while True:
    read_num += 1
    if read_num % 10000 == 0: sys.stderr.write('%i\n'%read_num)
    id1 = reads.readline().strip()
    if id1 == "":
    seq = reads.readline().strip()
    id2 = reads.readline().strip()
    qual = reads.readline().strip()
    #go over every chromosome and find if it's in there
    mapped = 0
    for chrom in my_genome:
        forward = my_genome[chrom].find(seq)
        if forward != -1:
            print seq, "+", chrom, forward
            mapped = 1
        reverse = my_reverse_genome[chrom].find(seq)
        if reverse != -1:
            print seq, "-", chrom, reverse
            mapped = 1
    if mapped == 0: print seq, ".", ".", "."


import sys
complement = {"A":"T","C":"G","G":"C","T":"A"}
def reverse_complement(seq):
    return(''.join(map(lambda x: complement[x], seq))[::-1])
def read_fasta(file):
    fasta_file = open(file,"r")
    sequences = {}
    for line in fasta_file:
        if line[0] == ">":
            cur_name = line.strip()[1:]
    for key in sequences:
        sequences[key] = ''.join(sequences[key])
    return sequences
def hash_genome(genome,k):
    genome_hash = {}
    #loop over chromosomes
    for chrom in genome:
        chrom_len = len(genome[chrom])
        genome_hash[chrom] = {}
        for i in range(chrom_len-k):
            if genome[chrom][i:(i+k)] in genome_hash:
    return genome_hash
fasta_file_path = sys.argv[1]
reads_path = sys.argv[2]
k = int(sys.argv[3])
my_genome = read_fasta(fasta_file_path)
my_reverse_genome = {}
for chrom in my_genome:
    my_reverse_genome[chrom] = reverse_complement(my_genome[chrom])
sys.stderr.write("Done reading genome!\n")
my_hashed_genome = hash_genome(my_genome,k)
my_hashed_reverse_genome = hash_genome(my_reverse_genome,k)
#for chrom in my_hashed_genome:
    #print "Yo"
    #print chrom, my_hashed_genome[chrom].keys()[:10]
    #print chrom, my_hashed_reverse_genome[chrom].keys()[:10]
sys.stderr.write("Done hashing genome!\n")
reads = open(reads_path,"r")
while True:
    id1 = reads.readline().strip()
    if id1 == "":
    seq = reads.readline().strip()
    id2 = reads.readline().strip()
    qual = reads.readline().strip()
    #go over every chromosome and find if it's in there...
    mapped = 0
    for chrom in my_hashed_genome:
        if seq[0:k] in my_hashed_genome[chrom]:
            print seq, "+", chrom, my_hashed_genome[chrom][seq[0:k]]
            mapped = 1
        elif seq[0:k] in my_hashed_reverse_genome[chrom]:
            print seq, "-", chrom, my_hashed_reverse_genome[chrom][seq[0:k]]
            mapped = 1
    if mapped == 0:  print seq, ".", ".", "."