RNA-Seq and the Tuxedo Suite


Let's say we're interested in the level of transcription of one or more genes. There are several technologies in common use today, so let's take a moment to talk about them, and which ones you might choose, and when:

Exquisitely sensitive
One gene at a time. Must know gene sequence. Reasonable to scale up to ~dozens of genes x ~ dozens of conditions
in situ hybridization
Preserves spatial localization
One gene at a time. Must know gene sequence. Small dynamic range, prone to hybridization artifacts
Expression Microarrays
Thousands of probes. Cheap: ~ $100/chip
Non-linear output. Must know sequence. Limited resolution. Relatively large input requirements
Can look at microRNA content, levels
Different bench approach from looking at genes. I don't know what software to point you to.
Sensitive, linear output at high resolution. No a priori genome necessary. Lots of data. Very tiny input requirements.
Relatively expensive (~$800-1500/lane + ~$60/sample). Lots of data!

I put down "Lots of data" as both a pro and a con of RNA-seq, since it's often hard to make sense of the piles and piles of data you get, but you know what makes that easier? Programming! Good thing we've spent a week and a half learning that!


Now, one of the major differences for our project is that prokaryotes don't reliably polyadenylate their transcripts, so instead we used the RiboZero kit to remove ribosomal RNA from the sample. It's not clear to me how it works (if you happen to know, I'd be curious!), so it may introduce some biases that we're not aware of. In addition to those, there are some commonly known biases to mRNA-seq data. For instance, GC rich fragments are going to be, on average, harder to amplify; there is some software that can make corrections to the final expression values you get based on the overall "mappability" of the sequence. In addition, the 5' ends of transcripts tend to be better represented; some of this is due to efficiency of the RT step, and is partially helped by the fragmentation, but it definitely seems like there's more to it than that. At any rate, most of the RNA-seq studies I've seen apply relatively minimal correction for these biases; they're something to be aware of, but it seems like correcting for them is something the Statistics departments of the world wishes the biologists would do more often than we do. On the other hand, if those bias correctors are easy to use, it probably makes sense to use them.

It's worth pointing out that there are two major (non-mutually exclusive) approaches for analyzing RNA-seq data. The first is for generating a transcript inventory: determining which parts of a genome can be expressed as RNA in exons. If you're working with a common laboratory species, then chances are there's already a pretty well annotated inventory, or will be soon (which you might be able to get access to). The second, and probably more common application is determining differential expression: which parts of a genome are expressed at different levels in different samples. That last sentence was intentionally vague about which parts: you can look at genes, transcript isoforms, exons, or even (conceivably) individual base pairs. Quantitating eukaryotic transcript isoforms is the hardest, since you have to first determine which isoform is represented when you have multiple isoforms that share one or more exons.

We've already talked about bowtie, which uses the Burroughs-Wheeler transform to do mapping really quickly. Now we're going to use a couple extra tools for doing RNA-seq expression analysis named tophat and cufflinks. Because of the somewhat whimsical names (which is not uncommon in software), the whole collection of tools is sometimes called the Tuxedo suite. Bowtie is slightly faster than bwa, and is aware of the quality scores from the bases, but doesn't do gapped alignment. If you're working with a strain fairly close to the reference strain for your species, you will likely prefer bowtie, but there is no hard and fast rule.

Before we get into the swing of things, let's take a few moments here to install the programs we'll need for the day:

$ cd ~/PythonCourse/ProgramFiles/
$ unzip bowtie-0.12.8-macos-10.5-x86_64.zip
$ mv bowtie-0.12.8 ~/bin

The default bowtie download comes with documentation, tutorials, etc, but cufflinks and tophat just have the executables and a few READMEs, so it's probably simplest just to put those in with the bowtie files:

$ tar -xvzf cufflinks-2.0.1.OSX_x86_64.tar.gz
$ mv cufflinks-2.0.1.OSX_x86_64/* ~/bin
$ tar -xvzf tophat-2.0.4.OSX_x86_64.tar.gz
$ mv tophat-2.0.4.OSX_x86_64/* ~/bin

Note that this does overwrite the README, LICENSE, and AUTHORS files, but it's likely more convenient just to use the relevant website (bowtie, tophat, and cufflinks). We'll also want to add in a line in our .bash_profile file, similar to when we installed :

export PATH=$PATH:~/bin

We'll also need BedTools, which I forgot to include in your installer packages:

Download from here

$ tar -xvzf BEDTools.v2.16.2.tar.gz
$ cd BEDTools-Version-2.16.2/
$ make
$ cp bin/* ~/bin/

And one more tool: bedGraphToBigWig, which you can get here http://hgdownload.cse.ucsc.edu/admin/exe/ (following the appropriate links for your OS).

$ cp ~/path/to/bedGraphToBigWig ~/bin/
$ chmod u+x ~/bin/bedGraphToBigWig

Where ~/path/to/bedGraphToBigWig may be something like ~/Downloads/bedGraphToBigWig. I don't know where your stuff gets downloaded to.

Finally, choose one of the index files from the same directory as the installer packages. What we'll do is have everyone work through one of the data sets, then combine the results. We can even make this study blinded... I know what all the samples are, but hopefully once we actually do the analysis, it should be pretty clear which one is which.

And now we're off to the races!


Let's say you're mapping things to the genome, and you find that suddenly, in the middle of a read, you jump several kb in the genome. If your read was of genomic DNA, this would probably tell you that the genome is pretty messed up. On the other hand, if you're working with RNA-seq data, probably all it means is that you've got a eukaryote and you hit a splice junction. Tophat is a read mapper that is aware of splice junctions, and is aware of lots of other subtleties that we might see in RNA-seq data. It actually uses bowtie to do the mapping, but it wraps the output nicely so that we can just plug it into the next step in the pipeline, cufflinks.

To find out how tophat takes its options, we can run it without any arguments:

$ tophat

which will give us a long list of data on how to use it. If we scroll up to the top, however, we see this message:

TopHat maps short sequences from spliced transcripts to whole genomes.
    tophat [options] <bowtie_index> <reads1[,reads2,...]> [reads1[,reads2,...]] \
                                    [quals1,[quals2,...]] [quals1[,quals2,...]]

So, like bowtie, we can call it with a relatively few number of required things (in the greater than and less than symbols), and one of about a bazillion options, which there's some detail on below the main usage.

Sometimes using scientific software feels like this...

A typical tophat run looks something like this:

$ tophat -o reads12 dmel-4-chromosome-r5.38 reads_1_1,reads_2_1 reads_1_2,reads_2_2

Let's break this down. We want tophat to:

  • put output in the folder called reads12, which it will make if it doesn't already exist. This is optional, and will default to tophat_out
  • We're using the index file dmel-r5_38
  • We've got four data files: reads1_1, reads_2_1, reads_1_2, and reads_2_2. In this example, we're using paired-end reads, although for our project, we only sequenced one end of the read, so there's no equivalent to reads_1_2 and reads_2_2.

For those of you clever enough to be working with well-studied model organisms, there's another option you probably want to keep in mind:

$ tophat --no-novel-juncs -G ~/PythonCourse/data/dmel-4-transcripts-r5.38.gtf -o reads12 dmel-r5_38 reads_1_1,reads_2_1 reads_1_2,reads_2_2

This tells tophat that you only want to use already annotated gene models. In principle, this should make it run ever-so-slightly faster, and you'll have better confidence in your data.

Tophat produces a few different files of output. For our purposes, the most important of these is accepted_hits.bam, which is a list of read alignments. It's in the binary version of the SAM format, which is not really readable by hand. You can use samtools to convert these to human readable format if you want, although that does take up quite a bit more space.

For our RNA-seq data, we have something like 7-10 different files for each index that we want to run tophat on. Now, it's doable to take each one and manually put a comma between them, but exercise 1 today will be to generate the command line string that actually runs the tophat.


It's all great to have the reads aligned to the genome, but there's still a few things we need to be aware of. First off, there's thousands of genes in the genome, and we would like to have a single number for each one. That's still a lot of data points, but it becomes tractable to deal with that number. The community has settled on RPKM (Reads per Kilobase per Million Reads) and FPKM (the paired-end equivalent, where one pair of reads is one fragment). This corrects for a few of the most obvious biases one might encounter in RNAseq. By dividing by the length of the gene, you correct for a longer transcript producing more sequencing reads than a shorter one; and by dividing by the number of reads, you correct for different sequencing runs having different numbers of reads, but potentially even being derived from the same library.

Let's furthermore say that you are interested in whether differential splicing is happening in your sample. This is an even harder problem, since you can't just look at the RPKM for each fragment and compare those: what if you had alternate exons?

Add in the fact that there are a number of biases present in the library preparation process, and you almost start to wonder why you thought RNAseq was a good idea in the first place. At least, you do until someone points out cufflinks, the last major tool in the Tuxedo suite. This builds on the last decade of bioinformatic research to produce a fast, easy-to-use tool that pretty much just works to estimate FPKM values for data.

If we run cufflinks without any arguments it gives us a long list of options:

$ cufflinks

but the most important line is near the beginning:

Usage: cufflinks [options] <hits.sam>

For our purposes, a couple of the options we'll care most about are:

-o/--output-dir write all output files to this directory [ default: ./ ]
-p/--num-threads number of threads used during analysis [ default: 1 ]
-G/--GTF quantitate against reference transcript annotations

Cufflinks will then generate three files:
  • transcripts.gtf
  • isoforms.fpkm_tracking
  • genes.fpkm_tracking

Each of these is human readable, and easily parsable. The Cufflinks manual has more details on what each of the columns are.

Auxiliary Cufflinks Programs

In addition to quantifying the expression levels of each gene and transcript (also called an isoform), cufflinks comes with some software to easily tell you what's different between samples. The program cuffcompare generates statistics on sensitivity (the fraction of true positives detected) and specificity (the fraction of true negatives detected), as well as merging the GTF files that cufflinks generates. It's especially important if you don't already have a good reference annotation.

$ cuffcompare

Perhaps the most important tool you'll want to know about is cuffdiff, which does some sophisticated statistical tests to tell you about significant changes in expression, splicing, and promoter usage. It will give you a list of yes/no "is this difference statistically significant" calls, one for each gene, transcript, and isoform.

Again, we can look at the run-time options by simply running it without any arguments:

$ cuffdiff

BedTools and BigWigs

One of the things that we'll want in the next couple days is a listing of the coverage for each base pair in the genome. This is pretty straightforward to generate, as long as we have the right tools installed. For us, the right tool is in the BedTools package, which will take the BAM file that tophat gives us, and makes a BedGraph.

A BedGraph file is a 4-column text file that has the chromosome, the start and stop positions, and then the coverage at that position. To make that file, we use:

$ genomeCoverageBed -bga -ibam accepted_hits.bam > cov.bed

from within the folder that tophat wrote the accepted hits into.

One more change we're going to make to this is to turn it into a BigWig file. BigWig is a non-human readable format, but it has the advantage that it's really fast for computers to look into, so if you have the right software, you can ask for the coverage at an arbitrary position, and don't have to stupidly read every line in the file in order to find the one line you're looking for.

To do that, we need a file that has the size of the chromosomes, which I've called chroms.sizes:

$ cat chroms.sizes
Chromosome    4639675
F    99159

Note that those are tabs between the name and the size.

$ bedGraphToBigWig cov.bed chroms.sizes cov.bw


For the RNA-seq lectures later on this week, we'll actually work through some analyses on Bicyclomycin treated E. coli reads, which we generated for the course. What we'll do is have everyone work on one of the indices, then have everyone combine their data once it's fairly well processed. Count off by six to determine which index to download:

  1. index 3
  2. index 4
  3. index 5
  4. index 6
  5. index 8
  6. index 11

Exercise 1

As promised in the lecture, we have about 10 different files whose names are all very nearly the same, that we need to stitch together into a command string. Using either the function listdir in the os module or the glob function in the glob module, print a command string that you can then feed into the subprocessing module (See lecture 6.2) the to run tophat.

You should make sure that it includes all fastq files with a given index, and only those with a given index. You can either take this index in from the command line, or hard code it into a variable.

Your command string should at least include the correct bowtie index, but also can include things like an output directory, and the --no-novel-juncs flag. If there's another flag you think might be helpful, feel free to ask and I can (hopefully) give some guidance.

Exercise 2

Run the command string you generated in exercise 1! This may take a while, so you may want to make sure it works before you go home, then leave your computer running overnight. As a general rule, if tophat makes it to the "mapping left_kept_reads" stage, it should be able to make it all the way through (possibly even "preparing reads" is fine). More often, it will simply fail right away.

Exercise 3

Here's another thing we'll need for the analysis: Using the operons file we generated from the Arkin lab data, let's figure out where all of the 3' UTRs are. This sounds simple, but if one operon comes right after another one, there's not much in the way of UTR available.

For this exercise, make another GFF file that has entries for operons, then entries for up to 300 bases of the 3' UTRs for each operon. The 3' UTRs should not overlap any operon, and any UTR that's less than 50 bp should not have an entry. The UTR should, of course, have the same strand as the operon it's downstream of.


Exercise 1

import os
import sys
import subprocess as sp
allfiles = os.listdir('../sequence/')
procs = []
#for index in ['index3', 'index4', 'index5', 'index6', 'index8', 'index11']:
for index in sys.argv[1:]:
    seqfiles = []
    for filename in allfiles:
        if index in filename and filename.endswith('.fastq'):
            seqfiles.append(os.path.join('..','sequence', filename))
    tophat_command = ['tophat',
                      '--output-dir', index,
                      'ensembl_ecoli', # The bowtie index name
                      ','.join(seqfiles) # Not sure who thought comma separated
                                         # list of files was a good idea...
for proc in procs:

Exercise 3

gff_fname = 'operons2.gff'
out_fname = 'features.gff'
def annotate_genome(gff_fname):
    gff_file = open(gff_fname)
    empty = ''
    annotation = []
    for line in gff_file:
        data = line.strip().split('\t')
        start = int(data[3])
        stop = int(data[4])
        desc = data[-1].split('"')
        operon_name = desc[desc.index('operonName ') + 1]
        while len(annotation) < start:
        for pos in range(start, stop):
            if len(annotation) + 1 == pos:
                print "BE A LERT! Your country needs more lerts", annotation[pos]
    return annotation
genome = annotate_genome(gff_fname)
GFF_base = '\t'.join(['Chromosome', 'MicrobesOnline', 'downstream_utr',
                      '%d', # low end position
                      '%d', # high end position
                      '.', '%s', # strand
                      '.', '%s utr_len %d;\n'])
out_file = open(out_fname, 'w')
for line in open(gff_fname):
    data = line.strip().split('\t')
    start = int(data[3])
    stop = int(data[4])
    strand = data[6]
    desc = data[-1]
    for i in range(300):
        if strand == '+':
            lo_pos = stop
            hi_pos = stop + i
            pos = hi_pos
        elif strand == '-':
            lo_pos = start - i - 1
            hi_pos = start
            pos = lo_pos
            assert False
        if lo_pos < 0 or hi_pos >= len(genome):
        if genome[pos]:
        genome[pos] = 'UTR'
    if i > 50:
        GFF_line = GFF_base % (lo_pos, hi_pos, strand, desc, i+1)