In Depth RNA-seq analysis

On Tuesday, we left off having all the tools we'd need to do the RNA-seq, and asked you to map the reads for your respective samples. Now we're going to combine our data, and actually try to draw some conclusions about what actually happens when we inhibit Rho.

First, though, let's check the results of tophat. Whenever I have lab meeting with new sequencing data, I always have a standard table format, where I have
  • # of reads from each sample
  • # of mapped reads from each sample (and % of reads that mapped)
  • Some estimate of the library complexity (historically, # of unique reads, as called by samtools)

In a standard RNA-seq experiment, we'd want to get the FPKM values for each of the genes, and we can certainly go ahead and do that. We need to use a GTF file, that's very specific about how things are laid out. For E coli, we could use the Ensembl annotation. That run would look something like:

$ cufflinks E_coli -G E_coli_k12.ASM584v1.14.gtf index3/accepted_hits.bam


$ cuffdiff -G E_coli_k12.ASM584v1.14.gtf index3/accepted_hits.bam index4/accepted_hits.bam index5/accepted_hits.bam ...

We care about where transcripts are mapping at a finer level than genes

Often-times in looking at RNA-seq data, we're interested in which genes are being up- and down-regulated in various conditions. In this case, though, we also have very specific questions that we'll need to know the coverage at the highest level available: the single base pair. So what we'll do is take the aligned reads and generate a BigWig file that, for every base pair in the genome, will tell us how many reads fell across that base pair.

$ genomeCoverageBed -bg -ibam accepted_hits.bam > cov.bed
$ bedGraphToBigWig cov.bed chroms.sizes

If you have a genome viewer (like IGB, but not covered in this course), we could load that up and look at the actual transcription levels as a function of location. Unless you have a particular place you're looking, though, this is going to just be a lot of information and no meaning (hence why I'm not covering a genome browser).

We can also look at the BedGraph file itself, and it's pretty straightforward.

Rho terminates transcription at the 3' ends of genes

Maybe the first question that we want to ask ourselves is, "Do we see the effect we'd expect to see? Does inhibiting Rho lead to read-through on the 3' ends of genes?" Our strategy, then, is the following. For each gene:

  • Calculate the number of reads that lie upstream of the gene (but not within another gene)
  • Calculate the number of reads that fall across the gene
  • Calculate the number of reads that fall downstream of the gene (again, not within another gene).

As a very first approach, we can average all of the genes from a given sample. We don't anticipate that the total amount of gene transcription per cell will be radically different in the different conditions, so if we take the read mass across the gene as constant, then we can compare different samples to each other.

Now the really nice thing is that we can use our pre-calculated BigWig file to make the lookups for each region a lot faster than if we had to look through the BAM file read by read. Even better, there's a python module already designed for reading BigWig files: bx-python. Installing it is really Easy. From your command line, just type:

$ easy_install bx-python

easy_install is a built-in tool for looking through known depositories of Python modules and installing everything they need. As long as we're installing things, let's also grab the progressbar module, which makes really fancy progress bars stupid simple:

$ easy_install progressbar

Now, what we want to do is go through every entry in our GFF file and pull out the relevant regions of the chromosome. Now, the GFF file we wrote in the exercises a couple days ago only had the 3' (downstream) UTRs, and didn't necessarily do a good job about dealing with genes that were really close to each other. I've made a modified GFF file that has both the upstream and downstream UTRs for each operon (again, assuming that each must be greater than 50 bp, can be a maximum of 300 bp, and with the new assumption that when genes are close together, the two UTRs each go to halfway between them).

import progressbar
from argparse import ArgumentParser, FileType
import bx.bbi.bigwig_file as bigwig
import numpy as np
def parse_args():
    parser = ArgumentParser()
    parser.add_argument('-d', '--downstream', type=int, default=300)
    parser.add_argument('-u', '--upstream', type=int, default=300)
    parser.add_argument('-r', '--resolution', type=int, default=200)
    parser.add_argument('-g', '--gff-filename',
    parser.add_argument('-t', '--use-non-terminated', dest='use_terminated',
    parser.add_argument('-T', '--use-terminated', dest='use_terminated',
    parser.add_argument('-G', '--terminated-genes', default=False)
    parser.add_argument('-S', '--genome-size', type=int, default=4639675)
    parser.add_argument('bigwig', type=FileType('rb'))
    global args
    args = parser.parse_args()
    if args.terminated_genes:
        terminated_genes = set()
        for line in open(args.terminated_genes):
        args.terminated_genes = terminated_genes
    return args
if __name__ == "__main__":
    args = parse_args()
    reads = bigwig.BigWigFile(args.bigwig)
    # number of reads (calculated with pileup)
    upstream = np.zeros(args.upstream)
    # number of genes used at each position
    upstream_n = np.zeros(args.upstream) + 1e-10
    downstream = np.zeros(args.downstream)
    downstream_n = np.zeros(args.downstream) + 1e-10
    gene_cov = np.zeros(args.resolution)
    gene_cov_n = 0
    gff_data = open(args.gff_filename).readlines()
    pbar = progressbar.ProgressBar(widgets=['Pileups: ',
                                           progressbar.Percentage(), ' ',
                                           progressbar.Bar(), ' ',
    for entry in pbar(gff_data):
        data = entry.strip().split('\t')
        chrom = data[0] # should always be 'Chromosome'
        kind = data[2]
        start = int(data[3])
        stop = int(data[4])
        strand = data[6]
        if start == 4275492:
        # soxR-1 has sraL with an overlapping region with sraL.  I'm a little
        # surprised this seems to be the only problem...
        curr_cov = reads.get_as_array(chrom, start, stop)
        size = len(curr_cov)
        if strand == '-':
            curr_cov = curr_cov[::-1]  # Reverse if on the minus strand
        if kind == 'operon':
            gene_cov_n += 1
            last_pct = 0
            for pos, n in enumerate(curr_cov):
                hi = int(np.floor(args.resolution * pos / size))
                if last_pct == hi:
                    gene_cov[hi] += n
                    gene_cov[last_pct:hi] += n/(hi - last_pct)
                last_pct = hi
        elif kind == 'dUTR':
            downstream[:size] += curr_cov
            downstream_n[:size] += 1
        elif kind == 'uUTR':
            upstream[-size:] += curr_cov
            upstream_n[-size:] += 1
            raise ValueError("Unrecognized feature type '%s'" % kind)
    normval = np.mean(gene_cov / gene_cov_n)
    print np.mean(upstream/upstream_n) / normval
    print np.mean(downstream/downstream_n) / normval

A couple notes on funny looking things I'm doing here:

  • I reverse everything on the minus strand so that I can just treat everything as normal. Because we're not looking at any sequence, we don't need to worry about the complement
  • I add 1x10^-10 to each position in the upstream and downstream regions to prevent divide-by-zero errors.
  • While adding in the coverage to the up- and downstream-UTRs works pretty straightforwardly, the operon area is much more difficult. We have to start at the beginning, and then work our way through so that each operon is rescaled to an "average" size.

Rho terminates transcription, but only at some genes

The next thing we'd expect is that some genes are Rho-dependent, and will have lots more transcription in their 3' UTRs in the medium and high BCM cases, whereas others will be independent, and won't change much at all.

One thing we can check is that in genes that have a terminator hairpin sequence, the response to Rho is much less pronounced than in those without. Of course, it's probably too much to expect that the every gene with a terminator is rho independent, and every gene without one is dependent, but by and large, that trend should hold true.

The easiest way I can think to do this is to go through each gene, and if depending on whether it's got a terminator, include or exclude it from the analysis in that particular run. Then, run again and switch whether we're keeping or tossing the genes. Fortunately, on the very first day of this course, we took a messy data file that we'd gotten from the program GeSTer (Genome Scanner for Terminators) and turned it into a simple list of genes. It's really simple to read that in as either a list, or as a set, which is like a list, but is guaranteed to have each entry only once, no matter how many times we add it.

In fact, I already set up the parse_args() function to do that, if we give terminated genes as an option. So now, it's as simple as checking whether or not any of the genes in the operon are terminated (ideally, we could check to see if only the last gene is terminated, but that's a bit of fiddliness that's not worth writing in for this class, and in practice it seems to be uncommon that there's a terminator in the middle of an operon).

We now add in this block of code to the loop:

        desc = data[8]
        operon_name = desc.split('"')[1]
        gene_names = operon_name.split('-')
        any = False
        for gene in gene_names:
            if gene in args.terminated_genes:
                any = True
        if args.terminated_genes and args.use_terminated and (not any):
        elif args.terminated_genes and (not args.use_terminated) and any:

Another thing we might check is to see how much blocking Rho increases (or, potentially decreases) the number of transcripts in the 3' UTR.

By now, we've compared the numbers and we know that indices * and * are the no-drug conditions. If you have one of these indices for your data, download the other one from the S3 website. Otherwise, download your choice. What we'll do now is figure the ratios of the expression in (your sample) to the no-drug sample. Later on, we can make a histogram of the ratios for different samples, but for now we can just explore manually.

import bx.bbi.bigwig_file as bigwig
import progressbar
import argparse
from Bio import SeqIO
import numpy as np
def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('--gff-fname', '-G')
    parser.add_argument('--no-drug-fname', '-0')
    parser.add_argument('--drug-fname', '-d')
    parser.add_argument('--genome', '-g', type=open, default=None)
    parser.add_argument('--outfh', '-o', type=argparse.FileType('w'))
    parser.add_argument('--ratio-high', '-R', type=float, default=1e100)
    parser.add_argument('--ratio_low', '-r', type=float, default=10.0)
    parser.add_argument('--keep-neg', dest='keep_pos', default=1,
                        action='store_const', const=-1)
    args = parser.parse_args()
    if args.genome:
        args.Genome = { r
                      for r in SeqIO.parse(args.genome, 'fasta')}
    return args
if __name__ == "__main__":
    args = parse_args()
    no_drug_reads = bigwig.BigWigFile(open(args.no_drug_fname))
    drug_reads = bigwig.BigWigFile(open(args.drug_fname))
    gff_data = open(args.gff_fname).readlines()
    all_ratios = {}
    outputted_data = []
    pbar = progressbar.ProgressBar(widgets=['Pileups: ',
                                           progressbar.Percentage(), ' ',
                                           progressbar.Bar(), ' ',
    for line in pbar(gff_data):
        data = line.strip().split('\t')
        chrom = data[0]
        kind = data[2]
        start = int(data[3])
        stop = int(data[4])
        strand = data[6]
        operon_name = data[-1].split('"')[1]
        if kind != 'dUTR':
        drug_pileup = drug_reads.get_as_array(chrom, start, stop)
        no_drug_pileup = no_drug_reads.get_as_array(chrom, start, stop)
        drug_counts = sum(drug_pileup)
        no_drug_counts = sum(no_drug_pileup)
        if no_drug_counts:
            all_ratios[operon_name] = drug_counts / no_drug_counts
            if (args.ratio_low < all_ratios[operon_name] < args.ratio_high
                and args.outfh):
                seq = args.Genome[chrom][start:stop]
                if strand == '-':
                    seq = seq.reverse_complement()
       = operon_name
                if operon_name not in outputted_data:
                    seq.description = '%d-%d' % (start, stop)
                    SeqIO.write(seq, args.outfh, 'fasta')
print "Mean UTR ratio (drug/nodrug):", np.mean([r for r in all_ratios.values()])
print "+/-", np.std([r for r in all_ratios.values()])
if args.outfh:

Inhibiting Rho has downstream effects on the expression levels of genes

In addition to the immediate effects of inhibiting termination, it's likely that the cell tries to respond to the presence of the drug in lots of complicated ways. The 15 minute time point is almost a full cell cycle later, and the 60 minute time point is 3 cell cycles later...

The standard, first-pass approach for looking at how transcriptomes are changing is to do hierarchical clustering.

In general, what you'll want to do is look to see if there's anything you can say about genes whose expression tends to change in the same direction under similar conditions. There was even a time when simply saying "these things change together" was good enough for a paper, but now we'd really like to connect it to some kind of underlying biology. Sometimes, using a GO term finder will help you pull out associations that you wouldn't have guessed from first principles, which you can then follow up on.

--break for brief clustering demo--

The software I've used for this is:

Cluster 3.0
Java TreeView


Exercise 1: Sequence histograms

Using the spacing-finder program from yesterday, see if you can determine any differences in the pyrimidine spacing of 3' UTRs of Rho-dependent and Rho-independent genes. How would you try to test whether these are significant?

Exercise 2: Venn diagrams

Download the rest of the sample bigwig files from the server. Using the UTR_compare program I showed in class, generate a list of operons in each condition that have higher transcription. Make a Venn diagram (or diagrams) for these sets. How many of the operons in the high drug cases are represented in the medium-drug, and vice-versa? Does this give you more or less confidence in the results?

As a hint, the set() datatype will likely be helpful here, since you can calculate the intersections and differences of two lists at a time...

Exercise 3: Anything else?

If you've got an idea for something that might be interesting to try, come talk to us, and see if we can offer suggestions on how to implement them...


Exercise 1:

I can't seem to find any differences. My run history looks like this:

run --gff-fname features.gff -0 -d -o 0_drug_ratios_hi.fa --genome ../S7.2/ecoli.fasta
run --gff-fname features.gff -0 -d -o 20_drug_ratios_hi.fa --genome ../S7.2/ecoli.fasta
run --gff-fname features.gff -0 -d -o 100_drug_ratios_hi.fa --genome ../S7.2/ecoli.fasta
run --gff-fname features.gff -0 -d -o 100_drug_ratios_lo.fa --genome ../S7.2/ecoli.fasta -r 0 -R 10
run 100_drug_ratios_lo.fa
run 100_drug_ratios_hi.fa

If you found any, how would you verify that it's real? The standard way to do it is to take some sort of "shuffled" sequence, where you take your sequences, shuffle them up a bunch of times, and see how many of those shuffled sequence give results as big. This is a procedure known as bootstrapping, because it's almost as hard as pulling yourself up by your own bootstraps...

Exercise 2

run --gff-fname features.gff -0 -d
hidrug = {key for key in all_ratios if all_ratios[key] > 10}
run --gff-fname features.gff -0 -d
lodrug = {key for key in all_ratios if all_ratios[key] > 10}
run --gff-fname features.gff -0 -d
nodrug = {key for key in all_ratios if all_ratios[key] > 10}
print "Hi drug", len(hidrug)
print "Hi ^ lo", len(hidrug.intersection(lodrug))
print "Hi ^ no", len(hidrug.intersection(nodrug))
print "lo drug", len(lodrug)
print "lo ^ no", len(lodrug.intersection(nodrug))
print "no drug", len(nodrug)