No model, no inference

Why do we need to talk about statistics in a class on sequencing? We've already seen at least one place where statistical modeling might be useful: finding sites where a sample differs from the reference sequence. So we know that statistical analysis can improve some of the inference we make from a sequencing experiment. In another sense, statistics are important because when you are doing a statistical test, or fitting a distribution, you are forced to make clear, explicit assumptions about the data and how they were generated. This is the heart of the phrase "No model, no inference", which basically says that in order to know what your data are saying, you have to have some kind of idea of how they behave. And statistical inference in the key.

Modern sequencing experiments also produce an extremely large amount of data. It would be quite challenging to go through all of that data by hand to pick out the interesting stuff. So it's becoming more and more common to use complicated statistical methods to pick out the good stuff. However, it's very important to understand what these methods are doing, otherwise you might be misled.

There will be a bit of math in this lecture. Don't worry about following all the details, and I won't present anything too complicated.

Even though you will be analyzing RNA seq data, it's instructive to start by thinking about DNA sequencing. Imagine that you sequence N reads of length L from a genome of size G, where L is much much smaller than G, as is the case with short read sequencing. Assume that there are no sequencing biases: every single nucleotide is equally likely to be the place where a read starts. Now think about a fixed nucleotide, and imagine you want to know the probability of a read covering that nucleotide. Clearly, if the read has a 5' end that starts anywhere L bases 5' of the site, that site will be covered. This can clearly be seen in the figure below, where reads colored red do not include the site, while reads colored blue do include the site.


Because the whole genome is G long, that means that the probability of a given site being hit by a single read must be

Now, the coverage of that site should be the number of reads that end up including that site. So you want to know what fraction of the N reads you sequenced covered that site. The probability that n out of N reads hit that site is given by the binomial distribution, so we have that

Okay, here is where things get a bit weird. It turns out that if N is big enough and L/G is small enough, we can approximate the binomial distribution by a Poisson distribution. A Poisson distribution is a fundamental probability distribution and is characterized by one number: it's average. In the case of the above binomial distribution, the average is

so we can say that

Notice that C is exactly what we calculate when we say the "coverage" of a genome: the number of nucleotides sequenced, which is the number of reads times the average read length, divided by the number of nucleotides in the whole genome. The above equation gives some intuition into why you need relatively high coverage to make sure you hit every site. Moreover, it's very important to realize: just because the AVERAGE coverage is, say, 10x, that doesn't mean that every single nucleotide shows up in 10 reads. In fact, until you have relatively high coverage, there is a good chance that some nucleotides are just missing from your sequencing data! Below, I've plotted the probability that a site is completely missed against the coverage, and you can see that it remains above 1% until you get to about 4.5x coverage.


However, it's worth pointing out that this is only an approximation: the real process is messy! Below is a figure showing the coverage histogram you would expect compared to one from real data (~27x resequencing of a human). Clearly the real data has a bit more variability than you would expect if reads were coming uniformly at random from the genome.


RNA sequencing statistics

Going off of that background for DNA, we can get a picture of what we expect to happen with RNA seq. To simplify things, let's just think about a transcriptome with two transcripts in it. It wouldn't be hard to work out the theory for a larger number of transcripts, but this will keep things transparent. Now we have two transcripts, one of length L_1 and the other of length L_2 (I appologize for the awful subscripts, but there is no efficient way to make subscripts as far as I can tell...). We need a bit more information, which is the number of copies of the transcripts floating around in the cell when you do the sequencing experiment. Let's say that those are E_1 and E_2.

Our interest here is a bit different than with DNA sequencing. Instead of wanting to know the probability that a given site is covered by a read, we want to estimate the number of reads that start on a given transcript. Let's think about a simple way to model this. First imagine the probability that a read starts on any given site in transcript 1. Certainly that probability is proportional to E_1. But there are L_1 sites in transcript 1. So the total probability that a read starts in transcript 1 should be proportional to E_1*L_1. Hence,

and by the same argument of a binomial distribution converging to a Poisson distribution, we see that the probability that a site in transcript 1 is covered by n reads is approximately Poisson distributed with average value

This is the origin of the Poisson model of variation between RNA seq replicates, that you can find in, say, Marioni et al (2008).

Let's take a look at that expression for C_1 and we'll already see some important things that come up in the analysis of RNA-seq data. First of all, what we would like to have is the absolute expression level, E_1, but it's all tied up with a bunch of other things. Let's take a look at the bottom of the fraction: In some sense, we can consider E_1*L_1 + E_2*L_2 the effective total expression in the cell: it's the total amount of expressed nucleotides. Similarly, E_1*L_1 is the effective expression level of transcript 1. So what we're really measuring in some sense is effective expression level, which is completely confounded by the length of a gene. Moreover, as should be obvious, the number of reads makes a difference: if you do more sequencing, counts of every gene should be higher!

This fact has motivated a common description of the expression level in RNA seq data: FPKM normalization, which stands for Fragments Per Kilobase (of transcript) per Million mapped reads. So, say that you have counted the number of reads, R, mapping to a particular some gene with length L and you mapped N reads total, you calculate the FPKM as

Thus, the FPKM of a gene should be proportional to expression level, but the proportionality constant is related to the effective total expression in the cell.

Biological variation

Everything we considered before can be considered "technical" variation. We assumed that, at the outset, the number of mRNAs corresponding to any transcript was fixed. This would make sense if we extracted RNA once and then just ran that same sample on multiple machines. However, in many cases, the variation between replicates you get from this kind of experiment will be less than the variation between replicates if you had extracted RNA multiple times and ran each of those samples separately. This should be obvious: say that you extract RNA today and then you extract RNA again tomorrow. Do you expect every single transcript to have the exact same number of mRNA molecules floating around in the cell on both days? Probably not!

The difference here is between technical replication and biological replication. This distinction is VERY important.
  • Technical replication is when the same RNA extract is used to run the experiment multiple times.
  • Biological replication is when different RNA extracts are used to run the experiment multiple times.

The problem with the Poisson model that we developed before is that it is characterized by a single number: the average number of reads mapping to a transcript. This means that the variation between experiments is only related to that number. In fact, the standard deviation, or width, of a Poisson distribution is equal to the square root of its average! To overcome this difficulty, a somewhat unmotivated but ultimately effective solution is to stop using a Poisson model, and instead use the negative binomial distribution. The negative binomial distribution can sort of look like a Poisson distribution, but can be "fatter". This "fatness" is measured by the dispersion parameter. When this parameter is equal to 0, there is no extra variation, and the negative binomial is equivalent to a Poisson. As the dispersion parameter gets large, the distribution becomes increasingly "fat". Below, I plotted an example of this for a negative binomial with mean 10 as the dispersion gets smaller. The solid line is a Poisson distribution with the same mean, while the points correspond to a negative binomial distribution with the given dispersion.


Thus, the negative binomial distribution provides a very robust way of accounting for biological variation. Numerous packages, including edgeR and DEseq, implement this model to control for biological variation between replicates.

Finding Differential expression

One of the key goals of many RNA-seq experiments is to identify genes that are differentially expressed between conditions. In fact, all this background we've developed is key to understanding how to test for differential expression. Let's say that you have two experimental conditions, and you have performed RNA seq in both of them and gotten counts of N_1 reads mapping to the gene in condition 1 and N_2 reads mapping to the gene in condition 2. Now you want to test every gene and ask if it is differentially expressed between condition 1 and condition 2. How can you do that?

For now, let's consider the Poisson model and ignore the negative binomial, and let's think way way back to the first statistics class you ever took. First, let's assume that that the total number of mapped reads and the total expression levels were the same between the replicates so that we can avoid issues of normalization, which we'll talk about later. Then the null hypothesis is that the expression level is the same between the two conditions; in other words, that the average of the Poisson distribution is the same between the two conditions. The alternative hypothesis is that the expression level is different between the two conditions; in other words, that the average of the Poisson distribution is different between the two conditions. To summarize, if C_1 is the average of the Poisson distribution in condition 1 and C_2 is the average of the Poisson distribution in condition 2,

  • Null hypothesis: C_1 = C_2,
  • Alternative hypothesis: C_1 =/= C_2.

How do we test this? We make use of a very powerful statistical test called a likelihood ratio test. The likelihood of the data is simply the probability of the observed data. In general, the best way to estimate the parameters of a model (e.g. the average value of a Poisson distribution) is to choose the parameters that make the observed data have the highest probability. It turns out that with a Poisson distribution, setting the average of the Poisson distribution equal to the sample mean gives the data the highest probability. Now what you want to do is ask if the data from your two experiments has a higher probability if there is ONE mean between both experiments (i.e. they both have the same expression level) vs. if there are separate means for each experiment. It turns out that the "right" way to do this is to take twice the log ratio of the probability of the data under the alternative and the null hypothesis,

In our case, we use the fact that the best guess the average of a Poisson is the sample average, (N_1 + N_2)/2, to get

while in the alternative case the same averages are simply the observed counts,

Once we have computed the likelihood ratio, we use a strange fact that is not at all obvious: it will be distributed as a chi-square with 1 degree of freedom. You should remember the chi square distribution from testing for Hardy-Weinberg equilibrium in introductory genetics. This means, for example, that if your likelihood ratio is greater than 3.84, your likelihood ratio is significant at the .05 level.


1. Simulate a simple sequencing experiment with 1000 100 base pair reads.


Instead of using the whole E. coli genome, download the random genome that is linked above. You will have to write a script to simulate the reads, I'll outline the pseudocode here:
  1. Read the genome in, like normal
  2. Generate (in one line of code!) the 5' ends of each of the reads (Hint: use numpy.random.random_integers)
  3. For i in 1 through the number of reads:
    1. Get the sequence starting from that integer and going 100 base pairs in the 3' direction
    2. Print each sequence in fastq format (give everything a quality score of F)
After you simulate the reads, map them to the fake genome using bowtie, and create a coverage file. Make a print out of how many sites have 1x coverage, 2x coverage, etc.

2. Likelihood ratio test for differential expression.
Download the 2 attached files, which contain the counts of genes from simulated RNAseq experiments. In each file, there are two columns, corresponding to the counts in an "experimental" and a "control" group. In one of the files, the experimental and control groups have no differential expression, while in the other one, some of the genes are differentially expressed.

Read the data into python using numpy.loadtxt (look it up!).

Using the Poisson likelihood ratio test, determine if which genes are differentially expressed at the 5% level. For each file, how many are there? How many would you expect, even if there is no differential expression? What fraction of the hits in the file with real differential expression do you think are actually differentially expressed?

NOTE: Do all these calculations using numpy arrays. You should not use a single for loop!
NOTE 2: You will not be able to compute the likelihood functions and then take the logs; you will have to take the logs first. So take the log by hand and input that. To get python to compute log(n!), use "from scipy import special" at the top of your script and compute "special.gammaln(n+1)"

No DE:

3. Wrong model, wrong inference.
Download the attached file, which has the counts of genes from 2 simulated RNA-seq experiments. None of these genes are differentially expressed, but they are all simulated using a negative binomial model, instead of a Poisson. Using the Poisson likelihood ratio test, determine which genes are differentially expressed. How many do you get? How many do you expect?

NOTE: Do all these calculations using numpy arrays. You should not use a single for loop!



This script will simulate data
import sys
import numpy as np
def read_fasta(file):
    fasta_file = open(file,"r")
    sequences = {}
    for line in fasta_file:
        if line[0] == ">":
            cur_name = line.strip().split(" ")[0][1:]
    for key in sequences:
        sequences[key] = ''.join(sequences[key])
    return sequences
genome_file = sys.argv[1]
read_len = int(sys.argv[2])
num_reads = int(sys.argv[3])
genome_dict = read_fasta(genome_file)
genome = ''.join(genome_dict.values())
genome_len = len(genome)
starts = np.random.random_integers(0,genome_len-read_len,num_reads)
for start in starts:
    seq_of_read = genome[start:(start+read_len)]
    qual = ["P"]*read_len
    print "@Whatever len=%i\n%s\n+\n%s"%(read_len,seq_of_read,''.join(qual))

This script will run the lrt with arbitrary cutoff and input file
import sys
import numpy as np
import scipy as sp
from scipy import special
counts_file = sys.argv[1]
cutoff = float(sys.argv[2])
counts = np.loadtxt(counts_file)
C_null = (counts[:,0]+counts[:,1])/2
C_null = np.transpose(np.array([C_null,C_null]))
P_null = counts*np.log(C_null)-special.gammaln(counts+1)-C_null
P_null = P_null[:,0]+P_null[:,1]
P_alt = counts*np.log(counts)-special.gammaln(counts+1)-counts
P_alt = P_alt[:,0]+P_alt[:,1]
LRT = 2*(P_alt-P_null)
print LRT
print np.where(LRT>cutoff)[0]
print "%i genes differentially expressed"%sum(LRT>cutoff)