SNP calling: finding polymorphism


Although in our current project, we are not interested per se in finding where the sample sequence differs from the reference sequence. However, in many kinds of experiments and analyses, identifying variable sites is a key step in the analysis pipeline. For example, if you might be conducting an experiment to find the genetic basis of a mutant phenotype in which some samples have a mutant phenotype (let's call these guys cases) and other samples don't (let's call these guys controls). One option in this case is to do an association study, which looks for all the variable sites where one allele is enriched in the cases and the other allele is enriched in the controls. Another, increasing common application of sequencing data is to look at sequence variation in natural populations. This kind of data can be extremely informative about the evolutionary and demographic history of a species; we'll explore that briefly at the end of this session.

Errors or SNPs?


If you were guaranteed that every read were 100% accurate, there would be no SNP calling problem. However, because some of the calls that map to a particular site may be errors, we need some way of figuring out whether or not the site is truly a SNP.

Let's first consider a haploid organism. This is both most relevant to this class, since the data we have is from E. coli, but also an easy place to start. Let's take a look at a little bit of a reference sequence and then the reads that map to it. I'm going to write the data in a similar form to what's called a pileup file; you'll see a proper one later.
A AAAAACAAA
G CCCCCGCCC
C CCCAGCAGG
The first column is the reference allele while the second column summarizes the nucleotides from the reads. For example, in the first row, the reference allele is an A and you obtained 8 reads that had an A at that position and 1 read with a C at that position.

The first row is relatively easy: the reference allele is an A, and 8 out of 9 reads are A. Probably that C is an error, and we would think that this site does not have a SNP in this individual.

The second row is the opposite: the reference allele is G, but only 1 out of the 9 reads is a G. Instead, because every other read has a C, we might think that this site is a SNP and this sample has a C, instead of a G.

The last row is a hard case: the reference is a C, but only 4 out of the 9 reads is a C. However, unlike before, the other reads are roughly evenly split between 3 Gs and 2 As. What do you do in this case?

One approach is to adopt a statistical framework that attempts to estimate error rates and makes use of all the data. This is relatively complicated and we won't talk about it here. Another approach makes implicit use of prior knowledge you may have. First of all, in a haploid you expect only 1 allele at any site; so if there aren't enough reads with the same nucleotide, it might be wise to ignore that site. Second, assuming that your reference and your sample are not too diverged, you should expect SNPs to be relatively rare. Even between humans and chimp you expect only about 1 site in 100 to be different! So you probably want to put more weight on a site not being a SNP. Then, you can simply choose an arbitrary cutoff and declare a site a SNP or not.

A diploid organism can be much harder, because you have to deal with heterozygous sites. In a perfect world, your reads would come equally from both copies of the allele, but that isn't always the case. Consider the following sites:
G GCGCCGGCCC
T GGGTGGGTGG
C CCATGCGATG
The first site has 4 Gs and 6 Cs. This is reasonably close to 50% G and 50% C, exactly what you would expect from a heterozygous site. But what about the second site? It has only two Ts but eight Gs. Is that possibly a heterozygous site, or a site homozygous for the non-reference allele? The final site is once again a mess.

Again, a statistical framework can estimate error rates, sequencing bias, and model the process that assigns alleles to reads. However, another option is to again make arbitrary, but informed, cutoffs.

Pre-packaged SNP calling


There are a number of software packages that will perform SNP calling for you. One of the simplest to use is samtools. Samtools uses a relatively simple command line that takes as an input a indexed fasta file and a sorted bam file (we'll talk about that in a moment). The general framework can be found here.

The first thing you need to do to use samtools is index your fasta file using samtools faidx:

$ samtools faidx E_coli_genome.fasta
$ head E_coli_genome.fasta.fai
F 99159 50 60 61
Chromosome 4639675 100938 60 61

As you can see, running samtools faidx results in a file with the same name as the input fasta but with ".fai" tagged on the end of it. The faidx file simply shows the name and size of the sequence, the place in the file where the sequence begins and the number of characters per line. Using this information it is quite easy to get an arbitrary sub-sequence if you know the start and end positions of that subsequence.

The other task is to generate a sorted bam file with your reads. A bam file is simply a binary version of a sam file and samtools plays more nicely with them. To generate a bam file from a sam file of mapped reads, we have to first convert our sam file file into a bam file and then sort it. To conver the sam file into a bam file, use samtools view:

$ samtools view -Sb E_coli_mapped_reads.sam > E_coli_mapped_reads.bam
[samopen] SAM header is present: 2 sequences.

These bam files are NOT human-readable. the -Sb option tells it that the input is in Sam format and that it should output in bam format. To sort the bam file, use samtools sort:

$ samtools sort E_coli_mapped_reads.bam E_coli_mapped_reads.sorted

which results in a file E_coli_mapped_reads.sorted.bam. The first argument is the input bam file and the second argument is the prefix for the output bam file (in other words, the output will be called prefix.bam).

Now we can finally call SNPs! We do this using a series of programs. The first one, samtools mpileup, will be used to generate bcf file. However, first it's instructive to take a brief look at a pileup file:

$ samtools mpileup -f ../E_coli_genome.fasta E_coli_mapped_reads.sorted.bam > E_coli_mapped_reads.pileup
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000

the -f option simply means that the reference genome is given as a fasta file. Let's take a brief look inside E_coli_mapped_reads.pileup:

Chromosome 21150 G 17 c,,,,,,,,,,,,,,,^~, +JIIJJGGJJH@JCJHC
Chromosome 21151 T 17 ,,,,,,,,,,,,,,,,, AIJGIJBFJJE;J>JGF
Chromosome 21152 C 17 ,,,,,,,,,,,,,,,,, GGJJJJGHJJJDJEJHH
Chromosome 21153 A 17 ,,,,,,,,,,,,,,,,, EJJJJJIEJJJIJHJCH
Chromosome 21154 A 17 ,,,,,,,,,,,,,,,,, >IIIJJIJJIJEJDJFE
Chromosome 21155 T 17 ,,,,,,,,,,,,,,,,, FJIIJJJIJIJEJ@IGF

The first column is the chromosome, followed by the position, the reference allele and the coverage. The next two columns tell you about the reads at that position: , signifies matching reference on forward strand while . is matching reference on reverse strand; similarly upper and lower case letters indicate a read from a non-reference allele on either the forward or reverse strand, respectively. There are a few other characters, for example indicating the beginning of a read, but this was just meant to be a short detour rather than a full exploration of the pileup format.

However for actual SNP calling we need this to be in bcf format. This is done using an almost identical command line, the only difference being that the option should be "-Suf" instead of simply "-f". The "S" will tell samtools to compute per-site strand bias values, which we'll come back to later.

$ samtools mpileup -Suf ../E_coli_genome.fasta E_coli_mapped_reads.sorted.bam > E_coli_mapped_reads.bcf
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000

Now, we use bcftools view to do all the genotype calling. bcftools uses a statistical method to call SNPs.

$ bcftools view -vg E_coli_mapped_reads.bcf > E_coli.vcf
[afs] 0:279.924 1:4.392 2:3.684

The options -vg tell it to output only variable sites and to call genotypes. Calling genotypes is a bit weird for E. coli, since it's haploid, but this can be a good practice: sites that are called heterozygous might be indicative of low quality sites or that you are not sequencing a clonal culture. The output is stored in a vcf. "vcf" stands for variant call format, and is a compressed way to store variant information. Taking a look inside, first there is a long header:

-->
##fileformat=VCFv4.1
##samtoolsVersion=0.1.17 (r973:277)
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same">
##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele frequency (assuming HWE)">
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele count (no HWE assumption)">
##INFO=<ID=G3,Number=3,Type=Float,Description="ML estimate of genotype frequencies">
##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based HWE test P-value based on G3">
##INFO=<ID=CLR,Number=1,Type=Integer,Description="Log ratio of genotype likelihoods with and without the constraint">
##INFO=<ID=UGT,Number=1,Type=String,Description="The most probable unconstrained genotype configuration in the trio">
##INFO=<ID=CGT,Number=1,Type=String,Description="The most probable constrained genotype configuration in the trio">
##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=PC2,Number=2,Type=Integer,Description="Phred probability of the nonRef allele frequency in group1 samples being larger (,smaller) than in group2.">
##INFO=<ID=PCHI2,Number=1,Type=Float,Description="Posterior weighted chi^2 P-value for testing the association between group1 and group2 samples.">
##INFO=<ID=QCHI2,Number=1,Type=Integer,Description="Phred scaled PCHI2.">
##INFO=<ID=PR,Number=1,Type=Integer,Description="# permutations yielding a smaller PCHI2.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# high-quality bases">
##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
##FORMAT=<ID=PL,Number=-1,Type=Integer,Description="List of Phred-scaled genotype likelihoods, number of values is (#ALT+1)*(#ALT+2)/2">


and then some lines indicating the variants
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT E_coli_mapped_reads.sorted.bam
Chromosome 2816123 . T C 6.02 . DP=2;AF1=1;AC1=2;DP4=0,0,0,2;MQ=20;FQ=-33 GT:PL:GQ 1/1:36,6,0:6
This looks complicated but it's rather simple. First let's look at the line describing the variant. The first column is the chromosome, then the position. The next column is an ID for that SNP, for example if you had a list of known SNPs. Finally you get the good stuff: the reference allele and the alternative allele. In this case, samtools thinks that while the reference is a T at position 2816123 in the E. coli chromosome, our sample actually has a C. The rest of the fields give you some information about the SNP. For example, the field that starts with "DP=2" is called the info field. If you want to know what all the things there mean, take a look up at the header. For example, this line
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
means that "DP=2" indicates that 2 reads mapped to this position in the genome. The last column tells you the actual genotype, as well as the genotype likelihood scores (essentially -log(probabilty of a given genotype)). So here it thinks that the E. coli is heterozygous for the alternative allele.

Population genomics

Analyzing genomic variation within and between populations is one of the most powerful uses of high-throughput sequencing (and one that's close to my own heart). We are able to sequence more individuals to higher quality than ever before and so we can get a genome-wide understanding of variation within populations.

Imagine that you have found all the variable sites in a set of individuals. Although you have all the sequence for each sample, you really only care about those places where at least one sample differs from one other sample. Below is a cartoon of a small dataset with 3 chromosomes. Variant sites are marked with a red x, the rest of the chromosome is assumed to be the same between all individuals. If two chromosomes have a variant that (roughly) line up vertically, they have the same variant at that site.

pop_sequence.png
There are three summaries of population genomic variation that can act as a first pass at exploratory data analysis. Two of these summaries are just single numbers, and one of them is a bit more complicated. The two simplest summaries are the number of segregating sites, S, and the average number of pairwise differences, pi. Let's walk through calculating those on the example data above.

S is exactly what it sounds like: simply count up the number of sites have at least one chromosome with a variant,

variants.png
so that S is equal to 6 in this case.

Calculating pi is a little bit harder. First, compare every pair of sequences and ask how many differences there are between them. Then, take the average of this number over all the comparisons you did,

pairwise_dif.png
so that in this case,


The slightly more complicated summary, called the site frequency spectrum, is just a histogram of the allele frequencies in the sample. There are two kinds of site frequency spectrum, depending on whether you know the ancestral state or not. If you do know the ancestral state, then you can tell the difference between a variant that's at frequency 1 out of n from a variant that's at frequency n-1 out of n. But if you don't know the ancestral state, you can't tell the difference---in that case, you fold the site frequency spectrum and collapse allele frequencies that you can't tell apart.

For the example, I have supposed that we can tell the ancestral state, so there are two allele frequencies: 1/3 and 2/3. It's easy to count them up, and the site frequency spectrum looks like this:
sfs.png
This site frequency spectrum is pretty boring: there are the same number of alleles with frequency 1/3 as there are with frequency 2/3.

What do these have to do with population history? It turns out that each summary is affected in different ways by the demographic and evolutionary history of a population. If everything is boring (no natural selection, no population size changes, no population subdivision, etc.) then it's easy to prove that the average values of the different statistics are

where N_e is some mysterious thing called an "effective population size" and and mu is the mutation rate for the whole genomic region per generation in the species. The last line refers to the average value of the number of alleles segregating with frequency i.

Exercises



VCF:
https://www.dropbox.com/s/33bgaxx6fcsg6d9/CG_chr_21.good.vcf.gz
Individual list: https://www.dropbox.com/s/1lw5pwok1zpx6j8/CGS_Diversity%2BPed_all54_indvs.txt
1. Download the above VCF file containing variants from human chromosome 21 in various human populations. Individuals are labeled with anonymous identifiers, so you will also need to download a list that tells which population each individual is from. Then,
a. Compute pi and S within each population SEPARATELY. Note that because these are diploids, you can just select a random allele from each individual for calculating pi (you may need to look at the module "random" for this purpose, specifically the function random.randint).
b. Compute the average distance between a random Yoruban (YRI) and European individual (CEU). How does it compare to within population diversity?

2. Using samtools, call SNPs in the reads from yesterday. How many heterozygous sites are called? Why might there be "heterozygous sites" in a haploid genome? Make a vcf file with just the variable sites in it.

Solutions

1.
import sys
import random as rn
 
vcf = sys.argv[1]
inds = sys.argv[2]
 
#parse the individuals and create some useful dictionaries
ind_pops = {}
pop_indices = {}
pop_S = {}
pop_pi = {}
pop_goodsites = {}
for line in open(inds):
    splitLine = line.strip().split("\t")
    ind_pops[splitLine[1]] = splitLine[0]
    if splitLine[0] not in pop_indices:
        pop_indices[splitLine[0]] = []
        pop_S[splitLine[0]] = 0.0
        pop_pi[splitLine[0]] = 0.0
        pop_goodsites[splitLine[0]] = 0.0
 
#parse the VCF
cur_line = -1
for line in open(vcf):
    if cur_line%10000 == 0:
        sys.stderr.write("%i\n"%cur_line)
    #if cur_line > 10000:
    #    break
    cur_line += 1
    if line[0:2] == "##":
        #skip comments
        continue
    if line[0:2] == "#C":
        #figure out which indices correspond to which individuals
        splitLine = line.strip().split("\t")
        inds = splitLine[9:]
        for i,ind in enumerate(splitLine[9:]):
            ind_name = ind.split("-")[0]
            pop_indices[ind_pops[ind_name]].append(i+9)
        continue
    #compute pi and s by adding SNP by SNP
    splitLine = line.strip().split("\t")
    #go over every population
    for pop in pop_indices:
        is_segregating = 0
        #go over every pair of individuals
        for ind1 in pop_indices[pop]:
            for ind2 in pop_indices[pop]:
                #don't count if the two individuals are the same!
                if ind1 == ind2:
                    continue
                #get each nucleotide
                ind1genotype = splitLine[ind1].split("/")
                ind2genotype = splitLine[ind2].split("/")
                #print ind1genotype
                #print ind2genotype
                #determine if either has missing data
                #if any missing data in this population for this site, ignore!
                if "." in ind1genotype or "." in ind2genotype:
                    break
                #choose a random allele from each individual
                ind1allele = ind1genotype[rn.randint(0,1)]
                ind2allele = ind2genotype[rn.randint(0,1)]
                #determine if they are different
                if ind1allele != ind2allele:
                    #print splitLine[1], pop, ind1genotype, ind2genotype
                    #raw_input()
                    #count the site as segregating if haven't already
                    if is_segregating == 0:
                        is_segregating = 1
                        pop_S[pop] += 1.0
                    #count as a pairwise difference
                    pop_pi[pop] += 1.0
                pop_goodsites[pop] += 1.0
 
#print the results
print "pop","pi","S","goodsites"
for pop in pop_pi:
    num_ind = float(len(pop_indices))
    print pop,pop_pi[pop]/(num_ind*(num_ind-1)/2),pop_S[pop],pop_goodsites[pop]

2.
import sys
import random as rn
 
vcf = sys.argv[1]
inds = sys.argv[2]
 
#parse the individuals and create useful dictionaries
ind_pops = {}
pop_indices = {}
for line in open(inds):
    splitLine = line.strip().split("\t")
    ind_pops[splitLine[1]] = splitLine[0]
    if splitLine[0] not in pop_indices:
        pop_indices[splitLine[0]] = []
 
#create the counters
pi_between = 0.0
good_sites = 0.0
 
#parse the VCF
cur_line = -1
for line in open(vcf):
    if cur_line%10000 == 0:
        sys.stderr.write("%i\n"%cur_line)
    cur_line += 1
    if line[0:2] == "##":
        #skip comments
        continue
    if line[0:2] == "#C":
        #figure out which indices correspond to which individuals
        splitLine = line.strip().split("\t")
        inds = splitLine[9:]
        for i,ind in enumerate(splitLine[9:]):
            ind_name = ind.split("-")[0]
            pop_indices[ind_pops[ind_name]].append(i+9)
        continue
    #compute pi between by adding line by line
    splitLine = line.strip().split("\t")
    for ind1 in pop_indices["YRI"]:
        for ind2 in pop_indices["CEU"]:
            ind1genotype = splitLine[ind1].split("/")
            ind2genotype = splitLine[ind2].split("/")
            #check if any missing data
            if "." in ind1genotype or "." in ind2genotype:
                break
            #choose a random allele
            ind1allele = ind1genotype[rn.randint(0,1)]
            ind2allele = ind2genotype[rn.randint(0,1)]
            #determine if they are different
            if ind1allele != ind2allele:
                pi_between += 1.0
            good_sites += 1.0
print "pi", "goodsites"
print pi_between/(9.0*9.0), good_sites