Functions as objects, non-linear fits, and putting it all together


BLAST searches against the PDB and Homology Models


We will begin by tying up a loose end from Session 8.2 regarding how to search for structural data relevant to any protein or nucleotide sequence. Let's assume that you have identified a SNP that maps to a missense mutation in a protein coding region, and you would like to see if a structure exists of that particular protein to help give biological context to that particular SNP (there are currently ~83,000 structures in the PDB, so it's always worth looking). This task is easily accomplished by downloading the PDB amino acid or nucleotide database from NCBI:

BLAST databases available for download

The databases that we'd want would either be pdbaa.tar.gz or pdbnt.gz, and we could run the BLAST search locally as described in Session 6.1. New structures are added to the PDB every week, so it is wise to always run your searches with the most recent BLAST database.
In the case that your protein target of interest is not in the PDB, but a highly (at least 20-30% amino acid sequence identity, higher is better) homologous protein does exist in the PDB, then you can always try the Phyre server to generate a homology model:

Phyre homology model server

Advanced Data Structures in Python


Today we will learn more about functions in Python. These data structures may seem unwieldly at first, but they can greatly simplify code that requires complicated data structures.
Functions (defined as "def" in Python) are objects that can be passed as variables into other functions at will. To help exemplify this, let's look at a simple function:
def arbitraryFunction(a,b):
 #Do something here
 return something involving a and b

We're completely comfortable with a and b being objects of type string, int, float, list, dictionary, etc. The truth is, they can be objects of any type. Functions themselves are also objects, and when we type "def _functionNameHere_(variables), we are initalizing an object of type function. Because of this fact, we can also pass functions as variables:

def numberTimesTwo(x):
 return x*2
def arbitraryFunction(a,b):
 return a(b)
 
print arbitraryFunction(numberTimesTwo,4)
 

This may just seem like a cute trick at first, but it can greatly help to simplify complicated code. Have you ever copy and pasted a function that you've written just to replace the name of one function with another? You could completely bypass the need to do this by passing the function that you're changing as a variable. Let's look at an example of some code that we can clean up by passing functions as variables:

import os
def squareEveryObjectOfAList(tempList):
 output = []
 for each in tempList:
 toAdd = each**2
 output.append(toAdd)
 return output
 
 
def cubeEveryObjectOfAList(tempList):
 output = []
 for each in tempList:
 toAdd = each**3
 output.append(toAdd)
 return output
 
 
def stringOfEveryObjectOfAList(tempList):
 output = []
 for each in tempList:
 toAdd = str(each)
 output.append(toAdd)
 return output

Cleaning up the first two functions could be easily accomplished by passing the exponent as a variable. Incorporating the third function into the same syntax, though, is a bit more challenging. Let's pass a function as an argument to clean up this example:

def doSomethingToEveryObjectOfAList(whatToDo,tempList):
 output = []
 for each in tempList:
 toAdd = whatToDo(each)
 output.append(toAdd)
 return output
 
def squareANumber(number):
 x = int(number)
 return x**2
 
def cubeANumber(number):
 x = int(number)
 return number**3
 
def stringANumber(number):
 output = str(number)
 return output

These data structures are never worth the effort for simple sets of functions, but can be invaluable when you are dealing with complicated analyses.

Let's Practice Doing Stuff with Real Data!


Let's say that you wanted to see how stable your favorite protein was. (And, for the sake of this example, let's say your favorite protein was lambda repressor. You love the classics.) You did a denaturant melt using a CD (circular dichroism spectrometer), which measures the amount of secondary structure. So, for each sample you now have two pieces of data: the denaturant (urea) concentration and the CD signal. In order to get the stability (∆G) of the protein, you want to look at the CD signal as a function of the denaturant concentration and fit that to the equation

meltequation.png

And we now know enough to be able to do all this (and more) in Python! Let's start with our data as scipy arrays (so we can analyze it) and our equation as a function.

#!/usr/bin/env python
 
import numpy
import scipy
 
wt_CD = scipy.array([-46.396, -46.43 , -46.082, -46.159, -46.169, -45.949, -45.896,
       -45.78 , -45.7  , -45.434, -45.19 , -45.084, -44.374, -43.963,
       -43.265, -42.12 , -40.694, -38.897, -36.468, -33.651, -30.485,
       -26.564, -23.369, -21.652, -20.149, -18.564, -17.223, -15.661,
       -14.473, -13.155, -12.688, -11.335, -11.297, -10.525, -10.013,
        -9.199,  -8.816,  -8.388,  -8.499,  -7.707,  -7.329,  -7.355,
        -6.688,  -6.782,  -6.789,  -6.37 ,  -5.944,  -5.817,  -5.719,
        -5.545,  -5.651,  -5.692,  -4.971,  -5.184])
wt_m0 = scipy.array([ 0. ,  0.3,  0.6,  0.9,  1.2,  1.5,  1.8,  2. ,  2.2,  2.4,  2.6,
        2.8,  3. ,  3.2,  3.4,  3.6,  3.8,  4. ,  4.2,  4.4,  4.6,  4.8,
        5. ,  5.1,  5.2,  5.3,  5.4,  5.5,  5.6,  5.7,  5.8,  5.9,  6. ,
        6.1,  6.2,  6.3,  6.4,  6.5,  6.6,  6.7,  6.8,  6.9,  7. ,  7.1,
        7.2,  7.3,  7.4,  7.5,  7.6,  7.7,  7.8,  7.9,  8. ,  8.1])
 
# m0 = denaturant/urea concentration
#m1_0 = folded baseline
#m2_0 = unfolded baseline
#m3_0 = ∆G
#m4_0 = m-value
#m5_0 = folded slope
#m6_0 = unfolded slope
 
def equation(m0, m1, m2, m3, m4, m5, m6):
    return (m1+m5*m0)*(1/(1+(numpy.exp(-(m3-m4*m0)/0.592126))))+(m2+m6*m0)*(numpy.exp(-(m3-m4*m0)/0.592126)/(1+(numpy.exp(-(m3-m4*m0)/0.592126))))

In order to fit an equation to your data, you begin by making starting guesses for the parameters you are fitting. Then you check the error and iteratively change the values of the parameters in order to minimize that error. The error is just the difference between the CD value predicted at each [urea] with your current parameters and your actual CD values. In Python, that looks like this:
#starting guesses
m1_0 = -50
m2_0 = -25
m3_0 = 6.5
m4_0 = 4.5
m5_0 = .5
m6_0 = 2.5
 
def resid(p, CDsignal, m0):
    m1, m2, m3, m4, m5, m6 = p
    return CDsignal - equation(m0, m1, m2, m3, m4, m5, m6)

And now, we can use the scipy function that exists for exactly this purpose, optimize.leastsq(), and practice passing functions as variables all in one feel swoop.

from scipy import optimize
 
results, flag = optimize.leastsq(resid, [m1_0, m2_0, m3_0, m4_0, m5_0, m6_0], args = (CDsignal, m0))

So now, we can put it all together into a function that takes our data (CD signal and urea concentration for every sample) as input and gives us values for each of the 6 parameters we're fitting as the output:

#!/usr/bin/env python
 
import numpy
import scipy
from scipy import optimize
 
wt_CD = scipy.array([-46.396, -46.43 , -46.082, -46.159, -46.169, -45.949, -45.896,
       -45.78 , -45.7  , -45.434, -45.19 , -45.084, -44.374, -43.963,
       -43.265, -42.12 , -40.694, -38.897, -36.468, -33.651, -30.485,
       -26.564, -23.369, -21.652, -20.149, -18.564, -17.223, -15.661,
       -14.473, -13.155, -12.688, -11.335, -11.297, -10.525, -10.013,
        -9.199,  -8.816,  -8.388,  -8.499,  -7.707,  -7.329,  -7.355,
        -6.688,  -6.782,  -6.789,  -6.37 ,  -5.944,  -5.817,  -5.719,
        -5.545,  -5.651,  -5.692,  -4.971,  -5.184])
wt_m0 = scipy.array([ 0. ,  0.3,  0.6,  0.9,  1.2,  1.5,  1.8,  2. ,  2.2,  2.4,  2.6,
        2.8,  3. ,  3.2,  3.4,  3.6,  3.8,  4. ,  4.2,  4.4,  4.6,  4.8,
        5. ,  5.1,  5.2,  5.3,  5.4,  5.5,  5.6,  5.7,  5.8,  5.9,  6. ,
        6.1,  6.2,  6.3,  6.4,  6.5,  6.6,  6.7,  6.8,  6.9,  7. ,  7.1,
        7.2,  7.3,  7.4,  7.5,  7.6,  7.7,  7.8,  7.9,  8. ,  8.1])
 
def denmeltfit(CDsignal, m0):
    import numpy
    import scipy
    from scipy import optimize
 
    # define equation for denaturant melt fit
    # m0 = denaturant concentration/x-value
    def equation(m0, m1, m2, m3, m4, m5, m6):
        return (m1+m5*m0)*(1/(1+(numpy.exp(-(m3-m4*m0)/0.592126))))+(m2+m6*m0)*(numpy.exp(-(m3-m4*m0)/0.592126)/(1+(numpy.exp(-(m3-m4*m0)/0.592126))))
 
    # define equation for residuals
    # CD = actual y-values/CD signal measured
    def resid(p, CDsignal, m0):
        m1, m2, m3, m4, m5, m6 = p
        return CDsignal - equation(m0, m1, m2, m3, m4, m5, m6)
 
    #starting guesses
    m1_0 = -50 # folded baseline
    m2_0 = -25 # unfolded baseline
    m3_0 = 6.5 # delta G
    m4_0 = 4.5 # m-value
    m5_0 = .5 # folded slope
    m6_0 = 2.5 # unfolded slope
 
    results, flag = optimize.leastsq(resid, [m1_0, m2_0, m3_0, m4_0, m5_0, m6_0], args = (CDsignal, m0))
 
    fbaseline = results[0]
    ubaseline = results[1]
    dG = results[2]
    mvalue = results[3]
    fslope = results[4]
    uslope = results[5]
    Cm = dG/mvalue
    return fbaseline, ubaseline, dG, mvalue, fslope, uslope, Cm
 
results = denmeltfit(wt_CD, wt_m0)
dG = float(results[2])
 
print 'The stability is %.3f kcal/mol.' %dG

Before we end, a quick note about denaturant melts that you will need to do the exercises. There are two ways to make samples. One is to use a titrator, where you prepare a no-denaturant and a high-denaturant solutions and the instrument adds one to the other, keeping track of the CD signal and urea concentration at every step. Your data is then all contained in a single file that looks like this:
.
The other option is to make each sample by hand and individually measure the urea concentration and CD signal. In that case, you have a folder with the data for each of your samples contained in a different file, like this one:
.

Exercises


1. Get me out of here!
a) One of the most irritating things to do by hand (and thus one of the most useful things for the computer to do for you) is actually get your data out of the data files and put it into a list for the computer to analyze. Modify the code we assembled in the second half of lecture so that all the user has to do is input the name of the titrator data file and the code outputs the stability of the protein.
b) Now modify the code to be able to do the same thing with the folder of data from the manual melt. Look up os.walk() to learn one way to loop over a directory. Note that: 1. For each urea concentration, the CD signal is measured for one minute. The single file from the titration experiment (part A) gives you the average CD signal over that minute; for the manual, you will need to determine the average yourself. 2. The urea concentrations are in a separate file within the folder.

2. Pretty pretty picture
Make a figure of any (or all!) of the three melts that you have fit. Show the data as points and the fit as a smooth line (hint: look up numpy.linspace()). Include the stability you've calculated as a text inset in your figure.

Solutions


1. Get me out of here!

Part A:

import numpy
import scipy
from scipy import optimize
 
def denmeltfit(CDsignal, m0):
    import numpy
    import scipy
    from scipy import optimize
 
    # define equation for denaturant melt fit
    # m0 = denaturant concentration/x-value
    def equation(m0, m1, m2, m3, m4, m5, m6):
        return (m1+m5*m0)*(1/(1+(numpy.exp(-(m3-m4*m0)/0.592126))))+(m2+m6*m0)*(numpy.exp(-(m3-m4*m0)/0.592126)/(1+(numpy.exp(-(m3-m4*m0)/0.592126))))
 
    # define equation for residuals
    # CD = actual y-values/CD signal measured
    def resid(p, CDsignal, m0):
        m1, m2, m3, m4, m5, m6 = p
        return CDsignal - equation(m0, m1, m2, m3, m4, m5, m6)
 
    #starting guesses
    m1_0 = -50 # folded baseline
    m2_0 = -25 # unfolded baseline
    m3_0 = 6.5 # delta G
    m4_0 = 4.5 # m-value
    m5_0 = .5 # folded slope
    m6_0 = 2.5 # unfolded slope
 
    results, flag = optimize.leastsq(resid, [m1_0, m2_0, m3_0, m4_0, m5_0, m6_0], args = (CDsignal, m0))
 
    fbaseline = results[0]
    ubaseline = results[1]
    dG = results[2]
    mvalue = results[3]
    fslope = results[4]
    uslope = results[5]
    Cm = dG/mvalue
    return fbaseline, ubaseline, dG, mvalue, fslope, uslope, Cm
 
 
###Modified code starts here###
 
#New function for to get CD value and [urea] for each sample
def titrator_getdata(datafile):
    import os
    from collections import defaultdict
    import sys
    import numpy
 
    titr_fh = open(datafile, 'r')
    lines = titr_fh.readlines()
    titr_fh.close()
    CD = []
    urea = []
    for item in lines:
        if item[0].isdigit():
            item = item.split()
            if float(item[3]) <= 500:
                CD.append(float(item[1]))
                urea.append(float(item[0]))
 
    return CD, urea
 
#Use your new function
filename = raw_input('Input path to data:\n')
wt_CD, wt_m0 = titrator_getdata(filename)
wt_CD = scipy.array(wt_CD)
wt_m0 = scipy.array(wt_m0)
 
results = denmeltfit(wt_CD, wt_m0)
dG = float(results[2])
 
print 'The stability is %.3f kcal/mol for the titrator melt.' %dG

Part B:

def manmelt_getdata(rootdir):
    import os
    from collections import defaultdict
    import sys
    import numpy
 
    # Make a dictionary "data" that has each file name as the key and all the data as values
    data = defaultdict(list)
 
    for pathstring, dirs, files in os.walk(rootdir):
        for afile in files:
            if afile.endswith('.dat'):
                fh = open(os.path.join(pathstring, afile), 'r')
                lines = fh.readlines()
                fh.close()
                for line in lines:
                    if line[0].isdigit():
                        data[afile].append(line)
 
    # Sort data points in numerical order (modified from first answer at http://stackoverflow.com/questions/2669059/how-to-sort-alpha-numeric-set-in-python)
    datapoints = data.keys()
    datapoints.sort(key=lambda item: (int(item.partition('#')[2].partition('.')[0])))
 
 
    # Get average dynode and CD values, plus standard error for the latter.
    CD = []
    stddev = []
    dynode = []
 
    def parsed_data(datastrings):
        CDtemp = []
        dynodetemp = []
        for y in datastrings:
            splitdatastring = y.split()
            CDtemp.append(float(splitdatastring[1]))
            dynodetemp.append(float(splitdatastring[2]))
        CDtemp = numpy.array(CDtemp)
        dynodetemp = numpy.array(dynodetemp)
        return CDtemp, dynodetemp
 
    for x in datapoints:
        datastring = data[x]
        CDtemp_array, dynodetemp_array = parsed_data(datastring)
        CD.append(numpy.average(CDtemp_array))
        stddev.append(numpy.std(CDtemp_array))
        dynode.append(numpy.average(dynodetemp_array))
 
    return CD, stddev, dynode
 
#Use the function to extract the data, then fit it
 
CD, stddev, dynode = manmelt_getdata('/Users/dkoulechova/PythonCourse/2013/wt/')
CD = numpy.array(CD)
 
ureafile = raw_input('Input path to file with urea concentrations:\n')
fh = open(ureafile)
m0 = fh.readlines()
m0 = [float(x) for x in m0]
m0 = numpy.array(m0)
fh.close()
 
results = denmeltfit(CD, m0)
dG = float(results[2])
print 'The stability is %.3f kcal/mol for the manual melt.' %dG
 

Pretty pretty picture.

import pylab
 
pylab.plot(m0, CD, 'bo') #plot the data
 
m0_smooth = numpy.linspace(0.1, 9.6, 100) #x-values for plotting fit
 
def denmelt(m0, fitresults): #function defining y-values for plotting fit (CD values predicted at the urea concentrations of m0_smooth with the calculated parameters)
    (m1, m2, m3, m4, m5, m6, Cm) = fitresults
    return (m1+m5*m0)*(1/(1+(numpy.exp(-(m3-m4*m0)/0.592126))))+(m2+m6*m0)*(numpy.exp(-(m3-m4*m0)/0.592126)/(1+(numpy.exp(-(m3-m4*m0)/0.592126))))
 
pylab.plot(m0_smooth, denmelt(m0_smooth, results), 'b') #plot the fit
 
pylab.title('Denaturant melt for manual data', weight = 'heavy')
pylab.xlabel('[Urea], M')
pylab.ylabel('CD signal')
pylab.xlim(0, 9)
pylab.text(6, -30, 'dG = %.2f' % (dG), color = 'g')
 
 
pylab.show()