Numerical Python (NumPy)

Topics

  • array data types
  • Conversion between standard Python data types and numpy data types
  • Introduction to numpy and scipy packages
  • Basic statistical analysis with numpy tools
  • Reading and writing tabular data with pandas
  • Compare performance using vector math and for loops.

Introduction

As Python matured into the multifunctional language it is today, more and more features were added. We've seen most of the original functionality of the language, which was originally intended for scripting of UNIX and web processes, and writing user interfaces. But as Python became popular as a scripting language, it found its way into the arms of scientists, who generally speaking, care more about efficient data manipulation than web programming. The big piece that was missing in Python was a numerical library, which means a collection of tools for dealing with large amounts of numbers at once. After a few efforts, an integrated one was developed. Numerical Python, or NumPy provided a large collection of mathematical functions similar to those found in a language like MATLAB or R, and also similarly to those languages, provides datatypes that are manipulatable as vectors and matrices. We'll introduce these data types, then their associated functions.


NumPy Basics



Numerical Python is a powerful library of functions, methods, and data types we can used to analyze our data. Unforunately for those of us whose heads continue to spin in a crash-course of syntax, it also uses a different set of rules. I hope you'll understand why when you see the power and speed NumPy's data types afford us. Let's start off creating some empty arrays, which look sorta like lists, and are in fact vectors.

They differ in a few fundamental ways from lists:

  1. Arrays cannot be of mixed types. They can be all integers, floats, strings, logical (or boolean) values, or other immutable values. But they cannot be some characters, some numbers, or any other olio of data types. They also cannot contain mutable types such as lists. So, we can have a list of lists, but not an array of lists. We can, however, have an array of arrays (sortof). Which brings us to:
  2. Arrays can be multidimensional, but they must be rectangular. You can have a list of lists, where the first interior list is 3 elements long, the second 5, and the third 12, but your multidimensional array must be expressible as "a m by n (by j by k by...) array". I have never encountered a situation where Python says there are too many dimensions (but I've never had need beyond, maybe, 4 dimensions).
  3. We can perform vector operations on them, which can be algebraic functions (like a dot product), or simple replacements of values in slice of the array.

For the some of the lecture today, I'll be using the format
>>> code goes here

to indicate things that I'm doing in IPython, as opposed to in a code file. It's the same format that the regular python interactive interpreter uses, (rather than In[] and Out[]), but it lets us use cpaste, which strips out the >>> when interpreting code.

Arrays

Here's one way: start with a list and change it:

>>> import numpy as np
 
>>> a = [0] * 40
>>> a = np.array(a)

Or this can be shortened:

>>> a = np.array([0] * 40)

But there's a better way to get a vector of zeros:
>>> a = np.zeros(40)

Note that the default type when declaring an array is float64:
>>> type(a[0])
<type 'numpy.float64'>

And here's the simplest way to change that:
>>> a = np.zeros(40, int)
>>> type(a[0])
<type 'numpy.int64'>

And here's how to declare something that's not all zeros:
>>> a = np.arange(40)
>>> a
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39])
>>> type(a[0])
<type 'numpy.int64'>
Notice the int type.

What if we want a float? There's a couple ways to do it:
>>> a = np.arange(40, dtype=float)  #Explicitly tell it to make a float
>>> type(a[0])
<type 'numpy.float64'>
>>> a = np.arange(40.0)  #Implicitly give a float, and it will still work
>>> type(a[0])
<type 'numpy.float64'>

Like with range(), you can also give arange() more parameters:
>>> np.arange(40, 50)  # Start and Stop
array([40, 41, 42, 43, 44, 45, 46, 47, 48, 49])
>>> np.arange(40, 50, 1) # Start, Stop, and increment
array([40, 41, 42, 43, 44, 45, 46, 47, 48, 49])
>>> np.arange(40,50,.25)
# If any of the parameters are floats, the output will be a float
array([ 40.  ,  40.25,  40.5 ,  40.75,  41.  ,  41.25,  41.5 ,  41.75,
        42.  ,  42.25,  42.5 ,  42.75,  43.  ,  43.25,  43.5 ,  43.75,
        44.  ,  44.25,  44.5 ,  44.75,  45.  ,  45.25,  45.5 ,  45.75,
        46.  ,  46.25,  46.5 ,  46.75,  47.  ,  47.25,  47.5 ,  47.75,
        48.  ,  48.25,  48.5 ,  48.75,  49.  ,  49.25,  49.5 ,  49.75])

As I said above, you can have arrays with more than one dimension

>>> a = np.zeros( (10, 10) )  # Note the inner set of parentheses
>>> a
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

And you can even modify a particular element with the same syntax, or a subtly different syntax, as our list-of-lists:

>>> a[5][5] = 3  # row, then column
>>> a[6,6] = 42  # still row, column
>>> a
array([[  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   3.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,  42.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.]])

So far, the coolest thing I've shown you isn't really that exciting: a range function that can have floats. The real power of arrays is the ability to have one statement affect a large chunk of an array:

>>> a[1,:] = 1
>>> a
array([[  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   3.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,  42.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.]])
>>> a[:,0] = 7
>>> a
array([[  7.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  7.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.],
       [  7.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  7.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  7.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  7.,   0.,   0.,   0.,   0.,   3.,   0.,   0.,   0.,   0.],
       [  7.,   0.,   0.,   0.,   0.,   0.,  42.,   0.,   0.,   0.],
       [  7.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  7.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  7.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.]])


Let us pause for a moment and think about how we would do this with a for loop in lists:

LoL = [[0]*10 for i in range(10)]
 
for i, elem in enumerate(LoL[1]):
    LoL[1][i] = 1
 
for L in LoL:
    L[0] = 7

We can also take slices of arrays, just as if they were lists:
>>> a = np.arange(10)
>>> a[2:5]
array([2, 3, 4])
>>> a[-1]
9
>>> a[::-1]
array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

Maybe you can see the advantage of the array syntax. But wait, there's more! Act now, and we'll throw in math operations for free!

Vector Math with Arrays

We can do math on many values at once with arrays, no for loop required.

>>> a = np.arange(0,100,2)
>>> b = np.arange(50)
# Simple math across the whole array:
>>> a
array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32,
       34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66,
       68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98])
>>> a + 10
array([ 10,  12,  14,  16,  18,  20,  22,  24,  26,  28,  30,  32,  34,
        36,  38,  40,  42,  44,  46,  48,  50,  52,  54,  56,  58,  60,
        62,  64,  66,  68,  70,  72,  74,  76,  78,  80,  82,  84,  86,
        88,  90,  92,  94,  96,  98, 100, 102, 104, 106, 108])
>>> b
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])
>>> b/2.0
array([  0. ,   0.5,   1. ,   1.5,   2. ,   2.5,   3. ,   3.5,   4. ,
         4.5,   5. ,   5.5,   6. ,   6.5,   7. ,   7.5,   8. ,   8.5,
         9. ,   9.5,  10. ,  10.5,  11. ,  11.5,  12. ,  12.5,  13. ,
        13.5,  14. ,  14.5,  15. ,  15.5,  16. ,  16.5,  17. ,  17.5,
        18. ,  18.5,  19. ,  19.5,  20. ,  20.5,  21. ,  21.5,  22. ,
        22.5,  23. ,  23.5,  24. ,  24.5])
 
# pairwise multiplication, consider the for loop alternative
>>> a * b
array([   0,    2,    8,   18,   32,   50,   72,   98,  128,  162,  200,
        242,  288,  338,  392,  450,  512,  578,  648,  722,  800,  882,
        968, 1058, 1152, 1250, 1352, 1458, 1568, 1682, 1800, 1922, 2048,
       2178, 2312, 2450, 2592, 2738, 2888, 3042, 3200, 3362, 3528, 3698,
       3872, 4050, 4232, 4418, 4608, 4802])
>>>
>>> a = np.arange(0,100,2)
>>> b = np.arange(50)
>>>
>>> sum(a * b)  # alternatively, the function np.dot(a,b)
80850

In general, you'll want your arrays to be the same shape and size when doing math with them, although there are arcane rules for what to do when they aren't (it may or may not just give an error).

Boolean (logical) Values with Arrays


>>> a = np.zeros(10, dtype=bool)
>>> a
array([False, False, False, False, False, False, False,
       False, False, False], dtype=bool)
 
# Slicing and mass-assignment still works
>>> a[2:5] = True
>>> a
array([False, False,  True,  True,  True, False, False,
       False, False, False], dtype=bool)
 
# Comparison and assignment
>>> b = (a == False)
>>> b
array([ True,  True, False, False, False,  True,  True,  True,
        True,  True], dtype=bool)
>>> b = ~a
>>> b
array([ True,  True, False, False, False,  True,  True,  True,
        True,  True], dtype=bool)
>>> b = not a # It would be nice if this worked, but no such luck
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: The truth value of an array with more than one
            element is ambiguous. Use a.any() or a.all()

Basic Statistics with Numpy


NumPy is freaking huge, with around 1200 pages of reference documentation, but all of you will, at some point, use some basic statistics to get a feel for your data. So let's make sure we hit some of those functions:

Summary Statistics
>>> a = np.arange(99)
>>> np.mean(a)
49.0
# Standard Deviation
>>> np.std(a)
28.577380332470412
# Variance
>>> np.var(a)
816.66666666666674

Random distributions
>>> a = np.random.uniform(0, 100, 10) # Low, High, Size of Output
>>> a
array([ 59.29058916,   6.92921008,  91.93188435,  45.38511999,
        34.46457059,  41.29478046,  33.51351871,  42.5493656 ,
        50.54698303,  19.79408663])
>>> a = np.random.uniform(0, 100, (3,3))
>>> a
array([[ 12.06197457,  90.90232851,  51.27616458],
       [ 61.12517983,  88.14112687,  29.36606864],
       [ 77.35499074,  53.7975086 ,  72.63088545]])
 
>>> a = np.random.uniform(0,100,1000000)
>>> np.mean(a)
50.022576176522485
 
# exponential sample with mean = e
>>> np.mean(np.random.exponential(np.exp(1),1000000))
2.7188135956330033

Pylab


In the early exploratory phases of your data analysis, sometimes every second counts when you're trying to convert some idea in your head into code that you can try out. IPython has a mode called Pylab that has lots of things already imported into the main namespace, so you can start playing around with your data without having to import things, and without having to type out numpy. (or even np.) in front of each and every Numpy function you want to use. To get it, simply start up IPython with a flag:

$ ipython --pylab

This starts it up with all the numpy functions imported, as well as lots of Scipy (which I'll tell you about next) and matplotlib (more on this on Friday) set up so that you can do pretty heavy math without a lot of mental overhead remembering which package everything is in. The way I often like to go about things is to figure out what I want to do, try it in IPython with Pylab on, then use the history or save magic words in IPython to get what I did, put it into a file, trim out all the failed approaches, and turn key parts into functions. In that file, I still usually do an import numpy as np, and then go to all the numpy functions i used and change them from, for example, x = array(...) to x = np.array(...). This is because there are a few functions (like min and max) that numpy overrides the built-in versions, and it's good to be explicit about which one you want.

SciPy and Fitting


SciPy (pronounced "Sigh Pie") is a collection of libraries that builds on NumPy, and has lots of convenient, fast functions for working with large amounts of scientific data. It's slightly smaller than NumPy, with only 900-odd pages of documentation. That includes sections on integrating C or Fortran code into Python, which is way outside the scope of this course, but if you ever do get to the point where you need a super-efficient implementation of something, you're covered. Especially in the one-off nature of academic science, you're often better served spending less time writing code that takes longer to run, compared to spending lots and lots of time writing code that runs slightly faster.

The stats module of SciPy has functions for even more statistical distributions, statistical tests, and other assorted functions that a good statistician might need. As an example, let's see how we might use the linregress function, which does a linear regression on some data. Linear regression is the process of finding a line that minimizes the sum of the square of the vertical distances from each point to the line.

First, we'll set up some noisy data:

import numpy as np
 
slope = .5
intercept = -10
 
x = np.arange(0,100)
y = slope*x + intercept
noise = 5 * np.random.randn(len(x))
 
y = y + noise
7.1-fitting-points.png

I'll show you how to generate plots like this in the afternoon, but if you know any Matlab, it's eerily similar.

This isn't a math class, so we're going to start with the equations for slope, intercept, and correlation coefficient of the best-fit line as given:

eq3.JPG
Now we just translate the math to Python code:
n = len(x)
 
m = (n * sum(x * y) - sum(x) * sum(y)) / (n * sum(x**2) - (sum(x))**2)
b = (sum(y) - m * sum(x))/n
r = (n * sum(x * y) - sum(x) * sum(y)) / np.sqrt((n*sum(x**2) - sum(x)**2)
                                                * (n * sum(y**2) - sum(y)**2))
 
print m, b, r
 

0.486677343735 -9.05040165994 0.928979337505

This gives us pretty much the right result, but it was kind of a pain to type in. If only the libraries had some sort of function that could do linear regression for us...

from scipy import stats
 
r_slope, r_int, r_rval, r_pval, r_stderr = stats.linregress(x, y)
 
print "Regression Slope: ", r_slope
print "Regression Intercept: ", r_int
print "Regression correlation: ", r_rval
print "R^2:, ", r_rval**2
print "p(slope is 0): ", r_pval
Regression Slope: 0.486677343735
Regression Intercept: -9.05040165994
Regression correlation: 0.928979337505
R^2:, 0.86300260951
p(slope is 0): 4.31945319634e-44

7.1-fitting.png

One other function that you might find useful is the corrcoef function, which gives you a correlation matrix between two data sets. For two 1-D sets like we have (x and y), this will give a 2x2 matrix.

>>> corrcoef(x, y)
array([[ 1.        ,  0.92897934],
       [ 0.92897934,  1.        ]])

Note that the 0,1 and 1,0 (correlation of x with y, and y with x, respectively) entries of this matrix are the same as the correlation coefficient from before. This two dimensional version of the output is kind of a pain for 2 sets of 1-D data, but it really does make sense with datasets with more variables.

There's a lot more to the stats module alone than we can cover in just one morning, so I'll just point you to the documentation for Scipy:
http://docs.scipy.org/doc/scipy/reference/index.html You might also look into the cluster, interpolate, and optimize modules, depending on what exactly you want to do. The others could be helpful as well, but those are among the most likely for biologists.

Performance Enhancing Arrays


Remember how I told you that it's better to write something that's kind of good enough, rather than spending your time optimizing the code, as long as the "kind of good enough" version is a lot faster to write? Well, it turns out with arrays we get to have our cake, and eat it too! Not only is the code a lot faster to write, since you don't need to do explicit for loops, it also runs a lot faster. Let's investigate this by generating two lists (or arrays) with 10 million random numbers, then add them together:

no_numpy_randsum.py

import random
 
list1 = [random.randrange(0, 100) for i in range(int(1e7))]
list2 = [random.randrange(0, 100) for i in range(int(1e7))]
list3 = [a + b for a, b in zip(list1, list2)]
 

Note that this is the fast "pure Python" version, since we're using list comprehensions. If you had to append onto a list, it would be even slower!

with_numpy_randsum.py

import numpy as np
 
list1 = np.random.randint(0,101, 1e7)
list2 = np.random.randint(0,101, 1e7)
list3 = list1 + list2
 


Because numpy is able to know that everything is going to be an integer, it can do a lot of optimizations to the arrays that it wouldn't be able to do if each element could, conceivably be a different type. Furthermore, a lot of the time is spent doing checks on the input to the randrange function that, because we're using an array, can be done twice, rather than 20 million times.


Tabular data


One of the things that gets boring really fast is reading and writing tables of data, especially in a way that preserves the information we care about. Recently, a really nice package has caught on for doing exactly that, called pandas (short for PANel DATA). Pandas uses numpy internally, so it's really fast, but it also has lots of nice features for dealing with tables that may contain several types of data.

Let's look at an example (which I'll do in IPython, in line with the exploratory nature of poking at someone else's data):

import pandas as pd
 
calls = pd.read_table('operonCalls.tsv')
 
genomeInfo = pd.read_table('genomeInfo.tsv', index_col='sysName')
# You can specify a column to use as the index
# but only if that column has a different value for every row
 
calls
# calls is a pandas data type called a DataFrame
# which is modeled after an equivalent data type
# in the language R
 
calls['SysName1']  # One way to access a column
calls.SysName1 # Another way to look at a column
 
calls.ix[0]  # Using .ix is the only good way to look at a row
genome.ix['b0344'] # And even works if you've specified an index
 
 
def do_something(row):
    pass # This function does nothing
 
for index in calls.index:
    do_something(calls.ix[index])
 
# or
 
for row_num, row in calls.iterrows():
    do_something(row)
 
row.index
 
row['bOp']
# It knows that this column has TRUE and FALSE, and
# converts those to boolean values
 
row['pOp']
# This column is numbers, so that's what it's stored as
 
row['Name1']
# This one is strings, so that easily pops out too!
 
genome[genome.name == 'lacZ']
# This is the simplest way to do a search
# (possibly the only way???)
 
calls.pOp.mean()
# You can even do math on the columns
#(or, if it's a homogenous table, the rows)
 
np.array(calls.pOp)
# You can convert the pandas types into plain old numpy arrays
 
 




Exercises


1) Writing Mathematical Functions
a) Write a function that accepts an array of floats. Return an array where every value of the input array is divided by 1.5.
b) Use the random functions to generate an array of floats. Write a function that accepts this array, and returns a list of values that are more than one standard deviation from the mean of the array.
c) Write a function that evaluate the function y = e^x for values from 0 to 10 and stores the values in an array. The function should also generate a random exponential sample of the same length. Sort the random sample, then return the sum of the distances at each point from the random sample to the exponential model.

2) Strings to arrays
So we had this idea that we might be able to find a periodicity in the spacing of pyrimidine residues downstream of the termination site in Rho dependent genes (by and large, we don't). Nevertheless:
a) Make a function that finds all locations of a certain substring (like 'C', or 'CT', or whatever) and returns it as an array.
b) Find the difference between each pair of adjacent substrings
c) Use the BioPython module to read in a fasta file that has lots of 3' UTRs (we'll calculate these later) and uses the previous functions to generate a full list of the spacings between the pyrimidines.
d) Compute the histogram of these spacings (we'll show you how to plot them later).


3) Updating old code

Let's go back and look at the code for computing the operons into a GFF file. How would you modify this code to use Pandas to read in the tab-delimited files?


Solutions



# a)
def two_thirds(an_array):
    return an_array/1.5
 
# b)
from numpy import mean, std
from pylab import normal
def reject_outliers(an_array):
    middle = mean(an_array)
    spread = std(an_array)
    greater = an_array > (middle + spread)
    less = an_array < (middle - spread)
    less_or_greater = less + greater
    return an_array[less_or_greater]
    # OR#
    ret = []
    for el in an_array:
        if not (-spread < el - middle < spread):
            ret.append(el)
    return ret
 
print reject_outliers(normal(0, 1, 1000))
 
# c)
 
from pylab import arange, exp, uniform
 
x = arange(0,10,.01)
y = exp(x)
 
z = uniform(0,10, len(y))
z = exp(z)
z.sort()  # note that z.sort() returns None
 
print sum(y - z) / len(y)
 

2) Histogram of spacings

import numpy as np
import sys
 
def find_all(string, substring):
    loc = string.find(substring)
    locs = []
    while loc != -1:
        locs.append(loc)
        loc = string.find(substring, loc + 1)
    return np.array(locs)
 
 
def find_all_from_list(string, list):
    all_locs = []
    for substring in list:
        all_locs.extend(find_all(string, substring))
    return array(sorted(all_locs))
 
 
all_diffs = []
 
for rec in SeqIO.parse(sys.argv[1], 'fasta'):
    hits = find_all(rec.seq, 'C')
    diffs = np.diff(hits)
    all_diffs.extend(diffs)
 
 
print np.histogram(all_diffs, range(0,30))

Exercise 3

from collections import defaultdict
import pandas as pd
 
 
def get_operon_pairs(operon_filename):
    # operons = open(operon_filename)
    # operons.readline()
    # # This skips the header line
 
    operons = pd.read_table(operon_filename)
 
    calls = defaultdict(bool)
    # # If no call, it is not an operon
    # calls[('rrmJ', 'ftsH')] = True
    # calls[('ftsH', 'rrmJ')] = True
 
    for row_num, row in operons:
        #line = line.strip()
        #elements = line.split('\t')
        #gene1 = elements[4]
        #gene2 = elements[5]
        #is_operon = (elements[6] == "TRUE")
        gene1 = row['SysName1']
        gene2 = row['SysName2']
        is_operon = row['bOp']
        calls[(gene1, gene2)] = is_operon
        calls[(gene2, gene1)] = is_operon
 
    return calls
 
 
operons = get_operon_pairs("operonCalls.txt")
# gene_info = open('geneInfo.txt')
# gene_info.readline()
# # Skip the header line
gene_info = pd.read_table('geneInfo.txt')
 
current_name = ''
current_start = -1
current_stop = -1
current_strand = '.'
current_operon_names = []
 
GFF_base = '\t'.join(['Chromosome', 'MicrobesOnline', 'operon',
                      '%d', # low end position
                      '%d', # high end position
                      '.', '%s', # strand
                      '.', 'operonName "%s";\n'])
outfh = open('operons2.gff', 'w')
 
genes_thus_far = set()
 
#for line in gene_info:
for row_num, row in gene_info.iterrows():
    #next_line = line.strip()
 
    if 'CRISPR' in row['name'] or 'insE' in row['name'] or 'insC' in row['name']:
        continue
 
    # next_elements = line.split('\t')
    # next_start = int(next_elements[4])
    # next_stop = int(next_elements[5])
    # next_strand = next_elements[6]
    # next_name = next_elements[8]
    next_start = row['start']
    next_stop = row['stop']
    next_strand = row['strand']
    next_name = row['sysName']
    next_hname = row['name']
 
    if not operons[(current_name, next_name)]:
        if current_name:
        # current_name starts out empty, so on the first valid line, this
        #  won't write out to the file
            GFF_line = GFF_base % (min(current_start, current_stop),
                                   max(current_start, current_stop),
                                   current_strand,
                                   '-'.join(current_operon_names))
            outfh.write(GFF_line)
        # current_operon_names = [next_name]
        current_operon_names = [next_hname]
        current_start = next_start
        current_stop = next_stop
        current_strand = next_strand
 
    else: # Genes are in the same operon
        if current_strand != next_strand:
            print line
            # This is just a consistency check.
        if current_strand == "+":
            current_start = min(current_start, next_start)
            current_stop = max(current_stop, next_stop)
            # current_operon_names.append(next_name)
            current_operon_names.append(next_hname)
        elif current_strand == "-":
            current_start = max(current_start, next_start)
            current_stop = min(current_stop, next_stop)
            # current_operon_names.insert(0, next_name)
            current_operon_names.insert(0, next_hname)
 
    current_strand = next_strand
    current_name = next_name
 
 
# Clear out the last one, which has no "next"
GFF_line = GFF_base % (min(current_start, current_stop),
                       max(current_start, current_stop),
                       current_strand,
                       '-'.join(current_operon_names))
outfh.write(GFF_line)
outfh.close()