Interactive Python: Running code as we go along


So that's it. You're done. We've taught you all the core programming concepts that you'll need for the next week and a half, as well as for most of your life as a person who programs. Every language worth using will have some version of integers, floats, strings, lists, if/else, loops of some kind, and functions.

external image 41XbV6PjcHL._SY300_.jpg

... Okay, so while that's true, we're not going to let you loose on the world just yet. Just like in every superhero movie, the hero first learns they have powers, then they have the training montage, where they refine their skills. You guys have your powers now, but we're going to spend some time learning how to use your powers. We'll also show you a whole bunch of modules and functions that make the jobs you need to get done faster and easier.

By now, you've been writing code for 3 days, and it should be pretty apparent that it's easy to make python crash. Whether through syntax errors (which make Python crash before it even runs your code), or index or key errors, or a host of other errors that you'll run into, there's just a lot that can go wrong. Oddly enough, having your program crash is actually the good kind of error! Far more insidious is when your code happily runs and does something not entirely what you meant.
mops.jpg
A program (fill the cauldron) gone wrong!



Tomorrow morning, Diana's going to go through a couple "Strategic Initiatives" to help us get bugs out of our code, and keep it bug free, but it will bear repeating that one of those is "Test Early, Test Often".


One of the ways that we can fight against this is to have tests for our code. (There's even a style of programming, called Test Driven Development, where you write the test first, then write the code to pass the test.) Writing tests for your code is hard, but it's also a worthwhile skill to have.

Just like in bench science, it's always a good idea to have both positive and negative controls.

One of the ways to make your tests as compact and easy to understand as possible is to have your functions that you're testing also be small and logically compact. In a highly contrived example, let's imagine a function that checks whether a list has the proper number of entries for a GFF file (9), then if it does, finds something that looks like a FlyBase gene identifier, and prints it.

def get_flybase_id(gff_list):
    if len(gff_list) == 9:
        fbgn_start = gff_list[8].find('FBgn')
        fbgn_end = gff_list[8].find(';', fbgn_start)
        return gff_list[8][fbgn_start:fbgn_end]
 

Now logically, this function has two separate ideas, each of which relies on a bit of domain specific knowledge. For instance, both the 8 and the 9 are what programmers call "magic numbers". While these aren't necessarily bad, someone else who's looking at your code might wonder why 9, and not 10, or 8. And the 8 in the next couple lines might be because it's 1 less than 9 (which is the maximum index), or it could just be a separate number that happens to be close to 9. Basically, for a 5 line function, it will be pretty confusing for the next person who looks at it, and that next person might be you in a couple months, which is more than long enough to forget why you wrote something the way you did.

What if we compare it to this set of functions:

def get_flybase_id(gff_list):
    if is_valid_gff(gff_list):
        return find_fbgn(gff_list)
 
def is_valid_gff(gff_list):
    return len(gff_list) == 9
    # There are 9 entries in the GFF specification:
    # http://www.sanger.ac.uk/resources/software/gff/spec.html
 
def find_fbgn(gff_list):
    annotation = gff_list[-1]
    fbgn_start = annotation.find('FBgn')
    fbgn_end = annotation.find(';', fbgn_start)
    return annotation[fbgn_start: fbgn_end]
 

This has a couple distinct advantages: by breaking it up, we can re-use each of the functions later. Second, we can test each of the functions separately, and if it turns out that there's a bug in one of them, then we can fix it once, and everywhere it gets reusued, it'll be fixed there. Finally, as a wise man once said, there's two ways to write software: one is to write code so simple that there are obviously no bugs, and the second is to write code so complex that there are no obvious bugs. By breaking things up into sub-functions, it's easier to do the former than the latter. It's also easier to try different inputs to the function, and see what results it gives.

Now, we can do this by making a separate file that tests the output to make sure it gives what we want, but another quick and dirty way to do it is to use an interactive interpreter.

The basic interpreter


In the last 3 days, has anyone tried typing just
$ python

on the command line? If you haven't, go ahead and try it now. Rather than complaining about the fact that there's no file for it to run, it prints some stuff out, and gives us a totally different command line:

Enthought Python Distribution -- www.enthought.com
Version: 7.3-1 (32-bit)

Python 2.7.3 |EPD 7.3-1 (32-bit)| (default, Apr 12 2012, 11:28:34)
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "credits", "demo" or "enthought" for more information.
>>>

The >>> is a prompt to enter commands, like in the shell, but now we can enter any Python code that we want. I'll be using it from now on to indicate anything that will go into an interactive interpreter, as opposed to into a file. Let's start by making a variable

>>> people = ['Aisha', 'Diana', 'Peter']

Nothing happened... or did it?
>>> print people
['Aisha', 'Diana', 'Peter']
>>> people
['Aisha', 'Diana', 'Peter']
As it turns out, we so often want to see the value of something from the interactive interpreter that we can just type the name of the variable (or many other expressions without an assignment in them), and the interpreter will print them out for us.

>>> people*3
['Aisha', 'Diana', 'Peter', 'Aisha', 'Diana', 'Peter', 'Aisha', 'Diana', 'Peter']
>>> people  # is left unchanged
['Aisha', 'Diana', 'Peter']

One key thing you might like to do from here is to get help about what else you can do

>>> help()
 
Welcome to Python 2.7!  This is the online help utility.
 
If this is your first time using Python, you should definitely check out
the tutorial on the Internet at http://docs.python.org/tutorial/.
 
Enter the name of any keyword, or topic to get help on writing Python
programs and using Python.  To quit this help utility and return to the
interpreter, just type "quit".
 
To get a list of available modules, keywords, or topics, type "keywords",
or "topics".

As the help function has just informed us, you can use help() to get help on specific modules, keywords, or topics. Let's use it to get the low-down on the other very useful interpreter function, dir().

Help on built-in function dir in module __builtin__:
 
dir(...)
dir([object]) -> list of strings
 
If called without an argument, return the names in the current scope.
Else, return an alphabetized list of names comprising (some of) the attributes
of the given object, and of attributes reachable from it.
If the object supplies a method named __dir__, it will be used; otherwise
the default dir() logic is used and returns:
for a module object: the module's attributes.
for a class object:  its attributes, and recursively the attributes
of its bases.
for any other object: its attributes, its class's attributes, and
recursively the attributes of its class's base classes.
(END)

This shows us that dir() can tell us about objects in our namespace, which in non-jargon means that dir() can tell us about things like variables and modules and functions, and which ones we have access to at whatever point we're at in our program. To get out of the help mode and back into the main interpreter, we just keep hitting "q" for quit until we get back to our familiar >>>

help> q
 
You are now leaving help and returning to the Python interpreter.
If you want to ask for help on a particular object directly from the
interpreter, you can type "help(object)".  Executing "help('string')"
has the same effect as typing a particular string at the help> prompt.

This gives us some useful information about how better to use help in the future.

Now, what does dir() do by itself?

>>> dir()
['__builtins__', '__doc__', '__name__', '__package__', 'people']

So, we should recognize our list variable, "people", from before. So, what does dir() have to say about that?

>>> dir(people)
['__add__', '__class__', '__contains__', '__delattr__', '__delitem__',
'__delslice__', '__doc__', '__eq__','__ge__', '__getattribute__',
'__getitem__', '__getslice__', '__gt__', '__hash__', '__iadd__',
'__imul__', '__init__', '__iter__', '__le__', '__len__', '__lt__',
'__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__',
'__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__',
'__setslice__', '__str__', 'append', 'count', 'extend', 'index',
'insert', 'pop', 'remove', 'reverse', 'sort']

This command shows us ALL the things that belong to our list variable, including the methods that we can call to manipulate it, such as .append(), etc. It also shows us a whole bunch of internal objects that belong to our list variable (which don't belong to us, and we don't get to use them -- at least not in this class).

So, hopefully you can see how to use this power for good: dir() can tell you all the methods and functions available to an object. For those of us that prefer to work in the interpreter, we find ourselves only rarely consulting exterior documentation. We can use dir() to find out what is available to an object, and then we can use help() to figure out what things are:

>>> help(people.append)
 
Help on built-in function append:
 
append(...)
L.append(object) -- append object to end
(END)

Which very succinctly tells us what we've already learned: people.append() will add whatever object we put in the parentheses to the end of the people list.



Using IPython, an enhanced Python interpreter



The default Python interpreter is a very basic interface. There are things that you'd like to be able to do, such as enter more than one line of code at a time, or perhaps look back through your history of commands. The enhanced IPython interpreter provides these perks and more.

You can start off by launching IPython from the command line:

$ ipython

Enthought Python Distribution -- www.enthought.com
 
Python 2.7.3 |EPD 7.3-1 (32-bit)| (default, Apr 12 2012, 11:28:34)
Type "copyright", "credits" or "license" for more information.
 
IPython 0.12.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
 
In [1]:

IPython launches and tells us what version we're running, and gives us a couple of tips. Just typing a question mark and pressing enter will get us a comprehensive introduction to IPython, which I'll leave to you to peruse in your leisure time. You'll notice that the prompt for IPython includes a line number for each line of your code, signified on the first line by ln [1]:. The %quickref guide gives us a succinct description of the fanciness of IPython. For starters, you have instant access to the system prompt for commands like ls and cd.

In [2]: ls
data/                pythons_of_the_world.txt
 


A number of magic functions, all of which start with a percent sign %, provide functionality that the regular interpretter cannot provide. One for regular use is %history, which will show you the lines of code you have previously entered in this session:

In [6]: %hist -nt
1: get_ipython().magic(u'quickref')
2: get_ipython().system(u'ls -F ')
3: get_ipython().magic(u'history ')
4: get_ipython().magic(u'hist ')
5: get_ipython().magic(u'pinfo history')
6: get_ipython().magic(u'hist -nt')

The -n flag means "show line numbers", which is actually usually more of a hindrance, so the default is to do it without. The -t flag means to translate what you wrote into what IPython is feeding directly into Python. This generally applies only to the "magic" commands that you type in.

In [7]: mystr = 'a;dlska;'
 
In [8]: for i in mystr:
...:     print i
...:
a
;
d
l
s
k
a
;
 
In [9]: %hist
%quickref
%ls
%history
%hist
%history?
%hist -nt
mystr = 'a;dlska;'
for i in mystr:
print i
%hist

See how the regular Python code got left alone? Now one of the primary use cases for IPython is to try out code that you're not sure how it's going to work, then add it into a .py file that you'll run over and over, once you're mostly confident that what you have is pretty good. If this is the first time you're saving into a given file, IPython makes it even easier!

In [11]: %save histfile.py 7-8
The following commands were written to file `histfile.py`:
mystr = 'a;dlska;'
for i in mystr:
print i

And now any time you want to run that script, you can do it from IPython:

In [12]: run histfile   # The .py is optional
a
;
d
l
s
k
a
;

One of the nice things about this approach is that all the variables from the global namespace of your script also get added to the interpreter's namespace, so you can then use a script to generate or load in some data, play around with it in IPython, then go back and update your script. It's also possible to edit pre-existing files from within IPython!

In [10]: savefile histfile.py 7-8
Editing...
 
 
 
 
 
1 # coding: utf-8
2 mystr = 'a;dlska;'
3 another_str = "This one doesn't get printed"
4 for i in mystr:
5     print i
6
 
----
Editing... done. Executing edited code...
a
;
d
l
s
k
a
;
 
In [14]: print another_str
This one doesn't get printed

Note that edit assumes you want to run the code immediately afterwards. If this isn't the case, you can call edit -x.

Okay, so there are lots of nice little magic tricks in IPython, and you can use the %quickref guide to explore them more on your own. Meanwhile, lemme show you a couple of other handy tricks that don't involve magic functions.

Remember in the basic Python interpretter, we could use dir() to find out what objects belonged to a given object? Well, in IPython, all we have to do to capture the same information is type the object and period, then press the tab.

In [16]: mystr.
mystr.capitalize  mystr.isalnum     mystr.lstrip      mystr.splitlines
mystr.center      mystr.isalpha     mystr.partition   mystr.startswith
mystr.count       mystr.isdigit     mystr.replace     mystr.strip
mystr.decode      mystr.islower     mystr.rfind       mystr.swapcase
mystr.encode      mystr.isspace     mystr.rindex      mystr.title
mystr.endswith    mystr.istitle     mystr.rjust       mystr.translate
mystr.expandtabs  mystr.isupper     mystr.rpartition  mystr.upper
mystr.find        mystr.join        mystr.rsplit      mystr.zfill
mystr.format      mystr.ljust       mystr.rstrip
mystr.index       mystr.lower       mystr.split


This shows us all the things we can use or modify that belong to our variable mystr, from our for loop. If you know what letter(s) a particular function you wants ends with, you can type them and it narrows the list:

In [16]: mystr.l
mystr.ljust   mystr.lower   mystr.lstrip
 
In [16]: mystr.l

If you give enough of the name that it's unique, you can just type that and hit tab, which will save you time.

And what if you don't know what one of those functons or methods is? Try this:
In [16]: mystr.endswith?
Type:       builtin_function_or_method
Base Class: <type 'builtin_function_or_method'>
String Form:<built-in method endswith of str object at 0x10d3780>
Namespace:  Interactive
Docstring:
S.endswith(suffix[, start[, end]]) -> bool
 
Return True if S ends with the specified suffix, False otherwise.
With optional start, test S beginning at that position.
With optional end, stop comparing S at that position.
suffix can also be a tuple of strings to try.
 
In [17]:
 


One question mark at the end of the line brings up a help dialog for the object, including the docstring for a function or method.
This is nearly the same as doing help(mystr.endswith), but 5 fewer keystrokes, or more if you need to scroll around to put the "help(" at the beginning after typing mystr.endswith.

If you call it with two question marks, and the source is available, it will display that as well! A lot of Python is written in Python, so one way to learn about things is to see how other people have written their modules.

What if there's a large block of code you want to paste in? You may have noticed, as we were typing in the for loop, it automatically gave us the standard 4 spaces to indicate the block. To get around this, use the %cpaste magic command, which lets you paste things in blocks of code from elsewhere, without trying to add indentation. To stop pasting, have a line with only two hyphens on it ('--').

In [25]: cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:def get_flybase_id(gff_list):
:    if is_valid_gff(gff_list):
:        return find_fbgn(gff_list)
:
:def is_valid_gff(gff_list):
:    return len(gff_list) == 9
:    # There are 9 entries in the GFF specification:
:    # http://www.sanger.ac.uk/resources/software/gff/spec.html
:
:def find_fbgn(gff_list):
:    annotation = gff_list[-1]
:    fbgn_start = annotation.find('FBgn')
:    fbgn_end = annotation.find(';', fbgn_start)
:    return annotation[fbgn_start: fbgn_end]
:--
 
In [26]: is_valid_gff([1,2,3])
Out[26]: False
 
In [27]: is_valid_gff(range(9))
Out[27]: True

Now, let's see how well that code we wrote before stands up:

In [36]: gffline = ('2L\tmE1_TFBS_cad\tTF_binding_site\t5126\t6402\t.\t.\t.\t'
                    'ID=FBsf0000240997;Name=TFBS_cad_000012;bound_moiety=cad:FBgn0000251')
 
In [37]: is_valid_gff(gffline.split())
Out[37]: True
 
In [38]: find_fbgn(gffline.split()[-1])
Out[38]: ''
 

We can look at the string that I took from a FlyBase gff file and see that there's a FBgn in it, but before, I assumed there would be a semicolon after that, and there isn't. We can certainly go through and correct the code, although as an example, I think you all see why you need to test code, because even things that are relatively simple are still complex enough to contain no obvious error... (The really great part is that when I was writing this lecture, I was expecting this positive control to come back with the name, and then demonstrate a more pathological case where things went wrong with the code)


Profiling


One thing that's pretty valuable, that IPython lets us do easily is "profiling" our code, which is a way to see where all the time is being spent. Now, I've said earlier that it's better to do something slow and simply, than (potentially) super-fast and complicated, but if you find that you're doing something a lot, it may be worthwhile to spend some time figuring out how to make it faster (especially if that comes at a low complexity cost).


As an example, let's look at our fasta parser:

def fastaParser(filename):
   current_gene = ""
   genes = {}
   fh = open(filename, 'r')
 
   for line in fh:
       line = line.strip()
       if line.startswith('>'):
           current_gene = line[1:]
           genes[current_gene] = ''
       else:
            join(genes, current_gene, line)
 
   return genes
 
 
def join(gene_dict, gene, line):
    gene_dict[gene] += line
 
 
yeast_genome = fastaParser('../fasta_files/cerevisiae_genome.fasta')
 
print "There are", len(yeast_genome), "chromosomes in yeast"
 

Now, this is probably okay for short little genes, but what happens if we try to parse the yeast genome, which has lots of long sequences?


When we run this, it takes a really long time (something like 30 seconds on my computer). Now, the yeast genome isn't that big (not to knock those mighty single-celled fermenters, but 12megabases is nothing to fly's 130MB, or a human's 3 GB, or Paris japonica, a canopy plant with a whopping 150 GB genome), so it would be nice if we could find a faster way to run it. To do that, though, we'd need to know where to focus our efforts. To do that, from within IPython, we run the code with the '-p' flag:

In [1]: run -p yeast_genome.py
 
 
         607964 function calls in 33.858 seconds
 
   Ordered by: internal time
 
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   202628   33.176    0.000   33.176    0.000 exercises.py:17(join)
        1    0.484    0.484   33.856   33.856 exercises.py:1(fastaParser)
   202645    0.112    0.000    0.112    0.000 {method 'startswith' of 'str' objects}
   202645    0.084    0.000    0.084    0.000 {method 'strip' of 'str' objects}
        1    0.001    0.001   33.857   33.857 yeast_genome.py:1(<module>)
        1    0.000    0.000   33.857   33.857 {execfile}
        1    0.000    0.000    0.000    0.000 {posix.getcwdu}
        1    0.000    0.000   33.858   33.858 interactiveshell.py:2257(safe_execfile)
...

So I broke up the fastaParser function in a bit odd of a way, but that's because I happened to know where it would be spending almost all of its time: the join function. (By the way, if you're profiling your own code and you think there may be a basic operation that you can't profile (something like +), feel free to break it out into it's own function when necessary).

What's happening here is that since a string is an immutable type, whenever you + two strings together, you have to make a new one. This is generally okay for short strings, but if you're doing it a lot of times, with increasingly longer strings, it gets really slow (in fact, it slows down with the square of the number of things you're adding together).

A different way to do this would be:


def fastaParser(filename):
   current_gene = ""
   genes = {}
   fh = open(filename, 'r')
 
   for line in fh:
       line = line.strip()
       if line.startswith('>'):
           current_gene = line[1:]
           genes[current_gene] = []
       else:
            join(genes, current_gene, line)
 
   for gene in genes:
       genes[gene] = ''.join(genes[gene])
 
   return genes
 
 
def join(gene_dict, gene, line):
    gene_dict[gene].append(line)

So what have I done here? Instead of building up the string as we go along, I add them to a list. Lists have been optimized such that you can add to them, and they don't slow down as they get longer. Then, once everything is in a list, we go through for all of the keys and use the string "join" method (which is different from the join function we wrote), which will combine all the items in the list, putting an empty string between them.

         810608 function calls in 0.605 seconds
 
   Ordered by: internal time
 
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.310    0.310    0.605    0.605 exercises.py:1(fastaParser)
   202628    0.113    0.000    0.137    0.000 exercises.py:20(join)
   202645    0.070    0.000    0.070    0.000 {method 'strip' of 'str' objects}
   202645    0.063    0.000    0.063    0.000 {method 'startswith' of 'str' objects}
       17    0.025    0.001    0.025    0.001 {method 'join' of 'str' objects}
   202633    0.024    0.000    0.024    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.605    0.605 {execfile}
        1    0.000    0.000    0.605    0.605 yeast_genome.py:1(<module>)
        1    0.000    0.000    0.605    0.605 interactiveshell.py:2257(safe_execfile)
        1    0.000    0.000    0.000    0.000 {posix.getcwdu}
        2    0.000    0.000    0.000    0.000 {open}
        1    0.000    0.000    0.000    0.000 syspathcontext.py:55(__enter__)
        1    0.000    0.000    0.000    0.000 posixpath.py:312(normpath)
        1    0.000    0.000    0.000    0.000 posixpath.py:341(abspath)
        1    0.000    0.000    0.605    0.605 py3compat.py:170(execfile)

So now we see that our own join function goes about 300 times faster, and the we spend almost no time at all in the extra string joining. Plus, the code is not really that much more complicated.

A general piece of advice is not to try to optimize your code until *after* you've profiled it. Otherwise, you might spend a lot of time making something largely irrelevant twice as fast, but only improve your total speed by a very tiny amount.

Exercises

Exercise 1 should be considered "required". Work on it tonight if you're unable to finish it during the time provided. Nevertheless, it may be helpful to do some of the other exercises first if you want to get a better handle on some of the things we've talked about.

1. One of the things we'll need next week during the Project portion of the class is a GFF file with all of the E. coli operons, or as EcoCyc might call them "transcription units". They call it something different, since in their mind an operon must have more than one gene in it. Instead, we're interested in everything that's likely to appear on a single transcript, whether that's one gene or several.

In the middle of the last decade, the Arkin lab put together a set of predictions about which genes in the E. coli genome (as well as lots of other bacterial genomes), were likely to be on the same transcript, and thus fit a rough definition of "operon". Their predictions were based on :
  • The distance between them in nucleotides
  • Whether the genes are conserved near each other in other genomes, based on MicrobesOnline Ortholog Groups
  • The correlation of their expression patterns, if gene expression data is available
  • Whether they both belong to a narrow GO category
  • Whether they share a COG functional category

While their E. coli data is online, it's in a format that is not standard and not terribly convenient. What we'd like to do here is take the tab-delimited version of the table, as well as the tab-delimited list of all genes that's also provided, and make a GFF file that has one operon per line.

A GFF is also tab delimited, and has the following fields:
  1. <seqname> This is the name of the reference sequence. If we had multiple chromosomes, it could change, but for E. coli, it will just be "Chromosome"
  2. <source> This is some information on where this annotation came from. If you're combining the results of multiple programs, it might change, but for us, it can just be "MicrobesOnline"
  3. <feature> This is the type of feature. A GFF file can have multiple kinds of features, like genes, exons, introns, tRNAs, and more all in the same file.
  4. <start> An integer from 1 to the size of the reference, inclusive. Note that, unlike Python, sequences start counting from 1.
  5. <end> An integer from <start> to the size of the reference. By convention, even if it's on the - strand, this will be a greater or equal coordinate than the <start>. Often times, you'll have to switch these by hand when your feature is on the minus strand. I'm not terribly pleased by this compromise, but it does seem to be standard, and you'll likely break other software if you go against it.
  6. <score> Often times, this is just a "." (without the quotes), meaning no score was assigned. Extra credit if you want to take the pOp column from the table and combine them in a sensible way.
  7. <strand> This will be either "+" or "-" (without the quotes). Some feature types won't have a direction, and will be ".", but really very few.
  8. <frame> This will be "." for our purposes. The most common case you'll see a number here is for exons. The first one will start at 0, but if other exons get spliced in the middle of a codon, it might be 1 or 2.
  9. [attributes] This is a semicolon separated list of whatever else you want to put in. The format is something like: Name1 "Value1"; Name2 "Value2". For our purposes, let's have one attribute, named "operonName", with the value a hyphen separated list of genes in that operon. For instance, we should have an operon named: "lacZ-lacY-lacA". (This is not the usual way of naming things, but easier, and clear enough. For bonus points, figure out how to get the canonical "lacZYA")

As for the tab separated operon calls file, here are the rules I can glean by spending some time looking at it:
  1. If two adjacent genes are in the same direction, determine whether they are part of the same operon. If they are, print TRUE in the bop column, otherwise, print FALSE. For reasons I won't speculate on, the cutoff seems to be if the probability is 0.570 or greater.
  2. If two adjacent genes are in opposite directions, then they are clearly not on the same operon, so no line is necessary.
  3. Genes on the minus strand look like they can appear in either the transcriptional order (like the trp operon) or in order along the '+' strand (like the lac operon)

For reasons that are, like the cutoff of 0.570, incomprehensible to me, the gene rlmE is referred to by its synonym rrmJ in the genome information file, and as rlmE in the operon prediction file. Feel free to edit one of them in the data file, or to have a special case at some point in your program.

It's worth bearing in mind that depending on whether you try to collate the names into the operon calls, or the operon calls into the names, this can be a lot harder or a lot easier.

The first few lines of my operons.gff file looks like this:

Chromosome    MicrobesOnline    operon    190    255    .    +    .    operonName "thrL";
Chromosome    MicrobesOnline    operon    337    5020    .    +    .    operonName "thrA-thrB-thrC";
Chromosome    MicrobesOnline    operon    5234    5530    .    +    .    operonName "yaaX";
Chromosome    MicrobesOnline    operon    5683    6459    .    -    .    operonName "yaaA";
Chromosome    MicrobesOnline    operon    6529    7959    .    -    .    operonName "yaaJ";
Chromosome    MicrobesOnline    operon    8238    9191    .    +    .    operonName "talB";
Chromosome    MicrobesOnline    operon    9306    9893    .    +    .    operonName "mog";
Chromosome    MicrobesOnline    operon    9928    10494    .    -    .    operonName "yaaH";
Chromosome    MicrobesOnline    operon    11382    11786    .    -    .    operonName "yaaI";
Chromosome    MicrobesOnline    operon    12163    15298    .    +    .    operonName "dnaK-dnaJ";


2) From ipython, type %magic to read about more magic functions that we did not discuss. Usually from inside of iPython, if you type ls you will get the directory listing of the current working directory. Change this behavior so that it displays the long directory listing (i.e. like using ls -l in the terminal).

3) One of the things we're teaching you in this course is how to learn more on your own. Give a man a fish and all that... In this exercise, we'll give you some guidance on exploring the BioPython module. One of the more useful sub-modules is SeqIO, which does input/output for sequences (and, I'm sorry to say, probably does it better than the FASTA parsers we've written thus far). In IPython, go to the directory with the yeast chromosomes, then simply type (or cpaste):

from Bio import SeqIO
chromosomes = [rec for rec in SeqIO.parse('yeast_genome.fa', 'fasta')]
               # or whatever your genome is named
rec0 = chromosomes[0]

a) Now, use one of the methods we taught today to figure out what attributes this chromosome has. Some of these are methods, and some are just plain variables.
b) Play around with the reverse_complement method. How does it handle ambiguous bases? Capitalization? You may find it helpful to create your own Seq objects, like this:

from Bio import Seq
my_seq = Seq.Seq('CAGGTAG')

c) Get the help information on SeqIO.write, and figure out how to write out the reverse-complement of all of the yeast genome to a new file.

d) Write a function that will convert a one-letter amino acid sequence (e.g. A = alanine, K = lysine, etc) into a three-letter amino acid sequence (alanine is Ala, lysine is Lys). You may find the following dictionary helpful:

threecode = {'A':'Ala', 'B':'Asx', 'C':'Cys', 'D':'Asp',
                 'E':'Glu', 'F':'Phe', 'G':'Gly', 'H':'His',
                 'I':'Ile', 'K':'Lys', 'L':'Leu', 'M':'Met',
                 'N':'Asn', 'P':'Pro', 'Q':'Gln', 'R':'Arg',
                 'S':'Ser', 'T':'Thr', 'V':'Val', 'W':'Trp',
                 'Y':'Tyr', 'Z':'Glx', 'X':'Xaa', '*':'Ter',
                 'U':'Sel', 'O':'Pyl', 'J':'Xle',
                 }

e) Now, import Bio.SeqUtils, and look at SeqUtils.seq3. How similar is this to what you wrote in part d?


Solutions


Exercise 1


Okay, this one was kind of hard. Here's how I thought about it, although there were certainly other (possibly more correct) valid solutions.

It looks to me like the genome listing was in order, so I made the assumption that if two genes are not contiguous in the gene listing, then they won't be in the same operon. (The converse, that if they *are* contiguous they will be in the same operon is clearly false).

Therefore, I made a dictionary of all of the pairs of genes that are called in the same operon, and then looped through the genome listing. If adjacent genes are in the same operon, I keep track of them in a "current operon" list.

from collections import defaultdict
 
 
 
def get_operon_pairs(operon_filename):
    operons = open(operon_filename)
    operons.readline()
    # This skips the header line
 
    calls = defaultdict(bool)
    # If no call, it is not an operon
    calls[('rrmJ', 'ftsH')] = True
    calls[('ftsH', 'rrmJ')] = True
 
    for line in operons:
        line = line.strip()
        elements = line.split('\t')
        gene1 = elements[4]
        gene2 = elements[5]
        is_operon = (elements[6] == "TRUE")
        calls[(gene1, gene2)] = is_operon
        calls[(gene2, gene1)] = is_operon
 
    return calls
 
 
operons = get_operon_pairs("operonCalls.txt")
gene_info = open('geneInfo.txt')
gene_info.readline()
# Skip the header line
 
current_name = ''
current_start = -1
current_stop = -1
current_strand = '.'
current_operon_names = []
 
GFF_base = '\t'.join(['Chromosome', 'MicrobesOnline', 'operon',
                      '%d', # low end position
                      '%d', # high end position
                      '.', '%s', # strand
                      '.', 'operonName "%s";\n'])
outfh = open('operons2.gff', 'w')
 
genes_thus_far = set()
 
for line in gene_info:
    next_line = line.strip()
 
    if 'CRISPR' in line or 'insE' in line or 'insC' in line:
        continue
 
    next_elements = line.split('\t')
    next_start = int(next_elements[4])
    next_stop = int(next_elements[5])
    next_strand = next_elements[6]
    next_name = next_elements[8]
 
    if not operons[(current_name, next_name)]:
        if current_name:
        # current_name starts out empty, so on the first valid line, this
        #  won't write out to the file
            GFF_line = GFF_base % (min(current_start, current_stop),
                                   max(current_start, current_stop),
                                   current_strand,
                                   '-'.join(current_operon_names))
            outfh.write(GFF_line)
        current_operon_names = [next_name]
        current_start = next_start
        current_stop = next_stop
        current_strand = next_strand
 
    else: # Genes are in the same operon
        if current_strand != next_strand:
            print line
            # This is just a consistency check.
        if current_strand == "+":
            current_start = min(current_start, next_start)
            current_stop = max(current_stop, next_stop)
            current_operon_names.append(next_name)
        elif current_strand == "-":
            current_start = max(current_start, next_start)
            current_stop = min(current_stop, next_stop)
            current_operon_names.insert(0, next_name)
 
    current_strand = next_strand
    current_name = next_name
 
 
# Clear out the last one, which has no "next"
GFF_line = GFF_base % (min(current_start, current_stop),
                       max(current_start, current_stop),
                       current_strand,
                       '-'.join(current_operon_names))
outfh.write(GFF_line)
outfh.close()
 



Exercise 2


In [13]: alias ls ls -l
 

Exercise 3



c) Get the help information on SeqIO.write, and figure out how to write out the reverse-complement of all of the yeast genome to a new file.

from Bio import Seq, SeqIO
my_seq = Seq.Seq('CAGGTAG')
SeqIO.write(my_seq.reverse_complement(), 'new_file.fasta')
# or
my_rc = my_seq.reverse_complement()
SeqIO.write(my_rc, open('new_file.fasta', 'a') )

d) Write a function that will convert a one-letter amino acid sequence (e.g. A = alanine, K = lysine, etc) into a three-letter amino acid sequence (alanine is Ala, lysine is Lys).

threecode = {'A':'Ala', 'B':'Asx', 'C':'Cys', 'D':'Asp',
                 'E':'Glu', 'F':'Phe', 'G':'Gly', 'H':'His',
                 'I':'Ile', 'K':'Lys', 'L':'Leu', 'M':'Met',
                 'N':'Asn', 'P':'Pro', 'Q':'Gln', 'R':'Arg',
                 'S':'Ser', 'T':'Thr', 'V':'Val', 'W':'Trp',
                 'Y':'Tyr', 'Z':'Glx', 'X':'Xaa', '*':'Ter',
                 'U':'Sel', 'O':'Pyl', 'J':'Xle',
                 }
 
def seq3(aa_seq):
    aa_list = []
    for aa in aa_seq:
        aa_list.append(threecode[aa])
    return "".join(aa_list)
 
 
# or
 
 
def seq3(aa_seq):
    aa_list = [threecode[aa] for aa in aa_list]
    return "".join(aa_list)