Session 6.1



System Calls and Scripting External Programs


Topics:

  • Manipulating the file system and UNIX environment from Python
  • Executing other external user space programs with os and sys modules
  • Executing external processes with the subprocess module
  • Running BLAST from a Python script

Introduction


This morning we will learn to use the os and sys modules to interface with the UNIX environment by running external programs to manipulate the filesystem and memory space. Then we will move on to the more sophisticated subprocess modules.

Useful Tools from the os Module


The UNIX environment that you have access to when you type in the terminal is also available to Python. That means that you don't have to open a separate terminal window just to change directories, create a new file, or see what operating system you program is running on. This also means that you have the capacity to script what would otherwise be exceedingly mundane tasks, such as renaming lots of files or creating various subdirectories for your output from different data sets.

You've already seen one key component of the os module, the function used to open filehandles: open(). Let's import the module and look at some of the others.

os.py

import os
# Let's make sure we know which working directory we are in
cwd = os.getcwd()
print cwd, 'is the current working directory.'
 
# And which files are present in the cwd? Note that the '.' represents the current
# working directory
cwdfiles = os.listdir('.')
 
# The output of listdir is stored as a list, so let's print it item by item.
for cwdfile in cwdfiles:
    print cwdfile
 
# And we can look in other directories too. Note that you'll have to change
# the dir below to one that exists on your filesystem.
homefiles = os.listdir('/Users/¬°yournamehere!')
for file in homefiles:
    print file
 
# We could also first change our cwd to a different directory, and then get the file list.
os.chdir('/')
print os.getcwd(), 'is the now the current working directory'
cwdfiles = os.listdir('.')
for file in cwdfiles:
    print file

We can also use the os module to manipulate the filesystem.

#!/usr/bin/env python
 
import os
# Let's make some new directories and files
numbers = [1, 2, 3]
letters = ['a', 'b', 'c']
for num in numbers:
    for let in letters:
        os.mkdir(str(num) + let)
 
# Let's get all our new directories in a list
diritems = os.listdir('.')
 
# And let's make some files in them
teachers = ['Peter', 'Debbie', 'Diana','Aisha']
 
for diritem in diritems:
# using what we've learned about exceptions and testing
    try:
        os.chdir(diritem)
        for teach in teachers:
            for num in numbers:
                try:
                    open(teach + '_' + str(num), 'w')
                except:
                    print "Caught a general exception attempting to open the new file",teach + '_' + str(num)
    except OSError:
        print "caught an error from module OS, possible that", diritem,"isn't a directory after all."
    except:
        print "caught a general exception trying to change to directory", diritem
    os.chdir('../')
 

Whew, what a mess. Let's clean up all those unimportant files, leaving only the important ones--mine!

import os
 
diritems = os.listdir('.')
print diritems
for diritem in diritems:
    print "diritem is",str(diritem)
    try:
        os.chdir(diritem)
        subdiritems = os.listdir('.')
        for subitem in subdiritems:
            if (subitem.startswith('Aisha')):
                continue
            else:
                os.remove(subitem)
    except OSError:
        print "caught an error from module OS, possible that", diritem, "isn't a directory after all."
    except:
        print "caught a general exception trying to change to directory", diritem
    os.chdir('../')
 

Okay, enough with files for now. Let's look at how we can execute UNIX commands from python. We'll start with using os.system().

import os
os.system('ls -l')

$ python os.py

Okay, so we can see the output of the UNIX command ls -l from the script, and compare it to the regular terminal. They should be the same!

A subtle difference of trailing slashes is the only difference in my case. But this is not the same as using os.listdir('.'). The two differences are that os.listdir('.') does not have access to the detailed information that regular command line ls -l does, and that os.listdir('.') returns a list of the directory items, not merely a string of the command line output. But this was a truly trivial example, so let's move on to, well, another truly trivial example.

import os
#remember our script from the first of the week?
os.system('../S1.2/hello.py')

$ os.py
hello world!

Yep, it's still there. And we can use another program to run it!

Just like opening files, it's a good idea to look for exceptions in the process of executing system calls. This is implemented very simply with os.system(), as the function will return a non-zero value if the call fails to execute.

import os
 
dirs = ['.', 'asdjfaklsdfjlksadjf']
print dirs
 
for d in dirs:
    if ( os.system('ls '+d) != 0 ):
        print os.system('ls '+d)
        print "Whoa, that didn't work out, which ls was happy to tell us about."
    else:
        print os.system('ls '+d)
        print "Well, okay, great. Thanks ls."
 
 

But this really is the simple way. Often when we are using python to run other programs, the programs we are executing may finish at different times. When that's the case, we may want to control the order in which they are executed, or merely the number of subprocesses (or pieces of each process) running at once. A process (or program) is the execution of instructions contained in a given piece of code, and a subprocess is the execution of pieces of instructions within that larger set of instructions.

These days, even laptops have multiple CPUs in them, but we still might not want to dump hundreds of subprocesses on our computer at once. The subprocess module allows us to be a bit more intelligent about how we execute system calls by using a method called subprocess.Popen(). This is somewhat different, however, in that the subprocess we want to execute must be created as an object, which we can then monitor with the methods of this object.

We're going to use a little UNIX tool called time, which will tell you how long it took to run your script.

$ time python os.py

And we see that script outputs the same as os.system() did, but with the added "time" info on the last line.


import subprocess as sp
 
for each in range(0,100):
    a = sp.Popen( ['ls', '-l'] )
    a.wait()
 

The wait() method of the subprocess object tells the system to wait until this subprocess has completed before doing anything else in the program. When we look at the time output for this code, we see that it takes much more time on the clock, but the same amount of user and system time. This is because while the program waits for the process, the second CPU core cannot be crunching away on another process. While this may seem inefficient, it is also courteous to the other processes on your computer. In other words, if you're using your laptop for anything other than running your program, you may want to not grind everything to a halt for the duration of your program's runtime (which for me is often hours and hours). And if you're using a server to run your program on, your coworkers may appreciate it if you do not make use of every CPU on the server while they're trying to get their work done. In truth, the operating system will make its own attempt at balancing the needs of processes, but especially in the case where you're trying to use your laptop or PC as a workstation, this can save you some frustration.

$ time python os.py

And just to show that it really is the CPU sharing that's driving the effect:

import subprocess as sp
 
# cut the number of subprocesses in half, then run two sets of them
for each in range(0,50):
    a = sp.Popen( ['ls', '-l'] )
    b = sp.Popen( ['ls', '-l'] )
    a.wait()
 

And we see that our runtime is almost exactly what it was in the first case.

Okay, now that we know how to run the subprocess, and we know that we have to create a subprocess object if we want to do things like wait(), let's look at other advantages of using this type of object.

import subprocess as sp
 
# let's say we want to store our output to a file instead of seeing it on the screen
 
fh = open('outfile', 'a')
proc = sp.Popen(['ls', '-l'], stdout=fh)
 

Now let's run the script and look at the contents of the file:

$ python os.py
$ cat outfile

We will see all our output, which can be much more convenient than having it scroll down the screen.


However, what can be even *more* convenient is having it stored in a data structure in memory.


import subprocess as sp
 
proc = sp.Popen(['ls', '-l'], stdout=sp.PIPE)
# The sp.PIPE is a mechanism that allows the output to be redirected to
# wherever you point it. In this case, to our subprocess object named proc.
# The next step is to get the data out of the object and into a data structure
# that we can organize however we see fit.
 
# we use the method below to get a tuple of two things: the stdout and the stderr
proc_output = proc.communicate()
print "------------------------"
print "We got a tuple\nWith the first part being:\n", proc_output[0], \
"\nand the second being:\n", proc_output[1]
print "------------------------"
 
# Once we use the .communicate() method, the stdout and stderr are gone from the proc object.
print "------------------------"
print "and if we don't format anything, we see the tuple like this:\n", proc_output
print "------------------------"
 
# But we could easily convert this into a friendlier structure
stdoutlist = proc_output[0].split('\n')
print "------------------------"
print "And now we see the first half of the tuple as a list of lines" \
+ "of output from our subprocess:\n", stdoutlist
print "------------------------"
 


I've intentionally kept this ls -l example set very simple, in terms of what command we are actually executing, because it's important to see how this object type works. However, typically this tool would be used to run things that are going to generate very large sets of data. For you, that might be a systematic search of a database like the PDB, or large-scale BLAST searches, multiple sequence alignments, searches for transcription factor binding sites in genomes, or a host of other common but computationally expensive bioinformatic problems. Or, for the foreseeable future, it may just be the most reliable way to organize small sets of output data into a data structure or log file.


BLAST


Good ol'e Basic Local Alignment Search Tool, now that's bioinformatics even Grandma can get down with. In its most basic form, BLAST takes a short sequence of either amino acids or nucleotides, searches a database of reference sequences, and returns the most likely matches for the query sequence. It's brilliant, and the only trouble is that we're often interested in BLASTing lotsa query sequences, and much less interested in the mundanity of parsing the output that comes back to us. Fortunately, we can easily script the execution of BLAST such that we can run queries many times with different sequences and settings, and also parse the output back into dictionaries to manipulate and even store longterm on the disk.

The first part of this section will show us how to install a new program in our UNIX environment, and the second part will let us practice using Popen() to script the execution of BLAST.

Installing BLAST


I suspect most everyone here is accustomed to using the NCBI BLAST web interface, but is just another program, whether it's running on an NCBI web server, or on our laptops. Before we get back to python, let's take a minute to install BLAST the old fashioned way, such that we have ready access to it from anywhere on our computers.


In the PythonCourse/ProgramFiles directory, you should find a tar.gz file containing the BLAST executable for your platform.

The .tar.gz file extension tells us that the file has been archived (the .tar part) and compressed (the .gz part). The UNIX command to uncompress and expand the file is:

$ tar zxvf [filename]

For example:

$ tar zxvf ncbi-blast-2.2.26+-universal-macosx.tar.gz
ncbi-blast-2.2.26+/
ncbi-blast-2.2.26+/bin/
ncbi-blast-2.2.26+/bin/blast_formatter
ncbi-blast-2.2.26+/bin/blastdb_aliastool
ncbi-blast-2.2.26+/bin/blastdbcheck
ncbi-blast-2.2.26+/bin/blastdbcmd
ncbi-blast-2.2.26+/bin/blastn
ncbi-blast-2.2.26+/bin/blastp
ncbi-blast-2.2.26+/bin/blastx
ncbi-blast-2.2.26+/bin/convert2blastmask
ncbi-blast-2.2.26+/bin/dustmasker
ncbi-blast-2.2.26+/bin/legacy_blast.pl
ncbi-blast-2.2.26+/bin/makeblastdb
ncbi-blast-2.2.26+/bin/makembindex
ncbi-blast-2.2.26+/bin/psiblast
ncbi-blast-2.2.26+/bin/rpsblast
ncbi-blast-2.2.26+/bin/rpstblastn
ncbi-blast-2.2.26+/bin/segmasker
ncbi-blast-2.2.26+/bin/tblastn
ncbi-blast-2.2.26+/bin/tblastx
ncbi-blast-2.2.26+/bin/update_blastdb.pl
ncbi-blast-2.2.26+/bin/windowmasker
ncbi-blast-2.2.26+/ncbi_package_info
ncbi-blast-2.2.26+/ChangeLog
ncbi-blast-2.2.26+/README
ncbi-blast-2.2.26+/LICENSE
ncbi-blast-2.2.26+/doc/
ncbi-blast-2.2.26+/doc/user_manual.pdf


We now have a new directory with all those uncompressed files in it, so let's change to it:

$ cd ncbi-blast-2.2.26+
$ ls
ChangeLog LICENSE README bin/ doc/ ncbi_package_info

We've got a ChangeLog file that will tell us what version of BLAST we've got, a doc directory with documentation, a data directory with the various matrices to use for esimating substituion rates, and then the bin directory that contains the actual binary executable for the program. If we change to the bin directory and take a look, we'll see several files:

$ cd bin
$ ls
blast_formatter* blastdbcmd* blastx* legacy_blast.pl* psiblast* segmasker* update_blastdb.pl*blastdb_aliastool* blastn* convert2blastmask* makeblastdb* rpsblast* tblastn* windowmasker*blastdbcheck* blastp* dustmasker* makembindex* rpstblastn* tblastx*

These are the executable programs that BLAST uses to do its dirty work. You might recognize some of these names from NCBI web site options (e.g. blastn vs. tblastx, etc). We want to setup a local directory to install these program files into, which we'll do in a new subdirectory of our home directory.

First, let's change back to our home directory:

$ cd
$ pwd
/Users/aishaellahi85/

Now, we can simply move the un-tarred BLAST directory into our new directory:

$ mv PythonCourse/programFiles/ncbi-blast-2.2.26+/ .
$ ls ncbi-blast-2.2.26+
ChangeLog LICENSE README bin doc ncbi_package_info

There we go, now the files are present directory beneath our home directory.

Now, in order to be able to use these program from anywhere in our directory structure, we need to add them to our executable PATH. We did something very similar to this during our installation of Aquamacs and Chimera, so this may look familiar. Copy these lines one at a time to the terminal (make sure to exclude the $, it's just there to symbolize your prompt).

$ echo 'PATH=$PATH:~/ncbi-blast-2.2.26+/bin' >> ~/.bash_profile
$ echo 'export PATH' >> ~/.bash_profile

Two more steps until we're ready to roll. First, check your bash profile:

$ less ~/.bash_profile
Your bash profile should look something like this:
alias aquamacs='/Applications/Aquamacs.app/Contents/MacOS/Aquamacs'
alias chimera='/Applications/Chimera.app/Contents/MacOS/chimera'
alias blast_formatter='/ncbi-blast-2.2.25+/bin/blast_formatter'
export EDITOR=aquamacs
PATH=$PATH:/usr/local/git/bin
export PATH
PATH=$PATH:~/ncbi-blast-2.2.26+/bin
export PATH

Finally, close the terminal window you were just working in, open a new one, and enter the following:
$ source ~/.bash_profile

Now, if all went well, we should be able to execute one of out BLAST programs and get some feedback:
$ blastn
BLAST query/options error: Either a BLAST database or subject sequence(s) must be specified

If you see this message, then you've properly installed BLAST. Good job.

Okay, now that we have these program available, let's start using them.

Since BLAST is going to compare a query sequence to a database of sequences, collected from say, a genome, or a genre of genomes, we first need to create the reference database to BLAST against. Many of these are available for download, but we're going to create our own from the S. cerevisiae genome. Go ahead and navigate to the FTP server for BLAST at ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/, which you can connect to through your favorite web browser, login as a guest, and copy yeast.aa.gz and yeast.nt.gz into your working directory from the FASTA folder.
Since these files came compressed with gzip, we're going to uncompress them into our working directory, which we'll return to (i.e. our S6.1 directory).

$ cd ~/PythonCourse/S6.1

And now to uncompress the files into our working directory.

$ gunzip ~/PythonCourse/data/yeastFASTA/yeast.nt.gz
$ mv ~/PythonCourse/data/yeastFASTA/yeast.nt .
$ gunzip ~/PythonCourse/data/yeastFASTA/yeast.aa.gz
$ mv ~/PythonCourse/data/yeastFASTA/yeast.aa .

And one more step before we can use our new BLAST installation: we have to create database of sequences we just moved to used by BLAST using the makeblastdb command. The -dbtype argument specifies the type of sequence we're formatting (the .nt extension stands for nucleotide), and the 'nucl' tells the program to format the database a nucleic acid sequence. The -in argument specifies the name of the input file, in this case yeast.nt.

$ makeblastdb -dbtype='nucl' -in ~/PythonCourse/data/yeastFASTA/yeast.nt

Now we have a fully operational BLAST installation and a yeast nucleotide database the search against. We can see that BLAST has created some additional files for our yeast database in our working directory:

$ ls
1a 1c 2b 3a 3c outfile yeast.nt yeast.nt.nin
1b 2a 2c 3b os.py yeast.aa yeast.nt.nhr yeast.nt.nsq

Now let's make a file to query the database with by opening a new text file and pasting in these lines:

>test seq
ATGGTTCATTTAGGTCCAAAGAAACCACAGGCTAGAAAGGGTTCCATGGCTGATGTGCCC
AAGGAATTGATGGATGAAATTCATCAGTTGGAAGATATGTTTACAGTTGACAGCGAGACC
TTGAGAAAGGTTGTTAAGCACTTTATCGACGAATTGAATAAAGGTTTGACAAAGAAGGGA
GGTAACATTCCAATGATTCCCGGTTGGGTCATGGAATTCCCAACAGGTAAAGAATCTGGT
AACTATTTGGCCATTGATTTGGGTGGTACTAACTTAAGAGTCGTGTTGGTCAAGTTGAGC
GGTAACCATACCTTTGACACCACTCAATCCAAGTATAAACTACCACATGACATGAGA

And then save the file as query.fasta.

And.... here we go a scriptin'.

import subprocess as sp
 
program = 'blastn'
queryseq = 'query.fasta'
database = '/Users/aishaellahi85/PythonCourse/data/yeastFASTA/yeast.nt'
 
proc = sp.Popen([program, '-query', queryseq, '-db', database], stdout=sp.PIPE)
 
output = proc.communicate()
outlist = output[0].split('\n')
for line in outlist: print line

You can imagine how you would set a list of query sequences, or a series of databases to query against, or even vary some of the BLAST settings (for example, the minimum E-value to report), and run this again. You also might care which species your hit came from (if you were searching against a larger database), or which chromosome hits were on (to determine synteny, e.g.)

In this case, we've used the PIPE to capture the output, which might be helpful when we want to systematically sort through the results in each BLAST hit, only storing the results in certain circumstances. However, we can also write the results to an output file instead of the PIPE. Let's try that:

import subprocess as sp
 
# some arguments for running BLAST
program = 'blastn'
queryseq = 'query.fasta'
database = '/Users/aishaellahi85/PythonCourse/data/yeastFASTA/yeast.nt'
 
# open a filehandle for our output file, we'll use the append flag in case
# we want to collect multiple queries worth of output
 
try:
    outfh = open('blastn_output', 'a')
except:
    print "Error opening output file: blastn_output"
 
proc = sp.Popen([program, '-query', queryseq, '-db', database], stdout=outfh)
 

Now you can cat the blastn_output file and see the results.



Exercises


1) Modify your Popen() call to set a minimum E-value of 1e-08. Parse the output of your BLAST hits into a dictionary keyed by chromosome, containing the length of the hit, the E-value, and the percent identity of the hit.

HINT: check out the BLAST options by issuing the command blastn -help. There are output options that will greatly facilitate your parsing efforts.

2) Write a script using the genetic code (provided here in parse-able text format) to translate your nucleotide sequence into proteins. Then use blastp instead of blastn to search against the amino acid sequence file we downloaded earlier. Remember to make the blast database first! (protein flag = "prot").

Ala/A GCU, GCC, GCA, GCG
Leu/L UUA, UUG, CUU, CUC, CUA, CUG
Arg/R CGU, CGC, CGA, CGG, AGA, AGG
Lys/K AAA, AAG
Asn/N AAU, AAC
Met/M AUG
Asp/D GAU, GAC
Phe/F UUU, UUC
Cys/C UGU, UGC
Pro/P CCU, CCC, CCA, CCG
Gln/Q CAA, CAG
Ser/S UCU, UCC, UCA, UCG, AGU, AGC
Glu/E GAA, GAG
Thr/T ACU, ACC, ACA, ACG
Gly/G GGU, GGC, GGA, GGG
Trp/W UGG
His/H CAU, CAC
Tyr/Y UAU, UAC
Ile/I AUU, AUC, AUA
Val/V GUU, GUC, GUA, GUG
STOP/* UAG, UGA, UAA
 


Solutions


1) Modify your Popen() call to set a minimum E-value of 1e-08. Parse the output of your BLAST hits into a dictionary keyed by chromosome, containing the length of the hit, the E-value, and the percent identity of the hit.

HINT: check out the BLAST options by issuing the command blastn -help. There are output options that will greatly facilitate your parsing efforts.

import subprocess as sp
# some arguments for running BLAST
program = 'blastn'
queryseq = 'query.fasta'
database = '/Users/aishaellahi85/PythonCourse/data/yeastFASTA/yeast.nt'
 
 
# Note the addition of arguments to the blastn call
proc = sp.Popen([program, '-query', queryseq, '-db', database,'-evalue', '1e-08', '-outfmt', '6'],stdout=sp.PIPE)
# Remember, this part returns a tuple, where the [0] element = stdout, [1]=stderr.
# You want the [0] element
 
output = proc.communicate()
 
outlist = output[0].split('\n')
 
# Type in blastn -help and scroll down to '***Formatting options' to get a list of possible outfile formats.
# When I google the tabular format [option 6] for a blastn output, I find out that the tabular format contains the following fields:
# 0 queryId, 1 subjectId, 2 percIdentity, 3 alnLength, 4 mismatchCount, 5 gapOpenCount, 6 queryStart, 7 queryEnd, 8 subjectStart, 9 subjectEnd, 10 eVal, 11 bitScore
# So we want: [1],[3],[10],[2]
 
 
chrs={}
for l in outlist:
    fields=l.split()
    if fields:  # make sure only non-empty lines are split
        chrs[fields[1]]=[fields[3],fields[0],fields[2]]
    else:
        pass
 
print chrs
 



2) Write a script using the genetic code (provided here in parse-able text format) to translate your nucleotide sequence into proteins. Then use blastp instead of blastn to search against the amino acid sequence file we downloaded earlier. Remember to make the blast database first! (protein flag = "prot").

# First, create a protein database. for help, type 'makeblastdb -help' into shell.
# $ makeblastdb -in /Users/aishaellahi85/PythonCourse/data/yeastFASTA/yeast.aa -dbtype 'prot'
 
import fastaParser as fp
 
query_dict=fp.fastaParser(queryseq)
# print query_dict
 
#Parse codon table
 
def parse_nuctable(filename):
    fh = open(filename)
    codon_table = {}
    for line in fh:
        line = line.strip()
        if line:# skips blank lines
            data = line.split()
            aa=data[0]
            oneletter=aa.split("/")[1]
            codons=data[1:]
            for codon in codons:
                codon=codon.strip().replace(',','')
                codon_table[codon]=oneletter
    return codon_table
 
codon_table=parse_nuctable('codon_table.txt')
 
def seq_to_codons(seq):
    codons=[]
    length=len(seq)
    i=0
    while i < length:
        codons.append(seq[i:i+3])
        i+=3
    return codons
 
qseq=query_dict['TEST SEQ']
 
 
def translate_seq(nt_seq):
    aa_seq=""
    nt_seq=nt_seq.replace('T','U')
    query_codons=seq_to_codons(nt_seq)
    for codon in query_codons:
        aa=codon_table[codon]
        aa_seq+=aa
    return aa_seq
 
 
proteinseq=translate_seq(qseq)
 
# Writing protein sequence to another fasta file.
fh2=open('qprotein.fasta','w')
fh2.write('>'+'PROTEIN TEST SEQ\n'+proteinseq+'\n')
fh2.close()
 
proteindb='/Users/aishaellahi85/PythonCourse/data/yeastFASTA/yeast.aa'
proc2=sp.Popen(['blastp','-query','qprotein.fasta','-db',proteindb],stdout=sp.PIPE)
output=proc2.communicate()
print output[0]