How to make fancy figures


Okay, there's a lot more to this than we can realistically cover in one lecture, but there are lots of things you can do to make really cool figures in Python. We're going to be using matplotlib, which is a plotting library that took a lot of the plotting functionality from the popular MATLAB software, re-wrote it in Python, and (in my opinion) made it about 10 times saner and easier to use.

Nevertheless, we'll try to cover the basics, and give you a good enough understanding of how everything is set up so you can look at the extensive gallery of examples and figure out how to make similar plots with your own data.

Basic plotting


We're going to do a lot of this plotting in the "pylab" mode of IPython. This is meant to be set up pretty close to the MATLAB environment, in terms of plotting and doing other math heavy stuff. Again, the way to start this is:

$ ipython --pylab

Let's say we have some data that's approximately a line, but there's some noise in it:

# Note that arange and rand are already in the namespace.
# ipython --pylab has, among other things, an implicit
# from pylab import *
# in its setup
 
x = arange(0,100)
y = 0.5 * x + 5 + 10*rand(len(x))
plot(x,y)

That was pretty easy! Now, let's say we don't want to have lines connecting each data point, but instead just a marker. We can look at the documentation for plot like we would anything else in IPython:

In [6]: plot?
 
Type: function
Base Class: <type 'function'>
String Form:<function plot at 0x6312770>
Namespace: Interactive
File: /Library/Frameworks/.../site-packages/matplotlib/pyplot.py
Definition: plot(*args, **kwargs)
Docstring:
Plot lines and/or markers to the
:class:`~matplotlib.axes.Axes`. *args* is a variable length
argument, allowing for multiple *x*, *y* pairs with an
optional format string. For example, each of the following is
legal::
 plot(x, y) # plot x and y using default line style and color
 plot(x, y, 'bo') # plot x and y using blue circle markers
 plot(y) # plot y using x as index array 0..N-1
 plot(y, 'r+') # ditto, but with red plusses
 
If *x* and/or *y* is 2-dimensional, then the corresponding columns
will be plotted.
 
An arbitrary number of *x*, *y*, *fmt* groups can be
specified, as in::
 a.plot(x1, y1, 'g^', x2, y2, 'g-')
Return value is a list of lines that were added.
The following format string characters are accepted to control
the line style or marker:
================ ===============================
character description
================ ===============================
``'-'`` solid line style
``'--'`` dashed line style
``'-.'`` dash-dot line style
``':'`` dotted line style
``'.'`` point marker
``','`` pixel marker
``'o'`` circle marker
``'v'`` triangle_down marker
``'^'`` triangle_up marker
...

You can see that there's a lot of different things you can do for something as simple as plotting... Markers, colors, lines. If you keep reading, you can even incorporate labels for the lines. Let's try this code, now, and see what it looks like:

# Note that arange and rand are already in the namespace.
# ipython --pylab has, among other things, an implicit
# from pylab import *
# in its setup
 
x = arange(0,100)
y = 0.5 * x + 5 + 10*rand(len(x))
plot(x,y, 'bo', label="Blue circles")
# Alternatively:
# scatter(x,y,label="Circles")
 
from scipy import stats
 
r_slope, r_int, r_rval, r_pval, r_stderr = stats.linregress(x, y)
plot(x, x * r_slope + r_int, 'g-.', label='Dash-dotted Line')
 


Hey, what about the labels?

legend()

Or, if we decide that we don't like the labels that we gave it before:

legend(["Noisy data", "Linear regression"])


Or, with some other tweaks in the call to the legend() function

legend(["Noisy data", "Linear regression"],
       loc='lower right',
       numpoints=1)


Making our own plotting functions


You know how in papers, they will sometimes have a kind of fancy figure, and then they'll have things in the same style, but for a bunch of different ways of slicing and dicing their data? It's really pretty effective scientific story-telling. It allows them to connect all those figures together conceptually, and readers only have to look for the relevant differences.

The thing is, if you're going to actually make those figures, it can be annoying to tweak the plots in the same way every time. Fortunately, we've spent almost two weeks now taking boring things that a person could do and making the computer automate them.

Plotting the average gene coverage


Now yesterday, we just looked at the average level of reads upstream, downstream, and within the genes. But there's a lot more data there than just number of reads: there's also the positioning of those reads. If the transcriptional machinery is falling off of the DNA, it would be nice to look at the plot and estimate what the rate of falling off is. So what we'll do is take the upstream, gene_cov, and downstream variables from CalcFalloff yesterday and plot them all together, giving us a sort of "average" gene.

x_upstream = arange(-len(upstream), 0)
plot(x_upstream, upstream)
x_gene = arange(0, len(gene_cov))
plot(x_gene, gene_cov)
x_downstream = arange(len(gene_cov), len(gene_cov) + len(downstream))
plot(x_downstream, downstream)

Now, that will give us what we want, but it's a bit tedious to type that out, or even to scroll up in IPython every time we want to do that, so let's make ourselves a nice function that does it. We'll be putting it in a PlotUtils.py module, so let's go ahead and start that as well.

import numpy as np
from matplotlib import pyplot as mpl
 
def plot_averaged_genes(upstream, gene_cov, downstream)
    x_upstream = np.arange(-len(upstream), 0)
    mpl.plot(x_upstream, upstream)
    x_gene = np.arange(0, len(gene_cov))
    mpl.plot(x_gene, gene_cov)
    x_downstream = np.arange(len(gene_cov), len(gene_cov) + len(downstream))
    mplplot(x_downstream, downstream)



Informative Interlude: Differences between IPython with and without --pylab


If you start IPython without the --pylab flag, and you discover that you want to plot things, you have a couple options. By far the best of these is to use the %pylab magic word, which will load all the pylab related things that you need. Next best is to use from pylab import *, which will load all the plotting functions, but they won't work quite properly for interactive plotting. Instead, every time you want to see what you've plotted, you'd need to use the show() function, which will block anything else you do until you close the window. If you want to have a program that does its plotting, then saves the figures out, then you can do something like the above, where you have

import numpy as np
import matplotlib.pyplot as mpl
 
# Plotting code here
# ...
# ...
 
mpl.savefig('myfile.jpg')

I think that in a perfect world, someone ought to be able to run a script and have almost all their figures for a paper just pop out, with only relatively minor tweaking. Then, if that code were available too, it should be possible for a reasonably savvy reviewer to see that all the data is on the up-and-up.


Now one thing that's kind of funny here is that the different plots are all different colors, despite the fact that they're from the same sample. Now, we could take the color as an argument to the function, but that means we have to specify the color every time, which is also kind of a hassle. So instead, we'll plot the first one using whatever color matplotlib thinks is best (internally, it keeps a track of the last few colors it plotted, and rotates through a list of about 10), and then plot the rest using that color.

def plot_averaged_genes(upstream, gene_cov, downstream)
    x_upstream = np.arange(-len(upstream), 0)
    x_gene = np.arange(0, len(gene_cov), color=color)
    x_downstream = np.arange(len(gene_cov), len(gene_cov) + len(downstream), color=color)
    result = mpl.plot(x_upstream, upstream)
    color = result[0].get_color()
    mpl.plot(x_gene, gene_cov)
    mpl.plot(x_downstream, downstream)

By the way, notice how we actually stored the result of plot() in a variable? If we look at the documentation on plot() again, we see that its "Return value is a list of lines that were added". What if, after we're done with this function, we still want to tweak the results. It's polite to return a list of everything that we've plotted, just the way that plot() does. Therefore, we'll keep extending our list:

def plot_averaged_genes(upstream, gene, downstream):
    retval = []
 
    x_upstream = np.arange(-len(upstream), 0)
    x_gene = np.arange(0, len(gene), color=color)
    x_downstream = np.arange(len(gene), len(gene) + len(downstream), color=color)
 
    retval.extend(mpl.plot(x_upstream, upstream))
    color = retval[0].get_color()
    retval.extend(mpl.plot(x_gene, gene, color=color))
    retval.extend(mpl.plot(x_downstream, downstream, color=color))
 
    # Place a separator between the averaged upstream region and the gene,
    # and the gene and the downstream region
    ymin, ymax = mpl.gca().get_ybound()
    retval.extend(mpl.vlines([0, len(gene)], ymin, ymax, linestyles='dashed')
 
    return retval


In the code here, I've done something more: I put in some vertical dashed lines to visually separate out the gene region from the UTRs. It also illustrates an important point: we can ask the stuff we've plotted things about how big it is, etc. Because we never explicitly said how tall to make the y axis, it could be anything, so it's useful to be able to go in and programmatically pull that out. If we switch back over to the interactive interpreter, we can see a lot of get_ methods on the current axis.


Informative Interlude: How Matplotlib is laid out


As long as we're grabbing information from the axes, it's worth spending a few moments talking about how Matplotlib is organized. We're going to use the code from the Log plots example to make a pretty looking set of pixels:

FigureAxes.png
The window that this is being plotted in corresponds to a Figure. This is everything inside the window (but not the tools on the bottom), and when you want to save an image to the disk (so you can include it in your manuscript), this is what actually gets saved. Figures control things like the size of the image if you print it out, and the resolution (for on screen, something like 72 dpi is fine, but if you're printing, you want it to be more like 300).

A Figure can contain zero or more sets of Axes, which are the subplots. In this case, we have four. A set of Axes is usually what you'll want to be trying to modify. Axes have properties like x and y limits, a set of major (and sometimes minor) ticks for the x and y axes, an optional legend, and lots more data. Furthermore, each axis has its own plotting commands, which are very similar to the top-level commands, but lets you be certain that they're going within the same Axes. This is particularly important if you have several subplots going on, each of which could conceivably receive the plotting data.

This is one of those places where we can't spend all our time talking about every single feature available, but the inline documentation is pretty good, and the gallery of examples is also really helpful if you kinda know what you want to do.



So now that we've got everything plotted together nicely, it would be nice to have the x-axis not be labelled wrong. The upstream region is fine, but it's not clear that the gene and the downstream are actually measured in two different units (percent of the gene, for the gene region, and bp for the downstream). So let's make a helper function:

def relabel(tick, gene_length):
    if tick < 0:
        return "%d bp" % tick
    elif tick == 0:
        return "-0 bp/0%"
    elif tick < gene_length:
        return "%d%%" % (float(tick) / gene_length * 100)
    elif tick == gene_length:
        return '100%/+0 bp'
    else:
        return "+%d bp" % (tick - gene_length)
 

For this function, you pass in an x-value and the length of the gene, and it will determine whether that x-value is in the upstream, gene, or downstream region, and give you the appropriate string label in either basepairs or % of the gene.



Now that we have our helper function, we're just going to tack a bit of code on to the end of plot_averaged_genes:

def plot_averaged_genes(upstream, gene, downstream):
    ax = mpl.gca()  # Get current Axes
    ...
    ticks = list(ax.get_xticks())
    # Ensure we have at least the full gene labelled
    ticks.extend([0, len(gene) / 2.0, len(gene)])
    ticks = np.unique(ticks)
    ticklabels = [relabel(tick, len(gene)) for tick in ticks]
    ax.set_xticks(ticks)
    ax.set_xticklabels(ticklabels)
    mpl.draw_if_interactive()

The final mpl.draw_if_interactive() is because when we use the ax.set_ functions, nothing actually gets updated on the screen. That tells matplotlib that it should be redrawn. However, if it's not in interactive mode, then it's okay that it doesn't get updated on the screen. Whenever someone calls mpl.show() or mpl.savefig(), it will actually calculate the pixels that need to be drawn. Before that, it's just represented internally as a collection of things that need to be drawn somehow.


Making a histogram of the 3' UTR coverages


So the other figure we said we'd make was a comparison of the different cases to see what the up-regulation of 3' UTRs looks like on a gene-by-gene basis. Earlier, we calculated the ratio of the number of reads in each of the drug conditions to the no-drug condition (and we used the no drug, 60 minute sample as a negative control). Making a histogram of these is really very easy:

>>> hist(log10(ratios_0to0))

As we can see, that made a histogram with 10 bars. Of course, those 10 bars are all the same width, and the boundaries are just arbitrarily placed based on the highest and lowest data points. More often, what we want to do is have bars that come at regularly spaced, sane intervals:

>>> clf()
>>> hist(log10(ratios_0to0), arange(-3, 4, .1))

From here, if we want to add more histograms on top of the current one, we can just call hist() again with the new data.

It is possible to go back through the returned values from each of those histogram objects and make sure that at each position, the highest one is in the back and the shortest is in the front, but it's a pain in the butt. Having tried it before, I'm also unconvinced that it leads to clear, intelligible plots. You're free to check out my personal version over at GitHub.

Another approach would be to calculate the histogram values, then plot() them as a sort of density function. You can use the return values of either histogram (which only calculates the histogram) or hist (which calculates and draws the histogram) to get out what you need.

Finally, you can draw partially transparent histogram bars. If you pass in the keyword argument "alpha" to hist(), then the bars are drawn partially transparent, depending on what that argument is.

hist(log10(ratios100to0), arange(-3, 4, .1), alpha=.4)

Alpha can be any number between 0 and 1, where 0 is fully transparent (in which case why are you even drawing it?), and 1 is fully opaque (this is the default). Most of the matplotlib functions know how to deal with this alpha property, so that can sometimes be useful when your plots start getting visually busy.


Exercises



Exercise 1
Look up documentation on the pcolor function (hint: try pcolor?) Explain to your neighbor how you might use it. Try running it with

pcolor(rand(4,4))
What options can you use? When would you use those options? Try using some of the different colormaps (see, for example, http://matplotlib.org/examples/color/colormaps_reference.html)



Exercise 2
Look up imread and imshow. Explain to your neighbor how an image is stored, once read.



Exercise 3
I have a paper coming out on Monday (pre-print available here: http://arxiv.org/abs/1302.4693, raw data here: http://allele.qb3.berkeley.edu/~pacombs/SliceSeq/60um-nomito.tsv). One of the supplemental figures compares the expression between all samples that are replicates (e.g. CaS1A vs CaS2A, CaS1A vs Cas3A, CaS2A vs CaS3A, CaS1B vs CaS2B, CaS1B vs CaS3B, etc). Can you make this figure?
combsS2.png

Functions you might find useful:
  • loglog
  • xlim and ylim
  • xticks and yticks
  • subplot

Looking at this figure, which embryo (row) seems less repeatable? Any particular samples (columns) that seem especially bad?


Solutions


import pandas as pd
from scipy import stats
from matplotlib import pyplot as mpl
 
all_data = pd.read_table('60um-nomito.tsv', index_col = 'gene_short_name')
 
mpl.figure(figsize=(20,6))
CaS1_columns = all_data.select(lambda x: x.startswith('CaS1'), axis=1).columns
CaS2_columns = all_data.select(lambda x: x.startswith('CaS2'), axis=1).columns
CaS3_columns = all_data.select(lambda x: x.startswith('CaS3'), axis=1).columns
 
for colnum, (cas1, cas2, cas3) in enumerate(zip(CaS1_columns,
                                                CaS2_columns,
                                                CaS3_columns)):
    mpl.subplot(3,6,colnum + 1)
    mpl.loglog(all_data[cas1], all_data[cas2], 'k.')
    mpl.xlim(1,10000)
    mpl.ylim(1,10000)
    mpl.title('Slice %d' % (colnum + 1))
    mpl.xticks([1,100,10000])
    mpl.yticks([1,100,10000])
    mpl.xlabel('r = {:.3}'.format(stats.spearmanr(all_data[cas1],
                                             all_data[cas2])[0]))
 
 
    mpl.subplot(3,6,colnum + 7)
    mpl.loglog(all_data[cas1], all_data[cas3], 'k.')
    mpl.xlim(1,10000)
    mpl.ylim(1,10000)
    mpl.xticks([1,100,10000])
    mpl.yticks([1,100,10000])
    mpl.xlabel('r = {:.3}'.format(stats.spearmanr(all_data[cas1],
                                             all_data[cas3])[0]))
 
 
    mpl.subplot(3,6,colnum + 13)
    mpl.loglog(all_data[cas2], all_data[cas3], 'k.')
    mpl.xlim(1,10000)
    mpl.ylim(1,10000)
    mpl.xticks([1,100,10000])
    mpl.yticks([1,100,10000])
    mpl.xlabel('r = {:.3}'.format(stats.spearmanr(all_data[cas3],
                                             all_data[cas2])[0]))
    mpl.show()
 
mpl.subplot(3,6,1)
mpl.ylabel('Rep 2 vs Rep 1')
mpl.subplot(3,6,7)
mpl.ylabel('Rep 3 vs Rep 1')
mpl.subplot(3,6,13)
mpl.ylabel('Rep 3 vs Rep 2')
mpl.tight_layout()
mpl.savefig('Supp2.png', dpi=150)