Fancy Data Structures


  • Tuples
  • Sets
  • Lists of lists
  • Dictionaries of dictionaries
  • Lists of dictionaries, dictionaries of lists
  • Nested structures


Let's say that you have a dictionary. In fact, let's say that you are a collector of English dictionaries, and you have all of them, from Cawdrey's 1604 A Table Alphabeticall all the way to a hard copy of the ever-contemporary Wiktionary project of Wikipedia. Finally, let us also suppose that you have some interest in the evolution of words over time. After all, why did you go out and buy all those dictionaries in the first place?

Given the number of dictionaries and the number of words, it seems only natural that you would need some knowledge of programming to tackle this project. However, it is here that we run into a problem: how should we best store this immense and complex amount of data? We have many dictionaries, each with many words; could we use what we've learned of lists and strings to store it? We certainly could have a list of strings, but then it would be difficult to associate them with their definitions. This is where the Pythonean dictionary can come in handy!

# define dictionaries
wiktionary = {}
aTableAlphabeticall = {}
# insert words, starting with, naturally, 'ventricle'
wiktionary['ventricle'] = '''1. (anatomy) One of two lower chambers of the heart.\
2. (anatomy) One of four cavities in the brain.'''
aTableAlphabeticall['ventricle'] = 'the stomacke which receiues the meate'

The trouble is, you've got thousands of dictionaries, and you're trying to make sense of them in relation to each other. This way, you'd have thousands upon thousands of lines of code just devoted to defining dictionaries, and afterwards, you've still got to refer to each of them individually in order to do any kind of analysis. This doesn't sound right. Wouldn't it be great to put all of those dictionaries into some kind of big list, perhaps chronologically arranged, so that you could tackle the data entry and analysis systematically? That's right, it would be great.

Similarly, we very often run into non-trivial data in biology. For instance, you might want a list of binding site positions for each of ten different transcription factors, or we might want to describe the single nucleotide polymorphisms in each of a thousand people. We often want to store several kinds of information about each position in a protein sequence. In order to accomplish these tasks and many others, we need fancier data structures: lists of lists, dictionaries of dictionaries, lists of dictionaries, even lists of dictionaries of dictionaries of lists of dictionaries of lists. Although we try to stay away from the latter.

This morning will be devoted to teaching you how to create fancy-pants data structures and how to combine them with tools that we're already learned, such as loops.

A Bit more on tuples

As we saw yesterday, a tuple is like a frozen list: once you've made a tuple, it is immutable, and you can't change it (although you can make a new one with similar elements. Since the keys of a dictionary need to be immutable, this is useful if you want to key a dictionary based on two things at once. The syntax for a tuple is much like a list, except instead of [square brackets], it's conventional to use (parentheses).

mytuple = (1,2)
# Actually, the parentheses are optional (but helpful for clarity)
mytuple = 1,2
mylist = [1,2]
mydict = {}
mydict[mytuple] = 'hello'
print mydict
mydict[mylist] = 'world'
print mydict

Also, if your tuple only has one item, you need to use a comma to make it clear that the tuple is a tuple and not just a value in parentheses:

tuple_A = ("Is this a tuple?")
tuple_B = ("What about this?", )
print tuple_A, type(tuple_A)
print tuple_B, type(tuple_B)

Tuples actually underlie the syntax for the in-place swap:

a = 1
b = 2
a,b = b,a
print a, b
#Is equivalent to saying:
a = 1
b = 2
mytuple = (b,a)
a,b = mytuple

We won't be using tuples extremely heavily, but be aware that if you ever have a function with multiple simultaneous return values, under the hood, it's actually combining them into a tuple.


A set is another nifty data structure. Like a set in mathematics, it has a bunch of elements with no repeats. To build a set, you pass in a list, and it will automatically remove duplicates

myset = set([1,1,2,3,5,8])
print myset
# You can also use the new syntax:
myset = {1,1,2,3,5,8}
print myset
# or you can build it up as you go along:
myset = set()
print myset

Sets make doing Venn diagrams of your data really easy:

UCs = set(["UC Berkeley", "UCLA", "UCSF", "UCSB", "UC Riverside", "UC Davis",
           "UCSD", "UC Santa Cruz", "UC Irvine", "UC Merced"])
bay_area_schools = set(["UC Berkeley", "Stanford", "UCSF", "San Jose State",
           "Santa Clara University", "University of San Francisco" ])
bay_area_UCs = UCs.intersection(bay_area_schools)
print "Bay Area UCs", bay_area_UCs
non_bay_area_UCs = UCs.difference(bay_area_schools)
print "UCs in less awesome places", non_bay_area_UCs
bay_area_non_UCs = bay_area_schools.difference(UCs)
print "Bay area schools that aren't UCs", bay_area_non_UCs
california_schools = UCs.union(bay_area_schools)

There are also ways to remove things from a set. Let's say that due to rampant grade inflation and falling academic standards, that school south of here loses it's accreditation:

# or

The difference between remove and discard is that discard doesn't care whether or not the item is already in the set or not, whereas if the item isn't in the set, remove will give an error.

There are other set functions, but this covers most of what I use them for, and should put you in good stead as well.

Lists of lists

Let's create a few lists. Being self-centered, let's make the list of our lives:

#!/usr/bin/env python
# make a couple lists
coffee = ['yalis','strada','brewedawakening']
lab = ['stanley','barker','LSA','Koshland','VLSB']

As it turns out, you can pretty much put anything into a list that you like, including other lists. Accessing and changing individual elements has a fairly straightforward syntax:

# make a list of lists!
importantPlaces = [coffee,lab]
# another way to write it
importantPlaces = [['yalis','strada','brewedawakening'],
                   ['stanley','barker','LSA', 'Koshland','VLSB']]
# Note that you're allowed to split long lines if it's inside of a parenthesis,
# bracket, or curly brace
aList = importantPlaces[0]
print aList
aPlace = aList[0]
print aPlace
# or
aPlace = importantPlaces[0][0]  # aPlace is 'yalis'
importantPlaces[0][0] = 'peets'
anotherPlace = importantPlaces[0][0]

All of the list operations from this morning still apply:

# make a list of lists!
home = ['rockridge','the mission','lake merritt','gourmet ghetto']
importantPlaces.append(home)  # important places is now [coffee,lab,home]
importantPlaces[2].append('bushrod park')  # home now includes 'bushrod park'
print importantPlaces[2]
print home

Take note of that very last step: although you made a change while naming importantPlaces, you also changed home. That's because they're one and the same: importantPlaces[2] is actually a location in the warehouse, and Python knows when it sees a location to pretend like that thing is inside the list. How would you make importantPlaces carry a copy of home instead of home itself?

# make a list of lists!
importantPlaces.append([])  # important places is now [coffee,lab,[]]
# or, using a shorthand:
importantPlaces[2][3] = 'alcatel'
# replace 'gourmet ghetto' with name of landmark liquor store
print importantPlaces
print home

This may seem like a subtlety now, and it some ways it is, but keep it in mind when you start coding and start seeing errors. Often a copy is what you want.

Dictionaries of dictionaries

Dictionaries of dictionaries work the same way. Let's return to the example used in the introduction:

#!/usr/bin/env python
# we keep our definitions in a dictionary
definitions = {}
# and, as above, we have several dictionaries in this dictionary
definitions['wiktionary'] = {}
definitions['aTableAlphabeticall'] = {}
print definitions
# and each dictionary is full of definitions
definitions['aTableAlphabeticall']['statue'] = 'an image of wood, or any other matter'
definitions['aTableAlphabeticall']['vapor'] = 'moisture, ayre, hote breath, or reaking'
print definitions

Lists of dictionaries, dictionaries of lists

Can we mix data structures? Of course! The main thing to keep in mind (besides what your data structure actually looks like) is that if you're building something up from scratch, you need to match the square brackets and curly braces:

a = [{},{}, ]  # a is a list of empty dictionaries
b = {'l1': [], 'l2': []} # b is a dictionary of empty lists
c = [{]}  # c is a Syntax Error!
d = {[}] # so is D!

Finally, everything that we've learned so far can be elaborated to even more complex data structures. After, all, there are often multiple definitions for a word...

# let's make the ventricle entry a little more sensible
listOfDictionaries = [{},{}]
listOfDictionaries[1]['ventricle'] = []
listOfDictionaries[1]['ventricle'].append('One of two lower chambers of the heart')
listOfDictionaries[1]['ventricle'].append('One of four cavities in the brain')
print listOfDictionaries

Also, you can put different types of things into the same level of a list or dictionary, but I don't recommend it:

list_of_stuff = [[], 1, "Hello", {'name': 'Peter'}]
#This is totally legal, but just try writing a for loop that handles each one correctly!
for thing in list_of_stuff: print thing
# This is about the only thing I can think of that would work on
# all of the things in that list...

Fancy loops for fancy data structures

You are also allowed to loop over a dictionary. Be aware that it loops over the keys of the dictionary (i.e. the ones you put in brackets to get the values), and that you shouldn't count on any particular ordering to those keys:

knights = {'gallahad': 'the pure', 'robin': 'the brave'}
for x in knights:
    print x
    print knights[x]

the pure
the brave

It turns out that you can put just about any valid Python code you want inside of a for loop. That happens to include other for loops! This is really useful for our fancy nested structures:

liLi = [[1,2,3],[7,8,9]]
for x in liLi:
    print x
    for y in x:
        print y

[1, 2, 3]
[7, 8, 9]


1. A dictionary of dictionaries.

a) You want to store sequence information from human, mouse, and rat. Each species has a set of gene names corresponding to a set of sequences-- make a dictionary that has keys 'human', 'mouse', and 'rat', with each key having as a value a dictionary with gene names as keys and sequences as values.

The data are:
Human genes:
'TallnessGene' has sequence 'AATAGCAG'
'SmartnessGene' has sequence 'TGACGGA'

Mouse genes:
'FuzzynessGene' has sequence 'CCCCCCA'
'BeadyLittleEyesGene' has sequence 'ATAGCGC'

Rat genes:
'FuzzynessGene' has sequence 'CCCTCCA'
'BiggerThanMouseGene' has sequence 'GGACAATT'

b) Using the 'keys' method, print the names of the human genes.
c) Print out the sequence of FuzzynessGene in rat and mouse.
d) Print the third nucleotide of the human tallness gene.

2. A list of lists.

You want to store the results of three different time series experiments, each with four data points. You should do this by creating a list with three elements. Each of these elements is a four element list. There are a few ways to do this, which are outlined in parts (a), (b), and (c) below.

run1: 2,3,5,5
run2: 2,2,4,5
run3: 3,3,4,6

a) Create an empty list. Use the append method three times to make this a list of three empty lists. For each of these empty lists, use the append method four times to populate the list of lists with the data above.

b) Create a list of three empty lists without using either 'append' or '*' (you will run into a complication of using '*' in this context in the last exercise today). For each of these empty lists, use the append method four times to populate the list of lists with the data above-- feel free to copy and paste this part from your solution to part (a).

c) Create your list of lists, complete with data, all in one line.

3. A dictionary of lists.

Here, you have some structural data for several different species-- you want to store the B-factor of each position in several orthologous proteins. I picture this as a dictionary of lists: the dictionary is keyed with the species name, which has as a value a list that relates the position to the B-factor.

Human: 5,4,6,7
Mouse: 8,12,11,14
Rat: 10,11,13,15

a) Create an empty dictionary. Create three key-value pairs with the names as keys and empty lists as values. Use the append method to populate this dictionary of lists with data.

b) Create the populated dictionary all in one line, as in problem 2c.

4. Run Lola Run

Return to the time point data from problem two. Make three lists called 'run1', 'run2', and 'run3', and from them make a list of lists:

run1 = [2,3,5,5]
run2 = [2,2,4,5]
run3 = [3,3,4,6]

listOfRuns = [run1,run2,run3]

a) Alter the first element of run1. Has anything happened to listOfRuns?
b) Specify listOfRuns similarly, but using copies of run1, run2, and run3 instead of the lists themselves. Now, alter the first element of run1 and see what has happened to listOfRuns.

5. Multiplying Lists.

Use '*' (multiplication) to create a list of three empty lists. Use the append method to add the element "i belong first!" to the first list of the list of lists. Print out the list of lists. What has happened? This is hard-- don't worry about getting a TA.


1. Dictionary of Dictionaries.

#!/usr/bin/env python
# set out top-level dictionary
species = {}
species['human'] = {}
species['mouse'] = {}
species['rat'] = {}
# add human data
species['human']['Tallness'] = 'AATAGCAG'
species['human']['Smartness'] = 'TGACGGA'
# add mouse data
species['mouse']['Fuzzyness'] = 'CCCCCCA'
species['mouse']['BeadyLittleEyes'] = 'ATAGCGC'
# add rat data
species['rat']['Fuzzyness'] = 'CCCTCCA'
species['rat']['BiggerThanMouse'] = 'GGACAATT'
print species['human'].keys()
print species['mouse']['Fuzzyness']
print species['rat']['Fuzzyness']
print species['human']['tallness'][2]

2. Lists of Lists.

#!/usr/bin/env python
List = []
print List
# part b
#!/usr/bin/env python
List = [  [], [], [] ]
print List
# part c
#!/usr/bin/env python
List = [[2,3,5,5], [2,2,4,5], [3,3,4,6]]
print List

3. Dictionary of Lists.

#!/usr/bin/env python
# part A
dict = {}
dict['human'] = []
dict['mouse'] = []
dict['rat'] = []
# part B
dict = {'human':[5,4,6,7],'mouse':[8,12,11,14],'rat':[10,11,13,15]}

4. Run Lola Run

#!/usr/bin/env python
# part A
run1 = [2,3,5,5]
run2 = [2,2,4,5]
run3 = [3,3,4,6]
listOfRuns = [run1,run2,run3]
run1[0] = 500
print listOfRuns
# the first element of the first element of listOfRuns is now 500 ins
tead of two
# part B
run1 = [2,3,5,5]
run2 = [2,2,4,5]
run3 = [3,3,4,6]
listOfRuns = [run1[:],run2[:],run3[:]]
run1[0] = 500
print listOfRuns
# list of runs remains the same

5. Multiplying Lists.

#!/usr/bin/env python
listOfLists = [[]] * 3
print listOfLists
listOfLists[0].append('i belong first!')
print listOfLists
# the multiplication operation has taken a single empty list and
# duplicated it-- it's now three copies of the same empty list that are all
# equivalent to each other.  By changing one, we change all of them.