For this in-class project, you will apply your knowledge of information theory to a biological problem, the relationship between information and function of bases in an RNA sequence.
RNA molecules start their lives as a simple one-dimensional string of RNA bases, then fold into a three-dimensional structure before becoming functional. At the most basic level, we can thus represent RNA molecules by the sequence of their bases, which can be adenine (`A'), uracil (`U'), guanine (`G'), or cytosine (`C'). The table below shows the RNA sequences collected from a number of E. coli specimens. The researchers who obtained this data already aligned the genomes, and inserted dashes (`-') as placeholders for bases that do not exist in a particular sequence.
| ecoli1 | -GGGGCUAUAGCUCAGCU-GGG--AGAGCGCCUGCUUUGCACGCAGGAGGUCUGCGGUUCGAUCCCGCAUAGCUCCACCA
|
| ecoli2 | -GGGGCUAUAGCUCAGCU-GGG--AGAGCGCUUGCAUGGCAUGCAAGAGGUCAGCGGUUCGAUCCCGCUUAGCUCCACCA
|
| ecoli3 | -GGCGCGUUAACAAAGC--GGU--UAUGUAGCGGAUUGCAAAUCCGUCUA-GUCCGGUUCGACUCCGGAACGCGCCUCCA
|
| ecoli4 | -GGAGCGGUAGUUCAGUC-GGUU-AGAAUACCUGCCUGUCACGCAGGGGGUCGCGGGUUCGAGUCCCGUCCGUUCCGCCA
|
| ecoli5 | -GUCCCCUUCGUCUAGA--GGCCCAGGACACCGCCCUUUCACGGCGGUAA-CAGGGGUUCGAAUCCCCUAGGGGACGCCA
|
| ecoli6 | -GCCCGGAUAGCUCAGUC-GGU--AGAGCAGGGGAUUGAAAAUCCCCGUGUCCUUGGUUCGAUUCCGAGUCCGGGCACCA
|
| ecoli7 | -GCGGGCAUCGUAUAAU--GGCU-AUUACCUCAGCCUUCCAAGCUGAUGA-UGCGGGUUCGAUUCCCGCUGCCCGCUCCA
|
| ecoli8 | -GCGGGAAUAGCUCAGUU-GGU--AGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA
|
| ecoli9 | -GCGGGCGUAGUUCAAU--GGU--AGAACGAGAGCUUCCCAAGCUCUAUA-CGAGGGUUCGAUUCCCUUCGCCCGCUCCA
|
| ecoli10 | GGUGGCUAUAGCUCAGUU-GGU--AGAGCCCUGGAUUGUGAUUCCAGUUGUCGUGGGUUCGAAUCCCAUUAGCCACCCCA
|
| ecoli11 | -AGGCUUGUAGCUCAGGU-GGUU-AGAGCGCACCCCUGAUAAGGGUGAGGUCGGUGGUUCAAGUCCACUCAGGCCUACCA
|
| ecoli12 | -GGCCCCUUAGCUCAGU--GGUU-AGAGCAGGCGACUCAUAAUCGCUUGGUCGCUGGUUCAAGUCCAGCAGGGGCCACCA
|
| ecoli13 | -GGGUCGUUAGCUCAGUU-GGU--AGAGCAGUUGACUUUUAAUCAAUUGGUCGCAGGUUCGAAUCCUGCACGACCCACCA
|
| ecoli14 | -GGCUACGUAGCUCAGUU-GGUU-AGAGCACAUCACUCAUAAUGAUGGGGUCACAGGUUCGAAUCCCGUCGUAGCCACCA
|
| ecoli15 | -UCCUCUGUAGUUCAGUC-GGU--AGAACGGCGGACUGUUAAUCCGUAUGUCACUGGUUCGAGUCCAGUCAGAGGAGCCA
|
| ecoli16 | -CGGCGAGUAGCGCAGCUUGGU--AGCGCAACUGGUUUGGGACCAGUGGGUCGGAGGUUCGAAUCCUCUCUCGCCGACCA
|
| ecoli17 | -CGGUGAUUGGCGCAGCCUGGU--AGCGCACUUCGUUCGGGACGAAGGGGUCGGAGGUUCGAAUCCUCUAUCACCGACCA
|
| ecoli18 | -CGGCACGUAGCGCAGCCUGGU--AGCGCACCGUCAUGGGGUGUCGGGGGUCGGAGGUUCAAAUCCUCUCGUGCCGACCA
|
| ecoli19 | -UGGGGUAUCGCCAAGC--GGU--AAGGCACCGGUUUUUGAUACCGGCAUUCCCUGGUUCGAAUCCAGGUACCCCAGCCA
|
| ecoli20 | -UGGGGUAUCGCCAAGC--GGU--AAGGCACCGGAUUCUGAUUCCGGCAUUCCGAGGUUCGAAUCCUCGUACCCCAGCCA
|
| ecoli21 | -GCGCCCGUAGCUCAGCU-GGAU-AGAGCGCUGCCCUCCGGAGGCAGAGGUCUCAGGUUCGAAUCCUGUCGGGCGCGCCA
|
| ecoli22 | -GCGCCCUUAGCUCAGUU-GGAU-AGAGCAACGACCUUCUAAGUCGUGGGCCGCAGGUUCGAAUCCUGCAGGGCGCGCCA
|
| ecoli23 | -GCGCCCUUAGCUCAGUU-GGAU-AGAGCAACGACCUUCUAAGUCGUGGGCCGCAGGUUCGAAUCCUGCAGGGCGCGCCA
|
| ecoli24 | -GCAUCCGUAGCUCAGCU-GGAU-AGAGUACUCGGCUACGAACCGAGCGGUCGGAGGUUCGAAUCCUCCCGGAUGCACCA
|
| ecoli25 | -GUCCUCUUAGUUAAAU--GGAU-AUAACGAGCCCCUCCUAAGGGCUAAU-UGCAGGUUCGAUUCCUGCAGGGGACACCA
|
| ecoli26 | -GCUGAUAUGGCUCAGUU-GGU--AGAGCGCACCCUUGGUAGGGGUGGGGUCCCCAGUUCGACUCUGGGUAUCAGCACCA
|
| ecoli27 | -GCUGAUAUAGCUCAGUU-GGU--AGAGCGCACCCUUGGUAAGGGUGAGGUCGGCAGUUCGAAUCUGCCUAUCAGCACCA
|
| ecoli28 | -GCCGACUUAGCUCAGUA-GGU--AGAGCAACUGACUUGUAAUCAGUAGGUCACCAGUUCGAUUCCGGUA-UCGGCACCA
|
| ecoli29 | -GCCGAUAUAGCUCAGUU-GGU--AGAGCAGCGCAUUCGUAAUGCGAAGGUCGUAGGUUCGACUCCUAUUAUCGGCACCA
|
| ecoli30 | -GCUGAUAUGGCUCAGUU-GGU--AGAGCGCACCCUUGGUAAGGGUGAGGUCCCCAGUUCGACUCUGGGUAUCAGCACCA
|
| ecoli31 | -AGGGGCGUAGUUCAAUU-GGU--AGAGCACCGGUCUCCAAAACCGGGUGUUGGGAGUUCGAGUCUCUCCGCCCCUGCCA
|
| ecoli32 | -CGCGGGGUGGAGCAGCCUGGU--AGCUCGUCGGGCUCAUAACCCGAAGAUCGUCGGUUCAAAUCCGGCCCCCGCAACCA
|
| ecoli33 | -CGCGGGGUGGAGCAGCCUGGU--AGCUCGUCGGGCUCAUAACCCGAAGGUCGUCGGUUCAAAUCCGGCCCCCGCAACCA
|
These sequences can also be found on the CSE machines in the file /user/ofria/Public/sequences or for a comma-separated, purely numerical version, see /user/ofria/Public/sequences.numeric
Your job is to make an educated guess of the result of folding these sequences based on the information content of the parts of the sequence.
It would be rather tedious to do this problem entirely by hand, so you're encouraged to use a computer algebra package such as Matlab for the repetitious calculations. Alternatively, you may write your own code in any programming language, such as perl, C, C++, Java, etc... In fact, none of the computations are beyond the capabilities of Excel. It's up to you. Whichever program you use, you must write down (in English with equations) exactly what you do. Handing in program sources is not sufficient, and in fact discouraged. When including graphs, please give them meaningful titles, and label the axes informatively.
Step 1: For each site in the sequence, calculate the entropy. You may treat the dashes as a fifth letter in the alphabet rather than treat them as missing values. Plot the per-site entropy in a bar graph.
In matlab, you can use my zl script to load in sequences.numeric and then perform the operations you need. You can run zl by:
seqs = zl('sequences.numeric');
Loop through all 80 columns and find the entropy of each. You can single out a column by using ':' for the row. For example you can grab column five with seqs(:,5). The "for" command allows you to loop through all values in a set. For example, if you wanted to sum each column, you would type:
totals = zeros(80,1);
for i = 1:80
totals(i) = sum(seqs(:,i);
end;
The zeros command creates an 80x1 matrix for you to fill in, all initialized to zero. Then we loop through each column sum it up, and fill in the value.
Now, instead of finding to sum of these values, you actually need to find the entropy. This means you need to get a count of how many zeros, how many ones, etc there are and determine the fraction of each. To do this, you should use the find command. For example, if you do:
You'll see it will output a handful of numbers. These are the positions where column 5 is set to 2. Now if you use the size command, you'll find out how many entries are listed here.
size(find(seqs(:,5) == 2))
This will print out "10 1". You only care about the first of these numbers, so you would type:
size(find(seqs(:,5) == 2), 1)
Now, to get all of the numbers you need, just loop through all 80 columns and all five symbols.
From there the mathe should be easy. Divide each count by 33 (the number of rows) and you have the frequencies. Now plug them into the entropy formula. Be careful when using a log; if p = 0, -0*log(0) will evaluate in matlab to NaN, but you want it to only count as zero toward entropy, so either use an "if" or add a very tiny value (0.000001) to all of your counts so that the probabilities are non-zero.
Finally, use the "bar" command to plot the entropies.
Step 2: The "physical information" for a site is amount the entropy for that site is below maximum entropy. Calculate the physical information for each site. Sites with high physical information typically encode something about the world, in this case the chemical environment, and in particular the shapes of molecules with which this RNA must interact. Which molecules are they? (Do not try to identify them with particular sub-sequences of the RNA yet.)
When RNA folds, it first forms a structure consisting of ladders (or stems), loops, and bulges. Ladders are regions where a series of bases in one part of the sequence pairs up with a matching series elsewhere on the molecule. Adenine (A) can form a pair with uracil (U), and guanine (G) can form a pair with cytosine (C). Loops are unpaired regions. Some examples of RNA molecules are shown below. After this secondary structure has formed, the molecule further folds into its tertiary structure, but we shall focus our attention on the secondary structure only.
Step 3: For each pair of sites, calculate the mutual information between them, and display the results in an appropriate plot. (One useful representation is site numbers on X- and Y-axis, and information represented by little colored squares.) How does this plot indicate which sites might pair up? Compile a list of potential paired sites.
The calculate the mutual information for a specific pair of sites, create a 5 by 5 matrix (since there are five possible symbols at each site) where you keep count of how many times each combination appeared. To figure out H(X|Y=y), limit yourself to the column of the matrix where Y=y and then you'll be able to just calculate the entropy with the reduced number of entries in that column. Remember that if there are no entries the entropy should be zero.
To build the whole matrix, you can use two for-loops in matlab, each going from 1 to 80. Make sure to build a matrix to fill in using zeros(80,80). Each position represents the mutual information of the two associated site being compared.
Step 4: A ladder forms when two or more consecutive sites pair with a consecutive region elsewhere on the sequence. Identify the ladders in the RNA of E. coli. Make a sketch of the secondary structure of the molecule, labeling a handful of key points with site numbers, as in the example.
The method of obtaining RNA structure information you have explored in this problem is complementary to the free-energy-based methods more commonly employed in RNA and protein folding studies.
Comments (0)
You don't have permission to comment on this page.