unemployment depression: bioinformatics

Showing posts with label bioinformatics. Show all posts

Do-It-Yourself Bio-Hacking: A Tutorial

Today I want to show you how you can do a slick bio-hacking experiment, and graph the results nicely, all in your browser, in well under 10 minutes. The following experiment will run just fine in Chrome or Firefox. In Firefox, it helps to have the Firebug extension. (If you're using Firefox, click F12. If it pops a console window, you already have Firebug.) I tested against Chrome v28.0.1500.72 and Firefox 15.0.1 with Firebug 1.9.2. Other combinations may work; those are just the ones I tested.

We're going to do a comparative genomics/proteomics experiment designed to explore amino-acid usage in a particular protein (DnaJ) across a couple dozen bacterial species. Even if you're not a bio-geek, I hope you'll follow along. At the very least, you'll learn how to make pretty graphs from any kind of data using the server at ZunZun.

What is the DnaJ protein, you ask? It's one of a class of proteins known as heat shock proteins, which are produced in response to elevated temperatures. (Your body produces heat shock proteins in response to fever, for example.) As you probably know (or can guess), proteins, in general, are rather sensitive to heat. Even a small amount of heat can cause a protein to start to unravel (or denature). DnaJ and its partners have the job of helping proteins re-fold into their correct original 3D shape(s) after exposure to heat. They're like little repair jigs. A partially damaged protein goes in; it re-folds and comes back out good as new.

Heat shock proteins occur widely, across all domains of life, and their amino-acid sequences are highly conserved; but they do differ. As we'll see right now.

Step 1
Go to http://www.uniprot.org/ and enter "DnaJ" (case doesn't matter) in the search field at the top of the page, then hit Enter. A list of organisms with DnaJ will appear, each with a checkbox on the left. Check all the checkboxes on the page (gang-check them with Shift-click).

Step 2
You'll notice at the bottom of the window there's a green bar with buttons "Retrieve," "Align," "Blast," and "Clear." Click the Retrieve button.

Steps 1 and 2.

Step 3
In the page that comes up, look for FASTA on the left. Under it are two links, Download and Open. Click Open. (See screen shot below.) You'll see a bunch of protein sequences (with one-letter abbreviations for amino acids), each preceded by a line that begins with > (greater-than sign). These are our DnaJ proteins.

Step 3.

Step 4
Click F12 to toggle open the console window. Be sure the Console tab is showing. In Firebug, you may also have to click the Console menu and choose Command Editor from the dropdown list.

Enter and execute (with Enter, in Chrome, or with Control-Enter in Firebug) the following lines of code:

all = document.getElementsByTagName("pre")[0].innerHTML.split(/&gt;/);

all.shift();

It's important that the part between slashes be ampersand-g-t-semicolon, not a greater-than symbol. The browser is showing you greater-than signs but in the HTML markup it's really ampersand-g-t-semicolon, not angle brackets. We actually do want to split on >, not on >.

Note that to execute a line of code in Firebug you have to type Control-Enter. In Chrome, you just type Enter. But in Chrome's JavaScript console, you have to use Shift-Enter to type on more than one line.

The variable all now contains an array of protein sequences. If you want to verify it, type all.length (then Enter, or in Firebug Control-Enter), and you should see the length of the array, 25.

Step 5
Enter the following code in the console (and execute it with Enter; it'll do nothing, which is fine).

function analyze( item ) {

   var sequence = item.split(/SV=\d\n(?=\w)/)[1];
   var lysineCount = sequence.match(/K/g).length;
   var arginineCount = sequence.match(/R/g).length;
   lysineCount /= sequence.length;
   arginineCount /= sequence.length;
   console.log( lysineCount + " " + arginineCount );

}

This is the callback code we'll use to process every member of the all array. Each item in the array consists of a FASTA header followed by a protein sequence. We just want the sequence, not the header, which is why we have a first line that splits off the part we need. The remaining lines obtain the number of lysines (K) and the number of arginines (R) in the protein sequence, then we divide those numbers by the sequence length to get a frequency-of-occurrence. The final line prints the results to the console window.

This function, by itself, doesn't do anything until we run it against each amino-acid sequence in the all array. That's the next step.

Step 6
Enter the following line of code into the console and run it with Enter (or Control-Enter, in Firebug):

all.forEach( analyze );

The console should immediately fill with numbers (25 rows of two numbers each). That's our data. We need to graph it to see what it looks like. Ready?

Step 7
Go now to http://zunzun.com and notice four pulldown menus at the top of the page. Use the far-left dropdown to select Polynomial.

Select Polynomial from the ZunZun function list.

A new window appears with ugly (or beautiful, depending on your mindset) formulas. Click the link to First-Order (Linear) 2D. Why? Because in the absence of any foreknowledge, we're going to blindly assume that our data is best fit by a straight line. If it's not straight-line data, we can come back and change our selection later.

When you click the First-Order (Linear) 2D link, you'll quickly be in a stark-looking window with a single pulldown menu at the top. Click it and select Data Labels for Graphs. Replace "X data" with "Lysine" and "Y data" with "Arginine."

Select Data Labels for Graph.

Step 8
Now use the single pulldown menu to select Text Data Editor.

Quickly go back to that console window and Copy all of your data (all 25 rows of numbers), then Paste the data into the Text Data Editor box.

Click the Submit button near the top of the page. Be patient, as it may take up to 20 seconds or so for your graph to be ready.

You'll know your graph is ready when the window changes to one that shows four pulldown menus at the top. The far-right menu is Data Graphs. Click into it and select Arginine vs. Lysine with model. NOTE: The exact names of the menu items will depend on how you labeled your axes at the end of Step 7 above.

You should see the window change to a view of a graph that looks like this:

Graph created on demand by the ZunZun server.

Pretty easy, right? It gets better. The line that ZunZun drew through the data points is a regression curve that minimizes the sum of squared error. To see the formula for the line, including coefficients, use the far-left menu, called Coefficients and Text Reports, to select Coefficients. Don't worry, your graph will still be there when you're done. To get back to the graph at any time, just use the far-right menu and any of the commands under it (which re-display the graph in various ways).

The graph seems to be saying that Arginine levels go down as Lysine goes up. But how good of a correlation is this, really? Use the far-left pulldown menu again. This time select Coefficient and Fit Statistics. You'll notice a ton of stats (chi-squared and so on). Among them, r-squared is given as 0.637834788057. That means the correlation coefficient, r, is 0.799, which is pretty solid.

I'll save the interpretation of our experiment's results for another time. For now, notice that underneath your ZunZun graph are links for saving the graph either as PNG or SVG. I strongly recommend you save it as both. You can open SVG in both Photoshop and Illustrator (and most browsers too). You will definitely want to keep an SVG version around to edit by hand in your favorite text editor (SVG is just a variety of XML). I'll be showing you how to do lots of sexy things with SVG graphs in upcoming posts.

reade more...

jQuery for Bioinformatics

I've been using JavaScript for almost two decades now, but somehow I've managed to avoid learning jQuery until just recently, mostly out of laziness but also because of a lingering yet torrid love-hate relationship with "syntax sugar" programming patterns. The best thing I can say about jQuery is that it has a seductively compact and powerful syntax. The worst thing I can say about jQuery is this.repeat(previousStatement).

For better or worse, I've had to begin dabbling in jQuery recently to save myself from the horror of old-school bare-knuckle DOM parsing. You know what I'm talking about: Nested loops with lots of calls to getElementsByTagName( ) followed up with hand-parsing of innerHTML. Who wants to do all that when you can use the oh-so-cute $(selector).each( ) construction?

The trouble with cute/compact syntax (as any recovering Perl user will gladly tell you in return for a bottle of cheap sherry) is that it's write-only. When you go back to look at something a week later and see 15 lines' worth of JS functionality rolled up into a shockingly crisp (yet thoroughly opaque) jQuery one-liner, you often wish you'd gone ahead and written those 15 homely lines of JavaScript in the first place, instead of giving in to that one irresistibly sexy, powerful line of jQuery that's oh yeah BTW also self-obfuscating.

Nonetheless, if you do a lot of page-scraping (as I do when visiting bioinformatics sites), the time savings of being able to parse a page with jQuery can be formidable. Who can resist grabbing all rows of a table with $("tr")? Who can resist iterating over them with .each()?

I tend to use the online apps at genomevolution.org quite heavily. The great folks who maintain that site have a nice way of serving up prodigious amounts of data in easy-to-use interactive forms, but sometimes you just want to harvest the data from a table and be done with it. Take the page I created at http://genomevolution.org/r/9726, which is based on a list of 100 unique bacterial species in the group known as Alphaproteobacteria. If you go to that page and scroll over to the far right, you'll see a column header labeled "Codon Usage." Underneath that label is a "Get All Codon Tables" link. Click that link and be prepared to wait about two minutes as the codon data loads for each organism. It's worth the wait, because when you're done, you're looking at color-coded codon usage frequencies for all 64 codons, for all 100 organisms.

Suppose you just want the codon data in text form, to analyze later? Scraping the raw data out of the HTML page is a royal bitch, because whether you know it or not, that page has tables embedded in tables embedded in tables. Parsing the DOM by hand is (shudder, wince) well nigh unthinkable.

Go to http://genomevolution.org/r/9726 and click "Get All Codon Tables" under the "Codon Usage" column heading. Allow a minute or two for codon data to load. Meanwhile, Control-Shift-J opens the Chrome console. (Select the Console tab at the top of the window if it's not already selected.) Paste the following code into the console. Hit Enter. Savor the power.

codonData = [];

function process( ) {

    var CODONS_COLUMN  = 15;

    var rowdata =  jQuery( 'td', this );
    var codonUsage = rowdata[ CODONS_COLUMN ].textContent.split(/(?=CCA)/)[1]; 
    codonData.push( codonUsage ); 
}

$('tr[id^=gl]').each( process ); // oh jQuery, must you tease me so?

console.log( codonData.join("\n") );

All of this was originally a single statement, with an inline callback function (in typical jQuery fashion). I decided to unroll it into more verbose, easier to understand form, lest my head explode two weeks from now trying to re-read and re-figure-out the code.

This bit of code does some pretty typical jQuery things, such as grab all rows of a table with $('tr'), except that in this case I most certainly do not want all rows of all tables in the HTML page (which would be hundreds of rows of extraneous stuff). The rows I need happen to have an "id" attribute with a value that begins with "gl." The construction $('tr[id^=gl]') is jQuery's syntax for selecting table rows that have an id-attribute that begins with "gl." (The ^= here means "begins with." You could signify "ends with" using $= instead of ^=.)

The process() callback fetches all table columns for the current row using the jQuery( 'td', this ) construction, which means gives me a jQuery object representing all "td" elements under the DOM node represented by this. In the callback context, this refers to the current jQuery node, not the window object or Function object. If you choose (as I did not) to declare your callback with arguments, as in function myCallback( argA, argB), then argA will be the index of the current item and argB will be this.

If you're wondering about the regex /(?=CCA)/, I need this because ordinarily the codon data would look like this:

Codon Usage: The Bacterial and Plant Plastid Code (transl_table=11) CCA(P) 1.18%CCG(P) 1.58%CCT(P) 1.17%CCC(P) 1.37%CGA(R) 0.32%CGG(R) 1.32%CGT(R) 1.82%CGC(R) 2.54%CAA(Q) 1.07%CAG(Q) 2.84%CAT(H) 1.59%CAC(H) 0.89%CTA(L) 0.48%CTG(L) 4.58%CTT(L) 1.96%CTC(L) 0.84%GCA(A) 2.94%GCG(A) 2.14%GCT(A) 2.31%GCC(A) 3.90%GGA(G) 0.90%GGG(G) 1.74%GGT(G) 2.11%GGC(G) 3.23%GAA(E) 3.92%GAG(E) 1.36%GAT(D) 3.76%GAC(D) 1.49%GTA(V) 1.08%GTG(V) 3.01%GTT(V) 2.19%GTC(V) 0.81%ACA(T) 1.82%ACG(T) 1.49%ACT(T) 0.57%ACC(T) 1.83%AGA(R) 0.30%AGG(R) 0.31%AGT(S) 0.61%AGC(S) 1.33%AAA(K) 2.01%AAG(K) 1.60%AAT(N) 1.39%AAC(N) 1.64%ATA(I) 0.59%ATG(M) 2.56%ATT(I) 2.88%ATC(I) 1.59%TCA(S) 0.65%TCG(S) 0.47%TCT(S) 1.37%TCC(S) 1.34%TGA(*) 0.14%TGG(W) 1.47%TGT(C) 0.46%TGC(C) 0.70%TAA(*) 0.14%TAG(*) 0.03%TAT(Y) 1.47%TAC(Y) 0.90%TTA(L) 0.61%TTG(L) 1.67%TTT(F) 2.41%TTC(F) 1.22%

Notice that first line ("Codon usage: The Bacterial [blah blah]"). I just want the codon data, not the leader line. But how to split off the codon data? Answer: Use a lookahead regular expression that doesn't consume the match. If you split on /CCA/ (the first codon) you will of course consume the CCA, never to be seen again. Instead, use (?=CCA), with parentheses (absolutely essential!) and the parser will look ahead to find an upcoming CCA, then stop and match the spot right before the CCA without consuming the CCA.

I'm sure a true jQuery expert can rewrite the foregoing code in a much more elegant, compact manner. For me, elegant and compact aren't always optimal. I've learned to value readable and self-documenting over elegant and opaque. Cute/sexy isn't always best. I'll take homely and straightforward any day.

reade more...

A Simple Method for Estimating the Rate of Transition vs. Transversion Mutations

Point mutations in DNA fall into two types: transition mutations, and transversion mutations. (See graphic below.)

In a transition mutation, a purine is swapped for a different purine (for example, adenine is swapped with guanine, or vice versa), or a pyrimidine is swapped with another pyrimidine (C for T or T for C); and usually, if a purine is swapped on one strand, the corresponding pyrimidine gets swapped on the other. Thus, a GC pair gets changed out for an AT pair, or vice versa.

A transversion, on the other hand, occurs when a purine is swapped for a pyrimidine. In a pairwise sense, this means a GC pair becomes a TA pair (for example) or an AT pair gets changed out for CG, or possibly AT for TA, or GC for CG.

Of the two types of mutation, transitions are more common. We also know that, in particular, GC-to-AT transitions are much more common than AT-to-GC transitions, for reasons that are well understood but that I won't discuss here. If you're curious to know what the experimental evidence is for the greater rate of GC-to-AT transitions, see Hall's 1991 Genetica paper (paywall protected, unfortunately) or the non-paywall-protected Y2K J. Bact. paper by Zhao. The latter paper is interesting because it shows that GC-to-AT transitions are more common in stationary-phase cells than exponentially-growing cells, and also, transitions in stationary E. coli are repaired by MutS and MutL gene products. (Overexpression of those two genes results in fewer transitions. Mutation of those two genes results in more transitions.)

An open question in molecular genetics is: What are the relative rates of transitions versus transversions, in natural populations? We know transitions are more common, but by what factor? Questions like this are tricky to answer, for a variety of reasons, and the answers obtained tend to vary quite a bit depending on the organism and methodology used. Van Bers et al. found a transition/transversion ratio (usually symbolized as κ) of 1.7 in Parus major (a bird species). Zhang and Gerstein looked at human DNA pseudogenes and found transitions outnumber transversions "by roughly a factor of two." Setti et al. looked at a variety of bacteria and found that the transition/transversion rate ratio for mutations affecting purines was 2.1 whereas the rate ratio for pyrimidines was 6.6. Tamura and Nei looked at nucleotide substitutions in the control region of mitochondrial DNA in chimps and humans (a region known to evolve rapidly) and found κ to be approximately 15. Yang and Yoder looked at mitochondrial cytochrome b in 28 primate species and found an average κ of 6.4. (In general, κ values tend to be considerably higher for mitochondrial DNA than other types of DNA.)

It's important to note that in all likelihood, no single value of κ will be universally applicable to all genes in all lineages, because evolutionary pressures vary from gene to gene and the rates of transition and transversion are different for different nucleotides (and so codon usage biases come into play). For an introduction to the various considerations involved in trying to estimate κ, I recommend Yang and Nielsen's 2000 paper as well as their 1998 and 1999 papers.

The reason I bring all this up is that I want to offer yet another possible way of estimating the transition/transversion rate ratio κ, using DNA composition statistics. Earlier, I presented data showing that the purine (A+G) content of coding regions of DNA correlates directly with genome A+T content. Analyzing the genomes of representatives of 260 bacterial genera, I came up with the following graph of purine mole-percent versus A+T mole-percent:

The correlation between genome A+T content and mRNA purine content is strong and positive (r=0.852) . Szybalski's Rule says that message regions tend to be purine-rich, but that's not exactly accurate. When genome A+T content is below approximately 35%, coding regions are richer in pyrimidines than purines. Above 35%, purines predominate. The concentration of purines in the mRNA-synonymous strand of DNA rises steadily with genome A+T content. It rises with a slope of 0.13013.

If you try to envision evolution taking an organism from one location on this graph to another, you can imagine that GC-to-AT transitions will move an organism to the right, whereas AT-to-GC transitions will move it to the left. To a first approximation (only!) we can say that horizontal movement on this graph essentially represents the net effect of transitions.

Vertical movement on this graph clearly involves transversions, because a net change in relative A+G content implies nothing less. To a very good first approximation, vertical movement in the graph corresponds to transversions.

Therefore, a good approximation of the relative rate of transitions versus transversions is given by the inverse of the slope. The value comes to 1.0/0.13013, or κ = 7.6846.

In an earlier post, I presented a graph like the one above applicable to mitochondrial DNA (N=203 mitochondrial genomes), which had a slope of 0.06702. Taking the inverse of that slope, we get a value of κ =14.92, which is in excellent agreement with Tamura and Nei's estimate of 15 for mitochondrial κ.

When I made a purine plot using plant and animal virus genomes (N=536), the rise rate (slope) was 0.23707, suggesting a κ value of 4.218. This agrees well with the transition/transversion rate for hepatitus C virus (as measured by Machida et al.) of 1.5 to 7.0 depending on the gene.

In short, we get very reasonable estimates of κ from calculations involving the slope of the A+G vs. A+T graph, across multiple domains.

The main methodological proviso that applies here has to do with the fact that technically, some horizontal movement on the graph can be accomplished with transversions (AT-to-CG, for example). We made a simplifying assumption that all horizontal movement was due to transitions. That assumption is not strictly true (although it is approximately true, since transitions do outnumber transversions; and some transversions, such as AT<-->TA and GC<-->CG, have no effect on genome A+T content). Bottom line, my method of estimating κ probably overestimates κ somewhat, by including a small proportion of AT<-->CG transversions in the numerator. Even so, the estimates agree well with other estimates, tending to validate the general approach.

I invite comments from knowledgeable specialists.

reade more...

A Bioinformatics Bookmarklet

Sometimes you want to scrape some screen data and analyze it on the spot without copying it to another program. It turns out there's an easy way to do just that. Just highlight the information (by click-dragging the mouse to Select a section of screen data), then run a piece of JavaScript against the selection.

Example: I do a lot of peeking and poking at DNA sequences on the web. Often, I'm interested in knowing various summary statistics for the DNA I'm looking at. For example, I might see a long sequence that looks like AGTTAGAAAACCTCAGCTACTAG (etc.) and wonder what the G+C content of that stream is. So I'll select the text by click-dragging across it. Then I'll obtain the text in JavaScript by calling getSelection().toString(). Then I parse the text and display the results in an alert dialog.

Suppose I've selected a run of DNA on-screen and I want to know the base content (the amounts of G, C, T, and A).

text = getSelection().toString(); // get the data as a string
text = text.toUpperCase(); // optionally convert it to upper case

bases = new Object;  // create a place to store the base counts
bases.G = bases.C = bases.T = bases.A = 0; // initialize

// now loop over the string contents:
for (var i = 0; i < text.length; i++)
   bases[ text[i] ]++;  // bump the count for that base

// format the data for viewing

msg = "G: " + bases.G/text.length + "\n";

msg += "C: " + bases.C/text.length + "\n";

msg += "A: " + bases.A/text.length + "\n";

msg += "T: " + bases.T/text.length + "\n";

msg += "GC Content: " + (bases.G + bases.C)/text.length;

// view it:

alert( msg );

If I run this script against a web page where I've highlighted some DNA text, I get:

The nice part is, you can put the above code in a bookmarklet, associate the bookmarklet with a button, and keep it in your bookmark bar so that whenever you want to run the code, you can just point and click. To do the packaging, reformat the above code (or your modified version of it) as a single line of code preceded by "javascript:" (don't forget the colon), then set that code as the URL of a bookmark. Instead of going to a regular URL, the browser will see "javascript:" as the URL scheme and execute the code directly.

Bookmarklets of this sort have proven to be a major productivity boon for me in various situations as I cruise the web. When I see data I want to analyze, I don't have to copy and paste it to Excel (or whatever). With a bookmarklet, I can analyze it instantly, sur la vitre.

reade more...

Chargaff's Second Parity Rule is Broadly Violated

Erwin Chargaff, working with sea-urchin sperm in the 1950s, observed that within double-stranded DNA, the amount of adenine equals the amount of thymine (A = T) and guanine equals cytosine (G = C), which we now know is the basis of "complementarity" in DNA. But Chargaff later went on to observe the same thing in studies of single-stranded DNA, causing him to postulate that A = T and G = C more generally (within as well as across strands of DNA). The more general postulation is known as Chargaff's second parity rule. It says that A = T and G = C within a single strand of DNA.

The second parity rule seemed to make sense, because there was and is no a priori reason to think that DNA or RNA, whether single-stranded or double-stranded, should contain more purines than pyrimidines (nor vice versa). All other factors being equal, nature should not "favor" one class of nucleotide over another. Therefore, across evolutionary times frames, one would expect purine and pyrimidine prevalences in nucleic acids to equalize.

What we instead find, if we look at real-world DNA and RNA, is that individual strands seldom contain equal amounts of purines and pyrimidines. Szybalski was the first to note that viruses (which usually contain single-stranded nucleic acids) often contain more purines than pyrimidines. Others have since verified what Szybalski found, namely that in many organisms, DNA is purine-heavy on the "sense" strand of coding regions, such that messenger RNA ends up richer in purines than pyrimidines. This is called Szybalski's rule.

In a previous post, I presented evidence (from analysis of the sequenced genomes of 93 bacterial genera) that Szybalski's rule not only is more often true than Chargaff's second parity rule, but in fact purine-loading of coding region "message" strands occurs in direct proportion to the amount of A+T (or in inverse propoertion to the amount of G+C) in the genome. At G+C contents below about 68%, DNA becomes heavier and heavier with purines on the message strand. At G+C contents above 68%, we find organisms in which the message strand is actually pyrimidine-heavy instead of purine-heavy.

I now present evidence that purine loading of message strands in proportion to A+T content is a universal phenomenon, applying to a wide variety of eukaryotic ("higher") life forms as well as bacteria.

According to Chargaff's second parity rule, all points on this graph should fall on a horizontal line at y = 1. Instead, we see that Chargaff's rule is violated for all but a statistically insignificant subset of organisms. Pink/orange points represent eukaryotic species. Dark green data points represent bacterial genera. See text for discussion. Permission to reproduce this graph (with attribution) is granted.

To create the accompanying graph, I did frequency analysis of codons for 58 eukaryotic life forms (pink data points) and 93 prokaryotes (dark green data points) in order to derive prevalences of the four bases (A, G, C, T) in coding regions of DNA. Eukaryotes that were studied included yeast, molds, protists, warm and cold-blooded animals, flowering and non-flowering plants, alga, and insects and crustaceans. The complete list of organisms is shown in a table further below.

It can now be stated definitively that Chargaff's second parity rule is, in general, violated across all major forms of life. Not only that, it is violated in a regular fashion, such that purine loading of mRNA increases with genome A+T content. Significantly, some organisms with very low A+T content (high G+C content) actually have pyrimidine-loaded mRNA, but they are in a small minority.

Purine loading is both common and extreme. For about 20% of organisms, the purine-pyrimidine ratio is above 1.2. For some organisms, the purine excess is more than 40%, which is striking indeed.

Why should purines migrate to one strand of DNA while pyrimidines line up on the other strand? One possibility is that it minimizes spontaneous self-annealing of separated strands into secondary structures. Unrestrained "kissing" of intrastrand regions during transcription might lead to deleterious excisions, inversions, or other events. Poly-purine runs would allow the formation of many loops but few stems; in general, secondary structures would be rare.

The significance of purine loading remains to be elucidated. But in the meantime, there can be no doubt that purine enrichment of message strands is indeed widespread and strongly correlates to genome A+T content. Chargaff's second parity rule is invalid, except in a trivial minority of cases.

The prokaryotic organisms used in this study were presented in a table previously. The eukaryotic organisms are shown in the following table:

Organism	Comment	G+C%	Purine ratio
Chlorella variabilis strain NC64A	endosymbiont of Paramecium	68.76	1.1055181128896376
Chlamydomonas reinhardtii strain CC-503 cw92 mt+	unicellular alga	67.96	1.0818749999999997
Micromonas pusilla strain CCMP1545	unicellular alga	67.41	1.1873268193087356
Ectocarpus siliculosus strain Ec 32	alga	62.74	1.2090728330510347
Sporisorium reilianum SRZ2	smut fungus	62.5	0.9776547360094916
Leishmania major strain Friedlin	protozoan	62.47	1.0325
Oryza sativa Japonica Group	rice	54.77	1.0668412348401317
Takifugu rubripes (torafugu)	fish	54.08	1.0655094027691674
Aspergillus fumigatus strain A1163	fungus	53.89	1.013091641490433
Sus scrofa (pig)	pig	53.77	1.0680595779892428
Drosophila melanogaster (fruit fly)		53.69	1.0986989367655287
Brachypodium distachyon line Bd21	grass	53.32	1.0764746703677999
Selaginella moellendorffii (Spikemoss)	moss	52.83	1.1014492753623195
Equus caballus (horse)	horse	52.29	1.0844453711426192
Pongo abelii (Sumatran orangutan)	orangutan	52	1.0929015146227405
Homo sapiens	human	51.97	1.0939049081896255
Mus musculus (house mouse) strain mixed	mouse	51.91	1.0827720297201582
Tuber melanosporum (Perigord truffle) strain Mel28	truffle	51.4	1.0836820083682006
Phaeodactylum tricornutum strain CCAP 1055/1	diatom	51.06	1.0418452745458253
Arthroderma benhamiae strain CBS 112371	fungus	50.99	1.0360268674944024
Ornithorhynchus anatinus (platypus)	platypus	50.97	1.1121909993661525
Taeniopygia guttata (Zebra finch)	bird	50.81	1.1344717182497328
Trypanosoma brucei TREU927	sleeping sickness protozoan	50.78	1.106974784013486
Danio rerio (zebrafish) strain Tuebingen	fish	49.68	1.1195053003533566
Gallus gallus	chicken	49.54	1.1265418970650787
Monodelphis domestica (gray short-tailed opossum)	opossum	49.07	1.0768110918544194
Sorghum bicolor (sorghum)	sorghum	48.93	1.046422719825232
Thalassiosira pseudonana strain CCMP1335	diatom	47.91	1.1403183213189638
Hyaloperonospora arabidopsis	mildew	47.75	1.053039546400631
Daphnia pulex (common water flea)	water flea	47.57	1.058036633052068
Physcomitrella patens subsp. patens	moss	47.33	1.1727134477514667
Anolis carolinensis (green anole)	lizard	46.72	1.113765477057538
Brassica rapa	flowering plant	46.29	1.1056659411640803
Fragaria vesca (woodland strawberry)	strawberry	46.02	1.1052853232259425
Amborella trichopoda	flowering shrub	45.88	1.0992441209406494
Citrullus lanatus var. lanatus (watermelon)	watermelon	44.5	1.0855134984692458
Capsella rubella	mustard-family plant	44.37	1.1041257367387034
Arabidopsis thaliana (thale cress)	cress	44.15	1.109853013573388
Lotus Japonicus	lotus	44.11	1.0773228019122847
Populus trichocarpa (Populus balsamifera subsp. trichocarpa)	tree	43.7	1.1097672456226706
Cucumis sativus (cucumber)	cucumber	43.56	1.0823847862298719
Caenorhabditis elegans strain Bristol N2	worm	42.96	1.106320224719101
Vitis vinifera (grape)	grape	42.75	1.0859833393697935
Ciona intestinalis	tunicate	42.68	1.158652461848546
Solanum lycopersicum (tomato)	tomato	41.7	1.1177
Theobroma cacao (chocolate)	chocolate	41.31	1.1297481860862142
Medicago truncatula (barrel medic) strain A17	flowering plant	40.78	1.093754366354618
Apis mellifera (honey bee) strain DH4	honey bee	39.76	1.216042543762464
Saccharomyces cerevisiae (bakers yeast) strain S288C	yeast	39.63	1.1387641650630744
Acyrthosiphon pisum (pea aphid) strain LSR1	aphid	39.35	1.1651853457619772
Debaryomyces hansenii strain CBS767	yeast	37.32	1.1477345930856775
Pediculus humanus corporis (human body louse) strain USDA	louse	36.57	1.2365791828213537
Schistosoma mansoni strain Puerto Rico	trematode	35.94	1.0586902800658977
Candida albicans strain WO-1	yeast	35.03	1.1490291609944834
Tetrapisispora phaffii CBS 4417 strain type CBS 4417	yeast	34.69	1.17503805175038
Paramecium tetraurelia strain d4-2	protist	30.03	1.2494922903347117
nucleomorph Guillardia theta	endosymbiont	23.87	1.1529462427330803
Plasmodium falciparum 3D7	malaria parasite	23.76	1.4471365638766511

reade more...

DNA: Full of Surprises

DNA is full of surprises, one of them being the radically different ways in which it can be used to express information. We think of DNA as a four-letter language (A,T,G,C), but some organisms choose to "speak" mostly G and C. Others avoid G and C, preferring instead to "speak" A and T. The question is, if DNA is fundamentally a four-letter language, why would some organisms want to limit themselves to dialects that use mostly just two letters?

The DNA of Clostridium botulinum (the botulism bug; a common soil inhabitant) is extraordinarily deficient in G and C: over 70% of its DNA is A and T. The soil bacterium Anaeromyxobacter dehalogenans, on the other hand, has DNA that's 74% G and C. Think of the constraints this puts on a coding system. Imagine that you want to store data using a four-letter alphabet, but you are required to use two of the four letters 74% of the time! Suddenly a two-bit-per-symbol encoding scheme (a four-letter code) starts to look and feel a lot more like a one-bit-per-symbol (two-letter) scheme.

What kinds of information are actually stored in DNA? Several kinds, but bottom line, DNA is primarily a system for specifying sequences of amino acids. The information is stored as three-letter "words" (GCA, ATG, TCG, etc.) called codons. There are 64 possible length-3 words in a system that uses a 4-letter alphabet. Fortunately, there are only 20 amino acids. I say "fortunately," because imagine if there were 64 different amino acids (as there might be in extra-terrestrial life, say) and they had to occur in roughly equal amounts in all proteins. Every possible codon would have to be used (in roughly equal numbers) and there would be no possibility of an organism like C. botulinum developing a "preference" for A or T in its DNA. It is precisely because only 20 codons out of a possible 64 need be used that organisms like C. botulinum (with a huge imbalance of AT vs. GC in its DNA) can exist.

As it happens, all organisms do tend to use all 64 possible codons, but they use them with vastly varying frequencies, giving rise to codon "dialects." (Note that the mapping of 64 codons onto 20 amino acids means some codons are necessarily synonymous. For example, there are four different codons for glycine and six for leucine.) You might expect that an organism like C. botulinum with mostly A and T in its DNA would "speak" in A- and T-rich codons. And you'd be right. Here's a chart showing which codons C. botulinum actually uses, and at what frequencies:

The green-highlighted codons are the ones C. botulinum uses preferentially (with the usage frequencies shown as precentages). As you can see, the most-often-used codons tend to contain a lot of A and/or T. Which is exactly what you'd expect, given that the organism's DNA is 72% A and T.

In theory, a 3-letter word in a 4-letter language can store six bits of information. But we know from information theory that the actual information content of a word depends on how often it's used. If I send you a 100-word e-mail that contains the question "Why?" repeated 100 times, you're not really receiving the same amount of information as would be in a 100-word e-mail that contains text in which no word appears twice.

The average information content of a C. botulinum codon is easily calculated using the usage-frequencies shown above. (All you do is calculate -F * log₂(F) for each codon and add up the results.) If you do the math, you find that C. botulinum uses an average of 5.217 bits per codon, about 13% short of the theoretical six bits available.

One might imagine that the more GC/AT-imbalanced an organism's DNA is, the more biased its codon preferences will be. This is exactly what we find if we plot codon entropy against genome G+C content for a range of organisms having DNA of various G+C contents.

Average codon entropy versus genome G+C content for 90 microorganisms.

In the above graph, you can see that when an organism's DNA is composed of equal amounts of the bases (G+C = 50%, A+T = 50%), the organism tends to use all codons more or less equally, and entropy approaches the theoretical limit of six bits per codon. But when an organism develops a particular "dialect" (of GC-rich DNA, or AT-rich DNA), it starts using a smaller and smaller codon vocabulary more and more intensively. This is what causes the curve to fall off sharply on either side of the graph.

If you have an observant eye, you may have noticed that the two halves of the graph are not symmetrical, even though they look symmetrical at first glance. (Organisms on the high-GC side are using slightly less entropy per codon than low-GC organisms, for a given amount of genome GC/AT skew.) If you're a biologist, you might want to think about why this is so. I'll return to the subject in a future post.

reade more...

Decrypting DNA

In a previous post ("Information Theory in Three Minutes"), I hinted at the power of information theory to gage redundancy in a language. A fundamental finding of information theory is that when a language uses symbols in such a way that some symbols appear more often than others (for example when vowels turn up more often than consonants, in English), it's a tipoff to redundancy.

DNA is a language with many hidden redundancies. It's a four-letter language, with symbol choices of A, G, C, and T (adenine, guanine, cytosine, and thymine), which means any given symbol should be able to convey two bits' worth of information, since log₂(4) is two. But it turns out, different organisms speak different "dialects" of this language. Some organisms use G and C twice as often as A and T, which (if you do the math) means each symbol is actually carrying a maximum of 1.837 bits (not 2 bits) of information.

Consider how an alien visitor to earth might be able to use information theory to figure out terrestrial molecular biology.

The first thing an alien visitor might notice is that there are four "symbols" in DNA (A, G, C, T).

By analyzing the frequencies of various naturally occurring combinations of these letters, the alien would quickly determine that the natural "word length" of DNA is three.

There are 64 possible 3-letter words that can be spelled with a 4-letter alphabet. So in theory, a 3-letter "word" in DNA should convey 6 bits worth of information (since 2 to the 6th power is 64). But an alien would look at many samples of earthly DNA, from many creatures, and do a summation of -F * log₂(F) for every 3-letter "word" used by a given creature's DNA (where F is simply the frequency of usage of the 3-letter combo). From this sort of analysis, the alien would find that even though 64 different codons (3-letter words) are, in fact, being used in earthly DNA, in actuality the entropy per codon in some cases is as little as 4.524 bits. (Or at least, it approaches that value asymptotically.)

Since 2 to the 4.524 power is 23, and since proteins (the predominant macromolecule in earthly biology) are made of amino acids, a canny alien would surmise that there must be around 23 different amino acids; and earthly DNA is a language for mapping 3-letters words to those 23 amino acids.

As it turns out, the genetic code does use 3-letter "words" (codons) to specify amino acids, but there are 20 amino acids (not 23), with 3 "stop codons" reserved for telling the cell's protein-making machinery "this is the end of this protein; stop here."

E. coli codon usage.

The above chart shows the actual codon usage pattern for E. coli. Note that all organisms use the same 3-letter codes for the same amino acids, and most organisms use all 64 possible codons, but the codons are used with vastly unequal frequencies. If you look in the upper right corner of the above chart, for example, you'll see that E. coli uses CTG (one of the six codons for Leucine) far more often than CTA (another codon for Leucine). One of the open questions in biology is why organisms favor certain synonymous codons over others (a phenomenon called codon usage bias).

While DNA's 6-bit codon bandwidth permits 64 different codons, and while organisms do generally make use of all 64 codons, the uneven usage pattern means fewer than 6 bits of information are used per codon. To get the actual codon entropy, all you have to do is take each usage frequency and calculate -F * log₂(F) for each codon, then sum. If you do that for E. coli, you get 5.679 bits per codon. As it happens, E. coli actually does make use of almost all the available bandwidth (of 6 bits) in its codons. This turns out not to be true for all organisms, however.

reade more...

Pages

.