Pages

.

Showing posts with label comparative genomics. Show all posts
Showing posts with label comparative genomics. Show all posts

Converting an SVG Graph to Histograms

The graphs you get from ZunZun.com (the free graphing service) are pretty neat, but one shortcoming of ZunZun is that it won't generate histograms. (Google Charts will do histograms, but unlike ZunZun, Google won't give you SVG output.) The answer? Convert a ZunZun graph to histograms yourself. It's only SVG, after all. It's XML; it's text. You just need to edit it.

Of course, nobody wants to hand-edit a zillion <use> elements (to convert data points to histogram rects). It makes more sense to do the job programmatically, with a little JavaScript.

In my case, I had a graph of dinucleotide frequencies for Clostridium botulinum coding regions. What that means is, I tallied the frequency of occurrence (in every protein-coding gene) of 5'-CpG-3', CpC, CpA, CpT, ApG, ApA, ApC, and all other dinucleotide combinations (16 in all). Since I already knew the frequency of G (by itself), A, C, and T, it was an easy matter to calculate the expected frequency of occurrence of each dinucleotide pair. (For example, A occurs with frequency 0.403, whereas G occurs with frequency 0.183. Therefore the expected frequency of occurrence of the sequence AG is 0.403 times 0.183, or 0.0738.) Bottom line, I had 16 expected frequencies and 16 actual frequencies, for 16 dinucleotide combos. I wanted side-by-side histograms of the frequencies.

First, I went to ZunZun and entered my raw data in the ZunZun form. Just so you know, this is what the raw data looked like:

0 0.16222793723642806
1 0.11352236777965981
2 0.07364933857345456
3 0.08166221769088752
4 0.123186555838253
5 0.12107590293804558
6 0.043711462078314355
7 0.03558766171971166
8 0.07364933857345456
9 0.07262685957145093
10 0.033435825941632816
11 0.03459042802303202
12 0.055925067612781175
13 0.042792101322514244
14 0.019844425842971265
15 0.02730405457750352
16 0.123186555838253
17 0.12232085101526233
18 0.055925067612781175
19 0.05502001002972254
20 0.09354077847378013
21 0.07321410524577443
22 0.03319196776961071
23 0.028600012050969865
24 0.043711462078314355
25 0.043328337600588136
26 0.019844425842971265
27 0.0062116692282947845
28 0.03319196776961071
29 0.04195172151930211
30 0.011777822917388797
31 0.015269662767317132


I made ZunZun graph the data, and it gave me back a graph that looked like this:



Which is fine except it's not a histogram plot. And it has goofy numbers on the x-axis.

I clicked the SVG link under the graph and saved an SVG copy to my local drive, then opened the file in Wordpad.

The first thing I did was locate my data points. That's easy: ZunZun plots points as a series of <use> elements. The elements are nested under a <g> element that looks like this:

<g clip-path="url(#p0c8061f7fd)">

I hand-edited this element to have an id attribute with value "DATA":

<g id="DATA" clip-path="url(#p0c8061f7fd)">

Next, I scrolled up to the very top of the file and found the first <defs> tag. Under it, I placed the following empty code block:

<script type="text/ecmascript"><![CDATA[
// code goes here

]]></script>

Then I went to work writing code (to go inside the above block) that would find the <use> elements, get their x,y values, and create <rect> elements of a height that would extend to the x-axis line.

The code I came up with looks like this:



// What is the SVG y-value of the x-axis?
// Attempt to discover by introspecting clipPath

function findGraphVerticalExtent( ) {
   var cp = document.getElementsByTagName('clipPath')[0];
   var rect = cp.childNodes[1];
   var top = rect.getAttribute('y') * 1;
   var bottom = rect.getAttribute('height') * 1;
   return top + bottom;
}


// This is for use with SVG graphs produced by ZunZun,
// in which data points are described in a series of
// <use> elements. We need to get the list of <use>
// nodes, convert it to a JS array, sort data points by
// x-value, and replace <use> with <rect> elements.

function changeToHistograms( ) {

   var GRAPH_VERTICAL_EXTENT = findGraphVerticalExtent( );

   // The 'g' element that encloses the 'use' elements
   // needs to have an id of "DATA" for this to work!
   // Manually edit the <g> node's id first!
   var data = document.getElementById( "DATA" );

   // NOTE: The following line gets a NodeList object,
   // which is NOT the same as a JavaScript array!
   var nodes = data.getElementsByTagName( "use" );

   // utility routine (an inner method)
   function nodeListToJavaScriptArray( nodes ) {

       var results = [];

       for (var i = 0; i < nodes.length; i++)
          results.push( nodes[i] );

       return results;
   }

   // utility routine (another inner method)
   function compareX( a,b ) {
       return a.getAttribute("x") * 1 - b.getAttribute("x") * 1;
   }

   var use = nodeListToJavaScriptArray( nodes );

   // We want the nodes in x-sorted order
   use.sort( compareX ); // presto, done

   // Main loop
   for (var i = 0; i < use.length; i++) {

       var rect =
           document.createElementNS("http://www.w3.org/2000/svg", "rect");
       var item = use[i];
       var x = item.getAttribute( "x" ) * 1;
       var y = item.getAttribute( "y" ) * 1;
       var rectWidth = 8;
       var rectHeight = GRAPH_VERTICAL_EXTENT - y;
       rect.setAttribute( "width", rectWidth.toString() );
       rect.setAttribute( "height", rectHeight.toString() );
       rect.setAttribute( "x" , x.toString() );
       rect.setAttribute( "y" , y.toString() );

       // We will alternate colors, pink/purple
       rect.setAttribute( "style" ,
           (i%2==0)? "fill:ce8877;stroke:none" : "fill:8877dd;stroke:none" );

       data.appendChild( rect ); // add a new rect
       item.remove(); // delete the old <use> element
   }

   return use;
}

As so often happens, I ended up writing more code than I thought it would take. The above code works fine for converting data points to histogram bars (as long as you remember to give that <g> element the id attribute of "DATA" as mentioned earlier). But you need to trigger the code somehow. Answer: insert onload="changeToHistograms( )" in the <svg> element at the very top of the file.

But I wasn't done, because I also wanted to apply data labels to the histogram bars (labels like "CG," "AG," "CC," etc.) and get rid of the goofy numbers on the x-axis.

This is the function I came up with to apply the labels:


   function applyLabels( sortedNodes ) {

var labels = ["aa", "ag", "at", "ac",
"ga", "gg", "gt", "gc", "ta", "tg",
"tt", "tc", "ca", "cg", "ct", "cc"];

var data = document.getElementById( "DATA" );
var labelIndex = 0;

for (var i = 0; i < sortedNodes.length; i+=2) {
var text =
document.createElementNS("http://www.w3.org/2000/svg", "text");
var node = sortedNodes[i];
text.setAttribute( "x", String( node.getAttribute("x")*1 +2) );
text.setAttribute( "y", String( node.getAttribute("y")*1 - 13 ) );
text.setAttribute( "style", "font-size:9pt" );
text.textContent = labels[ labelIndex++ ].toUpperCase();
text.setAttribute( "id", "label_" + labelIndex );
data.appendChild( text );
}
}


And here's a utility function that can strip numbers off the x-axis:

   // Optional. Call this to remove ZunZun graph labels.
// pass [1,2,3,4,5,6,7,8,9] to remove x-axis labels
function removeZunZunLabels( indexes ) {

for (var i = 0;i < indexes.length;i++)
try {
document.getElementById("text_"+indexes[i]).remove();
}
catch(e) { console.log("Index " + i + " not found; skipped.");
}
  
BTW, if you're wondering why I multiply so many things by one, it's because the attribute values that comprise x and y values in SVG are String objects. If you add them, you're concatenating strings, which is not what you want. To convert a number in string form to an actual JavaScript number (so you can add numbers and not concatenate strings), you can either multiply by one or explicitly coerce a string to a number by doing Number( x ).

The final result of all this looks like:


Final graph after surgery. Expected (pink) and actual (purple) frequencies of occurrence of various dinucleotide sequences in C. botulinum coding-region DNA.

Which is approximately what I wanted to see. The labels could be positioned better, but you get the idea.

What does the graph show? Well first of all, you have to realize that the DNA of C. botulinum is extremely rich in adenine and thymine (A and T): Those two bases constitute 72% of the DNA. Therefore it's absolutely no surprise that the highest bars are those that contain A and/or T. What's perhaps interesting is that the most abundant base (A), which should form 'AA' sequences at a high rate, doesn't. (Compare the first bar on the left to the shorter purple bar beside it.) This is especially surprising when you consider that AAA, GAA, and AAT are by far the most-used codons in C. botulinum. In other words, 'AA' occurs a lot, in codons. But even so, it doesn't occur as much as one would expect.

It's also interesting to compare GC with CG. (Non-biologists, note that these two pairs are not equivalent, because DNA has a built-in reading direction. The notation GC, or equivalently, GpC, means there's a guanine sitting on the 5' side of cytosine. The notation CG means there's a guanine on the 3' side of cytosine. The 5' and 3' numbers refer to deoxyribose carbon positions.) The GC combo occurs more often than predicted by chance whereas the combination CG (or CpG, as it's also written) occurs much less frequently than predicted by chance. The reasons for this are fairly technical. Suffice it to say, it's a good prima facie indicator that C. botulinum DNA is heavily methylated. Which in fact it is.
reade more... Résuméabuiyad

Do-It-Yourself Bio-Hacking: A Tutorial

Today I want to show you how you can do a slick bio-hacking experiment, and graph the results nicely, all in your browser, in well under 10 minutes. The following experiment will run just fine in Chrome or Firefox. In Firefox, it helps to have the Firebug extension. (If you're using Firefox, click F12. If it pops a console window, you already have Firebug.) I tested against Chrome v28.0.1500.72 and Firefox 15.0.1 with Firebug 1.9.2. Other combinations may work; those are just the ones I tested.

We're going to do a comparative genomics/proteomics experiment designed to explore amino-acid usage in a particular protein (DnaJ) across a couple dozen bacterial species. Even if you're not a bio-geek, I hope you'll follow along. At the very least, you'll learn how to make pretty graphs from any kind of data using the server at ZunZun.

What is the DnaJ protein, you ask? It's one of a class of proteins known as heat shock proteins, which are produced in response to elevated temperatures. (Your body produces heat shock proteins in response to fever, for example.) As you probably know (or can guess), proteins, in general, are rather sensitive to heat. Even a small amount of heat can cause a protein to start to unravel (or denature). DnaJ and its partners have the job of helping proteins re-fold into their correct original 3D shape(s) after exposure to heat. They're like little repair jigs. A partially damaged protein goes in; it re-folds and comes back out good as new.

Heat shock proteins occur widely, across all domains of life, and their amino-acid sequences are highly conserved; but they do differ. As we'll see right now.

Step 1
Go to http://www.uniprot.org/ and enter "DnaJ" (case doesn't matter) in the search field at the top of the page, then hit Enter. A list of organisms with DnaJ will appear, each with a checkbox on the left. Check all the checkboxes on the page (gang-check them with Shift-click).

Step 2
You'll notice at the bottom of the window there's a green bar with buttons "Retrieve," "Align," "Blast," and "Clear." Click the Retrieve button.

Steps 1 and 2.


Step 3
In the page that comes up, look for FASTA on the left. Under it are two links, Download and Open. Click Open. (See screen shot below.) You'll see a bunch of protein sequences (with one-letter abbreviations for amino acids), each preceded by a line that begins with > (greater-than sign). These are our DnaJ proteins.

Step 3.

Step 4
Click F12 to toggle open the console window. Be sure the Console tab is showing. In Firebug, you may also have to click the Console menu and choose Command Editor from the dropdown list.

Enter and execute (with Enter, in Chrome, or with Control-Enter in Firebug) the following lines of code:

all = document.getElementsByTagName("pre")[0].innerHTML.split(/&gt;/);
all.shift(); 

It's important that the part between slashes be ampersand-g-t-semicolon, not a greater-than symbol. The browser is showing you greater-than signs but in the HTML markup it's really ampersand-g-t-semicolon, not angle brackets. We actually do want to split on &gt;, not on >.

Note that to execute a line of code in Firebug you have to type Control-Enter. In Chrome, you just type Enter. But in Chrome's JavaScript console, you have to use Shift-Enter to type on more than one line.

The variable all now contains an array of protein sequences. If you want to verify it, type all.length (then Enter, or in Firebug Control-Enter), and you should see the length of the array, 25.

Step 5
Enter the following code in the console (and execute it with Enter; it'll do nothing, which is fine).


function analyze( item ) {

var sequence = item.split(/SV=\d\n(?=\w)/)[1];
var lysineCount = sequence.match(/K/g).length;
var arginineCount = sequence.match(/R/g).length;
lysineCount /= sequence.length;
arginineCount /= sequence.length;
console.log( lysineCount + " " + arginineCount );

}
 
This is the callback code we'll use to process every member of the all array. Each item in the array consists of a FASTA header followed by a protein sequence. We just want the sequence, not the header, which is why we have a first line that splits off the part we need. The remaining lines obtain the number of lysines (K) and the number of arginines (R) in the protein sequence, then we divide those numbers by the sequence length to get a frequency-of-occurrence. The final line prints the results to the console window.

This function, by itself, doesn't do anything until we run it against each amino-acid sequence in the all array. That's the next step.

Step 6
Enter the following line of code into the console and run it with Enter (or Control-Enter, in Firebug):

all.forEach( analyze );
 
The console should immediately fill with numbers (25 rows of two numbers each). That's our data. We need to graph it to see what it looks like. Ready?

Step 7
Go now to http://zunzun.com and notice four pulldown menus at the top of the page. Use the far-left dropdown to select Polynomial.

Select Polynomial from the ZunZun function list.

A new window appears with ugly (or beautiful, depending on your mindset) formulas. Click the link to First-Order (Linear) 2D. Why? Because in the absence of any foreknowledge, we're going to blindly assume that our data is best fit by a straight line. If it's not straight-line data, we can come back and change our selection later.

When you click the First-Order (Linear) 2D link, you'll quickly be in a stark-looking window with a single pulldown menu at the top. Click it and select Data Labels for Graphs. Replace "X data" with "Lysine" and "Y data" with "Arginine."

Select Data Labels for Graph.

Step 8
Now use the single pulldown menu to select Text Data Editor.

Quickly go back to that console window and Copy all of your data (all 25 rows of numbers), then Paste the data into the Text Data Editor box.



Click the Submit button near the top of the page. Be patient, as it may take up to 20 seconds or so for your graph to be ready.

You'll know your graph is ready when the window changes to one that shows four pulldown menus at the top. The far-right menu is Data Graphs. Click into it and select Arginine vs. Lysine with model. NOTE: The exact names of the menu items will depend on how you labeled your axes at the end of Step 7 above.

You should see the window change to a view of a graph that looks like this:

Graph created on demand by the ZunZun server.

Pretty easy, right? It gets better. The line that ZunZun drew through the data points is a regression curve that minimizes the sum of squared error. To see the formula for the line, including coefficients, use the far-left menu, called Coefficients and Text Reports, to select Coefficients. Don't worry, your graph will still be there when you're done. To get back to the graph at any time, just use the far-right menu and any of the commands under it (which re-display the graph in various ways).

The graph seems to be saying that Arginine levels go down as Lysine goes up. But how good of a correlation is this, really? Use the far-left pulldown menu again. This time select Coefficient and Fit Statistics. You'll notice a ton of stats (chi-squared and so on). Among them, r-squared is given as 0.637834788057. That means the correlation coefficient, r, is 0.799, which is pretty solid.

I'll save the interpretation of our experiment's results for another time. For now, notice that underneath your ZunZun graph are links for saving the graph either as PNG or SVG. I strongly recommend you save it as both. You can open SVG in both Photoshop and Illustrator (and most browsers too). You will definitely want to keep an SVG version around to edit by hand in your favorite text editor (SVG is just a variety of XML). I'll be showing you how to do lots of sexy things with SVG graphs in upcoming posts.

reade more... Résuméabuiyad

Getting Started in Desktop Bioinformatics

I've spent about four months now exploring do-it-yourself desktop bioinformatics. Overall, I'm excited by what I've been able to do and I'm optimistic about the prospects for other do-it-yourself desktop-science geeks, because there are tons of great online tools for doing bio-sci and lots of important scientific questions yet to be fleshed out. So I thought I'd share some of what I've learned, and provide some pointers for anyone who might want to try his or her hand at this sort of "citizen science."

It helps to have the benefit of a science education (in particular, a bio education) before beginning, but one of the great things about do-it-yourself desktop science is that you can (and will!) learn as you go. For example, you might have only a bare-bones basic understanding of enzymology before you begin, but as you move deeper into a particular research quest, you'll find yourself wanting to learn more about this or that aspect of an enzyme. So you'll hit Google Scholar and bring yourself up-to-date on this or that detail of a particular subject. That's a Good Thing.

When I first plunged into DNA analysis, I have to admit my knowledge of mitochondria was weak. I knew they had their own DNA, for example, but it wasn't obvious to me (until I started digging) that mitochondrial DNA is pathetically small, whereas the mitochondrial proteome (the superset of all products that go into making up a functioning mitochondrion) is large. In other words, most "mitochondrial genes" are not in mtDNA. They're in nuclear DNA. There are a couple of online databases of nuclear mitochondrial genes (NUMTs, as they're known), but by and large this is an area in dire need of more research. Someone needs to put together a database or reference set of yeast NUMTs, for example. We also need a database for algal NUMTs, another for protozoan NUMTs, another for rice or corn or Arabidopsis NUMTs, etc. Maybe you'll be the one to move such a project along?

So. How can you get started in desktop bioinformatics? I recommend familiarizing yourself with the great tools at genomevolution.org, which is powered by iPlant, which (in turn) is funded by the National Science Foundation Plant Cyberinfrastructure Program here in the U.S. In particular, I recommend you set aside an evening to run through some of the tutorials at genomevolution.org. That'll give you an idea of what's possible with their tools.
Many organisms have genes for flagellum proteins,
but not all such organisms actually make a flagellum.
(The flagellum is the whiplike appendage that gives the
cell motility. Above: Bdellovibrio, a bacterium with a
powerful flagellum.)

If you go to this page and scroll down, you'll find some really interesting short videos showing how to use some of the genomevolution.org tools. They're fun to watch and should stimulate your imagination.

What kinds of problems need investigating by desktop biologists? The sky's the limit. One quest that lends itself to citizen science is looking for examples of horizontal gene transfer (HGT). This requires that you first teach yourself a little bit about BLAST searches. (BLAST searches are sequence-similarity searches that let you compare DNA against DNA or amino-acid sequence against amino-acid sequence.) The strategies involved here can range from simple and brute force to sophisticated; and the great thing is, you can invent your own heuristics. It's a wide-open area. I recently found good evidence (90%+ similarity of DNA sequences) for bacterial gene transfer into rice, which I'll write about in a later post. I'm confident there are thousands of examples of horizontal gene transfer (whether from bacteria to bacteria, bacteria to plant, bacteria to insect, or whatever) waiting to be discovered. You could easily be the next discoverer of one of these gene transfers.

Here are some other ideas for desktop-science explorations:
  • Find and characterize flagellar genes in organisms that lack motility. If you dig into the literature, you'll find that there are many examples of supposedly immotile organisms (like the intracellular parasite Buchnera, which lives inside aphids) that not only harbor flagellum genes but express some of them—yet have no external flagellum. Obviously, organisms that retain flagellum genes but actually don't make a flagellum (that little whip-like tail that makes single-celled organisms swim around) must be retaining those genes for a reason. The gene products must be doing something. But what? Also: Paramecium and diatoms and other eukaryotes make flagella and/or cilia. Most animals also make cilia. (Ever get a tickle deep in your throat or bronchia? It was probably something tangling with the cilia lining your bronchial system.) What's the relationship between cilia gene products in Paramecium, say, and cilia in animals? Do any plants conceal cilia genes? If so, how are they related phylogenetically to lower-organism cilia?
  • Migration of genes from parasites to host DNA. A general pattern that seems to happen in nature is: a bacterium or other invader takes up residency inside an animal or plant cell, becoming an endosymbiont; then some of its genes (the symbiont's genes) move to the nucleus of the host cell. Which genes? What do the genes do? That's up to you to try to find out. 
  • Bidirectionally ("bidi") transcribed genes: While rare, there are examples of genes in which each strand of DNA is transcribed into mRNA. (The genome for Rothia mucilaginosa contains many putative examples of this.) Find organisms that contain bidi genes. Try to determine if both strands are actually transcribed. Examine sister-species organisms to see if one strand is transcribed in one organism and the other gene (on the other strand) is transcribed in the other organism.
  • Phylogenetics of plasmid and viral genes. Try to determine the ancestry of a virus gene. There are good tree-making services online that do all the hard work for you, including protein-sequence alignments. All you have to do is cut and paste Fasta files.
  • Codon analysis. There are many plants (rice is one) and higher organisms in which DNA is more or less equally divided into high-GC-content genes and low-GC-content genes. Surely the codon usage patterns for each class of gene(s) varies. But how? What are the codon adaptation indexes (CAI values) for the various genes? Create a few histograms of CAI values. Use CAIs and other techniques to try to determine which genes are highly expressed. Are HEGs (highly expressed genes) mostly high-GC? Low-GC? Both? Run some histograms on the genes' purine (A+G) content, G+C content, G+C content by codon position.
  • Many organisms (and organelles) have extremely GC-poor genomes. Some have bizarre codon usage patterns (where, say, the codon AAA is used 12 times more than the average codon). Some use 56 or fewer codons (out of 64 possible). Find the organelle or organism that uses the fewest codons. See if there's an organism or organelle that uses fewer than 20 amino acids. Which amino acid(s) get(s) left out? 
  • Characterize the DNA repairosome of an aerobic and an anerobic archeon. Compare and contrast the two.
  • Find all the genes in a particular organism that have mitochondrial-targeting presequences in their DNA. 
  • Pick two closely related organisms. Try to figure out how many million years ago they diverged. Use mitochondrial DNA analysis as well as cytoplasmic protein analysis. 
  • Find the bacterium that has more secretion-protein, permease, and protein-translocation genes than any other. Compare it to its closest relative. 
  • Find an organism that is pathogenic (to humans, animals, or plants). Find its closest non-pathogenic relative. Compare genomes. Determine which genes are most likely to be involved in virulence. 
  • Some seemingly simple organisms (amoebae) have more DNA than a human. Why? What's all that DNA doing there? Does it contain horizontally transferred genes from plants, bacteria, archeons, animals? Are amoebas and other super-large-genome organisms "DNA hoarders"? Are they DNA curationists? Characterize the genes (enumerate by category, first) of these organisms. How many are expressed? How many are junk? What's the energy cost of maintaining that much junk DNA? Can it all be junk? Is an amoeba actually a devolved higher life form that forgot how to do morphogenesis and can no longer develop into a tadpole or whatever?
  • [ insert your own project here! ]

reade more... Résuméabuiyad

Why Mitochondrial DNA is Different

Most genomes that are high in A+T content (or low in G+C content) show a surprising DNA strand asymmetry: The message strand of genes tends to be rich in purines. This rule applies across all domains I've looked at except mitochondria, where message strands tend to be pyrimidine-rich rather than purine-rich. The following two graphs makes this clearer.


This is a graph of message-strand (or RNA-synonymous-strand) purine content plotted vertically, against A+T plotted horizontally, for 1,373 bacterial species. Each dot represents a genome. High-GC/low-AT organisms like Streptomyces and Bordetella are on left and low-GC/high-AT organisms like Clostridium botulinum are toward the right. The few dots on the far right are intracellular endosymbionts that have lost a good bit of DNA over the millennia. They tend to be extremely high in A+T.

Compare the above graph with the graph below, which is the same thing (message-strand A+G vs. A+T) for mitochondrial DNA (N=2543 genomes). There is still an upward slope to the data (and in fact it is steeper than it looks, because the range of y-values is different in the graph below than in the graph above). The slope of the regression line is very nearly the same (0.148 vs. 0.149) for both graphs. But you can see that in the graph below, nearly all the points are below y = 0.50. That means message-strands are high in pyrimidines rather than purines.



I speculated in a previous post that the reason mitochondrial DNA is pyrimidine-heavy on the message strand is that mtDNA encodes a very small number of proteins (13, in all), and they tend to be membrane-associated proteins, which use mostly non-polar amino acids. It turns out that codons for the non-polar amino acids are pyrimidine-rich.

To see if that's really what's going on, I obtained the DNA sequences for cytochrome-c oxidase and NADH dehydrogenase (the two must fundamental enzyme systems of mitochondria) from several hundred bacterial species. Actually, I was able to obtain DNA sequences for a total of 942 bacterial NADH dehydrogenase (subunit L) proteins. I also succeeded in obtaining DNA sequences for 647 bacterial cytochrome-c oxidase subunit 1 proteins. In mitochondria, these genes are known as ND5 and Cox1. In bacteria they're better known as nuoL and cyoB.

The graph below shows A+G for the two enzymes versus whole-chromosome A+T, for the relevant organisms.

Message strand purine content was derived from the DNA sequences of cyoB (pink) genes from 942 bacteria, and from nuoL (blue) genes from 647 bacterial species. The A+G values were plotted against host-organism whole-genome A+T content. All cyoB and nuoL sequences tended to be pyrimidine rich. But pyrimidine content was less for organisms with high A+T content. (Note the slightly positive slope of the regression line.)

The pink points are for cytochrome-c oxidase subunit 1 (cyoB) while the blue points are for NADH dehydrogenase subunit 5 (nuoL). Two things are worth noting. One is that the regression line is upward-sloping, meaning that as an organism's DNA gets richer in A+T content, the purine content on the message strand rises. This effect seems to be universal. The second thing to note is that almost all of the points in the graph lie below y = 0.5, as is the case for mitochondria. These two signature "mitochondrial" enzyme systems, critical to oxidative phosphorylation (in bacteria as well as higher organisms), do tend to use pyrimidine-rich codons—rendering the relevant genes pyrminidine-rich on the RNA-synonymous (message) strand of DNA. The hypothesis is upheld.

For you bio students, a bit of homework: You might want to think about why it is that membrane-associated proteins are rich in non-polar amino acids. (In human mitochondria, leucine and isoleucine are the most-used amino acids. Together they account for an amazing 30% of all amino acids used in mtDNA-encoded gene products.) Hint: Most membranes have a lipid bilayer, and lipids don't like water.
reade more... Résuméabuiyad

More about Mitochondrial DNA

To recap my desktop-science experiments of the last month or so, I've found strandwise DNA asymmetry across domains, which is to say in bacteria, Archaea, eukaryotes, viruses, and mitochondrial DNA. In every case except mitochondria, the message (or RNA-synonymous) strand of DNA in coding regions tends to be purine-rich. The opposite strand tends to be pyrimidine-rich. Moreover, in all domains, including mitochondria, message-strand purine content increases in proportion to genome A+T content. (A+T content is a phylogenetic signature. Some genomes are inherently high in A+T content—or low in G+C content—while others are not. Related organisms tend to have similar A+T or G+C contents.)

Mitochondrial genes tend to be pyrmidine-rich on the message strand, seemingly in violation of the finding that in all other domains, message strands are purine-rich. The mitochondrial anomaly is actually very easy to understand (although it took me weeks to realize the explanation). In a nutshell: Mitochondrial DNA is pyrimidine-rich on message strands because mtDNA encodes only a few proteins (13, usually), all of them membrane-associated. Membrane-associated proteins are unusual because they tend to incorporate mostly non-polar amino acids such as leucine, isoleucine, valine, proline, alanine, or phenylalanine—all of which are specified by pyrimidine-rich codons.
The mitochondrion.

It seems to me mitochondrial DNA shouldn't be thought of as a genome, because well over 90% of mitochondrial-associated gene products are encoded by genes in the host nucleus. (In humans, there may be as many as 1500 nuclear-encoded mitochondrial genes.) This point is worth repeating, so let me quote Patrick Chinnery, TRENDS in Genetics (2003) 19:2, 60:

The vast majority of mitochondrial proteins (estimated at >1000) are synthesized in the cytosol from nuclear gene transcripts.

The circular mitochondrial "chromosome" (if it can be called that) is the vestigial remnant of a much larger genome that long ago migrated to the host nucleus, no doubt to avoid oxidative attack. The mitochondrion simply is not a safe place to store DNA. (Would you set up a sperm bank in a rocket-fuel factory?) It's teeming with molecular oxygen, superoxides, peroxides, free protons, and other hazardous materials.

The human mitochondrial chromosome.

Human mitochondrial DNA (which is typical of a lot of mtDNA) encodes just a handful of multi-subnit transmembrane proteins, namely: cytochrome-c oxidase, NADH dehydrogenase, cytochrome-b, and an ATPase. That's it. There are no other protein genes in human mtDNA. All other "mitochondrial proteins" are encoded somewhere else. (That includes 37 out of 44 subunits of the NADH dehydrogenase complex; the DNA polymerase that replicates mitochondrial DNA; the mitochondrial RNA polymerase; about 50 ribosomal proteins; so-called "mitochondrial" catalase; and hundreds of other "mitochondrial" proteins. All are encoded in the nucleus.)

Bottom line: Mitochondrial DNA encodes a very small ensemble of highly specialized membrane-associated proteins. We shouldn't expect this small ensemble to be representative of other genes found in other genomes. (And it's not.) That, in a nutshell, is why mtDNA is not particularly purine-rich in message strands.

But we should test this hypothesis, if possible. (And it is, in fact, possible.) Most bacteria are aerobic, which means most bacterial species have genes for cytochrome-c oxidase, NADH dehydrogenase, etc. The DNA for those genes should be similar to mtDNA with respect to strand-asymmetric purine content. If we analyze bacterial DNA, we should find that genes for cytochrome-c oxidase, NADH dehydrogenase, etc. are pyrimidine-rich on the message strand, just as in mtDNA.

In tomorrow's post: the data.
reade more... Résuméabuiyad

DNA Strand Asymmetry: More Surprises

The surprises just keep coming. When you start doing comparative genomics on the desktop (which is so easy with all the great tools at genomevolution.org and elsewhere), it's amazing how quickly you run into things that make you slap yourself on the side of the head and go "Whaaaa????"

If you know anything about DNA (or even if you don't), this one will set you back.

I've written before about Chargaff's second parity rule, which (peculiarly) states that A = T and G = C not just for double-stranded DNA (that's the first parity rule) but for bases in a single strand of DNA. The first parity rule is basic: It's what allows one strand of DNA to be complementary to another. The second parity rule is not so intuitive. Why should the amount of adenine have to equal the amount of thymine (or guanine equal cytosine) in a single strand of DNA? The conventional argument is that nature doesn't play favorites with purines and pyrimidines. There's no reason (in theory) why a single strand of DNA should have an excess of purines over pyrimidines or vice versa, all things being equal.

But it turns out, strand asymmetry vis-a-vis purines and pyrimidines is not only not uncommon, it's the rule. (Some call it Szybalski's rule, in fact.) You can prove it to yourself very easily. If you obtain a codon usage chart for a particular organism, then add the frequencies of occurrence of each base in each codon, you can get the relative abundances of the four bases (A, G, T, C) for the coding regions on which the codon chart was based. Let's take a simple example that requires no calculation: Clostridium botulinum. Just by eyeballing the chart below, you can quickly see that (for C. botulinum) codons using purines A and G are way-more-often used than codons containing pyrimidines T and C. (Note the green-highlighted codons.)


If you do the math, you'll find that in C. botulinum, G and A (combined) outnumber T and C by a factor of 1.41. That's a pretty extreme purine:pyrimidine ratio. (Remember that we're dealing with a single strand of DNA here. Codon frequencies are derived from the so-called "message strand" of DNA in coding regions.)

I've done this calculation for 1,373 different bacterial species (don't worry, it's all automated), and the bottom line is, the greater the DNA's A+T content (or, equivalently, the less its G+C content), the greater the purine imbalance. (See this post for a nice graph.)

If you inspect enough codon charts you'll quickly realize that Chargaff's second parity rule never holds true (except now and then by chance). It's a bogus rule, at least in coding regions (DNA that actually gets transcribed in vivo). It may have applicability to pseudogenes or "junk DNA" (but then again, I haven't checked; it may well not apply there either).

If Chargaff's second rule were true, we would expect to find that G = C (and A = T), because that's what the rule says. I went through the codon frequency data for 1,373 different bacterial species and then plotted the ratio of G to C (which Chargaff says should equal 1.0) for each species against the A+T content (which is a kind of phylogenetic signature) for each species. I was shocked by what I found:

Using base abundances derived from codon frequency data, I calculated G/C for 1,373 bacterial species and plotted it against total A+T content. (Each dot represents a genome for a particular organism.) Chargaff's second parity rule predicts a horizontal line at y=1.0. Clearly, that rule doesn't hold. 

I wasn't so much shocked by the fact that Chargaff's rule doesn't hold; I already knew that. What's shocking is that the ratio of G to C goes up as A+T increases, which means G/C is going up even as G+C is going down. (By definition, G+C goes down as A+T goes up.)

Chargaff says G/C should always equal 1.0. In reality, it never does except by chance. What we find is, the less G (or C) the DNA has, the greater the ratio of G to C. To put it differently: At the high-AT end of the phylogenetic scale, cytosine is decreasing faster (much faster) than guanine, as overall G+C content goes down.

When I first plotted this graph, I used a linear regression to get a line that minimizes the sum of squared absolute error. That line turned out to be given by 0.638 + [A+T]. Then I saw that the data looked exponential, not linear. So I refitted the data with a power curve (the red curve shown above) given by

G/C  = 1.0 + 0.587*[A+T] + 1.618*[A+T]2

which fit the data even better (minimum summed error 0.1119 instead of 0.1197). What struck me as strange is that the Golden Ratio (1.618) shows up in the power-curve formula (above), but also, the linear form of the regression has G/C equaliing 1.638 when [A+T] goes to 1.0. Which is almost the Golden Ratio.

In a previous post, I mentioned finding that the ratio A/T tends to approximate the Golden Ratio as A+T approaches 1.0. If this were to hold true, it could mean that A/T and G/C both approach the Golden Ratio as A+T approaches 1.0, which would be weird indeed.

For now, I'm not going to make the claim that the Golden Ratio figures into any of this, because it reeks too much of numerology and Intelligent Design (and I'm a fan of neither). I do think it's mildly interesting that A/T and G/C both approach a similar number as A+T approaches unity.

Comments, as usual, are welcome.
reade more... Résuméabuiyad

Science on the Desktop

For decades, I've been hoping I'd live long enough to see a day when serious science could be done on the desktop by dedicated amateurs. Amateur astronomers know what I'm talking about. You can't do much particle physics on the desktop, and there are no affordable desktop electron microscopes (yet), but if comparative genomics is your thing? Get ready to rock and roll, my friend.

Over the weekend I discovered http://genomevolution.org and promptly went nuts. Let me take you on a tour of what's possible.

First I should explain that my background is in microbiology, and I've always had a soft spot in my heart (not literally) for organisms with ultra-tiny genomes: things like Chlamydia trachomatis, the sexually transmitted parasite. It's technically a bacterium, but you can't grow it in a dish. It requires a host cell in which to live.

It turns out there are many of these itty-bitty obligate endosymbionts (at least a dozen major families are known), and because of their small size and obligate intracellular lifestyle, they have a lot in common with mitochondria. Which is to say, like mitochondria, they're about a micron in size, they divide on their own, they have circular DNA, and they provide services to the host in exchange for living quarters.

When you look at one of these little creatures under the microscope (whether it's Chlamydia or Ehrlichia or Anaplasma or what have you), you see pretty much the same thing. (See photo.) Namely, a tiny bacterium living in cytoplasm, mimicking a mitochondrion.

When Lynn Margulis wrote her classic 1967 paper suggesting that mitochondria were once tiny bacterial endosymbionts, it seemed laughable at the time, and her ideas were widely criticized (in fact her paper was "rejected by about fifteen journals," she once recalled). Now it's taught in school, of course. But we have a long way to go before we understand how mitochondria work. And we really, really need to know how they work, because for one thing, mitochondria seem to be deeply involved in orchestrating apoptosis (programmed cell death) and various kinds of signal transduction, and until we understand how all that works, we're going to be hindered in understanding cancer.

When I discovered the tools at http://genomevolution.org, one of the first things I did, on a what-the-hell basis, was compare the genomes of two small endosymbionts, Wolbachia pipientis and Neorickettsia sennetsu. The former lives in insects; the latter, in flatworms that infect fish, bats, birds, horses, and probably lots else. Note that for a horse to get Potomac horse fever, first the Neorickettsia has to infect a tiny flatworm; then the flatworm has to be ingested by a dragonfly, caddisfly, or mayfly; then the horse has to eat (or maybe be bitten by, although only infection-by-ingestion has been demonstrated) the worm-infected fly. The parasite-of-a-parasite chain of events is not only fascinating in its own right, it suggests (to me) that parasites enable each other through shared strategies at the biochemical level, and I might as well spoil some suspense here by revealing that there's even yet another layer of parasitism (and biochemical enablement) going on in this picture, involving viruses. But we're getting ahead of ourselves.

I mentioned Wolbachia a second ago. Wolbachia is a fascinating little critter, because it's found in the reproductive tract of anywhere from 20% to 70% of all insects (plus an undetermined number of spiders, mites, crustaceans, and nematodes), but they don't cause disease, and in fact it appears many insects are unable to survive without them. Wolbachia are unusual in that the extracellular phase of their lifecycle (the part where they spread from one host to another) isn't known; no one has observed it. What's more (and this part is incredible), Wolbachia have adapted to a stem-cell niche: They live in the cells that give rise to insect egg cells. Thus, all newborn female progeny of an infected mother are infected, and all eggs pass on the Wolbachia. In this sense, the genetics of Wolbachia obey mitochondrial genetics (whereby the mother passes on the organelle and its genome).

I quickly found, via Sunday afternoon desktop genomics, that Wolbachia and Neorickettsia (and other endosymbionts: Anaplasma, Ehrlichia, etc.) have many genes in common—hundreds, in fact. And when I say "genes in common," I mean that the genes often show better-than-50% similarity in DNA base-pair matching.

It's important to put some context on this. These little organisms have DNA that encodes only 1,000 genes. (By comparison, E. coli has around 4,400 genes.) Endosymbionts lack genes for common metabolic pathways. They cannot biosynthesize amino acids, for example; instead they rely on the host to provide such nutrients ready-made. If 400 to 500 of an endosymbiont's 1,000 genes are shared across major endosymbiont families, that's a huge percentage. It suggests there's a set of core genes, numbering in the low hundreds, that encapsulate the basic "strategy" of endosymbiosis.

A little more context: Mitochondria have their own DNA and look a lot like endosymbionts. But here's the thing: Mitochondrial DNA is tiny (only about 15,000 base pairs, versus a million for an endosymbiont). It turns out, 97% of the "stuff" that makes up a mitochondrion is encoded in the nucleus of the host. If you include these nuclear genes, mitochondria actually rely on about 1,000 genes total, of which only 3% are in the organelle's DNA. Lynn Margulis would say that what happened is, the endosymbiont ancestor of today's mitochondrion originally had DNA of about a million base-pairs (1,000 genes), but some time after taking up residency in the host cell, the invader's DNA mostly migrated to the host nucleus.

Why did symbiont-to-host DNA migration stop at 97%? Why not 100%? If we look at that 3%, we find genes coding for tRNA and bacterial ribosomes (specialized protein-making machinery) plus genes for enormous, complex transmembrane enzyme systems: cytochrome c oxidase and NADH dehydrogenase. (The former is the endpoint of oxidative respiration; the latter the entry-point.) Obviously it must be advantageous for these genes to be proximal to the organelle.

But why even have an organelle (a physical compartment)? One might ask why it's necessary to have a mitochondrial parasite swimming around in the cytoplasm at all, when most of the genes are part of the host's DNA? The answer is, the stuff that goes on inside the confines of the mitochondrion needs to be contained, because it's violently toxic stuff involving superoxide radicals, redox reactions, "proton pumps," and Fenton chemistry (transition-metal peroxide reactions). A containment structure is definitely called for, to segregate this toxic chemistry from the rest of the cell.

We might ask how it is that the DNA of the protobacterial ancestor of today's mitochondria wound up in the host nucleus in the first place. Let's consider the possibilities. Protobacterial (symbiont) DNA may have transferred to the host all at once, or it might have migrated piecemeal, over time. Or both. Is it realistic that huge amounts of endosymbiont DNA could have migrated to the host nucleus all at once? Yes. It's been suggested that vacuolar phagocytosis drove invader DNA to the nucleus in a big gulp. Evidence? Wolbachia inhabits the vacuolar space.

But export of genes and gene products to the host might have occurred piecemeal as well. A little desktop exploration provides some clues. If you use GenomeView or any number of other online tools to explore the DNA of Wolbachia, several things pop out at you. First is that many Wolbachia genes are mitochondria-like: They encode for things like cytochrome c oxidase, cytochrome b, NADH dehydrogenase, succinyl-CoA synthetase, Fenton-chemistry enzymes, and a slew of oxidases and reductases (including a nitroreductase). Wolbachia is clearly engaged in providing what might be called redox-detox services for the host—the same value proposition that mitochondria offer. This makes sense, because if Wolbachia cells were a net drag on the respiratory potential of host-cell mitochondria (if they couldn't at least hold their own with respect to mitochondria), the host would die.

The second thing that jumps out at you when you look at the Wolbachia genome is the abundance of genes devoted to export processes: membrane proteins, permeases, type I, II, and IV secretion systems, ABC transporters, etc., plus at least 60 ankyrin-repeat-domain genes—all powerful evidence of specializations aimed at export of genes and gene products to the host. But the most stunning "smoking gun" of all is the presence, in Wolbachia DNA, of five reverse-transcriptase genes, plus genes for resolvases, recombinases, transposases, DNA polymerases, RNA polymerases, and phage integrases. In essence, there's a complete suite of retroviral machinery, designed for export of foreign DNA into host DNA.

An example of one of 113 phage-derived genes in Wolbachia (lower gene array). In this case, the gene matches a phage gene found in Candidatus hamiltonella (upper gene array). The two isoforms exhibit 59% DNA sequence similarity, despite widely differing GC ratios. See text for discussion.

But wait. There's more. The third thing that jumps straight in your face when you start looking at the Wolbachia genome is the presence of (are you ready?) no less than 113 genes for phage-related proteins, including major and minor capsid and HK97-style prohead proteins, plus tail proteins, baseplate, tail tube, tail tape-measure, and sheath proteins; late control gene D; phage DNA methylases; and so on. (For non-biologists: phage is the term for viruses that attack bacteria.)

In the above screenshot, I'm comparing Wolbachia DNA (lower strip) to DNA from another insect-infecting endosymbiont, Candidatus hamiltonella, which is known to contain an intact virus (phage) in its DNA. Many phage proteins in Wolbachia have corresponding matches in the Candidatus genome. In this case, we're looking at a gene (the gold-colored stretch pointed at by red arrows) that is 1440 nucleotides long, with a 59% sequence match across genomes. The match percentage is remarkably high given that the Candidatus version of this gene has a 51.7% GC content while the Wolbachia version has a 40.6% GC. Also, note that Wolbachia itself has an overall GC of 34.2%. The fact that Wolbachia's putative phage genes are significantly higher in GC content than Wolbachia's non-phage genes is good confirmation that the genes really are from phage.

It's 100% clear that viral DNA has made its way into the DNA of Wolbachia (either recently or long ago), and it's reasonable to hypothesize that Wolbachia has repurposed the retrovirus-like phage genes for packaging and exporting Wolbachia DNA to the host nucleus.

Okay, so maybe you have to be a biologist for any of this stuff to make your hairs stand on end. To me, it's a dream come true to be able to do this kind of detective work on a Sunday afternoon while sitting on the living-room couch, using nothing more than a decrepit five-year-old Dell laptop with a wireless connection. The notion that you can do comparative genomics and proteomics while watching an Ancient Aliens rerun on TV is (for me) totally cerebrum-blowing. It makes me wonder what's just around the corner.

reade more... Résuméabuiyad