unemployment depression: DNA

Showing posts with label DNA. Show all posts

Converting an SVG Graph to Histograms

The graphs you get from ZunZun.com (the free graphing service) are pretty neat, but one shortcoming of ZunZun is that it won't generate histograms. (Google Charts will do histograms, but unlike ZunZun, Google won't give you SVG output.) The answer? Convert a ZunZun graph to histograms yourself. It's only SVG, after all. It's XML; it's text. You just need to edit it.

Of course, nobody wants to hand-edit a zillion <use> elements (to convert data points to histogram rects). It makes more sense to do the job programmatically, with a little JavaScript.

In my case, I had a graph of dinucleotide frequencies for Clostridium botulinum coding regions. What that means is, I tallied the frequency of occurrence (in every protein-coding gene) of 5'-CpG-3', CpC, CpA, CpT, ApG, ApA, ApC, and all other dinucleotide combinations (16 in all). Since I already knew the frequency of G (by itself), A, C, and T, it was an easy matter to calculate the expected frequency of occurrence of each dinucleotide pair. (For example, A occurs with frequency 0.403, whereas G occurs with frequency 0.183. Therefore the expected frequency of occurrence of the sequence AG is 0.403 times 0.183, or 0.0738.) Bottom line, I had 16 expected frequencies and 16 actual frequencies, for 16 dinucleotide combos. I wanted side-by-side histograms of the frequencies.

First, I went to ZunZun and entered my raw data in the ZunZun form. Just so you know, this is what the raw data looked like:

0 0.16222793723642806
1 0.11352236777965981
2 0.07364933857345456
3 0.08166221769088752
4 0.123186555838253
5 0.12107590293804558
6 0.043711462078314355
7 0.03558766171971166
8 0.07364933857345456
9 0.07262685957145093
10 0.033435825941632816
11 0.03459042802303202
12 0.055925067612781175
13 0.042792101322514244
14 0.019844425842971265
15 0.02730405457750352
16 0.123186555838253
17 0.12232085101526233
18 0.055925067612781175
19 0.05502001002972254
20 0.09354077847378013
21 0.07321410524577443
22 0.03319196776961071
23 0.028600012050969865
24 0.043711462078314355
25 0.043328337600588136
26 0.019844425842971265
27 0.0062116692282947845
28 0.03319196776961071
29 0.04195172151930211
30 0.011777822917388797
31 0.015269662767317132

I made ZunZun graph the data, and it gave me back a graph that looked like this:

Which is fine except it's not a histogram plot. And it has goofy numbers on the x-axis.

I clicked the SVG link under the graph and saved an SVG copy to my local drive, then opened the file in Wordpad.

The first thing I did was locate my data points. That's easy: ZunZun plots points as a series of <use> elements. The elements are nested under a <g> element that looks like this:

<g clip-path="url(#p0c8061f7fd)">

I hand-edited this element to have an id attribute with value "DATA":

<g id="DATA" clip-path="url(#p0c8061f7fd)">

Next, I scrolled up to the very top of the file and found the first <defs> tag. Under it, I placed the following empty code block:

<script type="text/ecmascript"><![CDATA[
// code goes here

]]></script>

Then I went to work writing code (to go inside the above block) that would find the <use> elements, get their x,y values, and create <rect> elements of a height that would extend to the x-axis line.

The code I came up with looks like this:

// What is the SVG y-value of the x-axis?
// Attempt to discover by introspecting clipPath

function findGraphVerticalExtent( ) { 
   var cp = document.getElementsByTagName('clipPath')[0];
   var rect = cp.childNodes[1];
   var top = rect.getAttribute('y') * 1;
   var bottom = rect.getAttribute('height') * 1;
   return top + bottom; 
}


// This is for use with SVG graphs produced by ZunZun,
// in which data points are described in a series of
// <use> elements. We need to get the list of <use>
// nodes, convert it to a JS array, sort data points by
// x-value, and replace <use> with <rect> elements.

function changeToHistograms( ) {

   var GRAPH_VERTICAL_EXTENT = findGraphVerticalExtent( );

   // The 'g' element that encloses the 'use' elements
   // needs to have an id of "DATA" for this to work!
   // Manually edit the <g> node's id first! 
   var data = document.getElementById( "DATA" );

   // NOTE: The following line gets a NodeList object,
   // which is NOT the same as a JavaScript array!
   var nodes = data.getElementsByTagName( "use" );

   // utility routine (an inner method)
   function nodeListToJavaScriptArray( nodes ) {

       var results = [];

       for (var i = 0; i < nodes.length; i++)
          results.push( nodes[i] );

       return results;
   }

   // utility routine (another inner method)
   function compareX( a,b ) { 
       return a.getAttribute("x") * 1 - b.getAttribute("x") * 1; 
   }

   var use = nodeListToJavaScriptArray( nodes );

   // We want the nodes in x-sorted order
   use.sort( compareX );  // presto, done

   // Main loop
   for (var i = 0; i < use.length; i++) {

       var rect = 
           document.createElementNS("http://www.w3.org/2000/svg", "rect");
       var item = use[i];
       var x = item.getAttribute( "x" ) * 1;
       var y = item.getAttribute( "y" ) * 1;
       var rectWidth = 8;
       var rectHeight = GRAPH_VERTICAL_EXTENT - y;
       rect.setAttribute( "width", rectWidth.toString() );
       rect.setAttribute( "height", rectHeight.toString() );   
       rect.setAttribute( "x" , x.toString() );
       rect.setAttribute( "y" , y.toString() );

       // We will alternate colors, pink/purple
       rect.setAttribute( "style" , 
           (i%2==0)? "fill:ce8877;stroke:none" : "fill:8877dd;stroke:none" );

       data.appendChild( rect ); // add a new rect
       item.remove();  // delete the old <use> element
   }

   return use;
}

As so often happens, I ended up writing more code than I thought it would take. The above code works fine for converting data points to histogram bars (as long as you remember to give that <g> element the id attribute of "DATA" as mentioned earlier). But you need to trigger the code somehow. Answer: insert onload="changeToHistograms( )" in the <svg> element at the very top of the file.

But I wasn't done, because I also wanted to apply data labels to the histogram bars (labels like "CG," "AG," "CC," etc.) and get rid of the goofy numbers on the x-axis.

This is the function I came up with to apply the labels:

   function applyLabels( sortedNodes ) {
 
    var labels = ["aa", "ag", "at", "ac", 
      "ga", "gg", "gt", "gc", "ta", "tg", 
      "tt", "tc", "ca", "cg", "ct", "cc"];

      var data = document.getElementById( "DATA" );
 var labelIndex = 0;

 for (var i = 0; i < sortedNodes.length; i+=2) {
     var text = 
              document.createElementNS("http://www.w3.org/2000/svg", "text");
     var node = sortedNodes[i];
          text.setAttribute( "x", String( node.getAttribute("x")*1 +2) );
          text.setAttribute( "y", String( node.getAttribute("y")*1 - 13 ) );
          text.setAttribute( "style", "font-size:9pt" );
          text.textContent = labels[ labelIndex++ ].toUpperCase();
          text.setAttribute( "id", "label_" + labelIndex );
          data.appendChild( text );
      }
   }

And here's a utility function that can strip numbers off the x-axis:

   // Optional. Call this to remove ZunZun graph labels.
   // pass [1,2,3,4,5,6,7,8,9] to remove x-axis labels
   function removeZunZunLabels( indexes ) {
 
 for (var i = 0;i < indexes.length;i++) 
    try {
   document.getElementById("text_"+indexes[i]).remove();
   }
  catch(e) { console.log("Index " + i + " not found; skipped.");
   }
   }

BTW, if you're wondering why I multiply so many things by one, it's because the attribute values that comprise x and y values in SVG are String objects. If you add them, you're concatenating strings, which is not what you want. To convert a number in string form to an actual JavaScript number (so you can add numbers and not concatenate strings), you can either multiply by one or explicitly coerce a string to a number by doing Number( x ).

The final result of all this looks like:

Final graph after surgery. Expected (pink) and actual (purple) frequencies of occurrence of various dinucleotide sequences in C. botulinum coding-region DNA.

Which is approximately what I wanted to see. The labels could be positioned better, but you get the idea.

What does the graph show? Well first of all, you have to realize that the DNA of C. botulinum is extremely rich in adenine and thymine (A and T): Those two bases constitute 72% of the DNA. Therefore it's absolutely no surprise that the highest bars are those that contain A and/or T. What's perhaps interesting is that the most abundant base (A), which should form 'AA' sequences at a high rate, doesn't. (Compare the first bar on the left to the shorter purple bar beside it.) This is especially surprising when you consider that AAA, GAA, and AAT are by far the most-used codons in C. botulinum. In other words, 'AA' occurs a lot, in codons. But even so, it doesn't occur as much as one would expect.

It's also interesting to compare GC with CG. (Non-biologists, note that these two pairs are not equivalent, because DNA has a built-in reading direction. The notation GC, or equivalently, GpC, means there's a guanine sitting on the 5' side of cytosine. The notation CG means there's a guanine on the 3' side of cytosine. The 5' and 3' numbers refer to deoxyribose carbon positions.) The GC combo occurs more often than predicted by chance whereas the combination CG (or CpG, as it's also written) occurs much less frequently than predicted by chance. The reasons for this are fairly technical. Suffice it to say, it's a good prima facie indicator that C. botulinum DNA is heavily methylated. Which in fact it is.

reade more...

Getting Started in Desktop Bioinformatics

I've spent about four months now exploring do-it-yourself desktop bioinformatics. Overall, I'm excited by what I've been able to do and I'm optimistic about the prospects for other do-it-yourself desktop-science geeks, because there are tons of great online tools for doing bio-sci and lots of important scientific questions yet to be fleshed out. So I thought I'd share some of what I've learned, and provide some pointers for anyone who might want to try his or her hand at this sort of "citizen science."

It helps to have the benefit of a science education (in particular, a bio education) before beginning, but one of the great things about do-it-yourself desktop science is that you can (and will!) learn as you go. For example, you might have only a bare-bones basic understanding of enzymology before you begin, but as you move deeper into a particular research quest, you'll find yourself wanting to learn more about this or that aspect of an enzyme. So you'll hit Google Scholar and bring yourself up-to-date on this or that detail of a particular subject. That's a Good Thing.

When I first plunged into DNA analysis, I have to admit my knowledge of mitochondria was weak. I knew they had their own DNA, for example, but it wasn't obvious to me (until I started digging) that mitochondrial DNA is pathetically small, whereas the mitochondrial proteome (the superset of all products that go into making up a functioning mitochondrion) is large. In other words, most "mitochondrial genes" are not in mtDNA. They're in nuclear DNA. There are a couple of online databases of nuclear mitochondrial genes (NUMTs, as they're known), but by and large this is an area in dire need of more research. Someone needs to put together a database or reference set of yeast NUMTs, for example. We also need a database for algal NUMTs, another for protozoan NUMTs, another for rice or corn or Arabidopsis NUMTs, etc. Maybe you'll be the one to move such a project along?

So. How can you get started in desktop bioinformatics? I recommend familiarizing yourself with the great tools at genomevolution.org, which is powered by iPlant, which (in turn) is funded by the National Science Foundation Plant Cyberinfrastructure Program here in the U.S. In particular, I recommend you set aside an evening to run through some of the tutorials at genomevolution.org. That'll give you an idea of what's possible with their tools.

Many organisms have genes for flagellum proteins,
but not all such organisms actually make a flagellum.
(The flagellum is the whiplike appendage that gives the
cell motility. Above: Bdellovibrio, a bacterium with a
powerful flagellum.)

If you go to this page and scroll down, you'll find some really interesting short videos showing how to use some of the genomevolution.org tools. They're fun to watch and should stimulate your imagination.

What kinds of problems need investigating by desktop biologists? The sky's the limit. One quest that lends itself to citizen science is looking for examples of horizontal gene transfer (HGT). This requires that you first teach yourself a little bit about BLAST searches. (BLAST searches are sequence-similarity searches that let you compare DNA against DNA or amino-acid sequence against amino-acid sequence.) The strategies involved here can range from simple and brute force to sophisticated; and the great thing is, you can invent your own heuristics. It's a wide-open area. I recently found good evidence (90%+ similarity of DNA sequences) for bacterial gene transfer into rice, which I'll write about in a later post. I'm confident there are thousands of examples of horizontal gene transfer (whether from bacteria to bacteria, bacteria to plant, bacteria to insect, or whatever) waiting to be discovered. You could easily be the next discoverer of one of these gene transfers.

Here are some other ideas for desktop-science explorations:

Find and characterize flagellar genes in organisms that lack motility. If you dig into the literature, you'll find that there are many examples of supposedly immotile organisms (like the intracellular parasite Buchnera, which lives inside aphids) that not only harbor flagellum genes but express some of them—yet have no external flagellum. Obviously, organisms that retain flagellum genes but actually don't make a flagellum (that little whip-like tail that makes single-celled organisms swim around) must be retaining those genes for a reason. The gene products must be doing something. But what? Also: Paramecium and diatoms and other eukaryotes make flagella and/or cilia. Most animals also make cilia. (Ever get a tickle deep in your throat or bronchia? It was probably something tangling with the cilia lining your bronchial system.) What's the relationship between cilia gene products in Paramecium, say, and cilia in animals? Do any plants conceal cilia genes? If so, how are they related phylogenetically to lower-organism cilia?
Migration of genes from parasites to host DNA. A general pattern that seems to happen in nature is: a bacterium or other invader takes up residency inside an animal or plant cell, becoming an endosymbiont; then some of its genes (the symbiont's genes) move to the nucleus of the host cell. Which genes? What do the genes do? That's up to you to try to find out.
Bidirectionally ("bidi") transcribed genes: While rare, there are examples of genes in which each strand of DNA is transcribed into mRNA. (The genome for Rothia mucilaginosa contains many putative examples of this.) Find organisms that contain bidi genes. Try to determine if both strands are actually transcribed. Examine sister-species organisms to see if one strand is transcribed in one organism and the other gene (on the other strand) is transcribed in the other organism.
Phylogenetics of plasmid and viral genes. Try to determine the ancestry of a virus gene. There are good tree-making services online that do all the hard work for you, including protein-sequence alignments. All you have to do is cut and paste Fasta files.
Codon analysis. There are many plants (rice is one) and higher organisms in which DNA is more or less equally divided into high-GC-content genes and low-GC-content genes. Surely the codon usage patterns for each class of gene(s) varies. But how? What are the codon adaptation indexes (CAI values) for the various genes? Create a few histograms of CAI values. Use CAIs and other techniques to try to determine which genes are highly expressed. Are HEGs (highly expressed genes) mostly high-GC? Low-GC? Both? Run some histograms on the genes' purine (A+G) content, G+C content, G+C content by codon position.
Many organisms (and organelles) have extremely GC-poor genomes. Some have bizarre codon usage patterns (where, say, the codon AAA is used 12 times more than the average codon). Some use 56 or fewer codons (out of 64 possible). Find the organelle or organism that uses the fewest codons. See if there's an organism or organelle that uses fewer than 20 amino acids. Which amino acid(s) get(s) left out?
Characterize the DNA repairosome of an aerobic and an anerobic archeon. Compare and contrast the two.
Find all the genes in a particular organism that have mitochondrial-targeting presequences in their DNA.
Pick two closely related organisms. Try to figure out how many million years ago they diverged. Use mitochondrial DNA analysis as well as cytoplasmic protein analysis.
Find the bacterium that has more secretion-protein, permease, and protein-translocation genes than any other. Compare it to its closest relative.
Find an organism that is pathogenic (to humans, animals, or plants). Find its closest non-pathogenic relative. Compare genomes. Determine which genes are most likely to be involved in virulence.
Some seemingly simple organisms (amoebae) have more DNA than a human. Why? What's all that DNA doing there? Does it contain horizontally transferred genes from plants, bacteria, archeons, animals? Are amoebas and other super-large-genome organisms "DNA hoarders"? Are they DNA curationists? Characterize the genes (enumerate by category, first) of these organisms. How many are expressed? How many are junk? What's the energy cost of maintaining that much junk DNA? Can it all be junk? Is an amoeba actually a devolved higher life form that forgot how to do morphogenesis and can no longer develop into a tadpole or whatever?
[ insert your own project here! ]

reade more...

The Trouble with Darwin

As a biologist, I find Darwin's theory hugely disappointing. It's better than the alternative (which is to believe in magic, basically), but not by much, sadly.

Charles Darwin died before Mendel
proved the existence of genes.

As scientific theories go, the theory of evolution is easily the weakest of all major scientific theories. It's a commendable piece of work in its ability to stir discussion, but terrible in most other ways.

To be useful, a scientific theory has to do a minimum of two things: explain what can be observed, and provide testable predictions. Darwin's theory is weak on the first count and useless on the second.

Evolutionary theory explains practically nothing, because every explanation of the theory is rooted in "survival of the fittest," which is a circular notion, utterly content-free. "Fittest" means most able to survive. Survival of the fittest means survival of those who survive.

Ironically, Darwin's landmark work was called On the Origin of Species. Yet it doesn't actually explain speciation, except in the most vacuous and speculative of terms. Of course, we can't set too high an expectation for Darwin, since he didn't live to see the publication of Mendel's work (the word "genetics" wouldn't exist until more than 20 years after Darwin's death), but still. Speciation is portrayed by Darwin as the outcome of the accumulation of small, gradual changes. That's all the explanation he offers.

But the explanation is wrong. Or at least it doesn't accord well with the facts. It doesn't explain the Cambrian Explosion, for example, or the sudden appearance of intelligence in hominids, or the rapid recovery (and net expansion!) of the biosphere in the wake of at least five super-massive extinction events in the most recent 15% of Earth's existence.

One of the most frustrating aspects of evolutionary theory (this is no fault of the theory's, though) is that it is so hard to test in the laboratory. The fact is, no one has ever seen speciation happen in the laboratory, under repeatable conditions, and until that happens we're at a distinct disadvantage for understanding speciation. (Incidentally, I don't count plant hybridization or breeding anomalies in fruit flies whose sexuality is under the control of microbial endosymbionts as examples of speciation.)

When I was in school, we were taught that mutations in DNA were the driving force behind evolution, an idea that is now thoroughly discredited. The overwhelming majority of non-neutral mutations are deleterious (they reduce, not increase, survival). Most mutations lead to loss of function (this is easily demonstrated in the lab), not gain of function. Evolutionary theory is great at explaining things like the loss of eyesight by cave-dwelling creatures (e.g., bats). It's terrible at explaining gain of function.

Even if mutations were capable of driving evolution, they simply don't happen fast enough to account for observed rates of speciation. In bacteria, the measured rate of 16S rRNA divergence due to point mutations is only 1% per 50 million years. And yet, there were no flowering plants on earth as recently as 150 million years ago! Does it take a biologist to see the disconnect?

I bring all this up because I've spent some time recently doing genomics research aimed at exploring mechanisms for new-protein creation/differentiation (mechanisms not relying wholly nor even mainly on point mutations), and I wanted to set the stage for discussing that research here. Over the next week or so, I'll be presenting some new ideas and findings. Hopefully, we can put some much-needed flesh on Darwin by exploring testable notions of how new protein motifs can arise quickly (without reliance on magic).

reade more...

A New Biological Constant?

Earlier, I gave evidence for a surprising relationship between the amount of G+C (guanine plus cytosine) in DNA and the amount of "purine loading" on the message strand in coding regions. The fact that message strands are often purine-rich is not new, of course; it's called Szybalski's Rule. What's new and unexpected is that the amount of G+C in the genome lets you predict the amount of purine loading. Also, Szybalski's rule is not always right.

Genome A+T content versus message-strand purine content (A+G) for 260 bacterial genera. Chargaff's second parity rule predicts a horizontal line at Y = 0.50. (Szybalski's rule says that all points should lie at or above 0.50.) Surprisingly, as A+T approaches 1.0, A/T approaches the Golden Ratio.

When you look at coding regions from many different bacterial species, you find that if a species has DNA with a G+C content below about 68%, it tends to have more purines than pyrimidines on the message strand (thus purine-rich mRNA). On the other hand, if an organism has extremely GC-rich DNA (G+C > 68%), a gene's message strand tends to have more pyrimidines than purines. What it means is that Szybalski's Rule is correct only for organisms with genome G+C content less than 68%. And Chargaff's second parity rule (which says that A=T an G=C even within a single strand of DNA) is flat-out wrong all the time, except at the 68% G+C point, where Chargaff is right now and then by chance.

Since the last time I wrote on this subject, I've had the chance to look at more than 1,000 additional genomes. What I've found is that the relationship between purine loading and G+C content applies not only to bacteria (and archaea) and eukaryotes, but to mitochondrial DNA, chloroplast DNA, and virus genomes (plant, animal, phage), as well.

The accompanying graphs tell the story, but I should explain a change in the way these graphs are prepared versus the graphs in my earlier posts. Earlier, I plotted G+C along the X-axis and purine/pyrmidine ratio on the Y-axis. I now plot A+T on the X-axis instead of G+C, in order to convert an inverse relationship to a direct relationship. Also, I now plot A+G (purines, as a mole fraction) on the Y-axis. Thus, X- and Y-axes are now both expressed in mole fractions, hence both are normalized to the unit interval (i.e., all values range from 0..1).

The graph above shows the relationship between genome A+T content and purine content of message strands in genomes for 260 bacterial genera. The straight line is regression-fitted to minimize the sum of squared absolute error. (Software by http://zunzun.com.) The line conforms to:

y = a + bx

where:

a =  0.45544384965539358
b =  0.14454244707261443

The line predicts that if a genome were to consist entirely of G+C (guanine and cytosine), it would be 45.54% guanine, whereas if (in some mythical creature) the genome were to consist entirely of A+T (adenine and thymine), adenine would comprise 59.99% of the DNA. Interestingly, the 95% confidence interval permits a value of 0.61803 at X = 1.0, which would mean that as guanine and cytosine diminish to zero, A/T approaches the Golden Ratio.

Do the most primitive bacteria (Archaea) also obey this relationship? Yes, they do. In preparing the graph below, I analyzed codon usage in 122 Archaeal genera to obtain A, G, T, and C relative proportions in coding regions of genes. As you can see, the same basic relationship exists between purine content and A+T in Archaea as in Eubacteria. Regression analysis yielded a line with a slope of 0.16911 and a vertical offset 0.45865. So again, it's possible (or maybe it's just a very strange coincidence) that A/T approaches the Golden Ratio as A+T approaches unity.

Analysis of coding regions in 122 Archaea reveals that the same relationship exists between A+T content and purine mole-fraction (A+G) as exists in eubacteria.

For the graph below, I analyzed 114 eukaryotic genomes (everything from fungi and protists to insects, fish, worms, flowering and non-flowering plants, mosses, algae, and sundry warm- and cold-blooded animals). The slope of the generated regression line is 0.11567 and the vertical offset is 0.46116.

Eukaryotic organisms (N=114).

Mitochondria and chloroplasts (see the two graphs below) show a good bit more scatter in the data, but regression analysis still comes back with positive slopes (0.06702 and .13188, respectively) for the line of least squared absolute error.

Mitochondrial DNA (N=203).

Chloroplast DNA (N=227).

To see if this same fundamental relationship might hold even for viral genetic material, I looked at codon usage in 229 varieties of bacteriophage and 536 plant and animal viruses ranging in size from 3Kb to over 200 kilobases. Interestingly enough, the relationship between A+T and message-strand purine loading does indeed apply to viruses, despite the absence of dedicated protein-making machinery in a virion.

Plant and animal viruses (N=536).

Bacteriophage (N=229).

For the 536 plant and animal viruses (above left), the regression line has a slope of 0.23707 and meets the Y-axis at 0.62337 when X = 1.0. For bacteriophage (above right), the line's slope is 0.13733 and the vertical offset is 0.46395. (When inspecting the graphs, take note that the vertical-axis scaling is not the same for each graph. Hence the slopes are deceptive.) The Y-intercept at X = 1.0 is 0.60128. So again, it's possible A/T approaches the golden ratio as A+T approaches 100%.

The fact that viral nucleic acids follow the same purine trajectories as their hosts perhaps shouldn't come as a surprise, because viral genetic material is (in general) highly adapted to host machinery. Purine loading appropriate to the A+T milieu is just another adaptation.

It's striking that so many genomes, from so many diverse organisms (eubacteria, archaea, eukaryotes, viruses, bacteriophages, plus organelles), follow the same basic law of approximately

A+G = 0.46 + 0.14 * (A+T)

The above law is as universal a law of biology as I've ever seen. The only question is what to call the slope term. It's clearly a biological constant of considerable significance. Its physical interpretation is clear: It's the rate at which purines are accumulated in mRNA as genome A+T content increases. It says that a 1% increase in A+T content (or a 1% decrease in genome G+C content) is worth a 0.14% increase in purine content in message strands. Maybe it should be called the purine rise rate? The purine amelioration rate?

Biologists, please feel free to get in touch to discuss. I'm interested in hearing your ideas. Reach out to me on LinkedIn, or simply leave a comment below.

reade more...

A Bioinformatics Bookmarklet

Sometimes you want to scrape some screen data and analyze it on the spot without copying it to another program. It turns out there's an easy way to do just that. Just highlight the information (by click-dragging the mouse to Select a section of screen data), then run a piece of JavaScript against the selection.

Example: I do a lot of peeking and poking at DNA sequences on the web. Often, I'm interested in knowing various summary statistics for the DNA I'm looking at. For example, I might see a long sequence that looks like AGTTAGAAAACCTCAGCTACTAG (etc.) and wonder what the G+C content of that stream is. So I'll select the text by click-dragging across it. Then I'll obtain the text in JavaScript by calling getSelection().toString(). Then I parse the text and display the results in an alert dialog.

Suppose I've selected a run of DNA on-screen and I want to know the base content (the amounts of G, C, T, and A).

text = getSelection().toString(); // get the data as a string
text = text.toUpperCase(); // optionally convert it to upper case

bases = new Object;  // create a place to store the base counts
bases.G = bases.C = bases.T = bases.A = 0; // initialize

// now loop over the string contents:
for (var i = 0; i < text.length; i++)
   bases[ text[i] ]++;  // bump the count for that base

// format the data for viewing

msg = "G: " + bases.G/text.length + "\n";

msg += "C: " + bases.C/text.length + "\n";

msg += "A: " + bases.A/text.length + "\n";

msg += "T: " + bases.T/text.length + "\n";

msg += "GC Content: " + (bases.G + bases.C)/text.length;

// view it:

alert( msg );

If I run this script against a web page where I've highlighted some DNA text, I get:

The nice part is, you can put the above code in a bookmarklet, associate the bookmarklet with a button, and keep it in your bookmark bar so that whenever you want to run the code, you can just point and click. To do the packaging, reformat the above code (or your modified version of it) as a single line of code preceded by "javascript:" (don't forget the colon), then set that code as the URL of a bookmark. Instead of going to a regular URL, the browser will see "javascript:" as the URL scheme and execute the code directly.

Bookmarklets of this sort have proven to be a major productivity boon for me in various situations as I cruise the web. When I see data I want to analyze, I don't have to copy and paste it to Excel (or whatever). With a bookmarklet, I can analyze it instantly, sur la vitre.

reade more...

A Very Simple Test of Chargaff's Second Rule

We know that for double-stranded DNA, the number of purines (A, G) will always equal the number of pyrimidines (T, C), because complementarity depends on A:T and G:C pairings. But do purines have to equal pyrimidines in single-stranded DNA? Chargaff's second parity rule says yes. Simple observation says no.

Suppose you have a couple thousand single-stranded DNA samples. All you have to do to see if Chargaff's second rule is correct is create a graph of A versus T, where each point represents the A and T (adenine and thymine) amounts in a particular DNA sample. If A = T (as predicted by Chargaff), the graph should look like a straight line with a slope of 1:1.

For fun, I grabbed the sequenced DNA genome of Clostridium botulinum A strain ATCC 19397 (available from the FASTA link on this page; be ready for a several-megabyte text dump), which contains coding sequences for 3552 genes of average length 442 bases each, and for each gene, I plotted the A content versus the T content.

A plot of thymine (T) versus adenine (A) content for all 3552 genes in C. botulinum coding regions. The greyed area represents areas where T/A > 1. Most genes fall in the white area where A/T > 1.

As you can see, the resulting cloud of points not only doesn't form a straight line of slope 1:1, it doesn't even cluster on the 45-degree line at all. The center of the cluster is well below the 45-degree line, and (this is the amazing part) the major axis of the cluster is almost at 90 degrees to the 45-degree line, indicating that the quantity A+T tends to be conserved.

A similar plot of G versus C (below) shows a somewhat different scatter pattern, but again notice that the centroid of the cluster is well off the 45-degree centerline. This means Chargaff's second rule doesn't hold (except for the few genes that randomly fell on the centerline).

A plot of cytosine (C) versus guanine (G) for all genes in all coding regions of C. botulinum. Again, notice that the points cluster well away from the 45-degree line (where they would have been expected to cluster, according to Chargaff).

The numbers of bases of each type in the botulinum genome are:

Amazingly, there are 296,937 more adenines than thymines in the genome (here, I'm somewhat sloppily equating "genome" with combined coding regions). Likewise, excess guanines number 218,938. On average, each gene contains 73 excess purines (42 adenine and 31 guanine).

The above graphs are in no way unique to C. botulinum. If you do similar plots for other organisms, you'll see similar results, with excess purines being most numerous in organisms that have low G+C content. As explained in my earlier posts on this subject, the purine/pyrimidine ratio (for coding regions) tends to be high in low-GC organisms and low in high-GC organisms, a relationship that holds across all bacterial and eukaryotic domains.

reade more...

Chargaff's Second Parity Rule is Broadly Violated

Erwin Chargaff, working with sea-urchin sperm in the 1950s, observed that within double-stranded DNA, the amount of adenine equals the amount of thymine (A = T) and guanine equals cytosine (G = C), which we now know is the basis of "complementarity" in DNA. But Chargaff later went on to observe the same thing in studies of single-stranded DNA, causing him to postulate that A = T and G = C more generally (within as well as across strands of DNA). The more general postulation is known as Chargaff's second parity rule. It says that A = T and G = C within a single strand of DNA.

The second parity rule seemed to make sense, because there was and is no a priori reason to think that DNA or RNA, whether single-stranded or double-stranded, should contain more purines than pyrimidines (nor vice versa). All other factors being equal, nature should not "favor" one class of nucleotide over another. Therefore, across evolutionary times frames, one would expect purine and pyrimidine prevalences in nucleic acids to equalize.

What we instead find, if we look at real-world DNA and RNA, is that individual strands seldom contain equal amounts of purines and pyrimidines. Szybalski was the first to note that viruses (which usually contain single-stranded nucleic acids) often contain more purines than pyrimidines. Others have since verified what Szybalski found, namely that in many organisms, DNA is purine-heavy on the "sense" strand of coding regions, such that messenger RNA ends up richer in purines than pyrimidines. This is called Szybalski's rule.

In a previous post, I presented evidence (from analysis of the sequenced genomes of 93 bacterial genera) that Szybalski's rule not only is more often true than Chargaff's second parity rule, but in fact purine-loading of coding region "message" strands occurs in direct proportion to the amount of A+T (or in inverse propoertion to the amount of G+C) in the genome. At G+C contents below about 68%, DNA becomes heavier and heavier with purines on the message strand. At G+C contents above 68%, we find organisms in which the message strand is actually pyrimidine-heavy instead of purine-heavy.

I now present evidence that purine loading of message strands in proportion to A+T content is a universal phenomenon, applying to a wide variety of eukaryotic ("higher") life forms as well as bacteria.

According to Chargaff's second parity rule, all points on this graph should fall on a horizontal line at y = 1. Instead, we see that Chargaff's rule is violated for all but a statistically insignificant subset of organisms. Pink/orange points represent eukaryotic species. Dark green data points represent bacterial genera. See text for discussion. Permission to reproduce this graph (with attribution) is granted.

To create the accompanying graph, I did frequency analysis of codons for 58 eukaryotic life forms (pink data points) and 93 prokaryotes (dark green data points) in order to derive prevalences of the four bases (A, G, C, T) in coding regions of DNA. Eukaryotes that were studied included yeast, molds, protists, warm and cold-blooded animals, flowering and non-flowering plants, alga, and insects and crustaceans. The complete list of organisms is shown in a table further below.

It can now be stated definitively that Chargaff's second parity rule is, in general, violated across all major forms of life. Not only that, it is violated in a regular fashion, such that purine loading of mRNA increases with genome A+T content. Significantly, some organisms with very low A+T content (high G+C content) actually have pyrimidine-loaded mRNA, but they are in a small minority.

Purine loading is both common and extreme. For about 20% of organisms, the purine-pyrimidine ratio is above 1.2. For some organisms, the purine excess is more than 40%, which is striking indeed.

Why should purines migrate to one strand of DNA while pyrimidines line up on the other strand? One possibility is that it minimizes spontaneous self-annealing of separated strands into secondary structures. Unrestrained "kissing" of intrastrand regions during transcription might lead to deleterious excisions, inversions, or other events. Poly-purine runs would allow the formation of many loops but few stems; in general, secondary structures would be rare.

The significance of purine loading remains to be elucidated. But in the meantime, there can be no doubt that purine enrichment of message strands is indeed widespread and strongly correlates to genome A+T content. Chargaff's second parity rule is invalid, except in a trivial minority of cases.

The prokaryotic organisms used in this study were presented in a table previously. The eukaryotic organisms are shown in the following table:

Organism	Comment	G+C%	Purine ratio
Chlorella variabilis strain NC64A	endosymbiont of Paramecium	68.76	1.1055181128896376
Chlamydomonas reinhardtii strain CC-503 cw92 mt+	unicellular alga	67.96	1.0818749999999997
Micromonas pusilla strain CCMP1545	unicellular alga	67.41	1.1873268193087356
Ectocarpus siliculosus strain Ec 32	alga	62.74	1.2090728330510347
Sporisorium reilianum SRZ2	smut fungus	62.5	0.9776547360094916
Leishmania major strain Friedlin	protozoan	62.47	1.0325
Oryza sativa Japonica Group	rice	54.77	1.0668412348401317
Takifugu rubripes (torafugu)	fish	54.08	1.0655094027691674
Aspergillus fumigatus strain A1163	fungus	53.89	1.013091641490433
Sus scrofa (pig)	pig	53.77	1.0680595779892428
Drosophila melanogaster (fruit fly)		53.69	1.0986989367655287
Brachypodium distachyon line Bd21	grass	53.32	1.0764746703677999
Selaginella moellendorffii (Spikemoss)	moss	52.83	1.1014492753623195
Equus caballus (horse)	horse	52.29	1.0844453711426192
Pongo abelii (Sumatran orangutan)	orangutan	52	1.0929015146227405
Homo sapiens	human	51.97	1.0939049081896255
Mus musculus (house mouse) strain mixed	mouse	51.91	1.0827720297201582
Tuber melanosporum (Perigord truffle) strain Mel28	truffle	51.4	1.0836820083682006
Phaeodactylum tricornutum strain CCAP 1055/1	diatom	51.06	1.0418452745458253
Arthroderma benhamiae strain CBS 112371	fungus	50.99	1.0360268674944024
Ornithorhynchus anatinus (platypus)	platypus	50.97	1.1121909993661525
Taeniopygia guttata (Zebra finch)	bird	50.81	1.1344717182497328
Trypanosoma brucei TREU927	sleeping sickness protozoan	50.78	1.106974784013486
Danio rerio (zebrafish) strain Tuebingen	fish	49.68	1.1195053003533566
Gallus gallus	chicken	49.54	1.1265418970650787
Monodelphis domestica (gray short-tailed opossum)	opossum	49.07	1.0768110918544194
Sorghum bicolor (sorghum)	sorghum	48.93	1.046422719825232
Thalassiosira pseudonana strain CCMP1335	diatom	47.91	1.1403183213189638
Hyaloperonospora arabidopsis	mildew	47.75	1.053039546400631
Daphnia pulex (common water flea)	water flea	47.57	1.058036633052068
Physcomitrella patens subsp. patens	moss	47.33	1.1727134477514667
Anolis carolinensis (green anole)	lizard	46.72	1.113765477057538
Brassica rapa	flowering plant	46.29	1.1056659411640803
Fragaria vesca (woodland strawberry)	strawberry	46.02	1.1052853232259425
Amborella trichopoda	flowering shrub	45.88	1.0992441209406494
Citrullus lanatus var. lanatus (watermelon)	watermelon	44.5	1.0855134984692458
Capsella rubella	mustard-family plant	44.37	1.1041257367387034
Arabidopsis thaliana (thale cress)	cress	44.15	1.109853013573388
Lotus Japonicus	lotus	44.11	1.0773228019122847
Populus trichocarpa (Populus balsamifera subsp. trichocarpa)	tree	43.7	1.1097672456226706
Cucumis sativus (cucumber)	cucumber	43.56	1.0823847862298719
Caenorhabditis elegans strain Bristol N2	worm	42.96	1.106320224719101
Vitis vinifera (grape)	grape	42.75	1.0859833393697935
Ciona intestinalis	tunicate	42.68	1.158652461848546
Solanum lycopersicum (tomato)	tomato	41.7	1.1177
Theobroma cacao (chocolate)	chocolate	41.31	1.1297481860862142
Medicago truncatula (barrel medic) strain A17	flowering plant	40.78	1.093754366354618
Apis mellifera (honey bee) strain DH4	honey bee	39.76	1.216042543762464
Saccharomyces cerevisiae (bakers yeast) strain S288C	yeast	39.63	1.1387641650630744
Acyrthosiphon pisum (pea aphid) strain LSR1	aphid	39.35	1.1651853457619772
Debaryomyces hansenii strain CBS767	yeast	37.32	1.1477345930856775
Pediculus humanus corporis (human body louse) strain USDA	louse	36.57	1.2365791828213537
Schistosoma mansoni strain Puerto Rico	trematode	35.94	1.0586902800658977
Candida albicans strain WO-1	yeast	35.03	1.1490291609944834
Tetrapisispora phaffii CBS 4417 strain type CBS 4417	yeast	34.69	1.17503805175038
Paramecium tetraurelia strain d4-2	protist	30.03	1.2494922903347117
nucleomorph Guillardia theta	endosymbiont	23.87	1.1529462427330803
Plasmodium falciparum 3D7	malaria parasite	23.76	1.4471365638766511

reade more...

Pages

.