unemployment depression: genome

Showing posts with label genome. Show all posts

A New Biological Constant?

Earlier, I gave evidence for a surprising relationship between the amount of G+C (guanine plus cytosine) in DNA and the amount of "purine loading" on the message strand in coding regions. The fact that message strands are often purine-rich is not new, of course; it's called Szybalski's Rule. What's new and unexpected is that the amount of G+C in the genome lets you predict the amount of purine loading. Also, Szybalski's rule is not always right.

Genome A+T content versus message-strand purine content (A+G) for 260 bacterial genera. Chargaff's second parity rule predicts a horizontal line at Y = 0.50. (Szybalski's rule says that all points should lie at or above 0.50.) Surprisingly, as A+T approaches 1.0, A/T approaches the Golden Ratio.

When you look at coding regions from many different bacterial species, you find that if a species has DNA with a G+C content below about 68%, it tends to have more purines than pyrimidines on the message strand (thus purine-rich mRNA). On the other hand, if an organism has extremely GC-rich DNA (G+C > 68%), a gene's message strand tends to have more pyrimidines than purines. What it means is that Szybalski's Rule is correct only for organisms with genome G+C content less than 68%. And Chargaff's second parity rule (which says that A=T an G=C even within a single strand of DNA) is flat-out wrong all the time, except at the 68% G+C point, where Chargaff is right now and then by chance.

Since the last time I wrote on this subject, I've had the chance to look at more than 1,000 additional genomes. What I've found is that the relationship between purine loading and G+C content applies not only to bacteria (and archaea) and eukaryotes, but to mitochondrial DNA, chloroplast DNA, and virus genomes (plant, animal, phage), as well.

The accompanying graphs tell the story, but I should explain a change in the way these graphs are prepared versus the graphs in my earlier posts. Earlier, I plotted G+C along the X-axis and purine/pyrmidine ratio on the Y-axis. I now plot A+T on the X-axis instead of G+C, in order to convert an inverse relationship to a direct relationship. Also, I now plot A+G (purines, as a mole fraction) on the Y-axis. Thus, X- and Y-axes are now both expressed in mole fractions, hence both are normalized to the unit interval (i.e., all values range from 0..1).

The graph above shows the relationship between genome A+T content and purine content of message strands in genomes for 260 bacterial genera. The straight line is regression-fitted to minimize the sum of squared absolute error. (Software by http://zunzun.com.) The line conforms to:

y = a + bx

where:

a =  0.45544384965539358
b =  0.14454244707261443

The line predicts that if a genome were to consist entirely of G+C (guanine and cytosine), it would be 45.54% guanine, whereas if (in some mythical creature) the genome were to consist entirely of A+T (adenine and thymine), adenine would comprise 59.99% of the DNA. Interestingly, the 95% confidence interval permits a value of 0.61803 at X = 1.0, which would mean that as guanine and cytosine diminish to zero, A/T approaches the Golden Ratio.

Do the most primitive bacteria (Archaea) also obey this relationship? Yes, they do. In preparing the graph below, I analyzed codon usage in 122 Archaeal genera to obtain A, G, T, and C relative proportions in coding regions of genes. As you can see, the same basic relationship exists between purine content and A+T in Archaea as in Eubacteria. Regression analysis yielded a line with a slope of 0.16911 and a vertical offset 0.45865. So again, it's possible (or maybe it's just a very strange coincidence) that A/T approaches the Golden Ratio as A+T approaches unity.

Analysis of coding regions in 122 Archaea reveals that the same relationship exists between A+T content and purine mole-fraction (A+G) as exists in eubacteria.

For the graph below, I analyzed 114 eukaryotic genomes (everything from fungi and protists to insects, fish, worms, flowering and non-flowering plants, mosses, algae, and sundry warm- and cold-blooded animals). The slope of the generated regression line is 0.11567 and the vertical offset is 0.46116.

Eukaryotic organisms (N=114).

Mitochondria and chloroplasts (see the two graphs below) show a good bit more scatter in the data, but regression analysis still comes back with positive slopes (0.06702 and .13188, respectively) for the line of least squared absolute error.

Mitochondrial DNA (N=203).

Chloroplast DNA (N=227).

To see if this same fundamental relationship might hold even for viral genetic material, I looked at codon usage in 229 varieties of bacteriophage and 536 plant and animal viruses ranging in size from 3Kb to over 200 kilobases. Interestingly enough, the relationship between A+T and message-strand purine loading does indeed apply to viruses, despite the absence of dedicated protein-making machinery in a virion.

Plant and animal viruses (N=536).

Bacteriophage (N=229).

For the 536 plant and animal viruses (above left), the regression line has a slope of 0.23707 and meets the Y-axis at 0.62337 when X = 1.0. For bacteriophage (above right), the line's slope is 0.13733 and the vertical offset is 0.46395. (When inspecting the graphs, take note that the vertical-axis scaling is not the same for each graph. Hence the slopes are deceptive.) The Y-intercept at X = 1.0 is 0.60128. So again, it's possible A/T approaches the golden ratio as A+T approaches 100%.

The fact that viral nucleic acids follow the same purine trajectories as their hosts perhaps shouldn't come as a surprise, because viral genetic material is (in general) highly adapted to host machinery. Purine loading appropriate to the A+T milieu is just another adaptation.

It's striking that so many genomes, from so many diverse organisms (eubacteria, archaea, eukaryotes, viruses, bacteriophages, plus organelles), follow the same basic law of approximately

A+G = 0.46 + 0.14 * (A+T)

The above law is as universal a law of biology as I've ever seen. The only question is what to call the slope term. It's clearly a biological constant of considerable significance. Its physical interpretation is clear: It's the rate at which purines are accumulated in mRNA as genome A+T content increases. It says that a 1% increase in A+T content (or a 1% decrease in genome G+C content) is worth a 0.14% increase in purine content in message strands. Maybe it should be called the purine rise rate? The purine amelioration rate?

Biologists, please feel free to get in touch to discuss. I'm interested in hearing your ideas. Reach out to me on LinkedIn, or simply leave a comment below.

reade more...

Decrypting DNA

In a previous post ("Information Theory in Three Minutes"), I hinted at the power of information theory to gage redundancy in a language. A fundamental finding of information theory is that when a language uses symbols in such a way that some symbols appear more often than others (for example when vowels turn up more often than consonants, in English), it's a tipoff to redundancy.

DNA is a language with many hidden redundancies. It's a four-letter language, with symbol choices of A, G, C, and T (adenine, guanine, cytosine, and thymine), which means any given symbol should be able to convey two bits' worth of information, since log₂(4) is two. But it turns out, different organisms speak different "dialects" of this language. Some organisms use G and C twice as often as A and T, which (if you do the math) means each symbol is actually carrying a maximum of 1.837 bits (not 2 bits) of information.

Consider how an alien visitor to earth might be able to use information theory to figure out terrestrial molecular biology.

The first thing an alien visitor might notice is that there are four "symbols" in DNA (A, G, C, T).

By analyzing the frequencies of various naturally occurring combinations of these letters, the alien would quickly determine that the natural "word length" of DNA is three.

There are 64 possible 3-letter words that can be spelled with a 4-letter alphabet. So in theory, a 3-letter "word" in DNA should convey 6 bits worth of information (since 2 to the 6th power is 64). But an alien would look at many samples of earthly DNA, from many creatures, and do a summation of -F * log₂(F) for every 3-letter "word" used by a given creature's DNA (where F is simply the frequency of usage of the 3-letter combo). From this sort of analysis, the alien would find that even though 64 different codons (3-letter words) are, in fact, being used in earthly DNA, in actuality the entropy per codon in some cases is as little as 4.524 bits. (Or at least, it approaches that value asymptotically.)

Since 2 to the 4.524 power is 23, and since proteins (the predominant macromolecule in earthly biology) are made of amino acids, a canny alien would surmise that there must be around 23 different amino acids; and earthly DNA is a language for mapping 3-letters words to those 23 amino acids.

As it turns out, the genetic code does use 3-letter "words" (codons) to specify amino acids, but there are 20 amino acids (not 23), with 3 "stop codons" reserved for telling the cell's protein-making machinery "this is the end of this protein; stop here."

E. coli codon usage.

The above chart shows the actual codon usage pattern for E. coli. Note that all organisms use the same 3-letter codes for the same amino acids, and most organisms use all 64 possible codons, but the codons are used with vastly unequal frequencies. If you look in the upper right corner of the above chart, for example, you'll see that E. coli uses CTG (one of the six codons for Leucine) far more often than CTA (another codon for Leucine). One of the open questions in biology is why organisms favor certain synonymous codons over others (a phenomenon called codon usage bias).

While DNA's 6-bit codon bandwidth permits 64 different codons, and while organisms do generally make use of all 64 codons, the uneven usage pattern means fewer than 6 bits of information are used per codon. To get the actual codon entropy, all you have to do is take each usage frequency and calculate -F * log₂(F) for each codon, then sum. If you do that for E. coli, you get 5.679 bits per codon. As it happens, E. coli actually does make use of almost all the available bandwidth (of 6 bits) in its codons. This turns out not to be true for all organisms, however.

reade more...

Parsing the DNA Crazy Quilt

A measure of how little we know about the real-world workings of evolution is that science still can't explain why some organisms have huge imbalances in the chemical composition of their DNA. If you look at the genome of Clostridium botulinum (the botulism germ), 72% of the bases in its DNA are either 'A' or 'T': adenine or thymine. (The four possibilities are, of course, adenine, thymine, guanine, and cytosine.) Conversely, you can find many examples of organisms in which the DNA is mostly 'G' or 'C.' The question is why A, T, G, and C don't occur in roughly equal proportions (which is what you'd expect after millions of years of genetic averaging; you'd expect some sort of regression to the mean).

Just to give you an idea of what GC/AT imbalance really looks like, here's the gene for the enzyme adenine deaminase from Clostridium botulinum, with all the A and T values in red:

ATGTATAAAAATATACAAAGAGAAATCTATAAAAATACAAAAGGAGACGGGGATATGTTTAATAAATTTGATACAAAGCCTCTTTGGGAGGTAAGTAAAACTTTATCAAGTGTAGCACAGGGGCTTGAACCGGCTGATATGGTTATTATAAATTCAAGGCTTATAAATGTCTGTACAAGAGAAGTCATAGAAAACACAGATGTAGCAATTAGCTGTGGAAGAATTGCTTTAGTAGGTGATGCAAAACATTGCATAGGGGAAAACACAGAGGTAATTGATGCAAAAGGACAATATATTGCACCAGGTTTTTTAGATGGTCATATTCATGTTGAATCATCAATGTTAAGTGTAAGCGAATATGCTCGTTCAGTAGTTCCACATGGTACTGTCGGAATATATATGGATCCACATGAAATTTGTAATGTACTCGGATTAAATGGTGTACGTTATATGATTGAAGATGGCAAGGGTACTCCACTTAAAAATATGGTGACC ACACCATCCTGTGTACCAGCAGTTCCAGGTTTTGAAGATACAGGAGCGGCTGTAGGACCAGAAGATGTTAGAGAAACAATGAAGTGGGATGAAATAGTTGGATTAGGAGAAATGATGAACTTCCCAGGTATACTTTATTCTACAGATCATGCTCATGGAGTAGTAGGAGAAACTTTAAAAGCTAGTAAAACAGTAACAGGACATTATTCTTTACCTGAAACAGGAAAAGGATTAAATGGATATATTGCATCAGGTGTAAGATGTTGTCATGAATCCACAAGAGCGGAAGATGCTCTTGCTAAAATGCGCCTTGGAATGTATGCAATGTTTAGAGAAGGATCTGCATGGCATGACTTAAAGGAAGTAAGTAAAGCCATTACAGAAAATAAGGTAGATAGTAGATTTGCTGTTTTAATATCTGATGATACTCACCCACACACATTGCTTAAGGATGGACATTTAGATCATATTATAAAACGTGCTATAGAAGAAGGG ATAGAGCCATTAACTGCAATTCAAATGGTAACAATAAATTGTGCACAATGTTTCCAAATGGATCATGAATTAGGTTCTATAACTCCAGGAAAATGTGCAGATATTGTATTTATAGAAGATTTAAAAGATGTAAAAATAACAAAGGTTATTATAGATGGAAATTTAGTTGCAAAGGGTGGACTATTAACTACTTCAATAGCTAAATATGATTATCCTGAAGATGCTATGAATTCAATGCATATTAAGAATAAAATAACACCAGATTCCTTTAATATTATGGCTCCTAATAAAGAAAAAATAACTGCAAGGGTTATTGAAATTATACCTGAAAGAGTTGGTACATATGAGAGACATGTTGAACTTAATGTTAAAGATGATAAAGTTCAATGTGATCCAAGTAAAGATGTTTTAAAAGCAGTTGTATTTGAAAGACACCATGAAACAGGAACAGCAGGATATGGTTTTGTTAAAGGTTTTGGTATTAAGAGAGGAGCTATGGCTGCAACAGTTGCCCATGATGCTCACAACTTATTAGTTATAGGAACAAATGATGAAGATATGGCATTAGCTGCTAATACATTAATAGAATGTGGTGGAGGAATGGTAGCCGTACAAGATGGTAAAGTATTAGGCTTAGTTCCATTACCAATAGCAGGACTTATGAGTAATAAGCCTTTAGAAGAAATGGCAGAAATGGTAGAAAAACTAGATAGTGCATGGAAAGAAATAGGATGTGATATAGTTTCACCATTTATGACAATGGCACTTATTCCACTTGCCTGCCTACCAGAATTAAGACTAACTAATAGAGGGTTAGTTGATTGTAATAAGTTTGAATTTGTATCATTATTTGTAGAAGAATAA

View gene at FastaView.

The organism Actinomyces oris (which occurs in the film that builds up on teeth) has an adenine deaminase gene that looks like this:

ATGGCCGATCAACCGTCCGCAGACCTGCTTATCAAGGACGCGCGCATCGTCCCTTTCCGGTCCCGTACCGAACTGGGTGCGCTGCGCCGAGGTGACCCTCACCCCGGCGCCTTGGCCGCGCCGCCGCCCCCGGGTGAGCCCGTGGATGTGCGTATCAAGGCGGGCCGGGTCGTCGAGGTGGGACAGGGGCTGAGTGCTCCCGGGACACGGGTCCTTGAGGCCGAGGGCTCCTTCCTCATTCCCGGCCTGTGGGACGCTCACGCCCACCTGGACATGGAGGCGGCGCGCTCGGCACGC ATCGACACGCTGGCCACCCGCAGCGCGGAGGAGGCCCTGGAGCTGGTGGCACGGGCGCTGCGGGATCATCCGGCCGGTTCGCCTCCGGCCACGATCCAG GGCTTCGGGCACCGCCTGTCCAACTGGCCCCGGGTGCCCACGGTGGCCGAGCTCGACGCCGTCACCGGGGAGGTTCCCACGCTGCTCATCTCCGGGGAC GTGCACTCCGGGTGGCTGAACTCGGCGGCGCTGCGTGTCTTCGGCCTGCCGGGGGCCAGCGCCCAGGACCCGGGAGCACCGATGAAGGAGGACCCGTGG TTCGCCCTACTCGACCGCCTCGATGAGGTCCCGGGGACACGCGAGCTGCGGGAGTCCGGCTACCGACAGGTCCTGGCCGACATGCTGTCCCGGGGCGTC ACCGGCGTGGTGGACATGAGCTGGTCGGAGGATCCCGATGACTGGCCGCGGCGCCTGCGGGCCATGGCGGACGAGGGCGTACTCCCCCAGGTGCTGCCC CGCATCCGCATCGGGGTCTACCGCGACAAGCTGGAACGGTGGATCGCCCGGGGCCTGCGCACCGGGACCGCGCTGGCAGGCTCACCCCGCCTGCCCGAC GGTTCCCCGGTGCTGGTGCAGGGGCCGCTCAAGGTGATCGCAGACGGCTCGATGGGCTCGGGCAGCGCACACATGTGCGAGCCCTATCCCGCCGAGCTG GGCCTGGAGCACGCCTGCGGCGTGGTCAACATCGACCGGGCCGAGCTCACCGACCTCATGGCCCACGCCTCCCGGCAGGGTTATGAGATGGCCATCCAC GCCATCGGGGACGCGGCGGTCGACGACGTCGCCGCGGCCTTCGCGCACTCGGGTGCCGCCGGGCG

For whatever reason (and that's the point: we have no idea why), Actinomyces has chosen an AT-poor dialect for its DNA, even though it has to make many of the same types of genes as Clostridium.

Some people don't see this as a major puzzle: One organism evolved its DNA to a super-AT-rich state, another one didn't. So what? It's all random drift.

I disagree. It's not drift. We know of two strong forces that should keep organisms like Actinomyces from developing high G+C content. First is "AT pressure." It's known that mutations naturally tend to go in the GC-->AT direction. (One study found that in Salmonella typhimurium, GC-->AT mutations outnumbered AT-->GC mutations 50 to 1.) In the absence of corrective measures, natural mutations would very quickly lead all organisms in the direction of DNA with a very low G+C content.

A second important force is that of lateral gene transfer, which we know is common in microorganisms; common enough, certainly, to "even out" GC/AT ratios over evolutionary timescales. Random uptake of foreign genes by cells should tend to make A, G, C, and T levels equal, over time. For organisms like Clostridium and Actinomyces (and many others), this clearly hasn't happened.

In an earlier post I mentioned one possible reason organisms drift away from the 50-50 GC/AT centerline. DNA replication is more efficient when the template is biased toward one extreme (GC) or the other (AT), assuming endogenous nucleotide levels can be regulated in a similarly biased fashion (which they presumably are, in these organisms).

One might speculate that GC/AT extremism also simplifies DNA maintenance and repair. Imagine that your DNA is 70% G+C. A super-simple DNA repair tactic for deaminated purines would be to just replace every defective purine with a guanine. Seven out of ten times, blind replacement of defective purines with guanine would be the correct repair, if you're Actionymyces. And one out of three times, mistakes wouldn't matter anyway, because high-GC codons tend to be fourfold degenerate. (In a fourfold degenerate codon, you can replace the third base with anything—A, G, C, or T—without changing the codon's meaning.) Blind guanine substitution would have a better than 80% success rate in a high-GC organism that needed to replace defective purines.

It turns out there are other reasons to live "away from centerline," if you're a bacterium. I'll talk about those in another post.

reade more...

DNA G+C Content and Survival Value

One of biology's big open questions is why organisms differ so much with regard to the relative amounts of GC and AT in their DNA. You'd think that if there are only two kinds of DNA base pairs (see diagram) they'd be more-or-less equally abundant. Not so. There are organisms with DNA that's mostly GC (and/or CG) pairs; there are organisms with very-AT-rich DNA; and within the chromosomes of higher organisms you find large GC-rich regions (isochores) in the midst of great swaths of AT-rich DNA.

DNA contains adenine and thymine in equal amounts, and
guanine and cytosine in equal amounts, but it does not
usually contain GC pairs and AT pairs in equal amounts. And
it doesn't seem as if there is an "optimum" GC:AT ratio. The
GC:AT ratio varies by species. Within a species, it's constant.

There are two really odd facts at work here:

1. The GC content of DNA varies by species, and it varies a lot.

2. Evolution doesn't seem to trend toward an "optimum CG:AT ratio" of any kind.

If there were such thing as an optimum GC:AT ratio for DNA, surely microorganisms would've figured it out by now. Instead, we find huge diversity: There are bacteria on every point in the GC% spectrum, running from 16% GC for the DNA of Candidatus Carsonella ruddii (a symbiont of the jumping plant louse) to 75% for Anaeromyxobacter dehalogenans 2CP-C (a soil bacterium). At each end of the spectrum you find aerobes and anaerobes; extremophiles and blandophiles; pathogens and non-pathogens. About the only generalization you can make is that the smaller an organism's genome is, the more likely it is to be rich in A+T (low GC%).

Genome size correlates loosely with GC content. The very smallest
bacteria tend to have AT-rich (low GC%) DNA.

The huge diversity in GC:AT ratios among bacteria is impressive. But does it simply represent a random walk all over the possibility-space of DNA? Or do the various points on the spectrum constitute special niches with important advantages? What advantage could there be for having high-GC% DNA? Or high-AT% DNA?

Some subtle clues tell us that this is not just random deviation from the mean. First, suppose we agree for sake of argument that lateral gene transfer (LGT) is common in the microbial world (a point of view I happen to agree with). Over the course of millions of years, with pieces of DNA of all kinds (high GC%, low GC%) flying back and forth, LGT should force a regression to the mean: It should make genomes tend toward a 50-50 GC:AT ratio. That clearly hasn't happened.

And then there's ordinary mutational pressures. It's beginning to be fairly well accepted (see Hershberg and Petrov, "Evidence That Mutation is Universally Biased Toward AT in Bacteria," PLoS Genetics, 2010, 6:9, e1001115, full version here) that natural mutation is strongly biased in the direction of AT by virtue of the fact that deamination of cytosine and methylcytosine (which occurs spontaneously at high frequency) leads to replacement of 'C' with 'T', hence GC pairs becoming AT pairs. The strong natural mutational bias toward AT says that all DNA should creep in the direction of low GC% and end up well below 50% GC. But again, this is not what we see. We see that high-GC organisms like Anaeromyxobacter (and many others) maintain their DNA's unusually high (75%) GC content across millions of generations. Even middle-of-the-road organisms like E. coli (with 50% GC content) don't slowly slip in the direction of high-AT/low-GC.

Clearly, something funny is going on. For a super-high-GC organism like Anaeromyxobacter to maintain its DNA's super-high GC content against the constant tug of mutations in the AT direction, it must be putting significant energy into maintaining that high GC percentage. But why? Why pay extra to maintain a high GC%? And how does the cost get paid?

I think I've come up with a possible answer. It has to do with DNA replication cost, where "cost" is figured in terms of time needed to synthesize a new copy of the DNA (for cell division). Anything that favors low replication cost (high replication speed) should favor survival; that's my main assumption.

My other assumption is that DNA polymerases (the enzymes involved in replication) are not clairvoyant. They can't know, until the need arises, which of the four deoxyribonucleotide triphosphates (dATP, dTTP, dGTP, dCTP) will be needed at a given moment, to elongate the new strand of DNA. When the need arises for (let's say) an 'A', the 'A' (in the form of dATP) has to come from an existing endogenous pool of dNTPs containing all four bases (dATP, dTTP, dGTP, dCTP) in whatever concentrations they're in. The enzyme has to wait until a dATP (if that's what's needed) randomly happens to lock into the active site. Odds are only one in four (assuming equal concentrations of dNTPs) of a dATP coming along at exactly the right moment. Odds are 3 out of 4 that some incorrect dNTP (either dGTP, dTTP, or dCTP) will try, and fail, to fit the active site first, before dATP comes along.

But imagine that your DNA is 75% G+C. And suppose you've regulated your intracellular metabolism to maintain dGTP and dCTP in a 3:1 ratio over dATP and dTTP. The odds of a good random "first hit" go up.

To simulate the various possibilities, I wrote software (in JavaScript) that simulates DNA replication, where the template DNA molecule is 1000 base-pairs in length and the dNTP pool size is 10000 bases. The software allows you to set the organism's genome GC% to whatever you want, and also set the dNTP pool's relative GC percentage to whatever you want. The template DNA is just a random string of A, T, G, and C bases (1000 total), reflecting their relative abundances as set in the GC% parameter. The pool of dNTPs is set up to be a randomized array (again reflecting abundances set in a GC% parameter).

The way the software works is this. Read a base off the template. Fetch a base randomly from the base pool. If the base happens to be the one (out of four) that's called for, score '1' for the timing parameter, and continue to read another base off the template. If the base was not the one that's called for, put it back in the pool array in a random location, then randomly fetch another base from the pool; and increment the timing parameter. (For each fetch, the timing parameter goes up by 1.) Keep fetching (and throwing back bases) until the proper base comes up, incrementing the time parameter as appropriate. (The time parameter keeps track of the number of fetch attempts.) When the correct base turns up, the pool shrinks by one base. In other words, replication consumes the pool, but as I said earlier, the pool contains ten times as many bases (to start) as the DNA template. So the pool ends up 10% smaller at the end of replication.

Each point on this graph represents the average of 100 Monte Carlo runs, each run representing complete replication of a 1000-bp DNA template, drawing from a pool of 10,000 bases. The blue points are runs that used a DNA template containing 25% G+C content. The red points are runs that used DNA with 75% G+C. The X-axis represents different base-pool compositions. See text for details. Click for larger image.

I ran Monte Carlo simulations for DNA templates having GC contents of 75%, 50%, and 25%, using base pools set up to have anywhere from 15% GC to 85% (in 2.5% increments). The results for the 75% GC and 25% GC templates (representing high- and low-GC organisms) are shown in the above graph. Each point on the graph represents the average of 100 complete replication runs. The Y-axis shows the average number of fetches per DNA base (so, a low value means fast replication; a high value means slower DNA replication). The X-axis shows the percentage of GC in the base-pool, in recognition of the fact that relative dNTP abundances in an organism may vary, in accordance with environmental constraints as well as with organism-specific homeostatic setpoints.

Maximal replication speed (the low point of each curve) happens at a base-pool GC percentage that is displaced in the direction of the DNA's own GC%. So, for the 25%-GC organism (blue data points), max replication efficiency comes when the base-pool is about 33% GC. For the 75% GC organism (red points) the sweet spot is at a base-pool GC concentration of 65%. (Why this is not exactly symmetrical with the other curve, I don't know; but bear in mind, these are Monte Carlo runs. Some variation is to be expected.)

The interesting thing to note is that max replication efficiency, for each organism, comes at 3.73 fetches per base-pair (Y-axis). Cache that thought. It'll be important in a minute.

The real jaw-dropper is what happens when you plot a curve for template DNA with 50% GC content. In the graph below, I've shown the 50%-GC runs as black points. (The red and blue points are exactly as before.)

This is the same graph as before, but with replication data for a 50%-GC genome (black points). Again, each data point represents the average of 100 Monte Carlo runs. Notice that the black curve bottoms out at a higher level (4.0) than the red or blue curves (3.73). This means replication is less efficient for the 50%-GC genome.

Notice that the best replication efficiency comes in the middle of the graph (no big surprise), but check the Y-value: 4.00. The very fastest DNA replication, when the DNA template is 50% GC, requires 4 fetches per base, compared to best-case base-fetching efficiency of 3.73 for the 25%-GC and 75%-GC DNAs.What does this mean? It means DNA replication, in a best-case scenario, is 4.25% more efficient for the skewed-GC organisms. (The difference between 3.73 and 4.00 is 4.25%.)

This goes a long way toward explaining why GC extremism is stable in organisms that pursue it. There is replication efficiency to be had in keeping your DNA biased toward high or low GC. (It doesn't seem to matter which.)

Consider the dynamics of an ATP drawdown. The energy economy of a cell revolves around ATP, which is both an energy molecule and a source for the adenine that goes into DNA and RNA. One would expect normal endogenous concentrations of ATP to be high relative to other NTPs. For a low-GC% organism, that's also a near-ideal situation for DNA replication, because high AT in the base pool puts you near the max-replication-speed part of the curve (see blue points). A sudden drawdown in ATP (when the cell is in crisis) shifts replication speed to the right-hand part of the blue curve, slowing replication significantly. This is what you want if you're an intracellular symbiont (or a mitochondrion, incidentally). You want to stop dividing when the host cell is unable to divide because of an energy crisis.

Consider the high-GC organism (red dots), on the other hand. If ATP levels are high during normal metabolism, replication is not as efficient as it could be, but so what? It just means you're willing to tolerate less-efficient replication in good times. But as ATP draws down (perhaps because nutrients are becoming scarce), DNA replication actually becomes more efficient. This is what you want if you're a free-living organism in the wild. You want to be able to continue replicating your DNA even as ATP becomes scarce. And indeed that's what happens (according to the red data points): As the base pool becomes more GC-rich, replication efficiency increases. The best efficiency comes when base-pool A+T is down around 35%.

I think these simulations are meaningful and I think they help explain the DNA-composition extremism seen among microorganisms. If you're a professional scientist and you find these results tantalizing, and you'd like to co-author a paper for PLoS Genetics (or another journal), please get in touch. (My Google mail is kas-dot-e-dot-thomas.) I'd like to coauthor with someone who is good with statistics, who can contribute more ideas to this line of investigation. I think these results are worth sharing with the scientific community at large.

reade more...

Hydrogen Peroxide Powers Evolution

I'm about to offer a conjecture that is a bit preposterous-sounding but could well hold true. I actually think it does.

I propose that evolution, at the level of bacteria (though probably not at higher levels), is driven by hydrogen peroxide.

This theory rests on three assumptions: One is that the creation of new bacterial species happens almost entirely via lateral gene transfer, not heritable point-mutations. Secondly, bacteria (marine and terrestrial) are regularly exposed to challenges by hydrogen peroxide in the environment. Thirdly, those challenges drive lateral gene transfer.

Evidence for the first assumption is embarrassingly abundant. If you're not up to speed on the subject, I suggest you read the excellent paper, "Lateral Gene Transfer," by Olga Zhaxybayeva and W. Ford Doolittle in Current Biology, April 2011, 21:7, pp. R242-246 (unlocked copy here). It's now common to find that any given bacterial species can trace a good percentage of its protein base to "ancestors" that are too far removed horizontally to be ancestors in the conventional sense.

Consider E. coli. There are hundreds of strains of E. coli, with genes ranging in number from 4,100 to about 5,300 per strain. The problem is, the various strains of E. coli have only about 900 genes in common (and that's far too few genes to render a fully functional E. coli). The E. coli pan-genome actually takes in more than 15,000 gene families, total. Certainly, you can draw a family tree of E. coli based on 16S ribosomal polymorphisms, but that doesn't explain where the 15,000 pan-genome genes came from. The "family tree" metaphor quickly breaks down if you start drawing trees based on proteins. You get many conflicting trees—all of them correct.

Trees like this are fiction where bacteria are concerned.
The tree of life is more like a net of life or web
of life than a directed acyclic graph.

Where are all of the genes coming from? Other species, of course. They arrive by way of mechanisms like transformation, transduction, and conjugation. all of which allow direct entry of foreign DNA into a bacterial cell. At one time it was thought that conjugation could only occur between bacteria of the same species, but it is now known that cross-species conjugation also occurs (as, for example, between E. coli and Streptomyces or Mycobacterium).

Transduction, which is where viruses package up an infected host's genes in virus capsules that are then taken up by another cell, occurs naturally in bacterial populations in response to environmental factors like ultraviolet light and hydrogen peroxide. Exposure of a virus-carrying (lysogenic) cell to UV light or peroxide can induce runaway production of virus, and in fact this mechanism is used by Streptococcus to kill competitive Staphylococcus cells, in a clever bit of chemical warfare. It's been known for years that hydrogen peroxide can cause many types of bacteria to shed DNA. Now we know why: Hydrogen peroxide is a signalling molecule. It signals (among other things) lysogenic bacteria to go into a lytic cycle. It also signals cells to mount what's known as the SOS response, which is a global response to oxidative challenge. Years ago, Bruce Ames and his colleagues showed that exposing Salmonella to very dilute (60 micromolar) hydrogen peroxide caused the cells to differentially express 30 "SOS" proteins, including heat-shock proteins and low-fidelity DNA-repair systems. We know that hydrogen peroxide as dilute as 0.1 micromolar can induce phage (virus) production in up to 11% of marine bacteria. This is significant, because rainwater contains hydrogen peroxide in concentrations of 2 to 40 micromolar, and ocean water has been known to reach millimolar levels of H2O2 after a rain storm.

If you're wondering why rain contains hydrogen peroxide, the peroxide gets there in two ways. One is UV-frequency photochemistry (where water is cleaved to H and OH, then reforms as H2 and H2O2); the other is via ionization reactions caused by lightning. (Lightning is energetic enough to bring airborne oxygen and water to a plasma state. The resulting ionization and rearrangement of free atoms yields a certain amount of hydrogen peroxide.) The presence of H2O2 in rainwater has been confirmed many times, and in fact there's a well-preserved "fossil record" of it in polar icepacks, going back centuries. (Polar snowpacks contain from 10 to 900 ppb of H2O2; it varies seasonally, the max coming in summer.)

Bottom line, every rain event (over land, over sea) constitutes a hydrogen peroxide challenge for microbes. Which induces viral transduction (and a release of whole-cell DNA through lysis, some of which will be inevitably be used in transformation). It also induces low-fidelity DNA repair (which is guaranteed to help evolution along). Every rain event, in other words, is a chance for evolution to do its thing. For bacteria, that means gene-sharing within and across species lines.

Darwin's theory of a tree-like ancestor basis
for all living things is dead wrong, at
least for bacteria.

W. Ford Doolittle (who wrote a classic book chapter about lateral gene transfer called "If the Tree of Life Fell, Would We Recognize the Sound?") estimates that if a horizontal gene transfer occurs once every ten billion vertical replications, "it would be enough to ensure that no gene in any modern genome has an unbroken history of vertical descent back to some hypothetical last universal common ancestor." (See this article.)

It's obvious (to me, at least) that every rain event carries with it the potential to cause far more gene transfers than are necessary (according to Doolittle) to make vertical inheritance fade into insignificance as an evolutionary bringer of change. The hydrogen peroxide in rain has been driving lateral gene transfer in bacteria for eons. In fact, it is arguably the dominant driver of evolution in bacteria.

Sorry, Mr. Darwin. Point mutations handed down to sons and daughters just isn't cutting it.

reade more...

More Science on the Desktop

Not to keep harping on the amazing power of desktop omics tools, but I thought I'd share a tip for those of you into genome-mining. The tip in a nutshell is that if you gang-load a bunch of FASTA sequences (DNA sequence data) into the FeatView form at http://genomevolution.org, then click the rather inconspicuous button labeled "Phylogeny.fr" at the bottom left of the FeatView page, you'll be taken automatically to http://www.phylogeny.fr, where you'll get a realtime-generated phylogenetic tree based on the sequence data you provided in FeatView, with no effort on your part (it's truly a one-click operation). Copy and paste DNA sequences into FeatView, click one button, and 30 seconds later a tree shows up on your screen, looking (perhaps) something like this:

The reason I made this tree is that I wasn't satisfied with my knowledge of the relatedness of certain weird microorganisms I've recently run into. Namely:

Ralstonia (which I mentioned yesterday), WEIRD BECAUSE: It turns hydrogen gas and CO2 into plastic.
Bordetella, a bronchial infection agent; WEIRD BECAUSE: It turns out to be very similar, genetically, to Ralstonia
Burkholderia, a soil organism (and human and animal pathogen), WEIRD BECAUSE: It has an unexpectedly large amount of genetic similarity to Ralstonia and Polynucleobacter
Polynucleobacter, a ditch-water bacterium, WEIRD BECAUSE: It can live as an intracellular parasite of freshwater ciliates or it can live independently in soil (making it potentially a great study organism for determining the genetic bases of intracellular symbiosis)
Thiomicrospira, a very tiny CO2- and sulfur-loving organism, WEIRD BECAUSE: It can only be found near deep-sea thermal vents (see my previous writeup)
Polaromonas, a relatively newly discovered and still poorly understood bacterium, WEIRD BECAUSE: It is abundant in glacier ice on multiple continents. Plus it has an amazing (and totally unexpected) amount of genetic overlap with our good friend Bordetella, the whooping-cough bug.

If you're not familiar with how bacterial classification works, let's just say it's a mess. There's a long historical tradition of classifying microorganisms based on a hodgepodge of ad hoc methods involving everything from physical appearance under the microscope (especially after staining with crystal violet), to the habitat of the organism, to its ability to metabolize various substances, its ability to make spores, adaptation to oxygen or lack of oxygen, serological characteristics, etc. It's always been an error-prone system, resulting in many misclassifications and later corrections, owing to its inconsistency and basic irrationality, to put it bluntly. With the advent of molecular genetic techniques, it's now possible to create accurate phylogenies based on little more than DNA sequence differences, usually involving the 16S ribosomal RNA (more here).

Freshwater ciliates (like this Euplotes) are
home for Polynucleobacter endosymbionts.

As big an advance as ribosome-based phylogeny is, it's pretty far from ideal (IMHO), mainly because it ignores phenotypes. In fact it's pretty far removed from anything at all having to do with an organism's ecology, metabolism, mode of living, etc. What are we really measuring when we measure relatedness according to a 16S ribosomal yardstick? Just the rate of random mutation accumulation in a pretty uninteresting cell artifact. I'd rather have a yardstick that's tied to phenotypic reality than to a slow-to-change, "highly conserved" piece of cold dead scaffolding.

So to create my own "family tree" of two dozen or so microbes, I said to hell with 16S ribosomes and decided to use, as my yardstick, genetic variation in the GroEL gene, which codes for the 60-kiloDalton heat-shock protein. I chose this protein (or rather, the gene for it) as my phylo-yardstick for a number of reasons. First, the DNA sequence is sizable, at about 1643 nucleotides (making it somewhat bigger than the 16S rDNA). It's important to have a large yardstick gene when looking for faint genetic signals. Secondly, this protein is essentially universal in prokaryotes. It's ubiquitous but not necessarily highly conserved, in the same sense that rRNA is highly conserved. ("Highly conserved" is not what you want. Think about it. Taken to the extreme, a "highly conserved" sequence is invariant. It never changes. And is therefore useless for phylogenetics.) Thirdly, the GroEL heat-shock protein has multiple intracellular touchpoints: It's known to interact with GroES, ALDH2, and dihydrofolate reductase, and it's involved in signal tranduction (it's induced not just by heat but by hydrogen peroxide). Not to overlook the obvious, but it is also a touchpoint protein for any enzyme that can be repaired by the 60kDa heat shock protein. That's probably dozens if not hundreds of enzymes. Why is that important? Think about it: A protein that is sensitive to the 3D conformational requirements of other proteins has to evolve in response to the needs of all the proteins it services. A thermophile (Thermomicrospira) is going to need a different heat-shock repair system than a psychrophile (Polaromonas). A salt-lover needs a different one than a freshwater-lover. GroEL has to reflect, in its own structure, the many shifting requirements of the host proteome. These considerations make GroEL a highly appropriate basis gene for phylogenetic analysis.

And frankly, I think the GroEL-based phylo-tree phylogeny.fr spit out for me (see illustration further above) speaks for itself. It's a remarkably informative (and accurate) tree. GroEL evolutionary differences not only accurately grouped endosymbionts together, soil organisms together, aquatic organisms, etc., it also correctly grouped the "enteric-alike" Erwinia with E. coli and Shigella, and it cannily put Polaromonas with soil organisms (rather than aquatics), which I think is correct, based on recent Polaromonas isolates being found in soil rather than snow. Likewise, it's good to see Bdellovibrio (a freshwater bug) clustered with Polynucleobacter (which is symbiotic with a ciliate protozoan), with Thiomicrospira (the saltwater hydro-vent organism) a very nearby out-node.

If you get an infection while in a hospital, pray
it's not Clostridium difficile, which is often deadly.

A harder call to make is Clostridium difficile, which is present in 1% to 5% of non-ill people's intestines. Is it an enteric (a la E. coli)? Definitely not. The Clostridia (botulism, tetanus, etc.) are spore-forming soil bacteria. Their placement in the tree not far from the soil-dwelling spore-former, Bacillus thuringensis, is thus eminently correct. Bacillus is a proximal out-node relative to Clostridium, which is understandable in that Bacillus is aerobic whereas Clostridia are strict anaerobes.

Buchnera (an aphid symbiont) comes at an odd location, much further away from the insect-dwelling Wolbachia than I would have predicted, but then again Buchnera's host feeds on cold sap where Wolbachia's hosts typically feed on warm blood. All the organisms around Wolbachia in the tree are hemophiles.

Our good friend Bordetella (of pertussis fame) is placed firmly in the soil group. I think that's real and significant. When you start to look at Bordetella's high DNA sequence similarity with Ralstonia and Burkholderia, it would be surprising, actually, if it fell anywhere else in the tree.

Honestly, when I took Bacterial Ecology 201 in college, many years ago, it was under duress and I hated the experience. But now, decades later, I'm starting to like it. With tools like those available for free at http://genomevolution.org and http://www.phylogeny.fr, what's not to like?

reade more...

Pages

.

A New Biological Constant?

Decrypting DNA

Parsing the DNA Crazy Quilt

DNA G+C Content and Survival Value

Hydrogen Peroxide Powers Evolution

More Science on the Desktop