unemployment depression: genomics

Showing posts with label genomics. Show all posts

More ZunZun Graphs and Biohacking

A few days ago I gave a detailed front-to-back tutorial on how to do a bit of bio-hacking. I showed how to quickly get the amino acid sequences for 25 versions of the Hsp40 (DnaJ) heat shock protein, as produced by 25 different organisms, then create a graph of arginine content versus lysine content for the 25 organisms (all bacterial and Archaea). The resulting graph looked something like this:

Lysine and arginine (mole-fraction) in the DnaJ protein of 50 microorganisms.

In this particular graph there are 50 points, because I went back to UniProt and added 25 more organisms to the mix, to see if the trend line would hold true. (The correlation actually got stronger: r=0.82.) The graph clearly shows that as lysine concentration goes up, arginine concentration goes down. Does this graph prove that lysine can take the place of arginine in DnaJ? That's not exactly what the graph says, although it's a worthwhile hypothesis. To check it out, you'd want to look at some AA sequence alignments to see if arginine and lysine are, in fact, replacing each other in the exact same spots in the protein, across organisms. Certainly it would be reasonable for lysine to replace arginine. Both are polar, positively charged amino acids.

I should point out that the organisms in this graph vary greatly in genomic G+C content. The codon AAA represents lysine, and experience has taught me (maybe you've noticed this too, if you're a biogeek) that lysine usage is a pretty reliable proxy for low G+C content. If you check the codon usage tables for organisms with low-GC genomes, you'll see that they use lysine more than any other amino acid. In Clostridium botulinum, AAA accounts for 8% of codon usage. In Buchnera it's 9%. Low G+C means high lysine usage.

Likewise, organisms with low G+C quite often have a disproportionately low frequency of 5'-CpG-3' dinucleotides in their DNA (much lower than would occur by chance). The technical explanation for this is interesting, but I'll leave it for another day. Suffice it to say, organisms with low CpG tend, by definition, not to use very many codons that begin with CG, all of which code for (guess which amino acid?) arginine.

To see if the arg-lys inverse relationship holds for higher organisms, I gathered up DnaJ sequences for 25 plants. The results:

Same idea as above, but this time using data for 50 plants (instead of bacteria).

Same negative relationship. However, note one thing: The scale of the graph's axes do not match the scale of the axes in the previous graph (further above). In this graph, we're seeing a very narrow range of frequencies for both Arg and Lys. Fortunately, ZunZun's interface makes it easy for us to re-plot the plant data using the same axis scaling as we had for our bacterial data. By constraining the x- and y-axis limits, we can re-visualize this data in an apples-to-apples context:

The plant data, re-plotted with the same axis scaling as used in the bacterial plot further above.

If you compare this graph carefully to the first graph, at the top of the page, you'll see that the points lie on pretty much the same line.

Just for fun, I went back to the UnitProt site and searched for DnaJ in insects (Arthropoda, which technically also subsumes crustacea). Then I plotted the bug data using the same x- and y-axis scaling as for the bacteria:

Same graph, this time for 62 insect DnaJ sequences.

The insect points tend to cluster higher in the graph, and much further to the left than for plants, indicating that the arthropods seem to like Arg and don't much care to use Lys. The moral, I suppose, is that if your diet is lacking in arginine, you should eat fewer plants and more insects.

reade more...

jQuery for Bioinformatics

I've been using JavaScript for almost two decades now, but somehow I've managed to avoid learning jQuery until just recently, mostly out of laziness but also because of a lingering yet torrid love-hate relationship with "syntax sugar" programming patterns. The best thing I can say about jQuery is that it has a seductively compact and powerful syntax. The worst thing I can say about jQuery is this.repeat(previousStatement).

For better or worse, I've had to begin dabbling in jQuery recently to save myself from the horror of old-school bare-knuckle DOM parsing. You know what I'm talking about: Nested loops with lots of calls to getElementsByTagName( ) followed up with hand-parsing of innerHTML. Who wants to do all that when you can use the oh-so-cute $(selector).each( ) construction?

The trouble with cute/compact syntax (as any recovering Perl user will gladly tell you in return for a bottle of cheap sherry) is that it's write-only. When you go back to look at something a week later and see 15 lines' worth of JS functionality rolled up into a shockingly crisp (yet thoroughly opaque) jQuery one-liner, you often wish you'd gone ahead and written those 15 homely lines of JavaScript in the first place, instead of giving in to that one irresistibly sexy, powerful line of jQuery that's oh yeah BTW also self-obfuscating.

Nonetheless, if you do a lot of page-scraping (as I do when visiting bioinformatics sites), the time savings of being able to parse a page with jQuery can be formidable. Who can resist grabbing all rows of a table with $("tr")? Who can resist iterating over them with .each()?

I tend to use the online apps at genomevolution.org quite heavily. The great folks who maintain that site have a nice way of serving up prodigious amounts of data in easy-to-use interactive forms, but sometimes you just want to harvest the data from a table and be done with it. Take the page I created at http://genomevolution.org/r/9726, which is based on a list of 100 unique bacterial species in the group known as Alphaproteobacteria. If you go to that page and scroll over to the far right, you'll see a column header labeled "Codon Usage." Underneath that label is a "Get All Codon Tables" link. Click that link and be prepared to wait about two minutes as the codon data loads for each organism. It's worth the wait, because when you're done, you're looking at color-coded codon usage frequencies for all 64 codons, for all 100 organisms.

Suppose you just want the codon data in text form, to analyze later? Scraping the raw data out of the HTML page is a royal bitch, because whether you know it or not, that page has tables embedded in tables embedded in tables. Parsing the DOM by hand is (shudder, wince) well nigh unthinkable.

Go to http://genomevolution.org/r/9726 and click "Get All Codon Tables" under the "Codon Usage" column heading. Allow a minute or two for codon data to load. Meanwhile, Control-Shift-J opens the Chrome console. (Select the Console tab at the top of the window if it's not already selected.) Paste the following code into the console. Hit Enter. Savor the power.

codonData = [];

function process( ) {

    var CODONS_COLUMN  = 15;

    var rowdata =  jQuery( 'td', this );
    var codonUsage = rowdata[ CODONS_COLUMN ].textContent.split(/(?=CCA)/)[1]; 
    codonData.push( codonUsage ); 
}

$('tr[id^=gl]').each( process ); // oh jQuery, must you tease me so?

console.log( codonData.join("\n") );

All of this was originally a single statement, with an inline callback function (in typical jQuery fashion). I decided to unroll it into more verbose, easier to understand form, lest my head explode two weeks from now trying to re-read and re-figure-out the code.

This bit of code does some pretty typical jQuery things, such as grab all rows of a table with $('tr'), except that in this case I most certainly do not want all rows of all tables in the HTML page (which would be hundreds of rows of extraneous stuff). The rows I need happen to have an "id" attribute with a value that begins with "gl." The construction $('tr[id^=gl]') is jQuery's syntax for selecting table rows that have an id-attribute that begins with "gl." (The ^= here means "begins with." You could signify "ends with" using $= instead of ^=.)

The process() callback fetches all table columns for the current row using the jQuery( 'td', this ) construction, which means gives me a jQuery object representing all "td" elements under the DOM node represented by this. In the callback context, this refers to the current jQuery node, not the window object or Function object. If you choose (as I did not) to declare your callback with arguments, as in function myCallback( argA, argB), then argA will be the index of the current item and argB will be this.

If you're wondering about the regex /(?=CCA)/, I need this because ordinarily the codon data would look like this:

Codon Usage: The Bacterial and Plant Plastid Code (transl_table=11) CCA(P) 1.18%CCG(P) 1.58%CCT(P) 1.17%CCC(P) 1.37%CGA(R) 0.32%CGG(R) 1.32%CGT(R) 1.82%CGC(R) 2.54%CAA(Q) 1.07%CAG(Q) 2.84%CAT(H) 1.59%CAC(H) 0.89%CTA(L) 0.48%CTG(L) 4.58%CTT(L) 1.96%CTC(L) 0.84%GCA(A) 2.94%GCG(A) 2.14%GCT(A) 2.31%GCC(A) 3.90%GGA(G) 0.90%GGG(G) 1.74%GGT(G) 2.11%GGC(G) 3.23%GAA(E) 3.92%GAG(E) 1.36%GAT(D) 3.76%GAC(D) 1.49%GTA(V) 1.08%GTG(V) 3.01%GTT(V) 2.19%GTC(V) 0.81%ACA(T) 1.82%ACG(T) 1.49%ACT(T) 0.57%ACC(T) 1.83%AGA(R) 0.30%AGG(R) 0.31%AGT(S) 0.61%AGC(S) 1.33%AAA(K) 2.01%AAG(K) 1.60%AAT(N) 1.39%AAC(N) 1.64%ATA(I) 0.59%ATG(M) 2.56%ATT(I) 2.88%ATC(I) 1.59%TCA(S) 0.65%TCG(S) 0.47%TCT(S) 1.37%TCC(S) 1.34%TGA(*) 0.14%TGG(W) 1.47%TGT(C) 0.46%TGC(C) 0.70%TAA(*) 0.14%TAG(*) 0.03%TAT(Y) 1.47%TAC(Y) 0.90%TTA(L) 0.61%TTG(L) 1.67%TTT(F) 2.41%TTC(F) 1.22%

Notice that first line ("Codon usage: The Bacterial [blah blah]"). I just want the codon data, not the leader line. But how to split off the codon data? Answer: Use a lookahead regular expression that doesn't consume the match. If you split on /CCA/ (the first codon) you will of course consume the CCA, never to be seen again. Instead, use (?=CCA), with parentheses (absolutely essential!) and the parser will look ahead to find an upcoming CCA, then stop and match the spot right before the CCA without consuming the CCA.

I'm sure a true jQuery expert can rewrite the foregoing code in a much more elegant, compact manner. For me, elegant and compact aren't always optimal. I've learned to value readable and self-documenting over elegant and opaque. Cute/sexy isn't always best. I'll take homely and straightforward any day.

reade more...

More about Mitochondrial DNA

To recap my desktop-science experiments of the last month or so, I've found strandwise DNA asymmetry across domains, which is to say in bacteria, Archaea, eukaryotes, viruses, and mitochondrial DNA. In every case except mitochondria, the message (or RNA-synonymous) strand of DNA in coding regions tends to be purine-rich. The opposite strand tends to be pyrimidine-rich. Moreover, in all domains, including mitochondria, message-strand purine content increases in proportion to genome A+T content. (A+T content is a phylogenetic signature. Some genomes are inherently high in A+T content—or low in G+C content—while others are not. Related organisms tend to have similar A+T or G+C contents.)

Mitochondrial genes tend to be pyrmidine-rich on the message strand, seemingly in violation of the finding that in all other domains, message strands are purine-rich. The mitochondrial anomaly is actually very easy to understand (although it took me weeks to realize the explanation). In a nutshell: Mitochondrial DNA is pyrimidine-rich on message strands because mtDNA encodes only a few proteins (13, usually), all of them membrane-associated. Membrane-associated proteins are unusual because they tend to incorporate mostly non-polar amino acids such as leucine, isoleucine, valine, proline, alanine, or phenylalanine—all of which are specified by pyrimidine-rich codons.

The mitochondrion.

It seems to me mitochondrial DNA shouldn't be thought of as a genome, because well over 90% of mitochondrial-associated gene products are encoded by genes in the host nucleus. (In humans, there may be as many as 1500 nuclear-encoded mitochondrial genes.) This point is worth repeating, so let me quote Patrick Chinnery, TRENDS in Genetics (2003) 19:2, 60:

The vast majority of mitochondrial proteins (estimated at >1000) are synthesized in the cytosol from nuclear gene transcripts.

The circular mitochondrial "chromosome" (if it can be called that) is the vestigial remnant of a much larger genome that long ago migrated to the host nucleus, no doubt to avoid oxidative attack. The mitochondrion simply is not a safe place to store DNA. (Would you set up a sperm bank in a rocket-fuel factory?) It's teeming with molecular oxygen, superoxides, peroxides, free protons, and other hazardous materials.

The human mitochondrial chromosome.

Human mitochondrial DNA (which is typical of a lot of mtDNA) encodes just a handful of multi-subnit transmembrane proteins, namely: cytochrome-c oxidase, NADH dehydrogenase, cytochrome-b, and an ATPase. That's it. There are no other protein genes in human mtDNA. All other "mitochondrial proteins" are encoded somewhere else. (That includes 37 out of 44 subunits of the NADH dehydrogenase complex; the DNA polymerase that replicates mitochondrial DNA; the mitochondrial RNA polymerase; about 50 ribosomal proteins; so-called "mitochondrial" catalase; and hundreds of other "mitochondrial" proteins. All are encoded in the nucleus.)

Bottom line: Mitochondrial DNA encodes a very small ensemble of highly specialized membrane-associated proteins. We shouldn't expect this small ensemble to be representative of other genes found in other genomes. (And it's not.) That, in a nutshell, is why mtDNA is not particularly purine-rich in message strands.

But we should test this hypothesis, if possible. (And it is, in fact, possible.) Most bacteria are aerobic, which means most bacterial species have genes for cytochrome-c oxidase, NADH dehydrogenase, etc. The DNA for those genes should be similar to mtDNA with respect to strand-asymmetric purine content. If we analyze bacterial DNA, we should find that genes for cytochrome-c oxidase, NADH dehydrogenase, etc. are pyrimidine-rich on the message strand, just as in mtDNA.

In tomorrow's post: the data.

reade more...

Highly Expressed Genes: Better-Repaired?

At any given time in any cell, some genes are highly expressed while others are moderately expressed, still others are barely expressed, and quite a few are not expressed at all. The fact that genes vary tremendously in their levels of expression is nothing new, of course, but we still have a lot to learn about how and why some genes have the "Transcribe Me!" knob cranked wide open and others remain dormant until called upon. (For a great paper on this subject, I recommend Samuel Karlin and Jan Mrázek, "Predicted Highly Expressed Genes of Diverse Prokaryotic Genomes," J. Bact. 2000 182:18, 5238-5250, free copy here.)

Reading up on this subject got me to thinking: If DNA undergoes damage and repair at transcription time (when genes are being expressed), shouldn't highly expressed genes differ in mutation rate from rarely expressed genes? (But, in which direction?) Also: Does one strand of highly expressed DNA (the strand that gets transcribed) mutate or repair at a different rate than the other strand?

We know that in most organisms, there is quite an elaborate repair apparatus dedicated to fixing DNA glitches at transcription time. (This is the so-called Transcription Coupled Repair System.) We also know that the TCRS has a preference for the template strand of DNA, just as RNA polymerase does. In fact, it's when RNA polymerase stalls at the site of a thymine dimer (or other major DNA defect) that TCRS kicks into action. Stalled RNAP is the trigger mechanism for TCRS.

But TCRS isn't the only repair option for DNA at transcription time. I've written before about the Archaeal Ogg1 enzyme (which detects and snips out oxidized guanine residues from DNA). The Ogg1 system is a much simpler Base Excision Repair system, fundamentally low-tech compared to the heavy-duty TCRS mechanism. The latter involves nucleotide-excision repair (NER), which means cutting sugars (deoxyribose) out of the DNA backbone and replacement of a whole section of DNA (at great energy cost). BER just snips bases and leaves the underlying sugar(s) in place.

Being a fan of desktop science, I wanted to see if I couldn't devise an experiment of my own to shed light on the question: Does differential repair of DNA strands at transcription time lead to strand asymmetry in highly expressed genes?

Methanococcus maripaludis

Happily, there's a database of highly expressed genes at http://genomes.urv.cat/HEG-DB, which is the perfect starting point for this sort of investigation. For my experiment, I chose the microbe Methanococcus maripaludis strain C5, This tiny organism (isolated from a salt marsh in South Carolina) is a strict anaerobe that lives off hydrogen gas and carbon dioxide. It has a relatively small genome (just under 1.7 million base pairs, enough to code for around 1400 genes). The complete genome is available from here (but don't click unless you want to start a 2-meg download). More to the point, a list of 123 of the creature's most highly expressed genes (HEGs) is available from this page (safe to click; no downloads). The HEGs are putative HEGs inferred from Codon Adaptation Index analysis relative to a reference set of (known-good) high-expression genes. For more details on the HEG ranking process see this excellent paper.

The DNA sequence data for M. maripaludis was easy to match up against the list of HEGs obtained from http://genomes.urv.cat/HEG-DB. In fact, I was able to do all the data-crunching I needed to do with a few lines of JavaScript, in the Chrome console. In no time, I had the adenine (A), guanine (G), and thymine (T) content for all of M. maripaludis's genes, which allowed me to make the following graph:

Purine content (y-axis) plotted against adenine-plus-thymine content for all genes of Methanococcus maripaludis. Each dot represents a gene. The red dots represent the most highly expressed genes. Click to enlarge.

What we're looking at here is message-strand purine content (A+G) on the y-axis versus A+T content (which is a common phylogenetic metric, akin to G+C content) on the x-axis. As you know if you've been following this blog, I have used purine-vs.-AT plots quite successfully to uncover coding-region strand asymmetries. (See this post and/or this one for details.) The important thing to notice above is that while points tend to fall in a shotgun-blast centered roughly at x=0.66 and y=0.55, the Highly Expressed Genes (HEGs, in red) cover the upper left quadrant of the shotgun blast.

What does it mean? Consider the following. Of the four bases in DNA, guanine (G) is the most vulnerable to oxidative damage. When such damage is left uncorrected, it eventually results in a G-to-T transversion mutation. A large number of such mutations will cause overall A+T to increase (shifting points on the above graph to the right). If G-to-T transversions accumulate preferentially on one strand, the strand in question will see a reduction in purine content (as G, a purine, is replaced by T, a pyrimidine) while the other strand will see a corresponding increase in purine content (via the addition of adenines to pair with the new T's). Bottom line, if G-to-T transversions happen on the message strand, points in the above graph will move to the right and down. If they happen on the template (or transcribed) strand, points will move left and up. What we see in this graph is that HEGs have gone left and up.

The fact that highly expressed genes appear in the upper left quadrant of the distribution means that yes, differential repair is indeed (apparently) happening at transcription time; highly expressed genes are more intensively repaired; and the beneficiary of said repair(s), at least in M. maripaludis, is the message strand (also called the RNA-synonymous or non-transcribed strand) of DNA, which is where our sequence data come from, ultimately. A relative excess of unrepaired 8-oxoguanine on the template strand (or transcribed strand) means guanines are being replaced by thymines on that strand, and new adenines are showing up opposite the thymines, on the message strand, boosting A+G.

I don't know too many other explanations that are consistent with the above graph.

I hasten to add that one graph is just one graph. A single graph isn't enough to prove any kind of universal phenomenon. What we see here applies to Methanococcus maripaludis, an Archaeal anaerobe that may or may not share similarities (vis-a-vis DNA repair) with other organisms.

reade more...

Do-It-Yourself Phylogenetic Trees

I've been doing a lot of desktop science lately, and I'm happy to report that superb, easy-to-use online tools exist for creating your own phylogenetic trees based on gene similarities, something that's non-trivial to implement yourself.

The other day, I speculated that the fruit-fly Ogg1 gene, which encodes an enzyme designed to repair oxidatively damaged guanine residues in DNA, might derive from Archaea. The Archaea (in case you're not a microbiologist) comprise one of three super-kingdoms in the tree of life. Basically, all life on earth can be classified as either Archaeal, Eukaryotic, or Eubacterial. The Eubacteria are "true bacteria": they're what you and I think of when we think "bacteria." (So, think Staphylococcus and tetanus bacteria and E. coli and all the rest.) The Eukaryota are higher life forms, starting with yeast and fungi and algae and plankton, progressing up through grass and corn and pine trees, worms and rabbits and donkeys, all the way to the highest life form of all, Stephen Colbert. (A little joke there.) Eukaryotes have big, complex cells with a distinct nucleus, complex organelles (like mitochondria and chloroplasts), and a huge amount of DNA packaged into pairs of chromosomes.

Archaea look a lot like bacteria (they're tiny and lack a distinct nucleus, organelles, etc.), and were in fact considered bacteria until recently. But around the turn of the 21st century, Carl Woese and George E. Fox provided persuasive evidence that members of this group of organisms were so different in genetic profile (not to mention lifestyle) that they deserved their own taxonomic domain. Thus, we now recognize certain bacteria-like creatures as Archaea.

The technical considerations behind the distinction between bacteria and archeons are rather deep and have to do with codon usage patterns, ribosomal RNA structure, cell-wall details, lipid metabolism, and other esoterica, but one distinguishing feature of archeons that's easy to understand is their willingness to live under harsh conditions. Archaeal species tend to be what we call extremophiles: They usually (not always) take up residence in places that are incredibly salty, or incredibly hot, or incredibly alkaline or acidic.

While it's generally agreed that eukaryotes arose after Archaea and bacteria appeared, it's by no means clear whether Archaea and bacteria branched off independently from a common ancestor, or perhaps one arose from the other. (A popular theory right now is that Archaea arose from gram-positive bacteria and sought refuge in inhospitable habitats to escape the chemical-warfare tactics of the gram-positives.) A complication that makes studying this sort of thing harder is the fact that horizontal gene transfer has been known to happen (with surprising frequency, actually) across domains.

Is it possible to study phylogenetic relationships, yourself, on the desktop? Of course. One way to do it: Obtain the DNA sequences of a given gene as produced by a variety of organisms, then feed those gene sequences to a tool like the tree-making tool at http://www.phylogeny.fr. Voila! Instant phylogeny.

The Ogg1 gene is an interesting case, because although the DNA-repair enzyme encoded by this gene occurs in a wide variety of higher life forms, plus Archaea, it is not widespread among bacteria. Aside from a couple of Spirochaetes and one Bacteroides species, the only bacteria that have this particular gene are the members of class Clostridia (which are all strict anaerobes). Question: Did the Clostridia get this gene from anaerobic Archaea?

Using the excellent online CoGeBlast tool, I was able to build a list of organisms that have Ogg1 and obtain the relevant gene sequences, all with literally just a few mouse clicks. Once you run a search using CoGeBlast, you can check the checkboxes next to organisms in the results list, then select "Phylogenetics" from the dropdown menu at the bottom of the results list. (See screenshot.)

When you click the Go button, a new FastaView window will open up, containing the gene sequences of all the items whose checkboxes you checked in CoGeBlast. At the bottom of this FastaView window, there's a small box that looks like this:

Click Phylogeny.fr button (red arrow). Immediately, your sequences are sent to the French server where they'll be converted to a phylogenetic tree in a matter of one to two minutes (usually). The result is a tree that looks something like this:

I've color-coded this tree to make the results easier to interpret. Creating a tree of this kind is not without potential pitfalls, because for one thing, if your DNA sequences are of vastly unequal lengths, the groupings made by Phylogeny.fr are likely to reflect gene lengths more than true phylogeny. For this tree, I did various data checks to make sure we're comparing apples and apples. Even so, a sanity check is in order. Do the groupings make sense? They do, actually. At the very top of the diagram (color-coded in green) we find all the eukaryotes grouped together: fruit-fly (Drosophila), yeast (Saccharomyces), fungus (Aspergillus). At the bottom of the diagram, Clostridium species (purplish red) fall into a subtree of their own, next to a tiny subtree of Methoanobrevibacter. This actually makes a good deal of sense, because the two Methanobrevibacter species shown are inhabitants of feces, as are the nearby Clostridium bartletti and C. diff. The fact that all the salt-loving Archaea members group together (organisms with names starting with 'H') is also indicative of a sound grouping. Overall, the tree looks sound.

If you're wondering what all the numbers are, the scale bar at the bottom (0.4) shows the approximate percentage difference in DNA sequences associated with that particular length of tree depth. The red numbers on the tree branches are indicative of the probability that the immediately underlying nodes are related. Probably the most important thing to know is that the evolutionary distance between any two leaves in the tree is proportional to the sums of the branch lengths connecting them. (The branch lengths are not explicitly specified; you have to eyeball it.) At the top of the diagram, you can see that the branch lengths of the two Drosophila instances are very short. This means they're closely related. By contrast, the branch lengths for Saccharomyces and the ancestor to Drosophila are long, meaning that these organisms are distantly related.

Just to give you an idea of the relatedness, I checked the C. botulinum Ogg1 protein amino-acid sequence against C. tetani, and found 63% identity of amino acids. When I compared C. botulinum's enzyme against C. difficile's, there was 52% identity. With Drosophila there is only 32% identity, and even that applies only to a 46% coverage area (versus 90%+ for C. tetani and C. diff). Bottom line, the Blast-wise relatedness does appear to correspond, in sound fashion, to tree-wise relatedness.

Two things stand out. One is that not all of the Clostridium species group together. (There's a small cluster of Clostridia near the salt-lovers, then a main branch near the methane-producing Archaea. The out-group of Clostridia near the salt-lovers happen to all have chromosomal G+C content of 50% or more, which makes them quite different from the rest of the Clositridia, whose G+C is under 30%.) The other thing that stands out is that it does appear as if Clostridial Ogg1 could be Archaeal in origin, based on the relationship of Methanoplanus and Methanobrevibacter to the main group of Clostridia. (Also, the C. leptum group's Ogg1 may share an ancestor with the halophilic Archaea.) One thing we can say for sure is that Ogg1 is ancient.

It's tempting to speculate that the eukaryotes obtained Ogg1 from early mitochondria, and that early mitochondria were actually Archaeal endosymbionts. The first part is easily true, because we know that early mitochondria quickly exported most of their DNA to the host nucleus. (Today's mitochondrial DNA is vestigial. Well over 90% of mitochondrial genes are actually in the host nucleus. Things like mitochondrial DNA polymerase have to be transcribed from nucleus-generated RNA.) Whether or not early mitochondria were Archaeal endosymbionts, no one knows.

Anyway, I hope this shows how easy it is to generate phylogenetic trees from the comfort of a living room sofa, using nothing more than a laptop with wireless internet connection. Try making your own phylo-trees using CoGeBlast and Phylogeny.fr—and let me know what you find out.

reade more...

Shedding Light on DNA Strand Asymmetry

In 1950, Erwin Chargaff was the first to report that the amount of adenine (A) in DNA equals the amount of thymine (T), and the amount of guanine (G) equals the amount of cytosine (C). This result was instrumental in helping Watson and Crick (and Rosalind Franklin) determine the structure of DNA.

It's pretty easy to understand that every A on one strand of DNA pairs with a T on the other strand (and every G pairs with an opposite-strand C); this explains DNA complementarity and the associated replication model. But somewhere along the line, Chargaff was credited with the much less obvious rule that A = T and G = C even for individual strands of DNA that aren't paired with anything. This is the so-called second parity rule attributed to Chargaff, although I can't find any record of Chargaff himself having postulated such a rule. The Chargaff papers that are so often cited as supporting this rule (in particular the 3-paper series culminating in this report in PNAS) do not, in fact, offer such a rule, and if you read the papers carefully, what Chargaff and colleagues actually found was that one strand of DNA is heavier than the other (they label the strands 'H' and 'L', for Heavy and Light); not only that, but Chargaff et al. reported a consistent difference in purine content between strands (see Table 1 of this paper).

When I interviewed Linus Pauling in 1977, he cautioned me to always read the Results section of a paper carefully, because people will often conclude something entirely different than what the Results actually showed, or cite a paper as showing "ABC" when the data actually showed "XYZ."

How right he was.

At any rate, it turns out that the "message" strand of a gene hardly ever contains equal amounts of purines and pyrimidines. Codon analysis reveals that as genes become richer in A+T content (or as G+C content goes down), the excess of purines on the message strand becomes larger and larger. This is depicted in the following graph, which shows message-strand purine content (A+G) plotted against A+T content, for 1,373 distinct bacterial species. (No species is represented twice.)

Codon analysis reveals that as A+T content increases, message-strand purine content (A+G) increases. Each point on this graph represents a unique bacterial species (N=1373).

It's quite obvious that when A+T content is above approximately 33%, as it is for most bacterial species, the message strand tends to be comparatively purine-rich. Below A+T = 33%, the message strand becomes more pyrimidine-rich than purine-rich. (Note: In bacteria, where most of the DNA is in coding regions, codon-derived A+T content is very close to whole-genome A+T content. I checked the 1,373 species graphed here and found whole-chromosome A+T to differ from codon-derived A+T by an average of less than 7 parts in 10,000.)

The correlation between A+T and purine content is strong (r=0.85). Still, you can see that quite a few points have drifted far from the regression line, especially in the region of x = 0.5 to x = 0.7, where lots of points lie above y = 0.55. What's going on with those organisms? I decided to do some investigating.

First, some basics. Over time, transition mutations (AT↔GC) can change an organism's A+T content and thus move it along the x-axis of the graph, but transitions cannot move an organism higher or lower on the graph, because (by definition) transitions don't affect the strandwise purine balance.

Transversions, on the other hand, can affect strandwise purine balance (in theory, at least), but only if they occur more often on one strand of DNA than the other. (I should say: occur more often, or are fixed more often, on one strand versus the other.) For example, let's say G-to-T transversions are the most common kind of transversion (which is probably true, given that guanine is the most easily oxidized of the four bases and given the fact that failure to repair 8-oxoguanine lesions does lead to eventual replacement with thymine). And let's say G-to-T transversions are most likely to occur on the non-transcribed strand of DNA, at transcription time. (The non-transcribed strand is uncoiled and unprotected while transcription is taking place on the other strand.) Over time, the non-transcribed strand would lose guanines; they'd be replaced by thymines. The message strand, or RNA-synonymous strand (which is also the non-transcribed strand) would become pyrimidine-rich and the other strand would become purine-rich.

Unfortunately, while that's exactly what happens for organisms with A+T content below 33%, precisely the opposite happens (purines accumulate on the message strand) in organisms with A+T above 33%. And in fact, in some high-AT organisms, the purine content of message strands is rather extreme. How can we explain that?

One possibility is that some organisms have evolved extremely effective transversion repair systems for the message (non-transcribed) strand of genes—systems that are so effective, no G-to-T transversions go unrepaired on the message strand. The transcribed strand, on the other hand, doesn't get the benefit of this repair system, possibly because the repair enzymes can't access the strand: it's engulfed in transcription factors, topoisomerases, RNA polymerase, nearby ribosomal machinery, etc.

If the non-transcribed strand never mutates (because all mutations are swiftly repaired), then the transcribed strand will (in the absence of equally effective repairs) eventually accumulate G-to-T mutations, and the message strand will accumulate adenines (purines). Perhaps.

In the graph further above, you'll notice at x = 0.6 a tiny spur of points hangs down at around y = 0.5. These points belong to some Bartonella species, plus a Parachlamydia and another chlamydial organism. These are endosymbionts that have lost a good portion of their genomes over time. It seems likely they've lost some transversion-repair machinery. During transcription, their message strands are going unrepaired. G-to-T transversions happen on the message strand, rendering it light in purines. Such a scenario seems plausible, at least.

By this reasoning, maybe points far above the regression line represent organisms that have gained repair functionality, such that their message strands never undergo G-to-T transversions (although their transcribed strands do). Is this possible?

Examination of the highest points on the graph shows a predominance of Clostridia. (Not just members of the genus Clostridium, but the class Clostridia, which is a large, ancient, and diverse class of anaerobes.) One thing we know about the Clostridia is that unlike all other bacteria (unlike members of the Gammaproteobacteria, the Alpha- and Betaproteobacteria, the Actinomycetes, the Bacteroidetes, etc.), the Clostridia have Ogg1, otherwise known as 8-oxoguanine glycosylase (which specifically prevents G-to-T transversions). They share this capability with all members of the Archaea, and all higher life forms as well.

Note that while non-Ogg1 enzymes exist for correcting 8-oxoguanine lesions (e.g., MutM, MutY, mfd), there is evidence that Ogg1 is specifically involved in repair of 8oxoG lesions in non-transcribed strands of DNA, at transcription time. (The other 8oxoG repair systems may not be strand-specific.)

If Archaea benefit from Ogg1 the way Clostridia do, they too should fall well above the regression line on a graph of A+G versus A+T. And this is exactly what we find. In the graph below, the pink squares are members of Archaea that came up positive in a protein-Blast query against Drosophila Ogg1. (I'll explain why I used Drosophila in a minute.) The red-orange circles are bacterial species (mostly from class Clostridia) that turned up Ogg1-positive in a similar Blast search.

Ogg1-positive organisms are plotted here. The pink squares are Archaea species. Red-orange circles are bacterial species that came up Ogg1-positive in a protein Blast search using a Drosophila Ogg1 amino-acid sequence. In the background (greyed out) is the graph of all 1,373 bacterial species, for comparison. Note how the Ogg1-positive organisms have a higher purine (A+G) content than the vast majority of bacteria.

The points in this plot are significantly higher on the y-axis than points in the all-bacteria plot (and the regression line is steeper), consistent with a different DNA repair profile.

In identifying Ogg1-positive organisms, I wanted to avoid false positives (organisms with enzymes that share characteristics of Ogg1 but that aren't truly Ogg1), so for the Blast query I used Drosophila's Ogg1 as a reference enzyme, since it is well studied (unlike Archaeal or Clostridial Ogg1). I also set the E-value cutoff at 1e-10, to reduce spurious matches with DNA repair enzymes or nucleases that might have domain similarity with Ogg1 but aren't Ogg1. In addition, I did spot checks to be sure the putative Ogg1 matches that came up were not actually matches of Fpg (MutM), RecA, RadA, MutY, DNA-3-methyladenine glycosidase, or other DNA-binding enzymes.

Bottom line, organisms that have an Archaeal 8-oxoguanine glycosylase enzyme (mostly obligate anaerobes) occupy a unique part of the A+G vs. A+T graph. Which makes sense. It's only logical that anaerobes would have different DNA repair strategies (and a different "repairosome") than oxygen-tolerant bacteria, because oxidative stress is, in general, handled much differently in anaerobes. The fact that they bring different repair tactics to bear on DNA shouldn't come as a surprise.

reade more...

More Science on the Desktop

If you took Bacteriology 101, you were probably subjected to (maybe even tested on) the standard mythology about anaerobes lacking the enzyme catalase. The standard mythology goes like this: Almost all life forms (from bacteria to dandelions to humans) have a special enzyme called catalase that detoxifies hydrogen peroxide by breaking it down to water and molecular oxygen. The only exception: strict anaerobes (bacteria that cannot live in the presence of oxygen). They seem to lack catalase.

I've written on this subject before, so I won't bore you with a proper debunking of all aspects of the catalase myth here. (For that, see this post.) Right now, I just want to emphasize one point, which is that, contrary to myth, quite a few strict anaerobes do have catalase. I've listed 87 examples by name below. (Scroll down.)

I have to admit, even I was shocked to find there are 87 species of catalase-positive strict anaerobes among the eubacteria. It's about quadruple the number I would have expected.

If you're curious how I came up with a list of 87 catalase-positive anaerobes, here's how. First, I assembled a sizable (N=1373) list of bacteria, unduplicated at the species level. (So in other words, E. coli is listed only once, Staphylococcus aureus is listed only once, etc. No species is listed twice.) I then used the free/online CoGeBlast tool to run two Blast searches: one designed to identify aerobes, and another to identify catalase-positive organisms. In the end, I had all 1,373 organisms tagged as to whether each was aerobic, anaerobic, catalase-positive, or catalase-negative.

It's not as easy as you'd think to identify strict anaerobes. There is no single enzymatic marker that can be used to identify anaerobes reliably (across 1,373 species), as far as I know. I took the opposite approach, tagging as aerobic any organism that produces cytochrome c oxidase and/or NADH dehydrogenase. (These are enzymes involved in classic oxidative phosphorylation of the kind no strict anaerobe participates in.) In particular, I used the following set of amino acid sequences as markers of aerobic respiration (non-biogeeks, scroll down):

>sp|Q6MIR4|NUOB_BDEBA NADH-quinone oxidoreductase subunit B OS=Bdellovibrio bacteriovorus (strain ATCC 15356 / DSM 50701 / NCIB 9529 / HD100) GN=nuoB PE=3 SV=1
MHNEQVQGLVSHDGMTGTQAVDDMSRGFAFTSKLDAIVAWGRKNSLWPMPYGTACCGIEF MSVMGPKYDLARFGAEVARFSPRQADLLVVAGTITEKMAPVIVRIYQQMLEPKYVLSMGA CASSGGFYRAYHVLQGVDKVIPVDVYIPGCPPTPEAVMDGIMALQRMIATNQPRPWKDNW KSPYEQA
>sp|P0ABJ3|CYOC_ECOLI Cytochrome o ubiquinol oxidase subunit 3 OS=Escherichia coli (strain K12) GN=cyoC PE=1 SV=1
MATDTLTHATAHAHEHGHHDAGGTKIFGFWIYLMSDCILFSILFATYAVLVNGTAGGPTG KDIFELPFVLVETFLLLFSSITYGMAAIAMYKNNKSQVISWLALTWLFGAGFIGMEIYEF HHLIVNGMGPDRSGFLSAFFALVGTHGLHVTSGLIWMAVLMVQIARRGLTSTNRTRIMCL SLFWHFLDVVWICVFTVVYLMGAM
>sp|Q9I425|CYOC_PSEAE Cytochrome o ubiquinol oxidase subunit 3 OS=Pseudomonas aeruginosa (strain ATCC 15692 / PAO1 / 1C / PRS 101 / LMG 12228) GN=cyoC PE=3 SV=1
MSTAVLNKHLADAHEVGHDHDHAHDSGGNTVFGFWLYLMTDCVLFASVFATYAVLVHHTA GGPSGKDIFELPYVLVETAILLVSSCTYGLAMLSAHKGAKGQAIAWLGVTFLLGAAFIGM EINEFHHLIAEGFGPSRSAFLSSFFTLVGMHGLHVSAGLLWMLVLMAQIWTRGLTAQNNT RMMCLSLFWHFLDIVWICVFTVVYLMGAL
>tr|Q7VDD9|Q7VDD9_PROMA Cytochrome c oxidase subunit III OS=Prochlorococcus marinus (strain SARG / CCMP1375 / SS120) GN=cyoC PE=3 SV=1
MTTISSVDKKAEELTSQTEEHPDLRLFGLVSFLVADGMTFAGFFAAYLTFKAVNPLLPDA IYELELPLPTLNTILLLVSSATFHRAGKALEAKESEKCQRWLLITAGLGIAFLVSQMFEY FTLPFGLTDNLYASTFYALTGFHGLHVTLGAIMILIVWWQARSPGGRITTENKFPLEAAE LYWHFVDGIWVILFIILYLL
>sp|Q8KS19|CCOP2_PSEST Cbb3-type cytochrome c oxidase subunit CcoP2 OS=Pseudomonas stutzeri GN=ccoP2 PE=1 SV=1
MTSFWSWYVTLLSLGTIAALVWLLLATRKGQRPDSTEETVGHSYDGIEEYDNPLPRWWFM LFVGTVIFALGYLVLYPGLGNWKGILPGYEGGWTQVKEWQREMDKANEQYGPLYAKYAAM PVEEVAKDPQALKMGGRLFASNCSVCHGSDAKGAYGFPNLTDDDWLWGGEPETIKTTILH GRQAVMPGWKDVIGEEGIRNVAGYVRSLSGRDTPEGISVDIEQGQKIFAANCVVCHGPEA KGVTAMGAPNLTDNVWLYGSSFAQIQQTLRYGRNGRMPAQEAILGNDKVHLLAAYVYSLS QQPEQ
>sp|P57542|CYOC_BUCAI Cytochrome o ubiquinol oxidase subunit 3 OS=Buchnera aphidicola subsp. Acyrthosiphon pisum (strain APS) GN=cyoC PE=3 SV=1
MIENKFNNTILNSNSSTHDKISETKKLFGLWIYLMSDCIMFAVLFAVYAIVSSNISINLI SNKIFNLSSILLETFLLLLSSLSCGFVVIAMNQKRIKMIYSFLTITFIFGLIFLLMEVHE FYELIIENFGPDKNAFFSIFFTLVATHGVHIFFGLILILSILYQIKKLGLTNSIRTRILC FSVFWHFLDIIWICVFTFVYLNGAI
>sp|O24958|CCOP_HELPY Cbb3-type cytochrome c oxidase subunit CcoP OS=Helicobacter pylori (strain ATCC 700392 / 26695) GN=ccoP PE=3 SV=1
MDFLNDHINVFGLIAALVILVLTIYESSSLIKEMRDSKSQGELVENGHLIDGIGEFANNV PVGWIASFMCTIVWAFWYFFFGYPLNSFSQIGQYNEEVKAHNQKFEAKWKHLGQKELVDM GQGIFLVHCSQCHGITAEGLHGSAQNLVRWGKEEGIMDTIKHGSKGMDYLAGEMPAMELD EKDAKAIASYVMAELSSVKKTKNPQLIDKGKELFESMGCTGCHGNDGKGLQENQVFAADL TAYGTENFLRNILTHGKKGNIGHMPSFKYKNFSDLQVKALLNLSNR
>sp|P0ABI8|CYOB_ECOLI Ubiquinol oxidase subunit 1 OS=Escherichia coli (strain K12) GN=cyoB PE=1 SV=1
MFGKLSLDAVPFHEPIVMVTIAGIILGGLALVGLITYFGKWTYLWKEWLTSVDHKRLGIM YIIVAIVMLLRGFADAIMMRSQQALASAGEAGFLPPHHYDQIFTAHGVIMIFFVAMPFVI GLMNLVVPLQIGARDVAFPFLNNLSFWFTVVGVILVNVSLGVGEFAQTGWLAYPPLSGIE YSPGVGVDYWIWSLQLSGIGTTLTGINFFVTILKMRAPGMTMFKMPVFTWASLCANVLII ASFPILTVTVALLTLDRYLGTHFFTNDMGGNMMMYINLIWAWGHPEVYILILPVFGVFSE IAATFSRKRLFGYTSLVWATVCITVLSFIVWLHHFFTMGAGANVNAFFGITTMIIAIPTG VKIFNWLFTMYQGRIVFHSAMLWTIGFIVTFSVGGMTGVLLAVPGADFVLHNSLFLIAHF HNVIIGGVVFGCFAGMTYWWPKAFGFKLNETWGKRAFWFWIIGFFVAFMPLYALGFMGMT RRLSQQIDPQFHTMLMIAASGAVLIALGILCLVIQMYVSIRDRDQNRDLTGDPWGGRTLE WATSSPPPFYNFAVVPHVHERDAFWEMKEKGEAYKKPDHYEEIHMPKNSGAGIVIAAFST IFGFAMIWHIWWLAIVGFAGMIITWIVKSFDEDVDYYVPVAEIEKLENQHFDEITKAGLK NGN
>sp|P0ABK2|CYDB_ECOLI Cytochrome d ubiquinol oxidase subunit 2 OS=Escherichia coli (strain K12) GN=cydB PE=1 SV=1
MIDYEVLRFIWWLLVGVLLIGFAVTDGFDMGVGMLTRFLGRNDTERRIMINSIAPHWDGN QVWLITAGGALFAAWPMVYAAAFSGFYVAMILVLASLFFRPVGFDYRSKIEETRWRNMWD WGIFIGSFVPPLVIGVAFGNLLQGVPFNVDEYLRLYYTGNFFQLLNPFGLLAGVVSVGMI ITQGATYLQMRTVGELHLRTRATAQVAALVTLVCFALAGVWVMYGIDGYVVKSTMDHYAA SNPLNKEVVREAGAWLVNFNNTPILWAIPALGVVLPLLTILTARMDKAAWAFVFSSLTLA CIILTAGIAMFPFVMPSSTMMNASLTMWDATSSQLTLNVMTWVAVVLVPIILLYTAWCYW KMFGRITKEDIERNTHSLY
>sp|Q6MIR4|NUOB_BDEBA NADH-quinone oxidoreductase subunit B OS=Bdellovibrio bacteriovorus (strain ATCC 15356 / DSM 50701 / NCIB 9529 / HD100) GN=nuoB PE=3 SV=1
MHNEQVQGLVSHDGMTGTQAVDDMSRGFAFTSKLDAIVAWGRKNSLWPMPYGTACCGIEF MSVMGPKYDLARFGAEVARFSPRQADLLVVAGTITEKMAPVIVRIYQQMLEPKYVLSMGA CASSGGFYRAYHVLQGVDKVIPVDVYIPGCPPTPEAVMDGIMALQRMIATNQPRPWKDNW KSPYEQA
>sp|Q89AU5|NUOB_BUCBP NADH-quinone oxidoreductase subunit B OS=Buchnera aphidicola subsp. Baizongia pistaciae (strain Bp) GN=nuoB PE=3 SV=1
MKYTLTRVNISDDDQNYPREKKIQVSDPTKKYIQKNVFMGTLSKVLHNLVNWGRKNSLWP YNFGLSCCYVEMVTSFTSVHDISRFGSEVLRASPRQADFMVIAGTPFIKMVPIIQRLYDQ MLEPKWVISMGSCANSGGMYDIYSVVQGVDKFLPVDVYIPGCPPRPEAYIHGLMLLQKSI SKERRPLSWIIGEQGIYKANFNSEKKNLRKMRNLVKYSQDKN
>sp|Q82DY0|NUOB1_STRAW NADH-quinone oxidoreductase subunit B 1 OS=Streptomyces avermitilis (strain ATCC 31267 / DSM 46492 / JCM 5070 / NCIMB 12804 / NRRL 8165 / MA-4680) GN=nuoB1 PE=3 SV=1
MGLEEKLPSGFLLTTVEQAAGWVRKASVFPATFGLACCAIEMMTTGAGRYDLARFGMEVF RGSPRQADLMIVAGRVSQKMAPVLRQVYDQMPNPKWVISMGVCASSGGMFNNYAIVQGVD HIVPVDIYLPGCPPRPEMLIDAILKLHQKIQSSKLGVNAEEAAREAEEAALKALPTIEMK GLLR

Astonishingly, certain bacteria that "everyone knows" are anaerobic turned up as aerobic when checked with the above Blast-query. (For example: Bacteroides fragilis, Desulfovibrio gigas, Moorella thermoacetica, and others.) It seems quite a number of so-called anaerobes have non-copper (heme only) cytochrome oxidases. (See this paper for further discussion.)

In any event, my Blast search turned up 1,089 positives (putative aerobes, some facultatively anaerobic) out of 1,373 bacterial species. I tagged the non-positives as anaerobes.

Of the 284 putative anaerobes, 87 scored positive in a Blast protein search (t-blast-n) for catalase. I used the following catalase sequences in my query:

>sp|B0C4G1|KATG_ACAM1 Catalase-peroxidase OS=Acaryochloris marina (strain MBIC 11017) GN=katG PE=3 SV=1
MSSASKCPFSGGALKFTAGSGTANRDWWPNQLNLQILRQHSPKSNPMDKAFNYAEAFKSL DLADVKQDIFDLMKSSQDWWPADYGHYGPLFIRMAWHSAGTYRIGDGRGGAGTGNQRFAP INSWPDNANLDKARMLLWPIKQKYGAKISWADLMILAGNCALESMGFKTFGFAGGREDIW EPEEDIYWGAETEWLGDQRYTGDRDLEATLGAVQMGLIYVNPEGPNGHPDPVASGRDIRE TFGRMAMNDEETVALTAGGHTFGKCHGAGDDAHVGPEPEGARIEDQCLGWKSSFGTGKGV HAITSGIEGAWTTNPTQWDNNYFENLFGYEWELTKSPAGANQWVPQGGAGANTVPDAHDP SRRHAPIMTTADMAMRMDPIYSPISRRFLDNPDQFADAFARAWFKLTHRDMGPRSRYLGP EVPEEELIWQDPVPAVNHELINEQDIATLKSQILATNLTVSQLVSTAWASAVTYRNSDKR GGANGARIRLAPQRDWEVNQPAQLATVLQTLEAVQTTFNHSQIGGKRVSLADLIVLGGCA GVEQAAKNAGWYDVKVPFKPGRTDATQAQTDVTSFAVLEPRADGFRNYLKGHYPVSAEEL LVDKAQLLTLTAPEMTVLVGGLRVLNANVGQAQHGVFTHRPESLTNDFFLNLLDMSVTWA ATSEAEEVFEGRDRKTGALKWTGTRVDLIFGSNSQLRALAEVYGCEDSQQRFVQDFVAAW DKVMNLDRFDLA
>tr|D9RGS2|D9RGS2_STAAJ Catalase OS=Staphylococcus aureus (strain JKD6159) GN=katE PE=3 SV=1
MSQQDKKLTGVFGHPVSDRENSMTAGPRGPLLMQDIYFLEQMSQFDREVIPERRMHAKGS GAFGTFTVTKDITKYTNAKIFSEIGKQTEMFARFSTVAGERGAADAERDIRGFALKFYTE EGNWDLVGNNTPVFFFRDPKLFVSLNRAVKRDPRTNMRDAQNNWDFWTGLPEALHQVTIL MSDRGIPKDLRHMHGFGSHTYSMYNDSGERVWVKFHFRTQQGIENLTDEEAAEIIASDRD SSQRDLFEAIEKGDYPKWTMYIQVMTEEQAKSHKDNPFDLTKVWYHDEYPLIEVGEFELN RNPDNYFMDVEQAAFAPTNIIPGLDFSPDKMLQGRLFSYGDAQRYRLGVNHWQIPVNQPK GVGIENICPFSRDGQMRVVDNNQGGGTHYYPNNHGKFDSQPEYKKPPFPTDGYGYEYNQR QDDDNYFEQPGKLFRLQSEDAKERIFTNTANAMEGVTDDVKRRHIRHCYKADPEYGKGVA KALGIDINSIDLETENDETYENFEK
>sp|P60355|MCAT_LACPN Manganese catalase OS=Lactobacillus plantarum PE=1 SV=1
MFKHTRKLQYNAKPDRSDPIMARRLQESLGGQWGETTGMMSYLSQGWASTGAEKYKDLLL DTGTEEMAHVEMISTMIGYLLEDAPFGPEDLKRDPSLATTMAGMDPEHSLVHGLNASLNN PNGAAWNAGYVTSSGNLVADMRFNVVRESEARLQVSRLYSMTEDEGVRDMLKFLLARETQ HQLQFMKAQEELEEKYGIIVPGDMKEIEHSEFSHVLMNFSDGDGSKAFEGQVAKDGEKFT YQENPEAMGGIPHIKPGDPRLHNHQG
>sp|P42321|CATA_PROMI Catalase OS=Proteus mirabilis GN=katA PE=1 SV=1
MEKKKLTTAAGAPVVDNNNVITAGPRGPMLLQDVWFLEKLAHFDREVIPERRMHAKGSGA FGTFTVTHDITKYTRAKIFSEVGKKTEMFARFSTVAGERGAADAERDIRGFALKFYTEEG NWDMVGNNTPVFYLRDPLKFPDLNHIVKRDPRTNMRNMAYKWDFFSHLPESLHQLTIDMS DRGLPLSYRFVHGFGSHTYSFINKDNERFWVKFHFRCQQGIKNLMDDEAEALVGKDRESS QRDLFEAIERGDYPRWKLQIQIMPEKEASTVPYNPFDLTKVWPHADYPLMDVGYFELNRN PDNYFSDVEQAAFSPANIVPGISFSPDKMLQGRLFSYGDAHRYRLGVNHHQIPVNAPKCP FHNYHRDGAMRVDGNSGNGITYEPNSGGVFQEQPDFKEPPLSIEGAADHWNHREDEDYFS QPRALYELLSDDEHQRMFARIAGELSQASKETQQRQIDLFTKVHPEYGAGVEKAIKVLEG KDAK
>sp|Q9Z598|CATA_STRCO Catalase OS=Streptomyces coelicolor (strain ATCC BAA-471 / A3(2) / M145) GN=katA PE=3 SV=1
MSQRVLTTESGAPVADNQNSASAGIGGPLLIQDQHLIEKLARFNRERIPERVVHARGSGA YGHFEVTDDVSGFTHADFLNTVGKRTEVFLRFSTVADSLGGADAVRDPRGFALKFYTEEG NYDLVGNNTPVFFIKDPIKFPDFIHSQKRDPFTGRQEPDNVFDFWAHSPEATHQITWLMG DRGIPASYRHMDGFGSHTYQWTNARGESFFVKYHFKTDQGIRCLTADEAAKLAGEDPTSH QTDLVQAIERGVYPSWTLHVQLMPVAEAANYRFNPFDVTKVWPHADYPLKRVGRLVLDRN PDNVFAEVEQAAFSPNNFVPGIGPSPDKMLQGRLFAYADAHRYRLGVNHTQLAVNAPKAV PGGAANYGRDGLMAANPQGRYAKNYEPNSYDGPAETGTPLAAPLAVSGHTGTHEAPLHTK DDHFVQAGALYRLMSEDEKQRLVANLAGGLSQVSRNDVVEKNLAHFHAADPEYGKRVEEA VRALRED
>Haloarcula marismortui strain ATCC 43049(v1, unmasked), Name: YP_136584.1, katG1, rrnAC2018, Type: CDS, Feature Location: (Chr: I, complement(1808213..1810405)) Genomic Location: 1808213-1810405
MLKTVLMPSPSKCSLMAKRDQDWSPNQLRLDILDQNARDADPRGTGFDYAEEFQELDLDAVKADLEELMTSSQDWWPADYGHYGPLFIRMAWHSAGTYRTTDGRGGASGGRQRFAPLNSWPDNANLDKARRLLWPIKKKYGRKLSWADLIVLAGNHAIESMGLKTFGWAGGREDAFEPDEAVDWGPEDEMEAHQSERRTDDGELKEPLGAAVMGLIYVDPEGPNGNPDPLASAENIRESFGRMAMNDEETAALIAGGHTFGKVHGADDPEENLGDVPEDAPIEQMGLGWENDYGSGKAGDTITSGIEGPWTQAPIEWDNGYIDNLLDYEWEPEKGPGGAWQWTPTDEALANTVPDAHDPSEKQTPMMLTTDIALKRDPDYREVMERFQENPMEFGINFARAWYKLIHRDMGPPERFLGPDAPDEEMIWQDPVPDVDHDLIGDEEVAELKTDILETDLTVSQLVKTAWASASTYRDSDKRGGANGARIRLEPQKNWEVNEPAQLETVLATLEEIQAEFNSARTDDTRVSLADLIVLGGNAAVEQAAADAGYDVTVPFEPGRTDATPEQTDVDSFEALKPRADGFRNYARDDVDVPAEELLVDRADLLDLTPEEMTVLVGGLRSLGATYQDSDLGVFTDEPGTLTNDFFEVVLGMDTEWEPVSESKDVFEGYDRETGEQTWAASRVDLIFGSHSRLRAIAEVYGADGAEAELVDDFVDAWHKVMRLDRFDLE
>sp|B2TJE9|KATG_CLOBB Catalase-peroxidase OS=Clostridium botulinum (strain Eklund 17B / Type B) GN=katG PE=3 SV=1
MTENKCPVTGKMGKATAGSGTTNKDWWPNQLNLNILHQNSQLSNPMSKDFNYAEEFKKLD FQALKVDLYMLMTDSQIWWPADYGNYGPLFIRMAWHSAGTYRVGDGRGGGSLGLQRFAPL NSWPDNINLDKARRLLWPIKKKYGNKISWADLLILTGNCALESMGLKTLGFGGGRVDVWE PQEDIYWGSEKEWLGDEREKGDKELENPLAAVQMGLIYVNPEGPNGNPDPLGSAHDVRET FARMAMNDEETVALIAGGHTFGKCHGAASPSYVGPAPEAAPIEEQGLGWKNTYGSGNGDD TIGSGLEGAWKANPTKWTMGYLKTLFKYDWELVKSPAGAYQWLAKNVDEEDMVIDAEDST KKHRPMMTTADLGLRYDPIYEPIARNYLKNPEKFAHDFASAWFKLTHRDMGPISRYLGPE VPKESFIWQDPIPLVKHKLITKKDITHIKKKILDSGLSISDLVATAWASASTFRGSDKRG GANGGRIRLEPQKNWEVNEPKKLNNVLNTLKQIKENFNSSHSKDKKVSLADIIILGGCVG IEQAAKRAGYNINVPFIPGRTDAIQEQTDVKSFAVLEPKEDGFRNYLKTKYVVKPEDMLI DRAQLLTLTAPEMTVLIGGMRVLNCNYNKSKDGVFTNRPECLTNDFFVNLLDMNTVWKPK SEDKDRFEGFDRETGELKWTATRVDLIFGSNSQLRAIAEVYACDDNKEKFIQDFIFAWNK IMNADRFEIK
>sp|Q59635|CATB_PSEAE Catalase OS=Pseudomonas aeruginosa (strain ATCC 15692 / PAO1 / 1C / PRS 101 / LMG 12228) GN=katB PE=3 SV=1
MNPSLNAFRPGRLLVAASLTASLLSLSVQAATLTRDNGAPVGDNQNSQTAGPNGSVLLQD VQLLQKLQRFDRERIPERVVHARGTGAHGEFVASADISDLSMAKVFRKGEKTPVFVRFSA VVHGNHSPETLRDPRGFATKFYTADGNWDLVGNNFPTFFIRDAIKFPDMVHAFKPDPRSN LDDDSRRFDFFSHVPEATRTLTLLYSNEGTPASYREMDGNSVHAYKLVNARGEVHYVKFH WKSLQGQKNLDPKQVAEVQGRDYSHMTNDLVSAIRKGDFPKWDLYIQVLKPEDLAKFDFD PLDATKIWPGIPERKIGQMVLNRNVDNFFQETEQVAMAPSNLVPGIEPSEDRLLQGRLFA YADTQMYRVGANGLGLPVNRPRSEVNTVNQDGALNAGHSTSGVNYQPSRLDPREEQASAR YVRTPLSGTTQQAKIQREQNFKQTGELFRSYGKKDQADLIASLGGALAITDDESKYIMLS YFYKADSDYGTGLAKVAGADLQRVRQLAAKLQD

The first of these is a cyanobacterial katG (large subunit) type of catalase, perhaps representative of primitive protobacterial catalase. The second sequence in the above list is classic Staphylococcus catalase (katE). The third is a manganese-containing catalase from Lactobacillus. (This brought the most hits, by the way.) The others are, in turn, katA catalase from Proteus and Streptomyces, two organisms that are far apart in genomic G+C content (and rather distant phylogenetically); an Archaeal catalase (even though none of the 1,373 species in my organism list was Archaeal in origin; but you never know whether a given bacterium may have obtained its catalase through horizontal gene transfer); then a known-valid anaerobic catalase from Clostridium botulinum, and finally a Pseudomonas katB catalase. The idea was to cover as much ground, phylogenetically and enzymatically, as possible, with big and small-subunit catalases, of the heme as well as the manganese variety, from aerobic and anaerobic bacteria of high and low genomic G+C content, as well as an archaeal catalase for good measure.

Here, then, finally, is the list of 87 catalase-positive strict anaerobes:

Acetohalobium arabaticum strain DSM 5501
Alkaliphilus metalliredigens strain QYMF
Alkaliphilus oremlandii strain OhILAs
Anaerococcus prevotii strain ACS-065-V-Col13
Anaerococcus vaginalis strain ATCC 51170
Anaerofustis stercorihominis strain DSM 17244
Anaerostipes caccae strain DSM 14662
Anaerostipes sp. strain 3_2_56FAA
Anaerotruncus colihominis strain DSM 17241
Bacteroides capillosus strain ATCC 29799
Bacteroides pectinophilus strain ATCC 43243
Brachyspira hyodysenteriae strain ATCC 49526; WA1
Brachyspira intermedia strain PWS/A
Brachyspira pilosicoli strain 95/1000
Candidatus Arthromitus sp. SFB-mouse-Japan
Carnobacterium sp. strain 17-4
Clostridium acetobutylicum strain ATCC 824
Clostridium asparagiforme strain DSM 15981
Clostridium bartlettii strain DSM 16795
Clostridium bolteae strain ATCC BAA-613
Clostridium botulinum A2 strain Kyoto
Clostridium butyricum strain 5521
Clostridium cellulovorans strain 743B
Clostridium cf. saccharolyticum strain K10
Clostridium citroniae strain WAL-17108
Clostridium clostridioforme strain 2_1_49FAA
Clostridium difficile QCD-37x79
Clostridium hathewayi strain WAL-18680
Clostridium hylemonae strain DSM 15053
Clostridium kluyveri strain DSM 555
Clostridium lentocellum strain DSM 5427
Clostridium leptum strain DSM 753
Clostridium ljungdahlii strain ATCC 49587
Clostridium novyi strain NT
Clostridium ramosum strain DSM 1402
Clostridium saccharolyticum strain WM1
Clostridium scindens strain ATCC 35704
Clostridium spiroforme strain DSM 1552
Clostridium sporogenes strain ATCC 15579
Clostridium tetani strain Massachusetts substrain E88
Coprobacillus sp. strain 3_3_56FAA
Coprococcus comes strain ATCC 27758
Coprococcus sp. strain ART55/1
Dethiobacter alkaliphilus strain AHT 1
Dorea formicigenerans strain 4_6_53AFAA
Dorea longicatena strain DSM 13814
Erysipelotrichaceae bacterium strain 21_3
Eubacterium dolichum strain DSM 3991
Eubacterium eligens strain ATCC 27750
Eubacterium siraeum strain 70/3
Eubacterium ventriosum strain ATCC 27560
Flavonifractor plautii strain ATCC 29863
Halothermothrix orenii strain DSM 9562; H 168
Holdemania filiformis strain DSM 12042
Lachnospiraceae bacterium strain 1_1_57FAA
Lactobacillus curvatus strain CRL 705
Lactobacillus sakei subsp. sakei strain 23K
Mahella australiensis strain 50-1 BON
Natranaerobius thermophilus strain JW/NM-WN-LF
Oscillibacter valericigenes strain Sjm18-20
Parabacteroides distasonis strain ATCC 8503
Parabacteroides johnsonii strain DSM 18315
Parabacteroides sp. strain D13
Pediococcus acidilactici strain DSM 20284
Pediococcus pentosaceus strain ATCC 25745
Pelotomaculum thermopropionicum strain SI
Pseudoflavonifractor capillosus strain ATCC 29799
Pseudoramibacter alactolyticus strain ATCC 23263
Roseburia hominis strain A2-183
Roseburia intestinalis strain M50/1
Ruminococcaceae bacterium strain D16
Ruminococcus bromii strain L2-63
Ruminococcus obeum strain A2-162
Ruminococcus sp. strain 18P13
Ruminococcus torques strain L2-14
Sphaerochaeta pleomorpha strain Grapes
Spirochaeta coccoides strain DSM 17374
Spirochaeta sp. strain Buddy
Subdoligranulum sp. strain 4_3_54A2FAA
Tepidanaerobacter sp. strain Re1
Thermoanaerobacter brockii subsp. finnii strain Ako-1
Thermoanaerobacter ethanolicus strain CCSD1
Thermoanaerobacter pseudethanolicus strain 39E; ATCC 33223
Thermoanaerobacter sp. strain X514
Thermosediminibacter oceani strain DSM 16646
Treponema brennaborense strain DSM 12168
Turicibacter sanguinis strain PC909

Note that these are all bacteria; no archaeons are included. (And yes, there are catalase-positive anaerobes among the Archaea.) The reason you don't see Bacteroides fragilis (which is catalase-positive) on the list is that, as explained before, B. fragilis ended up being classified an aerobe by my cytochrome-oxidase-based initial search. Even though "everybody knows" B. fragilis is anaerobic.

Incidentally, Blast searches were done with an E-value cutoff of 1e-5, to reduce the chance of false positives. (E-value is a measure of how likely it is that a given Blast match could have occurred due to chance. A threshold value of 1e-5 means the only matches that will be accepted are those that have less than a 1-in-100,000 chance of occurring by chance.)

If you learn of any other catalase-positive anaerobes that should be on this list, do be sure to let me know!

reade more...

A Simple Method for Estimating the Rate of Transition vs. Transversion Mutations

Point mutations in DNA fall into two types: transition mutations, and transversion mutations. (See graphic below.)

In a transition mutation, a purine is swapped for a different purine (for example, adenine is swapped with guanine, or vice versa), or a pyrimidine is swapped with another pyrimidine (C for T or T for C); and usually, if a purine is swapped on one strand, the corresponding pyrimidine gets swapped on the other. Thus, a GC pair gets changed out for an AT pair, or vice versa.

A transversion, on the other hand, occurs when a purine is swapped for a pyrimidine. In a pairwise sense, this means a GC pair becomes a TA pair (for example) or an AT pair gets changed out for CG, or possibly AT for TA, or GC for CG.

Of the two types of mutation, transitions are more common. We also know that, in particular, GC-to-AT transitions are much more common than AT-to-GC transitions, for reasons that are well understood but that I won't discuss here. If you're curious to know what the experimental evidence is for the greater rate of GC-to-AT transitions, see Hall's 1991 Genetica paper (paywall protected, unfortunately) or the non-paywall-protected Y2K J. Bact. paper by Zhao. The latter paper is interesting because it shows that GC-to-AT transitions are more common in stationary-phase cells than exponentially-growing cells, and also, transitions in stationary E. coli are repaired by MutS and MutL gene products. (Overexpression of those two genes results in fewer transitions. Mutation of those two genes results in more transitions.)

An open question in molecular genetics is: What are the relative rates of transitions versus transversions, in natural populations? We know transitions are more common, but by what factor? Questions like this are tricky to answer, for a variety of reasons, and the answers obtained tend to vary quite a bit depending on the organism and methodology used. Van Bers et al. found a transition/transversion ratio (usually symbolized as κ) of 1.7 in Parus major (a bird species). Zhang and Gerstein looked at human DNA pseudogenes and found transitions outnumber transversions "by roughly a factor of two." Setti et al. looked at a variety of bacteria and found that the transition/transversion rate ratio for mutations affecting purines was 2.1 whereas the rate ratio for pyrimidines was 6.6. Tamura and Nei looked at nucleotide substitutions in the control region of mitochondrial DNA in chimps and humans (a region known to evolve rapidly) and found κ to be approximately 15. Yang and Yoder looked at mitochondrial cytochrome b in 28 primate species and found an average κ of 6.4. (In general, κ values tend to be considerably higher for mitochondrial DNA than other types of DNA.)

It's important to note that in all likelihood, no single value of κ will be universally applicable to all genes in all lineages, because evolutionary pressures vary from gene to gene and the rates of transition and transversion are different for different nucleotides (and so codon usage biases come into play). For an introduction to the various considerations involved in trying to estimate κ, I recommend Yang and Nielsen's 2000 paper as well as their 1998 and 1999 papers.

The reason I bring all this up is that I want to offer yet another possible way of estimating the transition/transversion rate ratio κ, using DNA composition statistics. Earlier, I presented data showing that the purine (A+G) content of coding regions of DNA correlates directly with genome A+T content. Analyzing the genomes of representatives of 260 bacterial genera, I came up with the following graph of purine mole-percent versus A+T mole-percent:

The correlation between genome A+T content and mRNA purine content is strong and positive (r=0.852) . Szybalski's Rule says that message regions tend to be purine-rich, but that's not exactly accurate. When genome A+T content is below approximately 35%, coding regions are richer in pyrimidines than purines. Above 35%, purines predominate. The concentration of purines in the mRNA-synonymous strand of DNA rises steadily with genome A+T content. It rises with a slope of 0.13013.

If you try to envision evolution taking an organism from one location on this graph to another, you can imagine that GC-to-AT transitions will move an organism to the right, whereas AT-to-GC transitions will move it to the left. To a first approximation (only!) we can say that horizontal movement on this graph essentially represents the net effect of transitions.

Vertical movement on this graph clearly involves transversions, because a net change in relative A+G content implies nothing less. To a very good first approximation, vertical movement in the graph corresponds to transversions.

Therefore, a good approximation of the relative rate of transitions versus transversions is given by the inverse of the slope. The value comes to 1.0/0.13013, or κ = 7.6846.

In an earlier post, I presented a graph like the one above applicable to mitochondrial DNA (N=203 mitochondrial genomes), which had a slope of 0.06702. Taking the inverse of that slope, we get a value of κ =14.92, which is in excellent agreement with Tamura and Nei's estimate of 15 for mitochondrial κ.

When I made a purine plot using plant and animal virus genomes (N=536), the rise rate (slope) was 0.23707, suggesting a κ value of 4.218. This agrees well with the transition/transversion rate for hepatitus C virus (as measured by Machida et al.) of 1.5 to 7.0 depending on the gene.

In short, we get very reasonable estimates of κ from calculations involving the slope of the A+G vs. A+T graph, across multiple domains.

The main methodological proviso that applies here has to do with the fact that technically, some horizontal movement on the graph can be accomplished with transversions (AT-to-CG, for example). We made a simplifying assumption that all horizontal movement was due to transitions. That assumption is not strictly true (although it is approximately true, since transitions do outnumber transversions; and some transversions, such as AT<-->TA and GC<-->CG, have no effect on genome A+T content). Bottom line, my method of estimating κ probably overestimates κ somewhat, by including a small proportion of AT<-->CG transversions in the numerator. Even so, the estimates agree well with other estimates, tending to validate the general approach.

I invite comments from knowledgeable specialists.

reade more...

The Trouble with Darwin

As a biologist, I find Darwin's theory hugely disappointing. It's better than the alternative (which is to believe in magic, basically), but not by much, sadly.

Charles Darwin died before Mendel
proved the existence of genes.

As scientific theories go, the theory of evolution is easily the weakest of all major scientific theories. It's a commendable piece of work in its ability to stir discussion, but terrible in most other ways.

To be useful, a scientific theory has to do a minimum of two things: explain what can be observed, and provide testable predictions. Darwin's theory is weak on the first count and useless on the second.

Evolutionary theory explains practically nothing, because every explanation of the theory is rooted in "survival of the fittest," which is a circular notion, utterly content-free. "Fittest" means most able to survive. Survival of the fittest means survival of those who survive.

Ironically, Darwin's landmark work was called On the Origin of Species. Yet it doesn't actually explain speciation, except in the most vacuous and speculative of terms. Of course, we can't set too high an expectation for Darwin, since he didn't live to see the publication of Mendel's work (the word "genetics" wouldn't exist until more than 20 years after Darwin's death), but still. Speciation is portrayed by Darwin as the outcome of the accumulation of small, gradual changes. That's all the explanation he offers.

But the explanation is wrong. Or at least it doesn't accord well with the facts. It doesn't explain the Cambrian Explosion, for example, or the sudden appearance of intelligence in hominids, or the rapid recovery (and net expansion!) of the biosphere in the wake of at least five super-massive extinction events in the most recent 15% of Earth's existence.

One of the most frustrating aspects of evolutionary theory (this is no fault of the theory's, though) is that it is so hard to test in the laboratory. The fact is, no one has ever seen speciation happen in the laboratory, under repeatable conditions, and until that happens we're at a distinct disadvantage for understanding speciation. (Incidentally, I don't count plant hybridization or breeding anomalies in fruit flies whose sexuality is under the control of microbial endosymbionts as examples of speciation.)

When I was in school, we were taught that mutations in DNA were the driving force behind evolution, an idea that is now thoroughly discredited. The overwhelming majority of non-neutral mutations are deleterious (they reduce, not increase, survival). Most mutations lead to loss of function (this is easily demonstrated in the lab), not gain of function. Evolutionary theory is great at explaining things like the loss of eyesight by cave-dwelling creatures (e.g., bats). It's terrible at explaining gain of function.

Even if mutations were capable of driving evolution, they simply don't happen fast enough to account for observed rates of speciation. In bacteria, the measured rate of 16S rRNA divergence due to point mutations is only 1% per 50 million years. And yet, there were no flowering plants on earth as recently as 150 million years ago! Does it take a biologist to see the disconnect?

I bring all this up because I've spent some time recently doing genomics research aimed at exploring mechanisms for new-protein creation/differentiation (mechanisms not relying wholly nor even mainly on point mutations), and I wanted to set the stage for discussing that research here. Over the next week or so, I'll be presenting some new ideas and findings. Hopefully, we can put some much-needed flesh on Darwin by exploring testable notions of how new protein motifs can arise quickly (without reliance on magic).

reade more...

A New Biological Constant?

Earlier, I gave evidence for a surprising relationship between the amount of G+C (guanine plus cytosine) in DNA and the amount of "purine loading" on the message strand in coding regions. The fact that message strands are often purine-rich is not new, of course; it's called Szybalski's Rule. What's new and unexpected is that the amount of G+C in the genome lets you predict the amount of purine loading. Also, Szybalski's rule is not always right.

Genome A+T content versus message-strand purine content (A+G) for 260 bacterial genera. Chargaff's second parity rule predicts a horizontal line at Y = 0.50. (Szybalski's rule says that all points should lie at or above 0.50.) Surprisingly, as A+T approaches 1.0, A/T approaches the Golden Ratio.

When you look at coding regions from many different bacterial species, you find that if a species has DNA with a G+C content below about 68%, it tends to have more purines than pyrimidines on the message strand (thus purine-rich mRNA). On the other hand, if an organism has extremely GC-rich DNA (G+C > 68%), a gene's message strand tends to have more pyrimidines than purines. What it means is that Szybalski's Rule is correct only for organisms with genome G+C content less than 68%. And Chargaff's second parity rule (which says that A=T an G=C even within a single strand of DNA) is flat-out wrong all the time, except at the 68% G+C point, where Chargaff is right now and then by chance.

Since the last time I wrote on this subject, I've had the chance to look at more than 1,000 additional genomes. What I've found is that the relationship between purine loading and G+C content applies not only to bacteria (and archaea) and eukaryotes, but to mitochondrial DNA, chloroplast DNA, and virus genomes (plant, animal, phage), as well.

The accompanying graphs tell the story, but I should explain a change in the way these graphs are prepared versus the graphs in my earlier posts. Earlier, I plotted G+C along the X-axis and purine/pyrmidine ratio on the Y-axis. I now plot A+T on the X-axis instead of G+C, in order to convert an inverse relationship to a direct relationship. Also, I now plot A+G (purines, as a mole fraction) on the Y-axis. Thus, X- and Y-axes are now both expressed in mole fractions, hence both are normalized to the unit interval (i.e., all values range from 0..1).

The graph above shows the relationship between genome A+T content and purine content of message strands in genomes for 260 bacterial genera. The straight line is regression-fitted to minimize the sum of squared absolute error. (Software by http://zunzun.com.) The line conforms to:

y = a + bx

where:

a =  0.45544384965539358
b =  0.14454244707261443

The line predicts that if a genome were to consist entirely of G+C (guanine and cytosine), it would be 45.54% guanine, whereas if (in some mythical creature) the genome were to consist entirely of A+T (adenine and thymine), adenine would comprise 59.99% of the DNA. Interestingly, the 95% confidence interval permits a value of 0.61803 at X = 1.0, which would mean that as guanine and cytosine diminish to zero, A/T approaches the Golden Ratio.

Do the most primitive bacteria (Archaea) also obey this relationship? Yes, they do. In preparing the graph below, I analyzed codon usage in 122 Archaeal genera to obtain A, G, T, and C relative proportions in coding regions of genes. As you can see, the same basic relationship exists between purine content and A+T in Archaea as in Eubacteria. Regression analysis yielded a line with a slope of 0.16911 and a vertical offset 0.45865. So again, it's possible (or maybe it's just a very strange coincidence) that A/T approaches the Golden Ratio as A+T approaches unity.

Analysis of coding regions in 122 Archaea reveals that the same relationship exists between A+T content and purine mole-fraction (A+G) as exists in eubacteria.

For the graph below, I analyzed 114 eukaryotic genomes (everything from fungi and protists to insects, fish, worms, flowering and non-flowering plants, mosses, algae, and sundry warm- and cold-blooded animals). The slope of the generated regression line is 0.11567 and the vertical offset is 0.46116.

Eukaryotic organisms (N=114).

Mitochondria and chloroplasts (see the two graphs below) show a good bit more scatter in the data, but regression analysis still comes back with positive slopes (0.06702 and .13188, respectively) for the line of least squared absolute error.

Mitochondrial DNA (N=203).

Chloroplast DNA (N=227).

To see if this same fundamental relationship might hold even for viral genetic material, I looked at codon usage in 229 varieties of bacteriophage and 536 plant and animal viruses ranging in size from 3Kb to over 200 kilobases. Interestingly enough, the relationship between A+T and message-strand purine loading does indeed apply to viruses, despite the absence of dedicated protein-making machinery in a virion.

Plant and animal viruses (N=536).

Bacteriophage (N=229).

For the 536 plant and animal viruses (above left), the regression line has a slope of 0.23707 and meets the Y-axis at 0.62337 when X = 1.0. For bacteriophage (above right), the line's slope is 0.13733 and the vertical offset is 0.46395. (When inspecting the graphs, take note that the vertical-axis scaling is not the same for each graph. Hence the slopes are deceptive.) The Y-intercept at X = 1.0 is 0.60128. So again, it's possible A/T approaches the golden ratio as A+T approaches 100%.

The fact that viral nucleic acids follow the same purine trajectories as their hosts perhaps shouldn't come as a surprise, because viral genetic material is (in general) highly adapted to host machinery. Purine loading appropriate to the A+T milieu is just another adaptation.

It's striking that so many genomes, from so many diverse organisms (eubacteria, archaea, eukaryotes, viruses, bacteriophages, plus organelles), follow the same basic law of approximately

A+G = 0.46 + 0.14 * (A+T)

The above law is as universal a law of biology as I've ever seen. The only question is what to call the slope term. It's clearly a biological constant of considerable significance. Its physical interpretation is clear: It's the rate at which purines are accumulated in mRNA as genome A+T content increases. It says that a 1% increase in A+T content (or a 1% decrease in genome G+C content) is worth a 0.14% increase in purine content in message strands. Maybe it should be called the purine rise rate? The purine amelioration rate?

Biologists, please feel free to get in touch to discuss. I'm interested in hearing your ideas. Reach out to me on LinkedIn, or simply leave a comment below.

reade more...

Pages

.