unemployment depression: mutation

Showing posts with label mutation. Show all posts

Strand Asymmetry in Mitochondrial DNA

Funny how the availability of so much free DNA data can go to your head. When I learned that DNA sequence data for more than 2,000 mitochondrial genomes could be accessed, free, at genomevolution.org, I couldn't resist: I wrote some scripts that checked the DNA composition of 2,543 mtDNA (mitochondrial DNA) sequences. What I found blew me away.

If you're a biologist, you're accustomed to thinking of genome G+C (guanine plus cytosine) content as a kind of phylogenetic signature. (Related organisms usually have G+C values that are fairly close to one another.) For purposes of the following discussion, I'm going to reference A+T content, which is, of course, just one-minus-GC. (A GC content of 0.25, or 25%, means the AT content is 0.75, or 75%).

What I learned is that mitochondrial DNA shows strand asymmetry in coding regions (regions that actually get transcribed to RNA, as opposed to non-coding "control" regions and junk DNA). In particular, it shows an excess of pyrimidines (T and C) on the "message strand." This is the exact opposite of the situation in Archaea and bacteria, where message strands tend to accumulate purines (G and A).

The interesting thing is, just like bacteria (and Archaea), mitochondrial genomes tend to show a steady, predictable rate of increase of purines on the message strand with increasing A+T, even though purines are outnumbered by pyrimidines on the message strand. A picture might make this clearer:

Purine (A+G) content versus A+T for the message strand of mitochondrial DNA coding regions (N=2543).

Every point in this graph represents a mitochondrial genome (2,543 in all). As you can see, the regression line (which minimizes the sum of squared error) is upward-sloping, with a rise of 0.149, meaning that for every 10% increase in genome A+T content, there's a corresponding 1.49% increase in message-strand purine (A+G) content. What's striking about this is that in a similar graph for 1,373 bacterial genomes (see this post), the regression-line slope turned out to be 0.148. Chargaff's second parity law predicts a straight horizontal line at y=0.5. Obviously that law is kaput.

I've written before about my repeated finding (in bacteria, Archaea, eukaryotes, viruses, bacteriophage; basically every place I look) that message-strand purine content accumulates in proportion to genome A+T content. Strand asymmetry with respect to purines and pyrimidines seems to be universal. But why?

Strand-asymmetric buildup of purines or pyrimidines is very hard to explain without invoking either a theory of strand-asymmetric DNA repair or a theory of strand-asymmetric mutagenesis, or both. Is it reasonable to suppose that one strand of DNA is more vulnerable to mutagenesis than another? Yes, if you accept that in a growing cell, the strands spend a good portion of their time apart (during transcription and replication). Neither replication nor transcription is symmetric in implementation. I'll spare you the details for the replication side of the argument, but suffice it to say, replication-related asymmetries are not likely (in my opinion) to be behind the purine/pyrminidine strand asymmetries I've been documenting. What we're seeing, I think, is the result of asymmetric repair at transcription time.

During transcription, a gene's DNA strands are separated. One strand is used as a template by RNA polymerase to create messenger RNA and ribosomal RNA. The other strand is free and floppy and vulnerable to attack by mutagens. But it's also readily accessible to repair enzymes.

The above diagram oversimplifies things considerably, but I include it for the benefit of non-biogeeks who might want to follow this argument through. Note that DNA strands have directionality: the sugar bonds face one way in one strand and the other way in the other strand. This is denoted by the so-called 5'-to'3 orientation of strands.(RNAP = RNA polymerase.)

DNA repair is a complex subject. Be assured, every cell, of every kind, has dozens of different kinds of enzymes devoted to DNA repair. Without these enzymes, life as we know it would end, because DNA is constantly undergoing attack and requiring repair.

The Ogg family of DNA base-excision enzymes exhibit
a signature helix-hairpin-helix topology (HhH). See
Faucher et al., Int J Mol Sci 2012; 13(6): 6711–6729.

Some types of repair take place in double-stranded DNA (that is, DNA that is not undergoing replication or transcription). Other types of repair apply to single-stranded DNA. In bacteria as well as higher life forms, there's a transcription-coupled repair system (TCRS) that comes into play when RNA polymerase is stalled by thymine dimers or other DNA damage. This remarkably elaborate system changes out short sections of damaged DNA (at considerable energy cost). Because it involves replacing whole nucleotides (sugar and all), it's categorized as a Nucleotide Execision Repair system (NER). The alternative to NER is Base Excision Repair (BER), which is where a defective base (usually an oxidized guanine) gets snipped out without removing any sugars from the DNA backbone. The enzymes that perform this base-clipping are generically known as glycosylases.

For many years, it was thought that mitochondria did not have DNA repair systems. We now know that's not true. Mitochondrial DNA is subject to constant oxidative attack and it turns out the damage is quickly repaired, in double-stranded DNA. Evidence for repair of single-stranded mtDNA is scant. Those who have looked for a transcription-coupled repair system (or indeed any NER system) in mitochondria have not found one. Mitochondrial BER repair (via Ogg1) does exist, but it seems to operate when the DNA is double-stranded, not during transcription. This makes sense, because for BER to finish, the strand must be nicked by AP endonuclease after the bad base is popped out, then the repair proceeds by matching the opposing base (opposite the abasic site) using the other strand as template. In Clostridia and Archaea (which have an Ogg enzyme that other bacteria do not have; see this post and this paper), Ogg1 can pop out a bad base while the DNA is single-stranded; Ogg1 then binds to the abasic site and is only released by AP endonuclease when it arrives later on.

Bottom line, we know that mitochondrial DNA spends much of its time in the unwound state (because mtDNA products are very highly transcribed) and that the non-transcribed DNA strand is extremely vulnerable to oxidative attack. (The template strand is less vulnerable, because it is cloaked in enzymes: RNA polymerase, transcription factors, ribosomes, etc.) We also know that 8-oxoguanine is the most prevalent form of oxidative damage in mtDNA and that, uncorrected, such damage leads to G-to-T transversion. The finding of consistently high pyrimidine content in the message strand of mitochondrial DNA (see graph further above) is consistent with a slower rate of repair of the non-transcribed strand, and the differential occurrence of G-to-T transversions on that strand. Or at least, that's a possible explanation of the pyrimidine richness of the message strand of mtDNA.

But there are additional factors to consider, such as selection pressure. Mitochondrial DNA tends to encode membrane-associated proteins, and membrane proteins use nonpolar amino acids, which are (in turn) predominantly encoded by pyrimidine-rich codons. More about this in an upcoming post.

reade more...

DNA Repair 101

You don't have to be a biologist to know that anything that can damage DNA is potentially harmful, because it can cause mutations (which are, in fact, mostly harmful; very few mutations are beneficial). Fortunately, cells contain dozens of different kinds of repair enzymes, and most DNA damage is repaired quickly. When damage isn't repaired quickly (or properly), you have a mutation.

It's not much of a stretch to say that DNA repair enzymes play a front-and-center role in evolution (or at least the portion of evolution that's driven by mutations). Which is why molecular geneticists tend to pay a lot of attention to DNA repair processes. Anything that can affect the composition of DNA can change the course of evolution.

DNA is remarkably stable, chemically. Nonetheless, it is vulnerable to oxidative attack (by hydroxyl radicals, superoxides, nitric oxide, and other Reactive Oxygenated Species generated in the course of cell metabolism—never mind exogenous poisons).

Of the four bases in DNA—guanine (G), cytosine (C), adenine (A), thymine (T)—guanine is the most susceptible to oxidative attack. When it's exposed to an oxidant, it can form 7,8-dihydro-8-oxoguanine, OG for short. What can happen then is, the OG residue in DNA pivots around its ribosyl bond until the amino group is facing the other way (see diagram), and when that happens, OG can pair up with adenine instead of guanine's usual partner, cytosine.

When guanine is oxidized to form 7,8-dihydro-8-oxoguanine,
it mispairs with adenine instead of its usual partner, cytosine.

Rest assured, there are proofreading enzymes that can and will detect such funny business in short order. But if OG isn't detected and replaced with a normal guanine before replication occurs, OG may get paired up with an adenine during replication (and then it'll eventually be swapped out with thymine, adenine's usual partner). That's bad, because what it means is that a G:C pair ended up getting changed to a T:A pair. (The place of the G got taken first by OG and then T. The place of G's opposite-strand partner, C, eventually got taken by A.) In so many words: that's a mutation.

It turns out there's a special enzyme designed to prevent the G↔T funny business we've just been talking about. It's called oxoguanine glycosylase, or Ogg1 for short. You'll sometimes see it called 8-oxoguanine-DNA-glycosylase, and from a capabilities standpoint it's often (wrongly) compared to the Fpg enzyme (formamidopyrimidine-DNA glycosylase), which is not the same as Ogg1 at all.

Just about all higher life forms have an Ogg1 enzyme (which clips OG out of DNA and ensures it gets replaced with a brand-new guanine before any funny business can happen). Surprisingly few bacteria have this enzyme, instead preferring to let the more general-purpose Fpg (MutM) take its place. If you run a Blast search of a reference Ogg1 gene (the Drosophila version works well) against all bacterial genomes, you'll get only a few hundred matches (out of around 10,000 sequenced bacterial genomes), the vast majority belonging to members of the class Clostridia (a truly fearsome group of anaerobic spore-formers containing the botulism germ, the tetanus bacterium, the notorious C. difficile—also known as C. diff—and some other creatures you probably don't want to meet). If you run the same Blast search against Archaea (this is the other major "germ-like" microbial domain, along with true bacteria), you'll get hits against almost every member species of the Archaea. Personally, I think it's likely the Ogg1 enzyme originated with a common ancestor of today's Archaea and Eukaryota, and arrived in Clostridia by lateral gene transfer (not terribly recently, though).

One thing is certain: E. coli does not have Ogg1, nor does Staphylococcus, nor Streptococcus, nor any germ you've ever heard of (other than the aforementioned Clostridia members, plus Archaea). And yet, every yeast and fungus has it, every plant, every fruit fly, every fish, every human—every higher life form. Ironically, only five members of Archaea turned up positive for the Fpg enzyme when I did a check, whereas almost all Eubacteria ("true bacteria") have it, including Clostridia. Bottom line, Clostridia have the best of both worlds: Fpg, plus Ogg1. Belt and suspenders, both.

This is just a tiny intro to the subject of DNA repair, which is a vast subject indeed. For more, see this article, or just start rummaging around in Google Scholar.

reade more...

DNA G+C Content and Survival Value

One of biology's big open questions is why organisms differ so much with regard to the relative amounts of GC and AT in their DNA. You'd think that if there are only two kinds of DNA base pairs (see diagram) they'd be more-or-less equally abundant. Not so. There are organisms with DNA that's mostly GC (and/or CG) pairs; there are organisms with very-AT-rich DNA; and within the chromosomes of higher organisms you find large GC-rich regions (isochores) in the midst of great swaths of AT-rich DNA.

DNA contains adenine and thymine in equal amounts, and
guanine and cytosine in equal amounts, but it does not
usually contain GC pairs and AT pairs in equal amounts. And
it doesn't seem as if there is an "optimum" GC:AT ratio. The
GC:AT ratio varies by species. Within a species, it's constant.

There are two really odd facts at work here:

1. The GC content of DNA varies by species, and it varies a lot.

2. Evolution doesn't seem to trend toward an "optimum CG:AT ratio" of any kind.

If there were such thing as an optimum GC:AT ratio for DNA, surely microorganisms would've figured it out by now. Instead, we find huge diversity: There are bacteria on every point in the GC% spectrum, running from 16% GC for the DNA of Candidatus Carsonella ruddii (a symbiont of the jumping plant louse) to 75% for Anaeromyxobacter dehalogenans 2CP-C (a soil bacterium). At each end of the spectrum you find aerobes and anaerobes; extremophiles and blandophiles; pathogens and non-pathogens. About the only generalization you can make is that the smaller an organism's genome is, the more likely it is to be rich in A+T (low GC%).

Genome size correlates loosely with GC content. The very smallest
bacteria tend to have AT-rich (low GC%) DNA.

The huge diversity in GC:AT ratios among bacteria is impressive. But does it simply represent a random walk all over the possibility-space of DNA? Or do the various points on the spectrum constitute special niches with important advantages? What advantage could there be for having high-GC% DNA? Or high-AT% DNA?

Some subtle clues tell us that this is not just random deviation from the mean. First, suppose we agree for sake of argument that lateral gene transfer (LGT) is common in the microbial world (a point of view I happen to agree with). Over the course of millions of years, with pieces of DNA of all kinds (high GC%, low GC%) flying back and forth, LGT should force a regression to the mean: It should make genomes tend toward a 50-50 GC:AT ratio. That clearly hasn't happened.

And then there's ordinary mutational pressures. It's beginning to be fairly well accepted (see Hershberg and Petrov, "Evidence That Mutation is Universally Biased Toward AT in Bacteria," PLoS Genetics, 2010, 6:9, e1001115, full version here) that natural mutation is strongly biased in the direction of AT by virtue of the fact that deamination of cytosine and methylcytosine (which occurs spontaneously at high frequency) leads to replacement of 'C' with 'T', hence GC pairs becoming AT pairs. The strong natural mutational bias toward AT says that all DNA should creep in the direction of low GC% and end up well below 50% GC. But again, this is not what we see. We see that high-GC organisms like Anaeromyxobacter (and many others) maintain their DNA's unusually high (75%) GC content across millions of generations. Even middle-of-the-road organisms like E. coli (with 50% GC content) don't slowly slip in the direction of high-AT/low-GC.

Clearly, something funny is going on. For a super-high-GC organism like Anaeromyxobacter to maintain its DNA's super-high GC content against the constant tug of mutations in the AT direction, it must be putting significant energy into maintaining that high GC percentage. But why? Why pay extra to maintain a high GC%? And how does the cost get paid?

I think I've come up with a possible answer. It has to do with DNA replication cost, where "cost" is figured in terms of time needed to synthesize a new copy of the DNA (for cell division). Anything that favors low replication cost (high replication speed) should favor survival; that's my main assumption.

My other assumption is that DNA polymerases (the enzymes involved in replication) are not clairvoyant. They can't know, until the need arises, which of the four deoxyribonucleotide triphosphates (dATP, dTTP, dGTP, dCTP) will be needed at a given moment, to elongate the new strand of DNA. When the need arises for (let's say) an 'A', the 'A' (in the form of dATP) has to come from an existing endogenous pool of dNTPs containing all four bases (dATP, dTTP, dGTP, dCTP) in whatever concentrations they're in. The enzyme has to wait until a dATP (if that's what's needed) randomly happens to lock into the active site. Odds are only one in four (assuming equal concentrations of dNTPs) of a dATP coming along at exactly the right moment. Odds are 3 out of 4 that some incorrect dNTP (either dGTP, dTTP, or dCTP) will try, and fail, to fit the active site first, before dATP comes along.

But imagine that your DNA is 75% G+C. And suppose you've regulated your intracellular metabolism to maintain dGTP and dCTP in a 3:1 ratio over dATP and dTTP. The odds of a good random "first hit" go up.

To simulate the various possibilities, I wrote software (in JavaScript) that simulates DNA replication, where the template DNA molecule is 1000 base-pairs in length and the dNTP pool size is 10000 bases. The software allows you to set the organism's genome GC% to whatever you want, and also set the dNTP pool's relative GC percentage to whatever you want. The template DNA is just a random string of A, T, G, and C bases (1000 total), reflecting their relative abundances as set in the GC% parameter. The pool of dNTPs is set up to be a randomized array (again reflecting abundances set in a GC% parameter).

The way the software works is this. Read a base off the template. Fetch a base randomly from the base pool. If the base happens to be the one (out of four) that's called for, score '1' for the timing parameter, and continue to read another base off the template. If the base was not the one that's called for, put it back in the pool array in a random location, then randomly fetch another base from the pool; and increment the timing parameter. (For each fetch, the timing parameter goes up by 1.) Keep fetching (and throwing back bases) until the proper base comes up, incrementing the time parameter as appropriate. (The time parameter keeps track of the number of fetch attempts.) When the correct base turns up, the pool shrinks by one base. In other words, replication consumes the pool, but as I said earlier, the pool contains ten times as many bases (to start) as the DNA template. So the pool ends up 10% smaller at the end of replication.

Each point on this graph represents the average of 100 Monte Carlo runs, each run representing complete replication of a 1000-bp DNA template, drawing from a pool of 10,000 bases. The blue points are runs that used a DNA template containing 25% G+C content. The red points are runs that used DNA with 75% G+C. The X-axis represents different base-pool compositions. See text for details. Click for larger image.

I ran Monte Carlo simulations for DNA templates having GC contents of 75%, 50%, and 25%, using base pools set up to have anywhere from 15% GC to 85% (in 2.5% increments). The results for the 75% GC and 25% GC templates (representing high- and low-GC organisms) are shown in the above graph. Each point on the graph represents the average of 100 complete replication runs. The Y-axis shows the average number of fetches per DNA base (so, a low value means fast replication; a high value means slower DNA replication). The X-axis shows the percentage of GC in the base-pool, in recognition of the fact that relative dNTP abundances in an organism may vary, in accordance with environmental constraints as well as with organism-specific homeostatic setpoints.

Maximal replication speed (the low point of each curve) happens at a base-pool GC percentage that is displaced in the direction of the DNA's own GC%. So, for the 25%-GC organism (blue data points), max replication efficiency comes when the base-pool is about 33% GC. For the 75% GC organism (red points) the sweet spot is at a base-pool GC concentration of 65%. (Why this is not exactly symmetrical with the other curve, I don't know; but bear in mind, these are Monte Carlo runs. Some variation is to be expected.)

The interesting thing to note is that max replication efficiency, for each organism, comes at 3.73 fetches per base-pair (Y-axis). Cache that thought. It'll be important in a minute.

The real jaw-dropper is what happens when you plot a curve for template DNA with 50% GC content. In the graph below, I've shown the 50%-GC runs as black points. (The red and blue points are exactly as before.)

This is the same graph as before, but with replication data for a 50%-GC genome (black points). Again, each data point represents the average of 100 Monte Carlo runs. Notice that the black curve bottoms out at a higher level (4.0) than the red or blue curves (3.73). This means replication is less efficient for the 50%-GC genome.

Notice that the best replication efficiency comes in the middle of the graph (no big surprise), but check the Y-value: 4.00. The very fastest DNA replication, when the DNA template is 50% GC, requires 4 fetches per base, compared to best-case base-fetching efficiency of 3.73 for the 25%-GC and 75%-GC DNAs.What does this mean? It means DNA replication, in a best-case scenario, is 4.25% more efficient for the skewed-GC organisms. (The difference between 3.73 and 4.00 is 4.25%.)

This goes a long way toward explaining why GC extremism is stable in organisms that pursue it. There is replication efficiency to be had in keeping your DNA biased toward high or low GC. (It doesn't seem to matter which.)

Consider the dynamics of an ATP drawdown. The energy economy of a cell revolves around ATP, which is both an energy molecule and a source for the adenine that goes into DNA and RNA. One would expect normal endogenous concentrations of ATP to be high relative to other NTPs. For a low-GC% organism, that's also a near-ideal situation for DNA replication, because high AT in the base pool puts you near the max-replication-speed part of the curve (see blue points). A sudden drawdown in ATP (when the cell is in crisis) shifts replication speed to the right-hand part of the blue curve, slowing replication significantly. This is what you want if you're an intracellular symbiont (or a mitochondrion, incidentally). You want to stop dividing when the host cell is unable to divide because of an energy crisis.

Consider the high-GC organism (red dots), on the other hand. If ATP levels are high during normal metabolism, replication is not as efficient as it could be, but so what? It just means you're willing to tolerate less-efficient replication in good times. But as ATP draws down (perhaps because nutrients are becoming scarce), DNA replication actually becomes more efficient. This is what you want if you're a free-living organism in the wild. You want to be able to continue replicating your DNA even as ATP becomes scarce. And indeed that's what happens (according to the red data points): As the base pool becomes more GC-rich, replication efficiency increases. The best efficiency comes when base-pool A+T is down around 35%.

I think these simulations are meaningful and I think they help explain the DNA-composition extremism seen among microorganisms. If you're a professional scientist and you find these results tantalizing, and you'd like to co-author a paper for PLoS Genetics (or another journal), please get in touch. (My Google mail is kas-dot-e-dot-thomas.) I'd like to coauthor with someone who is good with statistics, who can contribute more ideas to this line of investigation. I think these results are worth sharing with the scientific community at large.

reade more...

Pages

.

Strand Asymmetry in Mitochondrial DNA

DNA Repair 101

DNA G+C Content and Survival Value