Pages

.

Showing posts with label Chargaff's second rule. Show all posts
Showing posts with label Chargaff's second rule. Show all posts

Strand Asymmetry in Mitochondrial DNA

Funny how the availability of so much free DNA data can go to your head. When I learned that DNA sequence data for more than 2,000 mitochondrial genomes could be accessed, free, at genomevolution.org, I couldn't resist: I wrote some scripts that checked the DNA composition of 2,543 mtDNA (mitochondrial DNA) sequences. What I found blew me away.

If you're a biologist, you're accustomed to thinking of genome G+C (guanine plus cytosine) content as a kind of phylogenetic signature. (Related organisms usually have G+C values that are fairly close to one another.) For purposes of the following discussion, I'm going to reference A+T content, which is, of course, just one-minus-GC. (A GC content of 0.25, or 25%, means the AT content is 0.75, or 75%).

What I learned is that mitochondrial DNA shows strand asymmetry in coding regions (regions that actually get transcribed to RNA, as opposed to non-coding "control" regions and junk DNA). In particular, it shows an excess of pyrimidines (T and C) on the "message strand." This is the exact opposite of the situation in Archaea and bacteria, where message strands tend to accumulate purines (G and A).

The interesting thing is, just like bacteria (and Archaea), mitochondrial genomes tend to show a steady, predictable rate of increase of purines on the message strand with increasing A+T, even though purines are outnumbered by pyrimidines on the message strand. A picture might make this clearer:

Purine (A+G) content versus A+T for the message strand of mitochondrial DNA coding regions (N=2543).

Every point in this graph represents a mitochondrial genome (2,543 in all). As you can see, the regression line (which minimizes the sum of squared error) is upward-sloping, with a rise of 0.149, meaning that for every 10% increase in genome A+T content, there's a corresponding 1.49% increase in message-strand purine (A+G) content. What's striking about this is that in a similar graph for 1,373 bacterial genomes (see this post), the regression-line slope turned out to be 0.148.  Chargaff's second parity law predicts a straight horizontal line at y=0.5. Obviously that law is kaput.

I've written before about my repeated finding (in bacteria, Archaea, eukaryotes, viruses, bacteriophage; basically every place I look) that message-strand purine content accumulates in proportion to genome A+T content. Strand asymmetry with respect to purines and pyrimidines seems to be universal. But why?

Strand-asymmetric buildup of purines or pyrimidines is very hard to explain without invoking either a theory of strand-asymmetric DNA repair or a theory of strand-asymmetric mutagenesis, or both. Is it reasonable to suppose that one strand of DNA is more vulnerable to mutagenesis than another? Yes, if you accept that in a growing cell, the strands spend a good portion of their time apart (during transcription and replication). Neither replication nor transcription is symmetric in implementation. I'll spare you the details for the replication side of the argument, but suffice it to say, replication-related asymmetries are not likely (in my opinion) to be behind the purine/pyrminidine strand asymmetries I've been documenting. What we're seeing, I think, is the result of asymmetric repair at transcription time.

During transcription, a gene's DNA strands are separated. One strand is used as a template by RNA polymerase to create messenger RNA and ribosomal RNA. The other strand is free and floppy and vulnerable to attack by mutagens. But it's also readily accessible to repair enzymes.


The above diagram oversimplifies things considerably, but I include it for the benefit of non-biogeeks who might want to follow this argument through. Note that DNA strands have directionality: the sugar bonds face one way in one strand and the other way in the other strand. This is denoted by the so-called 5'-to'3 orientation of strands.(RNAP = RNA polymerase.)

DNA repair is a complex subject. Be assured, every cell, of every kind, has dozens of different kinds of enzymes devoted to DNA repair. Without these enzymes, life as we know it would end, because DNA is constantly undergoing attack and requiring repair.

The Ogg family of DNA base-excision enzymes exhibit
a signature helix-hairpin-helix topology (HhH). See
Faucher et al., Int J Mol Sci 2012; 13(6): 6711–6729.
Some types of repair take place in double-stranded DNA (that is, DNA that is not undergoing replication or transcription). Other types of repair apply to single-stranded DNA. In bacteria as well as higher life forms, there's a transcription-coupled repair system (TCRS) that comes into play when RNA polymerase is stalled by thymine dimers or other DNA damage. This remarkably elaborate system changes out short sections of damaged DNA (at considerable energy cost). Because it involves replacing whole nucleotides (sugar and all), it's categorized as a Nucleotide Execision Repair system (NER). The alternative to NER is Base Excision Repair (BER), which is where a defective base (usually an oxidized guanine) gets snipped out without removing any sugars from the DNA backbone. The enzymes that perform this base-clipping are generically known as glycosylases.

For many years, it was thought that mitochondria did not have DNA repair systems. We now know that's not true. Mitochondrial DNA is subject to constant oxidative attack and it turns out the damage is quickly repaired, in double-stranded DNA. Evidence for repair of single-stranded mtDNA is scant. Those who have looked for a transcription-coupled repair system (or indeed any NER system) in mitochondria have not found one. Mitochondrial BER repair (via Ogg1) does exist, but it seems to operate when the DNA is double-stranded, not during transcription. This makes sense, because for BER to finish, the strand must be nicked by AP endonuclease after the bad base is popped out, then the repair proceeds by matching the opposing base (opposite the abasic site) using the other strand as template. In Clostridia and Archaea (which have an Ogg enzyme that other bacteria do not have; see this post and this paper), Ogg1 can pop out a bad base while the DNA is single-stranded; Ogg1 then binds to the abasic site and is only released by AP endonuclease when it arrives later on.

Bottom line, we know that mitochondrial DNA spends much of its time in the unwound state (because mtDNA products are very highly transcribed) and that the non-transcribed DNA strand is extremely vulnerable to oxidative attack. (The template strand is less vulnerable, because it is cloaked in enzymes: RNA polymerase, transcription factors, ribosomes, etc.) We also know that 8-oxoguanine is the most prevalent form of oxidative damage in mtDNA and that, uncorrected, such damage leads to G-to-T transversion. The finding of consistently high pyrimidine content in the message strand of mitochondrial DNA (see graph further above) is consistent with a slower rate of repair of the non-transcribed strand, and the differential occurrence of G-to-T transversions on that strand. Or at least, that's a possible explanation of the pyrimidine richness of the message strand of mtDNA.

But there are additional factors to consider, such as selection pressure. Mitochondrial DNA tends to encode membrane-associated proteins, and membrane proteins use nonpolar amino acids, which are (in turn) predominantly encoded by pyrimidine-rich codons. More about this in an upcoming post.
reade more... Résuméabuiyad

A Very Simple Test of Chargaff's Second Rule

We know that for double-stranded DNA, the number of purines (A, G) will always equal the number of pyrimidines (T, C), because complementarity depends on A:T and G:C pairings. But do purines have to equal pyrimidines in single-stranded DNA? Chargaff's second parity rule says yes. Simple observation says no.

Suppose you have a couple thousand single-stranded DNA samples. All you have to do to see if Chargaff's second rule is correct is create a graph of A versus T, where each point represents the A and T (adenine and thymine) amounts in a particular DNA sample. If A = T (as predicted by Chargaff), the graph should look like a straight line with a slope of 1:1.

For fun, I grabbed the sequenced DNA genome of Clostridium botulinum A strain ATCC 19397 (available from the FASTA link on this page; be ready for a several-megabyte text dump), which contains coding sequences for 3552 genes of average length 442 bases each, and for each gene, I plotted the A content versus the T content.

A plot of thymine (T) versus adenine (A) content for all 3552 genes in C. botulinum coding regions. The greyed area represents areas where T/A > 1. Most genes fall in the white area where A/T > 1.

As you can see, the resulting cloud of points not only doesn't form a straight line of slope 1:1, it doesn't even cluster on the 45-degree line at all. The center of the cluster is well below the 45-degree line, and (this is the amazing part) the major axis of the cluster is almost at 90 degrees to the 45-degree line, indicating that the quantity A+T tends to be conserved.

A similar plot of G versus C (below) shows a somewhat different scatter pattern, but again notice that the centroid of the cluster is well off the 45-degree centerline. This means Chargaff's second rule doesn't hold (except for the few genes that randomly fell on the centerline).

A plot of cytosine (C) versus guanine (G) for all genes in all coding regions of C. botulinum. Again, notice that the points cluster well away from the 45-degree line (where they would have been expected to cluster, according to Chargaff).

The numbers of bases of each type in the botulinum genome are:
G: 577108
C: 358170
T: 977095
A: 1274032

Amazingly, there are 296,937 more adenines than thymines in the genome (here, I'm somewhat sloppily equating "genome" with combined coding regions). Likewise, excess guanines number 218,938. On average, each gene contains 73 excess purines (42 adenine and 31 guanine).

The above graphs are in no way unique to C. botulinum. If you do similar plots for other organisms, you'll see similar results, with excess purines being most numerous in organisms that have low G+C content. As explained in my earlier posts on this subject, the purine/pyrimidine ratio (for coding regions) tends to be high in low-GC organisms and low in high-GC organisms, a relationship that holds across all bacterial and eukaryotic domains.
reade more... Résuméabuiyad

Chargaff's Second Parity Rule is Violated in Proportion to Genome A+T Content

Erwin Chargaff was the first to notice, in the early 1950s, before Watson and Crick deduced the structure of DNA, that the quantity of purines in DNA equals the quantity of pyrimidines (specifically, the amount of adenine equals the amount of thymine; and the amount of guanine equals the amount of cytosine). This observation was key to establishing the structure of DNA, and it is often cited as Chargaff's first parity rule. But Chargaff also made another observation (the second parity rule), namely that even within a single strand of DNA, the amount of adenine tends to equal the amount of thymine and the amount of guanine tends to equal the amount of cytosine.

It's easy to understand why the first parity rule holds true, because complementarity of DNA strands depends on A pairing with T and G pairing with C; these pairings give rise to the "rungs" of the DNA ladder and ensure that copying of strands occurs with total fidelity during cell division. But there doesn't seem to be any a priori reason why the second parity rule should hold true. And in fact, it often doesn't hold true, as Wacław Szybalski noted in 1966 when he reported finding imbalances of purines and pyrimidines in bacteriophage and other DNA samples. Szybalski observed that in most cases, protein-coding regions of DNA tend to have slightly more purines than pyrimidines on one strand and slightly more pyrimidines than purines on the other strand, such that messenger RNA ends up purine-heavy.

If you're having trouble visualizing the situation, imagine a very short (12-base) "chromosome" containing 50% G+C content. One possibility is that one strand looks like GGGGGGTTTTTT and the other strand is CCCCCCAAAAAA. In this case half the purines (all the G's) are on one strand and half (A's) are on the other. But you could just as easily have strands be GGGGGGAAAAAA and CCCCCCTTTTTT. In this case, one strand is all-purines, the other all-pyrimidines. Both examples violate Chargaff's second rule, which requires that G = C and A = T within each strand (e.g., GGGCCCTTTAAA + CCCGGGAAATTT would obey the rule).

To my knowledge, no one has yet reported the fact (which I'll now report) that the degree to which Chargaff's second parity rule is violated depends on the G+C content of the source genome (at least for bacteria). Simply put, organisms with a G+C content of around 68% obey Chargaff's rules. Organisms with more than 68% G+C content violate Chargaff's second rule in the direction of pyrimidine loading of mRNA. Organisms with less than 68% G+C content (which of course includes the overwhelming majority of organisms) have purine-heavy DNA, to a degree that depends on the amount of A+T in the DNA.

Purine/pyrimidine ratio (in coding regions) as a function of genome G+C content based on codon analysis of 93 organisms. As genomes become more A+T rich, mRNA becomes more heavily purine-loaded.

The above graph shows how this relationship works. To create the graph, I did a statistical analysis of codon usage in 93 bacterial species. Organisms were chosen so as to obtain representatives across the AT/GC spectrum. No genus is represented more than once. In order to get as broad a sampling as possible, I included 14 intracellular symbionts with ultra-low G+C content (plus one such creature—Candidatus Hodgkinia cicadicola—with a 58% G+C content); many extremophiles; heterotrophs and autotrophs; pathogens and non-pathogens; and organisms with large and small genomes. The complete organism list is presented in a table further below.

Codon usage statistics for each organism were obtained using tools at http://genomevolution.org. Relative prevalences of A, T, G, and C in the genomes' coding regions were determined by codon frequency analysis. The purine:pyrimidine ratio was simply calculated as (A+G)/(C+T) based on the codon-wise frequency of usage of each base.

What we see is that while there is a good deal of noise in the data, nevertheless it's quite clear that purine/pyrimidine ratios increase sharply as genome G+C decreases.Organisms for which Chargaff's second rule holds true (points falling at y = 1.0) are in a small minority. Most organisms have purine-rich coding regions, resulting in purine-rich mRNA.

Purine enrichment occurs for both adenine and guanine. For example, in Clostridium botulinum (genome G+C = 28.21%), codon analysis reveals G/C/A/T relative abundances (on the coding strand) of 18.3/10.8/40.3/30.6.

Intra-codon base position analysis reveals that purine enrichment is far more concentrated in position one of the codon than other positions. The graphs below show the purine balance on a position-by-position basis, for each base in a codon.

Most of the variation in purine/pyrimidine ratio happens in position 1 of the codon (the 'A' in ATG, for example). Notice that the purine/pyrimidine ratio in this position is well above 1.0 for all organisms.

Variation in purine loading at the second position of the codon is more carefully controlled (notice that there is less "scatter" in this graph). The y-axis scale is different here than in the previous graph, hence the slope is quite a bit less pronounced than it looks. Also, notice that most of the points in this plot are below parity (i.e., below 1.0 on the y-axis), indicating that this codon position is relatively pyrimidine-rich.

The third (so-called "wobble") position of the codon shows considerable variation in values, but the slope of the curve is less than in the previous two graphs, and this position is pyrimidine-rich for about two-thirds of the organisms.

It's well known that GC-skew tends to be exaggerated in position 3 of the codon. For example, if the overall genome G+C is 70%, the position-wise G+C for the wobble base may be 90%. Surprisingly, we find that purine loading is most exaggerated in position 1 of the codon, not position 3. Not only is the slope of the purine-ratio curve shallower in position 3 than for the other two base positions, only position 1 is actually purine-heavy: positions 2 and 3 tend to be net pyrimidine-rich. This fact (that purine loading is primarily localized to codon position 1, whereas GC-skew is exaggerated in position 3) might indicate that the forces responsible for purine loading are entirely different from the forces responsible for GC skew.

What might those forces be? What kinds of selection pressure might cause organisms to purine-load one strand of their DNA? One possibility is that purine loading of the coding strand is a strategy for protecting the "weaker" or more vulnerable strand from damage or mutations. Cytosine is thought to be particularly vulnerable to deamination (and later substitution with thymine, during repair). It's possible that the transcription process (which is asymmetric, in that RNA polymerase operates against just one strand of DNA, leaving the other strand free) is protective of the antisense strand of DNA. That is, in transcription, RNA polymerase cloaks the antisense strand and in so doing renders that strand less vulnerable to deamination events, rogue methylations, etc., while transcription is taking place.

An entirely different possibility is envisioned by an RNA World hypothesis. In this hypothesis, the genetic material of early ancestor organisms was single-stranded RNA. Since single-stranded RNA is not "complementary" to anything, there is no need for it to obey Chargaff symmetries. Thus, purine loading could have occurred prior to the advent of double-stranded DNA, and early organisms could have been uniformly AT-rich. In this model of the world, GC-rich genomes are a late development, and the processes responsible for creating GC-rich DNA led to genetic material with full Chargaff base parity.

We may not know for a long time (if ever) what the mechanisms of purine enrichment are. But we know for sure that purine accumulation is a widespread phenomenon in the bacterial world (operating across diverse clades) and happens in a way that encourages purine-rich mRNA in organisms with low G+C content in their genomes.


Organisms used in this study:

Organism GC% genome size
Anaeromyxobacter dehalogenans 2CP-1 74.67 5009007
Cellulomonas flavigena strain DSM 20109 74.29 4123179
Xylanimonas cellulosilytica strain DSM 15894 72.47 3831380
Streptomyces bingchenggensis strain BCW-1 70.75 11936683
Myxococcus fulvus strain HW-1 70.63 9003593
Rubrobacter xylanophilus strain DSM 9941 70.48 3225748
Rhodospirillum centenum ATCC 51521 70.46 4355543
Actinomyces sp. oral taxon 175 strain F0384 68.73 3133330
Rhodococcus equi strain ATCC 33707 68.72 5259057
Acidovorax avenae subsp. citrulli strain AAC00-1 68.53 5352772
Bordetella bronchiseptica strain RB50 68.08 5339179
Alicycliphilus denitrificans strain K601 67.81 5070751
Stenotrophomonas maltophilia strain JV3 66.89 4544477
Rhodobacter capsulatus strain SB 1003 66.56 3871920
Pseudomonas aeruginosa strain PA7 66.45 6588339
Ralstonia eutropha strain H16 66.29 7416678
Xanthomonas campestris pv. raphani strain 756C 65.29 4941214
Thioalkalivibrio sp. strain HL-EbGR7 65.06 3470516
Rhodopseudomonas palustris strain BisB18 64.96 5513844
Brevundimonas diminuta strain ATCC 11568 64.51 3369316
Rhodothermus marinus strain DSM 4252 64.09 3386737
Bradyrhizobium japonicum strain USDA 110 64.06 9105828
Mycobacterium tuberculosis strain C 63.82 4379118
Thermanaerovibrio acidaminovorans strain DSM 6589 63.79 1848474
Halomonas elongata DSM 2581 strain type DSM 2581 63.61 4061296
Novosphingobium nitrogenifigens strain DSM 19370 63.43 4182647
Polaromonas sp. strain JS666 62.24 5898676
Desulfovibrio africanus strain Walvis Bay 61.42 4200534
Candidatus Desulforudis audaxviator strain MP104C 60.85 2349476
Burkholderia rhizoxinica strain HKI 454 60.68 3750138
Slackia heliotrinireducens strain DSM 20476 60.21 3165038
Candidatus Nitrospira defluvii 59.03 4317083
Halogeometricum borinquense DSM 11551 58.43 3944467
Candidatus Hodgkinia cicadicola strain Dsem 58.39 143795
Sideroxydans lithotrophicus strain ES-1 57.54 3003656
Cenarchaeum symbiosum A 57.37 2045086
Serratia sp. strain AS12 55.96 5443009
Acidaminococcus fermentans strain DSM 20731 55.84 2329769
Hyperthermus butylicus strain DSM 5456 53.74 1667163
Methanosaeta thermophila (Methanothrix thermophila PT) strain PT 53.55 1879471
Neisseria gonorrhoeae strain NCCP11945 53.37 2236178
Treponema paraluiscuniculi strain Cuniculi A 52.74 1133390
Pseudovibrio sp. strain FO-BEG1 52.38 5916782
Nitrosococcus halophilus strain Nc4 51.60 4145260
Herpetosiphon aurantiacus DSM 785 50.84 6785430
Escherichia coli B strain REL606 50.77 4629812
Bdellovibrio bacteriovorus strain ATCC15356;
50.65 3782950
Pectobacterium wasabiae strain WPP163 50.48 5063892
Anaplasma centrale (Anaplasma marginale subsp. centrale str. Israel) strain Israel 49.98 1206806
Actinomyces coleocanis strain DSM 15436 49.47 1723843
Desulfotalea psychrophila strain LSv54 46.72 3659634
Polynucleobacter necessarius strain STIR1 45.56 1560469
Nitrosomonas sp. strain Is79A3 45.44 3783444
Coprothermobacter proteolyticus strain DSM 5265 44.77 1424912
Vibrio sp. Ex25 strain EX25 44.57 5160431
Geobacillus thermoglucosidans strain TNO-09.020 43.82 3740238
Waddlia chondrophila strain 2032/99 43.59 2139757
Bacteroides fragilis strain 638R 43.42 5373121
Thiomicrospira crunogena strain XCL-2 43.13 2427734
Coxiella burnetii strain CbuG_Q212 42.63 2008870
Chlamydia muridarum Nigg strain MoPn 40.27 1080451
Psychromonas ingrahamii strain 37 40.09 4559598
Nitratiruptor sp. strain SB155-2 39.69 1877931
Lactobacillus reuteri strain DSM 20016 38.87 1999618
Thermotoga lettingae strain TM 38.70 2135342
Streptococcus pyogenes strain Alab49 38.63 1841271
Bartonella bacilliformis strain ATCC 35685; KC583 38.24 1445021
Halothermothrix orenii strain DSM 9562; H 168 37.78 2463968
Staphylothermus marinus strain F1 35.73 1570485
Calditerrivibrio nitroreducens strain DSM 19672 35.69 2216552
Bacillus thuringiensis serovar andalousiensis strain BGSC 4AW1 34.96 5488844
Desulfurobacterium thermolithotrophum 34.95 1541968
Wolbachia pipientis strain wPip 34.19 1482455
Nitrosopumilus maritimus strain SCM1 34.17 1645259
Staphylococcus aureus strain 04-02981 32.90 2821452
Methanobrevibacter ruminantium strain M1 32.64 2937203
Rickettsia japonica strain YH 32.35 1283087
Methanocaldococcus fervens strain AG86 (v1) 32.21 1507251
Mycoplasma genitalium G37 strain G-37 31.69 580076
Nanoarchaeum equitans strain Kin4-M 31.56 490885
Orientia tsutsugamushi strain Boryong 30.53 2127051
Methanococcus aeolicus strain Nankai-3 30.04 1569500
Candidatus Pelagibacter ubique strain HTCC1062 29.68 1308759
Ehrlichia canis strain Jake 28.96 1315030
Arcobacter nitrofigilis strain DSM 7299 28.36 3192235
Clostridium botulinum A strain ATCC 19397 28.21 3863450
Parvimonas sp. oral taxon 393 strain F0440 28.17 1483165
Candidatus Arthromitus sp. strain SFB-mouse-NYU 27.94 1569870
Candidatus Blochmannia floridanus 27.38 705557
Buchnera aphidicola (Acyrthosiphon pisum) strain 5A 25.69 653223
Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis 22.48 703004
Candidatus Sulcia muelleri strain CARI (v1) 21.13 276511
Candidatus Carsonella ruddii strain PV (v1) 16.56 159662

reade more... Résuméabuiyad