unemployment depression: new gene creation

Showing posts with label new gene creation. Show all posts

An Example of Antisense Proteogenesis?

The question of how organisms develop entirely new genes is one of the most important open questions in biology. One possibility is that new genes often develop through accidental translation of antisense strands of DNA.

An example of this can be seen with the S1 protein of the 30S bacterial ribosome. If you take the amino-acid sequence for an S1 gene and use it as the query sequence in a blast-p (protein blast), you'll mostly get back hits on other S1 proteins, but you'll also get minor (low-fidelity) hits on polynucleotide phosphorylase. Why? When you do a blast search, the search engine, by default, looks at both DNA strands of target genes (sense and antisense strands) to see if there's a potential sequence match with the query. If there's a match on the antisense strand, it will be reported along with "sense" matches. In the case of the S1 protein, blast-p searches often report weak antisense hits on polynucleotide phosphorylase in addition to strong sense hits on ribosomal S1.

Ribosomal proteins are, of course, among the most highly conserved proteins in nature. It turns out that polynucleotide phosphorylase (PNPase) is very highly conserved as well. It's an enzyme that occurs in every life form (bacteria, fungi, plants, animals), absent only in a scant handful of microbial endosymbionts that have lost the majority of their genes through deletions. While the chemical function of PNPase is well understood (it catalyzes the interconversion of nucleoside diphosphates to RNA), its physiologic purpose is not well understood, although recent research shows that PNPase-knockout mutants of E. coli exhibit lower mutation rates. (Hence, PNPase may actually be involved in generating mutations.)

The bacterium Rothia mucilaginosa, strain DY18, has a (putative) PNPase gene at a genome offset of 1277514. When this gene is used as the query for a blast-p search, the hits that come back include many strong matches for the S1 ribosomal proteins of various organisms. By "strong match," I mean better than 80% sequence identity coupled with an E-value (expectation value) of zero. (Recall that the E-value represents the approximate odds of the match in question happening due to random chance.

If we use the Genome Viewer at genomevolution.org to look at the PNPase gene of Rothia mucilaginosa, we see something extraordinarily peculiar (look carefully at the graphic below). Click to enlarge the following image, or better yet, to see this genome view for yourself, go to this link.

Notice the presence of overlapping sense and antisense open reading frames on a portion of DNA from Rothia mucilaginosa. The top reading frame contains the gene for polynucleotide phosphorylase. The lower (-1 strand) reading frame contains ribosomal S1. To see this in your own browser, go to this link.

Notice that there are overlapping genes. On the top strand is the gene for PNPase; on the bottom strand, in the same location, is a gene for ribosomal S1. These are bidirectionally overlapping open reading frames, something occasionally encountered in virus nucleic acids but rarely seen in bacterial or other genomes.

How do we explain this anomaly? It could be just that: an anomaly, two open reading frames that happen to overlap (but that aren't necessarily translated in vivo). Or it could be that at some point, many millions of years ago, the ribosomal S1 gene of a Rothia ancestor was erroneously translated via the antisense strand, producing a protein with PNPase characteristics. We don't know why PNPase confers survival value (its physiologic purpose is not fully understood), but we do know, with a fair degree of certainty, that PNPase does, in fact, confer survival value—because every organism, at every level of the tree of life, has at least one copy of PNPase. Once Rothia's ancestor, through whatever process, opened up a reading frame on the antisense strand of ribosomal S1, the reading frame stayed open, because it conferred survival value. In this way, the first Rothia PNPase was born. (Arguably.)

At some point in its history, Rothia duplicated its PNPase gene and placed a new copy at genome offset 1650959. Over time, this second copy diverged from the original copy, becoming more like E. coli PNPase (which is also to say, less S1-like). Rothia's second PNPase shows a blast-p similarity of 45% (in terms of AA identities) to E. coli PNPase, with E-value 4.0e-147. It shows a blast-p similarity of 26% (AA identities) with E. coli ribosomal S1 (E-value: 4.0e-17). Neither E. coli PNPase nor Rothia PNPase-2 overlaps an S1 gene. However, both are colocated with the ribosomal S15 protein gene. And you'll find (if you look at lots of bacterial genomes) that PNPase is almost always located immediately next to an S15 ribosomal gene.

Rothia PNPase is an example of an enzyme that may very well have started out as an antisense copy of another protein (the S1 ribosomal protein). Of course, the mere presence of bidirectionally overlapping open reading frames doesn't prove that both frames are actually transcribed and translated in vivo. But the fact that blast-p searches using PNPase as the query almost always turn up faint S1 echoes (in a wide variety of organisms) is highly suggestive of an ancestral relationship between the two proteins.

reade more...

Evolution and Antisense Translation of DNA

Yesterday I offered a theory for new gene creation which might be called the Erroneous Translation Theory. Basically, I proposed that new proteins arise through frameshifted and/or reversed translation of nucleic acids (translation of antisense strands of DNA).

Erroneous translation of DNA offers interesting possibilities for gain of function. (Recall that most point mutations result in loss of function, and one of the major criticisms of Darwinian theory is that evolution based on accumulation of point mutations cannot account for gain-of-function events.) Wholesale mistranslation via frameshift errors and/or wrong-strand transcription allow for the sudden emergence of entirely new classes of proteins. The unit of change is no longer the single base-pair polymorphism but the functional domain or motif.

An important aspect of antisense-strand translation has to do with stop codons. In DNA, the sequences TCA, TTA, and CTA specify amino acids serine, leucine, and leucine, respectively. But when these three codons are complemented, then read in 5'-to-3' direction—in other words, when they're antisense-translated—they form the stop codons TGA, TAA, and TAG, which tell the cell's protein-making machinery to terminate the production of the current polypeptide. Thus, if a typical gene containing codons TCA, TTA, and CTA is translated "backwards," translation will end prematurely: It will end as soon as a stop codon is encountered.

How important a consideration is this in the real world? Consider the following DNA sequence, which represents the gene for the cytidine deaminase enzyme of Clostridium botulinum:

>Clostridium botulinum A strain ATCC 19397(v1, unmasked), Name: ABS32549.1, CLB_0040, Type: CDS, Feature Location: (Chr: 1, 37028..37465) Genomic Location: 37028-37465ATGAATGATTATATAGAATATGCAATAATTGAAGCAAAAAAAGCATTAGCAATAGGAGAAGTACCTGTTGGAGCTATTATAGTTAAAGAAAATAAAATTATAGCAAAAAGTCATAATTTAAAAGAGTCATTGAAGGATCCAACAGCTCATGCAGAGATATTAGCTATAAAAGAAGCTTGCAATACAATACATAATTGGAGATTAAAAGGATGTAAGATGTATGTAACATTAGAACCATGTGCTATGTGTGCTAGTGCAATAATTCAATCTAGAATAAGTGAATTGCATATAGGAACCTTTGATCCAGTGGGAGGGGCTTGTGGATCAGTAGTAAATATAACAAATAATAGTTATTTAAAAAATAATTTAAATATTAAATGGTTATATGATGATGAATGTAGTAGAATAATAACAAATTTTTTTAAAAATATTAGATAA

The above sequence is the "sense" strand of the DNA, in 5'-to-3' direction. The sequence below is the corresponding 3'-to-5' complementary sequence (in other words, what's on the antisense strand of DNA):

TACTTACTAATATATCTTATACGTTATTAACTTCGTTTTTTTCGTAATCGTTATCCTCTTCATGGACAACCTCGATAATATCAATTTCTTTTATTTTAATATCGTTTTTCAGTATTAAATTTTCTCAGTAACTTCCTAGGTTGTCGAGTACGTCTCTATAATCGATATTTTCTTCGAACGTTATGTTATGTATTAACCTCTAATTTTCCTACATTCTACATACATTGTAATCTTGGTACACGATACACACGATCACGTTATTAAGTTAGATCTTATTCACTTAACGTATATCCTTGGAAACTAGGTCACCCTCCCCGAACACCTAGTCATCATTTATATTGTTTATTATCAATAAATTTTTTATTAAATTTATAATTTACCAATATACTACTACTTACATCATCTTATTATTGTTTAAAAAAATTTTTATAATCTATT

When the antisense sequence is translated in the normal 5'-to-3' direction, the following amino acid sequence results:

LSNIFKKICYYSTTFIII*PFNI*IIF*ITIICYIYY*STSPSHWIKGSYMQFTYSRLNYCTSTHSTWF*CYIHLTSF*SPIMYCIASFFYS*YLCMSCWILQ*LF*IMTFCYNFIFFNYNSSNRYFSYC*CFFCFNYCIFYIIIH

This sequence of 146 amino acids (shown here using standard one-letter amino-acid abbreviations) contains 10 stop codons (depicted as asterisks). Any attempt to translate the antisense strand of the C. botulinum cytidine deaminase gene will result in (at best) a series of short oligopeptides.

It's tempting to conclude that this is nature's ingenious way of preventing the occurrence of nonsense proteins. Translate the wrong strand of DNA by mistake, and translation quickly terminates. (In the above example, a stop codon occurs every 14 amino acids, on average.) But before you jump to that conclusion, consider the cytidine deaminase gene of Anaeromyxobacter dehalogenans strain 2CP-C:

GTGGACGAGCGCGAGGCGATGCAGGAGGCGCTGGGGCTGGCGCGCGAGGCGGCGGCCCGCGGCGAGGTGCCGGTCGGCGCGGTGGCGCTGTTCGAGGGCCGCGTGGTCGGCCGCGGCGCGAACGCCCGCGAGGCGGCGCGCGATCCCACCGCGCACGCGGAGCTCCTCGCGATCCAGGAGGCGGCGCGCACCCTCGGGCGCTGGCGCCTCACCGGCGTCACGCTGGTGGTGACGCTCGAGCCCTGCGCCATGTGCGCCGGCGCCATGGTGCTCGCCCGCATCGACCGGCTCGTCTACGGGGCGAGCGATCCCAAGGCCGGCTGCACCGGCTCCCTCCAGGACCTGTCGGCGGACCCCCGGCTGAACCACCGGTTCCCGGTGGAGCGCGGCCTGCTGGCCGAGGAGTCCGGCGAGCTCCTCCGGGCCTTCTTCCGGGCCCGCCGGGGCGCCGGGAACGGAAACGGCAACGGCGGCGAGGGTTAG

The translation of the antisense version of this gene is:

LTLAAVAVSVPGAPAGPEEGPEELAGLLGQQAALHREPVVQPGVRRQVLEGAGAAGLGIARPVDEPVDAGEHHGAGAHGAGLERHHQRDAGEAPAPEGARRLLDREELRVRGGIARRLAGVRAAADHAALEQRHRADRHLAAGRRLARQPQRLLHRLALVH

Which contains no stop codons! Why does one version of the gene give ten stop codons when anti-translated, whereas the other version gives zero stop codons? Clostridium botulinum has a genome G+C content of 28% whereas the DNA of Anaeromyxobacter dehalogenans has a G+C content of 74%. The two organisms favor entirely different codons. Anaeromyxobacter uses codons TCA, TTA, and CTA only 0.03%, 0%, and 0.02% of the time, respectively. Clostridium uses the same codons 1.72%, 5.62%, and 4.67% of the time—over 200 times more often than Anaeromyxobacter.

Bottom line: Almost any gene in Anaeromyxobacter (or any high-GC organism, it turns out) can be antisense-translated without generating stop codons. Stop codons occur in antisense genes in inverse proportion to the amount of G+C in the gene.

If it's true that antisense-strand translation is (or has been) an important source of new proteins in nature, the foregoing observation is tremendously relevant, because it means successful reverse translation has likely occurred far more often in high-GC organisms than in low-GC organisms. It suggests that bacteria with high G+C content in their genomes may, in fact, have been the incubators of early proteins. It implies a "GC Eden" scenario in which early life forms had predominantly high-GC genomes. Low-GC organisms then arose through continuous "AT pressure," from large numbers of accumulated GC-to-AT transition mutations. (We know that GC-to-AT transition mutations occur at a much higher rate than AT-to-GC transitions; this fact is not in dispute.)

Even so, we have to ask: What is the evidence for reverse (antisense-strand) translation having occurred in nature? Is there any such evidence?

More on this subject tomorrow.

reade more...

Thoughts on New Gene Origination

The other day, I wrote a damning critique of Darwin's theory and offered nothing in the way of a positive alternative to the traditional view of accumulated-point-mutations as a driving force for evolution. It's easy to take potshots at someone else's theory and walk away. As a rule, I don't like naysayers who criticize something, then offer nothing in return. So I'd like to take a moment to try to offer a different perspective on evolution. In particular, I'd like to offer my own theory as to how new genes arise.

The question of where new genes comes from is, of course, one of the foremost open problems in biology. Current theory revolves mostly around gene duplication followed by modification of the duplicated gene (via mutations and deletions) under survival pressure [reference 4 below]. Gene fusion and fission have also been proposed as mechanisms for gene origination [3]. In addition, genes derived from noncoding DNA have recently been described in Drosophila [1]. Likewise, transposons (genes that jump from one location to another) have been implicated in gene biogensis [3].

The problem with these theories is that various enzymes are required in order for duplication, transposition, fusion, fission, etc., to occur (to say nothing of transcription, translation initiation, translation elongation, and so on), and existing theories don't explain how these participating enzymes appeared, themselves, in the first place. A fully general theory has to start from the assumption that in pre-cellular, pre-chromosomal, pre-organismic times, genes (if they existed) may have occurred singly, with multiple copies arising through non-enzymatic replication. Likewise, we should assume that early protein-making machinery was probably non-enzymatic, which is to say entirely RNA-based (i.e., ribozymal). If the idea of catalytic RNA is new to you or sounds unreasonably farfetched, please review the 1989 Nobel Prize research by Altman and Cech.

The fundamental mechanisms of de novo gene creation available in pre-enzymatic times might well have been nothing more than ribozymal duplication of nucleic acid sequences followed by erroneous translation. "Erroneous translation" can be of two fundamental types: frameshifted translation, and reverse translation. (Reverse translation here means transcription of the antisense strand of DNA and subsequent translation to a polypeptide.)

DNA is parsed 3 bases at a time (the 3-base combinations are called codons; each codon corresponds to an amino acid). If a single base is spuriously added to, or deleted from, a gene, the reading frame is disrupted and a hugely different amino-acid sequence results. This is called a frameshift error or frameshift mutation.

Spurious addition or deletion of a single base to a free-floating piece of single-stranded genetic material (RNA or DNA) is all that's needed in order to cause frameshifted translation. The protein that results from a frameshift error is, of course, in general, vastly different from the original protein.

If pre-organismic nucleic acids were single-stranded, then reverse translation would require 3'-to-5' reading of the nucleic acid as well as 5'-to-3' reading. If, on the other hand, early nucleic acids were double-stranded, then 5'-to-3' (normal direction) translation of each strand would suffice to give one normal and one reverse translation product. (Note for non-biologists: In all known current organisms, reading of DNA and RNA takes place in the 5'-to-3' direction only.)

Nucleic acids (RNA and DNA) have directionality, defined by the orientation of sugar backbone molecules in terms of their 5' and 3' carbons.

It's interesting to speculate on the role of reverse translation in production of novel proteins, especially as it applies to early biological systems. We don't know if early systems relied on triplet codons (or even if all four bases—guanine, cytosine, adenine, thymine—existed from the beginning). We also don't know if there were 20 amino acids in the beginning. There may have been fewer (or more).

A novel possibility is that early triplet codons were palindromic (giving identical semantics when read in either direction). There are 16 palindromic codons in the codon lexicon (AGA, GAG, CAC, ACA, ATA, TAT, AAA, and so on) which today encode 15 amino acids out of the 20 commonly used. In a palindromic-codon world, the distinction between "sense" and "antisense" nucleic acid sequences vanishes, because a single-stranded gene made up of palindromic codons could be translated in either direction to give a polypeptide with the same sequence, the only chirality arising from N- to C-terminal polarity. For example, the sequence GGG-CAC-GCG-AAA would give a polypeptide of glycine-histidine-alanine-lysine whether translated forward or backward, the only difference being that the forward version would have glycine at the N-terminus whereas the reverse version would have glycine at the C-terminus. The secondary and tertiary structures of the two versions would be the same. As long as catalytic function didn't directly depend on an amino or carboxy terminus of an end-acid, the two proteins would also be functionally indistinguishable.

Codon palindromicity is potentially important in any system in which single-stranded genes are bidirectionally translated, because in the case where a gene does happen to rely heavily on palindromic codons, the reverse-translated product will (for the reasons just explained) have the potential to be functionally paralogous to the forward-translated product (to an extent matching the extent of palindromic-codon usage). But this assumes that in early organisms (or pre-organismic soups), single-stranded genes could be translated in the 5'-to-3' direction or the 3'-to-5' direction.

It turns out modern organisms differ markedly in the degree to which they use palindromic codons, and there are (remarkably) some prokaryotes whose genes use an average of ~40% palindromic codons. The complementary strand of DNA would, of course, contain palindromic complements: AGA opposite TCT, CCC opposite GGG, etc.

All of this makes for interesting conjecture, but does any of it really apply to the natural world? For example: Do organisms actually employ strategies of "erroneous translation" in creating new proteins? Did today's microbial meta-proteome arise through mechanisms involving frameshifted and/or reverse translation? Is there any evidence of such processes, one way or the other? Tomorrow I want to continue on this theme, presenting a little data to back up some of these strange ideas. Please join me; and bring a biologist-friend with you!

References
1. Begun, D., et al. Evidence for de novo evolution of testis-expressed genes in the Drosophila yakuba/Drosophila erecta clade. Genetics 176, 1131–1137 (2007).
2. Fechotte, C., & Pritham, E. DNA transposons and the evolution of eukaryotic genomes. Annual Review of Genetics 41, 331–368 (2007)
3. Jones, C. D., & Begun, D. J. Parallel evolution of chimeric fusion genes. Proceedings of the National Academy of Sciences 102, 11373–11378 (2005).
4. Ohno, S. Evolution by Gene Duplication (Springer-Verlag, Berlin, 1970).

reade more...

Pages

.

An Example of Antisense Proteogenesis?

Evolution and Antisense Translation of DNA

Thoughts on New Gene Origination