Pages

.

Showing posts with label desktop science. Show all posts
Showing posts with label desktop science. Show all posts

Bacterial Genes in Rice: A Cautionary Tale

Something very strange happened the other day.

I was fooling around looking for flagellum genes in various organisms, hoping to find homology between bacterial flagellum proteins and eukaryotic cilia proteins. All of a sudden, a search came back positive for a bacterial gene in rice, of all things.

On a lark, I decided to check further. ("If one gene transferred, maybe there are more," I reasoned.) It was late at night. Before going to bed, I downloaded the DNA sequence data for all 3,725 genes of Enterobacter cloacae subsp. cloacae strain NCTC 9394 and set up a brute-force BLAST search of the 3,725 bacterial genes against all 49,710 genes of Oryza sativa L. ssp. indica. I set the E-value threshold to the most stringent value allowed by the CoGeBlast interface, namely 1e-30, meaning: reject anything that has more than a one-in-1030 chance of having matched by chance. I went to bed expecting the search to turn up nothing more than the one flagellum protein-match I'd found earlier.

When I woke up the next morning, I was stupefied to find that my brute force blast-n (DNA sequence) search had brought back more than 150 high-quality hits in the rice genome.

I later found 400 more bacterial genes, from Acidovorax, a common rice pathogen. (Enterobacter is not a known pathogen of rice, although it has been isolated from rice.)

But before you get the impression that this is some kind of major scientific find, let me cut the suspense right now by telling you the bottom line, which is that after many days of checking and rechecking my data, I no longer think there are really hundreds of horizontally transferred bacterial genes lurking in the rice genome. Oh sure, the genes are there, in the data (you can check for yourself), but this is actually just a sad case of garbage in, rubbish out. The Oryza sativa indica genome, I'm now convinced, suffers from sample contamination. That is to say: Bacterial cells were present in the rice sample prior to sequencing. Some of the bacterial genes were amplified and got into the contigs, and the assembly software dutifully spliced the bacterial data in with the rice data.

My first tipoff to the possibility of contamination (aside from finding several hundred bacterial genes where there shouldn't be any bacterial genes) came when I re-ran my BLAST searches using the most up-to-date copy of the indica genome. Suddenly, many of the hits I'd been seeing vanished. The most recent genome consists of 12 chromosome-sized contigs. The earlier genome I had been using had had the 12 chromosomes plus scores of tiny orphan contgis. When the orphan contigs went away, so did most of my hits.

When I looked at NCBI's master record for the Oryza sativa Indica Group, I noticed a footnote near the bottom of the page: "Contig AAAA02029393 was suppressed in Feb. 2011 because it may be a contaminant." (In actuality, a great many other contigs have been removed as well.)

When I ran my tests against the other sequenced rice genome, the Oryza sativa Japonica Group genome, I found no bacterial genes.

Contamination continues to plague the Indica Group genome. The 12 "official" chromosomes of Oryza sativa indica have Acidovorax genes all over the place, to this day. I suppose technically, it is possible those genes represent instances of horizontal gene transfer. But if that's what it is, then it's easily the biggest such transfer across species lines ever recorded. And it happened only in the indica variety of rice, not japonica. (The two varieties diverged 60,000 to 220,000 years ago.)

The following table shows some of the Acidovorax genes that can be found in the Oryza satisva Indica Group genome. This is by no means a complete list. Note that the Identities number in the far-right column pertains to DNA-sequence similarity, not amino-acid-sequence similarity.

Acidovorax Genes Ocurring in the Published Oryza sativa indica Genome
Query gene
Function
Rice gene
Query coverage
E
Identities
Aave_0021
phospho-2-dehydro-3-deoxyheptonate aldolase
OsI_15236
100.0%
0.0
93.6%
Aave_0289
orotate phosphoribosyltransferase
OsI_36535
100.0%
0.0
96.8%
Aave_0363
lipoate-protein ligase B
OsI_15083
100.0%
0.0
94.6%
Aave_0368
F0F1 ATP synthase subunit B
OsI_15082
100.0%
0.0
98.9%
Aave_0372
F0F1 ATP synthase subunit beta
None
100.1%
0.0
98.2%
Aave_0373
F0F1 ATP synthase subunit epsilon
OsI_15081
100.0%
0.0
97.8%
Aave_0637
twitching motility protein
OsI_37113
100.1%
0.0
95.5%
Aave_0916
general secretory pathway protein E
OsI_17332
86.9%
0.0
96.6%
Aave_1272
NADH-ubiquinone/plastoquinone oxidoreductase, chain 6
OsI_28652
100.0%
0.0
97.3%
Aave_1273
NADH-ubiquinone oxidoreductase, chain 4L
OsI_28651
100.0%
3e-174
100%
Aave_1301
DedA protein (DSG-1 protein)
OsI_21534
97.3%
0.0
96.8%
Aave_1312
hypothetical protein
OsI_15703
99.8%
0.0
93.4%
Aave_1948
histidine kinase internal region
OsI_23297
100.0%
0.0
96.3%
Aave_1950
hypothetical protein
OsI_23296
100.0%
0.0
96.6%
Aave_1957
penicillin-binding protein 1C
OsI_15534
100.1%
0.0
92.8%
Aave_1958
hypothetical protein
OsI_15533
99.2%
0.0
92.2%
Aave_2274
major facilitator superfamily transporter
OsI_33140
95.1%
0.0
92.5%
Aave_2484
2,3,4,5-tetrahydropyridine-2-carboxylate N-succinyltransferase
OsI_19753
100.0%
0.0
97.3%
Aave_3000
ferrochelatase
OsI_33935
100.0%
0.0
96.2%

So let this be a lesson to DIY genome-hackers everywhere. If you find what you think are dozens of putative horizontally transferred genes in a large genome, stop and consider: Which is more likely to occur, a massive horizontal gene transfer event involving several dozen genes crossing over into another life form, or contamination of a lab sample with bacteria? I think we all know the answer.

Many thanks to professor Jonathan Eisen at U.C. Davis for providing valuable consultation.
reade more... Résuméabuiyad

Getting Started in Desktop Bioinformatics

I've spent about four months now exploring do-it-yourself desktop bioinformatics. Overall, I'm excited by what I've been able to do and I'm optimistic about the prospects for other do-it-yourself desktop-science geeks, because there are tons of great online tools for doing bio-sci and lots of important scientific questions yet to be fleshed out. So I thought I'd share some of what I've learned, and provide some pointers for anyone who might want to try his or her hand at this sort of "citizen science."

It helps to have the benefit of a science education (in particular, a bio education) before beginning, but one of the great things about do-it-yourself desktop science is that you can (and will!) learn as you go. For example, you might have only a bare-bones basic understanding of enzymology before you begin, but as you move deeper into a particular research quest, you'll find yourself wanting to learn more about this or that aspect of an enzyme. So you'll hit Google Scholar and bring yourself up-to-date on this or that detail of a particular subject. That's a Good Thing.

When I first plunged into DNA analysis, I have to admit my knowledge of mitochondria was weak. I knew they had their own DNA, for example, but it wasn't obvious to me (until I started digging) that mitochondrial DNA is pathetically small, whereas the mitochondrial proteome (the superset of all products that go into making up a functioning mitochondrion) is large. In other words, most "mitochondrial genes" are not in mtDNA. They're in nuclear DNA. There are a couple of online databases of nuclear mitochondrial genes (NUMTs, as they're known), but by and large this is an area in dire need of more research. Someone needs to put together a database or reference set of yeast NUMTs, for example. We also need a database for algal NUMTs, another for protozoan NUMTs, another for rice or corn or Arabidopsis NUMTs, etc. Maybe you'll be the one to move such a project along?

So. How can you get started in desktop bioinformatics? I recommend familiarizing yourself with the great tools at genomevolution.org, which is powered by iPlant, which (in turn) is funded by the National Science Foundation Plant Cyberinfrastructure Program here in the U.S. In particular, I recommend you set aside an evening to run through some of the tutorials at genomevolution.org. That'll give you an idea of what's possible with their tools.
Many organisms have genes for flagellum proteins,
but not all such organisms actually make a flagellum.
(The flagellum is the whiplike appendage that gives the
cell motility. Above: Bdellovibrio, a bacterium with a
powerful flagellum.)

If you go to this page and scroll down, you'll find some really interesting short videos showing how to use some of the genomevolution.org tools. They're fun to watch and should stimulate your imagination.

What kinds of problems need investigating by desktop biologists? The sky's the limit. One quest that lends itself to citizen science is looking for examples of horizontal gene transfer (HGT). This requires that you first teach yourself a little bit about BLAST searches. (BLAST searches are sequence-similarity searches that let you compare DNA against DNA or amino-acid sequence against amino-acid sequence.) The strategies involved here can range from simple and brute force to sophisticated; and the great thing is, you can invent your own heuristics. It's a wide-open area. I recently found good evidence (90%+ similarity of DNA sequences) for bacterial gene transfer into rice, which I'll write about in a later post. I'm confident there are thousands of examples of horizontal gene transfer (whether from bacteria to bacteria, bacteria to plant, bacteria to insect, or whatever) waiting to be discovered. You could easily be the next discoverer of one of these gene transfers.

Here are some other ideas for desktop-science explorations:
  • Find and characterize flagellar genes in organisms that lack motility. If you dig into the literature, you'll find that there are many examples of supposedly immotile organisms (like the intracellular parasite Buchnera, which lives inside aphids) that not only harbor flagellum genes but express some of them—yet have no external flagellum. Obviously, organisms that retain flagellum genes but actually don't make a flagellum (that little whip-like tail that makes single-celled organisms swim around) must be retaining those genes for a reason. The gene products must be doing something. But what? Also: Paramecium and diatoms and other eukaryotes make flagella and/or cilia. Most animals also make cilia. (Ever get a tickle deep in your throat or bronchia? It was probably something tangling with the cilia lining your bronchial system.) What's the relationship between cilia gene products in Paramecium, say, and cilia in animals? Do any plants conceal cilia genes? If so, how are they related phylogenetically to lower-organism cilia?
  • Migration of genes from parasites to host DNA. A general pattern that seems to happen in nature is: a bacterium or other invader takes up residency inside an animal or plant cell, becoming an endosymbiont; then some of its genes (the symbiont's genes) move to the nucleus of the host cell. Which genes? What do the genes do? That's up to you to try to find out. 
  • Bidirectionally ("bidi") transcribed genes: While rare, there are examples of genes in which each strand of DNA is transcribed into mRNA. (The genome for Rothia mucilaginosa contains many putative examples of this.) Find organisms that contain bidi genes. Try to determine if both strands are actually transcribed. Examine sister-species organisms to see if one strand is transcribed in one organism and the other gene (on the other strand) is transcribed in the other organism.
  • Phylogenetics of plasmid and viral genes. Try to determine the ancestry of a virus gene. There are good tree-making services online that do all the hard work for you, including protein-sequence alignments. All you have to do is cut and paste Fasta files.
  • Codon analysis. There are many plants (rice is one) and higher organisms in which DNA is more or less equally divided into high-GC-content genes and low-GC-content genes. Surely the codon usage patterns for each class of gene(s) varies. But how? What are the codon adaptation indexes (CAI values) for the various genes? Create a few histograms of CAI values. Use CAIs and other techniques to try to determine which genes are highly expressed. Are HEGs (highly expressed genes) mostly high-GC? Low-GC? Both? Run some histograms on the genes' purine (A+G) content, G+C content, G+C content by codon position.
  • Many organisms (and organelles) have extremely GC-poor genomes. Some have bizarre codon usage patterns (where, say, the codon AAA is used 12 times more than the average codon). Some use 56 or fewer codons (out of 64 possible). Find the organelle or organism that uses the fewest codons. See if there's an organism or organelle that uses fewer than 20 amino acids. Which amino acid(s) get(s) left out? 
  • Characterize the DNA repairosome of an aerobic and an anerobic archeon. Compare and contrast the two.
  • Find all the genes in a particular organism that have mitochondrial-targeting presequences in their DNA. 
  • Pick two closely related organisms. Try to figure out how many million years ago they diverged. Use mitochondrial DNA analysis as well as cytoplasmic protein analysis. 
  • Find the bacterium that has more secretion-protein, permease, and protein-translocation genes than any other. Compare it to its closest relative. 
  • Find an organism that is pathogenic (to humans, animals, or plants). Find its closest non-pathogenic relative. Compare genomes. Determine which genes are most likely to be involved in virulence. 
  • Some seemingly simple organisms (amoebae) have more DNA than a human. Why? What's all that DNA doing there? Does it contain horizontally transferred genes from plants, bacteria, archeons, animals? Are amoebas and other super-large-genome organisms "DNA hoarders"? Are they DNA curationists? Characterize the genes (enumerate by category, first) of these organisms. How many are expressed? How many are junk? What's the energy cost of maintaining that much junk DNA? Can it all be junk? Is an amoeba actually a devolved higher life form that forgot how to do morphogenesis and can no longer develop into a tadpole or whatever?
  • [ insert your own project here! ]

reade more... Résuméabuiyad

Do-It-Yourself Phylogenetic Trees

I've been doing a lot of desktop science lately, and I'm happy to report that superb, easy-to-use online tools exist for creating your own phylogenetic trees based on gene similarities, something that's non-trivial to implement yourself.

The other day, I speculated that the fruit-fly Ogg1 gene, which encodes an enzyme designed to repair oxidatively damaged guanine residues in DNA, might derive from Archaea. The Archaea (in case you're not a microbiologist) comprise one of three super-kingdoms in the tree of life. Basically, all life on earth can be classified as either Archaeal, Eukaryotic, or Eubacterial. The Eubacteria are "true bacteria": they're what you and I think of when we think "bacteria." (So, think Staphylococcus and tetanus bacteria and E. coli and all the rest.) The Eukaryota are higher life forms, starting with yeast and fungi and algae and plankton, progressing up through grass and corn and pine trees, worms and rabbits and donkeys, all the way to the highest life form of all, Stephen Colbert. (A little joke there.) Eukaryotes have big, complex cells with a distinct nucleus, complex organelles (like mitochondria and chloroplasts), and a huge amount of DNA packaged into pairs of chromosomes.

Archaea look a lot like bacteria (they're tiny and lack a distinct nucleus, organelles, etc.), and were in fact considered bacteria until recently. But around the turn of the 21st century, Carl Woese and George E. Fox provided persuasive evidence that members of this group of organisms were so different in genetic profile (not to mention lifestyle) that they deserved their own taxonomic domain. Thus, we now recognize certain bacteria-like creatures as Archaea.

The technical considerations behind the distinction between bacteria and archeons are rather deep and have to do with codon usage patterns, ribosomal RNA structure, cell-wall details, lipid metabolism, and other esoterica, but one distinguishing feature of archeons that's easy to understand is their willingness to live under harsh conditions. Archaeal species tend to be what we call extremophiles: They usually (not always) take up residence in places that are incredibly salty, or incredibly hot, or incredibly alkaline or acidic.

While it's generally agreed that eukaryotes arose after Archaea and bacteria appeared, it's by no means clear whether Archaea and bacteria branched off independently from a common ancestor, or perhaps one arose from the other. (A popular theory right now is that Archaea arose from gram-positive bacteria and sought refuge in inhospitable habitats to escape the chemical-warfare tactics of the gram-positives.) A complication that makes studying this sort of thing harder is the fact that horizontal gene transfer has been known to happen (with surprising frequency, actually) across domains.

Is it possible to study phylogenetic relationships, yourself, on the desktop? Of course. One way to do it: Obtain the DNA sequences of a given gene as produced by a variety of organisms, then feed those gene sequences to a tool like the tree-making tool at http://www.phylogeny.fr. Voila! Instant phylogeny.

The Ogg1 gene is an interesting case, because although the DNA-repair enzyme encoded by this gene occurs in a wide variety of higher life forms, plus Archaea, it is not widespread among bacteria. Aside from a couple of Spirochaetes and one Bacteroides species, the only bacteria that have this particular gene are the members of class Clostridia (which are all strict anaerobes). Question: Did the Clostridia get this gene from anaerobic Archaea?

Using the excellent online CoGeBlast tool, I was able to build a list of organisms that have Ogg1 and obtain the relevant gene sequences, all with literally just a few mouse clicks. Once you run a search using CoGeBlast, you can check the checkboxes next to organisms in the results list, then select "Phylogenetics" from the dropdown menu at the bottom of the results list. (See screenshot.)


When you click the Go button, a new FastaView window will open up, containing the gene sequences of all the items whose checkboxes you checked in CoGeBlast. At the bottom of this FastaView window, there's a small box that looks like this:


Click Phylogeny.fr button (red arrow). Immediately, your sequences are sent to the French server where they'll be converted to a phylogenetic tree in a matter of one to two minutes (usually). The result is a tree that looks something like this:


I've color-coded this tree to make the results easier to interpret. Creating a tree of this kind is not without potential pitfalls, because for one thing, if your DNA sequences are of vastly unequal lengths, the groupings made by Phylogeny.fr are likely to reflect gene lengths more than true phylogeny. For this tree, I did various data checks to make sure we're comparing apples and apples. Even so, a sanity check is in order. Do the groupings make sense? They do, actually. At the very top of the diagram (color-coded in green) we find all the eukaryotes grouped together: fruit-fly (Drosophila), yeast (Saccharomyces), fungus (Aspergillus). At the bottom of the diagram, Clostridium species (purplish red) fall into a subtree of their own, next to a tiny subtree of Methoanobrevibacter. This actually makes a good deal of sense, because the two Methanobrevibacter species shown are inhabitants of feces, as are the nearby Clostridium bartletti and C. diff. The fact that all the salt-loving Archaea members group together (organisms with names starting with 'H') is also indicative of a sound grouping. Overall, the tree looks sound.

If you're wondering what all the numbers are, the scale bar at the bottom (0.4) shows the approximate percentage difference in DNA sequences associated with that particular length of tree depth. The red numbers on the tree branches are indicative of the probability that the immediately underlying nodes are related. Probably the most important thing to know is that the evolutionary distance between any two leaves in the tree is proportional to the sums of the branch lengths connecting them. (The branch lengths are not explicitly specified; you have to eyeball it.) At the top of the diagram, you can see that the branch lengths of the two Drosophila instances are very short. This means they're closely related. By contrast, the branch lengths for Saccharomyces and the ancestor to Drosophila are long, meaning that these organisms are distantly related.

Just to give you an idea of the relatedness, I checked the C. botulinum Ogg1 protein amino-acid sequence against C. tetani, and found 63% identity of amino acids. When I compared C. botulinum's enzyme against C. difficile's, there was 52% identity. With Drosophila there is only 32% identity, and even that applies only to a 46% coverage area (versus 90%+ for C. tetani and C. diff). Bottom line, the Blast-wise relatedness does appear to correspond, in sound fashion, to tree-wise relatedness.

Two things stand out. One is that not all of the Clostridium species group together. (There's a small cluster of Clostridia near the salt-lovers, then a main branch near the methane-producing Archaea. The out-group of Clostridia near the salt-lovers happen to all have chromosomal G+C content of 50% or more, which makes them quite different from the rest of the Clositridia, whose G+C is under 30%.) The other thing that stands out is that it does appear as if Clostridial Ogg1 could be Archaeal in origin, based on the relationship of Methanoplanus and Methanobrevibacter to the main group of Clostridia. (Also, the C. leptum group's Ogg1 may share an ancestor with the halophilic Archaea.) One thing we can say for sure is that Ogg1 is ancient.

It's tempting to speculate that the eukaryotes obtained Ogg1 from early mitochondria, and that early mitochondria were actually Archaeal endosymbionts. The first part is easily true, because we know that early mitochondria quickly exported most of their DNA to the host nucleus. (Today's mitochondrial DNA is vestigial. Well over 90% of mitochondrial genes are actually in the host nucleus. Things like mitochondrial DNA polymerase have to be transcribed from nucleus-generated RNA.) Whether or not early mitochondria were Archaeal endosymbionts, no one knows.

Anyway, I hope this shows how easy it is to generate phylogenetic trees from the comfort of a living room sofa, using nothing more than a laptop with wireless internet connection. Try making your own phylo-trees using CoGeBlast and Phylogeny.fr—and let me know what you find out.
reade more... Résuméabuiyad