unemployment depression: Ralstonia

Showing posts with label Ralstonia. Show all posts

More Science on the Desktop

Not to keep harping on the amazing power of desktop omics tools, but I thought I'd share a tip for those of you into genome-mining. The tip in a nutshell is that if you gang-load a bunch of FASTA sequences (DNA sequence data) into the FeatView form at http://genomevolution.org, then click the rather inconspicuous button labeled "Phylogeny.fr" at the bottom left of the FeatView page, you'll be taken automatically to http://www.phylogeny.fr, where you'll get a realtime-generated phylogenetic tree based on the sequence data you provided in FeatView, with no effort on your part (it's truly a one-click operation). Copy and paste DNA sequences into FeatView, click one button, and 30 seconds later a tree shows up on your screen, looking (perhaps) something like this:

The reason I made this tree is that I wasn't satisfied with my knowledge of the relatedness of certain weird microorganisms I've recently run into. Namely:

Ralstonia (which I mentioned yesterday), WEIRD BECAUSE: It turns hydrogen gas and CO2 into plastic.
Bordetella, a bronchial infection agent; WEIRD BECAUSE: It turns out to be very similar, genetically, to Ralstonia
Burkholderia, a soil organism (and human and animal pathogen), WEIRD BECAUSE: It has an unexpectedly large amount of genetic similarity to Ralstonia and Polynucleobacter
Polynucleobacter, a ditch-water bacterium, WEIRD BECAUSE: It can live as an intracellular parasite of freshwater ciliates or it can live independently in soil (making it potentially a great study organism for determining the genetic bases of intracellular symbiosis)
Thiomicrospira, a very tiny CO2- and sulfur-loving organism, WEIRD BECAUSE: It can only be found near deep-sea thermal vents (see my previous writeup)
Polaromonas, a relatively newly discovered and still poorly understood bacterium, WEIRD BECAUSE: It is abundant in glacier ice on multiple continents. Plus it has an amazing (and totally unexpected) amount of genetic overlap with our good friend Bordetella, the whooping-cough bug.

If you're not familiar with how bacterial classification works, let's just say it's a mess. There's a long historical tradition of classifying microorganisms based on a hodgepodge of ad hoc methods involving everything from physical appearance under the microscope (especially after staining with crystal violet), to the habitat of the organism, to its ability to metabolize various substances, its ability to make spores, adaptation to oxygen or lack of oxygen, serological characteristics, etc. It's always been an error-prone system, resulting in many misclassifications and later corrections, owing to its inconsistency and basic irrationality, to put it bluntly. With the advent of molecular genetic techniques, it's now possible to create accurate phylogenies based on little more than DNA sequence differences, usually involving the 16S ribosomal RNA (more here).

Freshwater ciliates (like this Euplotes) are
home for Polynucleobacter endosymbionts.

As big an advance as ribosome-based phylogeny is, it's pretty far from ideal (IMHO), mainly because it ignores phenotypes. In fact it's pretty far removed from anything at all having to do with an organism's ecology, metabolism, mode of living, etc. What are we really measuring when we measure relatedness according to a 16S ribosomal yardstick? Just the rate of random mutation accumulation in a pretty uninteresting cell artifact. I'd rather have a yardstick that's tied to phenotypic reality than to a slow-to-change, "highly conserved" piece of cold dead scaffolding.

So to create my own "family tree" of two dozen or so microbes, I said to hell with 16S ribosomes and decided to use, as my yardstick, genetic variation in the GroEL gene, which codes for the 60-kiloDalton heat-shock protein. I chose this protein (or rather, the gene for it) as my phylo-yardstick for a number of reasons. First, the DNA sequence is sizable, at about 1643 nucleotides (making it somewhat bigger than the 16S rDNA). It's important to have a large yardstick gene when looking for faint genetic signals. Secondly, this protein is essentially universal in prokaryotes. It's ubiquitous but not necessarily highly conserved, in the same sense that rRNA is highly conserved. ("Highly conserved" is not what you want. Think about it. Taken to the extreme, a "highly conserved" sequence is invariant. It never changes. And is therefore useless for phylogenetics.) Thirdly, the GroEL heat-shock protein has multiple intracellular touchpoints: It's known to interact with GroES, ALDH2, and dihydrofolate reductase, and it's involved in signal tranduction (it's induced not just by heat but by hydrogen peroxide). Not to overlook the obvious, but it is also a touchpoint protein for any enzyme that can be repaired by the 60kDa heat shock protein. That's probably dozens if not hundreds of enzymes. Why is that important? Think about it: A protein that is sensitive to the 3D conformational requirements of other proteins has to evolve in response to the needs of all the proteins it services. A thermophile (Thermomicrospira) is going to need a different heat-shock repair system than a psychrophile (Polaromonas). A salt-lover needs a different one than a freshwater-lover. GroEL has to reflect, in its own structure, the many shifting requirements of the host proteome. These considerations make GroEL a highly appropriate basis gene for phylogenetic analysis.

And frankly, I think the GroEL-based phylo-tree phylogeny.fr spit out for me (see illustration further above) speaks for itself. It's a remarkably informative (and accurate) tree. GroEL evolutionary differences not only accurately grouped endosymbionts together, soil organisms together, aquatic organisms, etc., it also correctly grouped the "enteric-alike" Erwinia with E. coli and Shigella, and it cannily put Polaromonas with soil organisms (rather than aquatics), which I think is correct, based on recent Polaromonas isolates being found in soil rather than snow. Likewise, it's good to see Bdellovibrio (a freshwater bug) clustered with Polynucleobacter (which is symbiotic with a ciliate protozoan), with Thiomicrospira (the saltwater hydro-vent organism) a very nearby out-node.

If you get an infection while in a hospital, pray
it's not Clostridium difficile, which is often deadly.

A harder call to make is Clostridium difficile, which is present in 1% to 5% of non-ill people's intestines. Is it an enteric (a la E. coli)? Definitely not. The Clostridia (botulism, tetanus, etc.) are spore-forming soil bacteria. Their placement in the tree not far from the soil-dwelling spore-former, Bacillus thuringensis, is thus eminently correct. Bacillus is a proximal out-node relative to Clostridium, which is understandable in that Bacillus is aerobic whereas Clostridia are strict anaerobes.

Buchnera (an aphid symbiont) comes at an odd location, much further away from the insect-dwelling Wolbachia than I would have predicted, but then again Buchnera's host feeds on cold sap where Wolbachia's hosts typically feed on warm blood. All the organisms around Wolbachia in the tree are hemophiles.

Our good friend Bordetella (of pertussis fame) is placed firmly in the soil group. I think that's real and significant. When you start to look at Bordetella's high DNA sequence similarity with Ralstonia and Burkholderia, it would be surprising, actually, if it fell anywhere else in the tree.

Honestly, when I took Bacterial Ecology 201 in college, many years ago, it was under duress and I hated the experience. But now, decades later, I'm starting to like it. With tools like those available for free at http://genomevolution.org and http://www.phylogeny.fr, what's not to like?

reade more...

A Tale of Two Microbes

One area where Big Data has started to pay big dividends is in genome research, and you can begin to taste the payoff yourself, right now, if you want to come along as I show you how to mine genetic data from public databases in the service of a little desktop microbial genetics. You'll be amazed at what you can do.

No one knows why, but when Ralstonia eutropha
eats too much, it produces plastic granules
instead of, say, starch or fat. Go figure.

For today's experiment, we're going to compare the genomes of two bacteria, one of which you know very well, the other of which you don't, unless you've got way too much time on your hands. The germ you already know is Bordetella, the whooping cough bug. The bug you haven't heard of is Ralstonia eutropha, a soil organism that has the amazing ability to subsist only on hydrogen gas, nitrate, and carbon dioxide. In return, it produces wicked-crazy quantities of plastic (yes, plastic—it stores carbon as polyhydroxybutyrate), and because it's potentially useful to industry, Ralstonia's DNA, like Bordetella's, has been fully sequenced.

If you go right now to http://genomevolution.org/r/8o1x, you'll see that I've set up a little experiment for you. You shouldn't have to press the pink "Generate SynMap" button on that page. It should run automatically (but if you don't see an image like the one below, hit the button).

Every dot in this dot-plot represents a match between
a gene in Bordetella bronchiseptica and a gene in
Ralstonia eutropha. See text for discussion.

What has happened is that the SynMap server has been instructed to go find the complete DNA sequence of Ralstonia eutropha Strain H16 as well as the complete DNA sequence for Bordetella bronchiseptica Strain RB50, and run a comparison of one against the other. It so happens Bordetella has a single chromosome with 5,339,179 base pairs, whereas our hydrogen-loving, plastic-storing friend Ralstonia has 3 chromosomes totalling 7,416,678 base pairs. (It has one main chromosome, and two small auxiliary chromosomes called plasmids.)

Every point on the above graph represents a match between a gene in Bordetella and a gene in Ralstonia. The X-axis represents locations on the Bordetella genome (starting from one end and going to the other). The Y-axis plots locations on the Ralstonia genome. All we're doing is mapping one genome to another and tallying the significant matches.

This is a massive number of matches (well over 10,000), just to let you know. Usually, when you compare organisms, you don't see this many dots. I chose Bordetella and Ralstonia because I knew there'd be a lot of hits, based on my own prior experiments. And by the way, I don't think most microbiologists are aware (yet) that Bordetella and Ralstonia are extremely closely related. This is new information I'm sharing with you.

It's one thing to get a bunch of points on a dot-plot, but how do we really know these two organisms are related? This is where synteny comes in. Synteny is the degree to which two chromosomes share blocks of order. The key intuition is that merely sharing genes isn't enough; what counts is whether matching genes are in the same arrangements. If genome A has genes X, Y, and Z, in that order, and genome B also has genes X, Y, and Z (in the same order), we say that A and B share a syntenous triplet. The genomes have a degree of synteny.

The SynMap tool is very powerful because it lets you find syntenous regions in DNA, and it's tunable. If you go to the Analysis Options tab on the SynMap page, you'll see that you can set two parameters called Maximum Distance Between Two Matches, and Minimum Number of Aligned Pairs. The URL that I sent you to (for our experiment) has values of 50 and 2, respectively, already dialed in. That means the graph is plotting every occurrence of 2 gene-pair matches that occurred between genes no more than 50 genes apart. That's a pretty liberal setting. If two organisms are related, you can expect to see a lot of matches.

But what I propose you try (if you want) is setting "Maximum Distance Between Two Matches" to 500 and "Minimum Number of Aligned Pairs" to 250. (Then click the Generate SynMap button to refresh the graph.) This is a much more stringent requirement: It tells SynMap to try to find 250 matched genes within any given 500-gene region, do it for all regions of both genomes, and plot the results, if any. A 250-gene chunk is a pretty large syntenous region for a creature that has only 10,000-or-so genes to begin with.

The result of our hunt for super-large 250-gene syntenous regions is shown in the first graph below. The red dots represent the regions. They run from the top of the Y-axis to the lower right corner. Remember that the axes map directly to positions on the genome. What the diagonal line says is that there's a near-linear mapping of syntenous regions from one genome to the other.

The second graph below shows what happens when we re-tune our DNA-matching parameters to find blocks of 200 ordered genes within each 500-gene domain. We're looking for shorter runs of genes (200 instead of 250), which should be more plentiful. And they are. This time our graph looks like an 'X'. Why? Bacterial chromosomes do a lot of rearranging, and one of the most common events is a symmetric inversion around the origin of replication (and/or the terminus of replication). If you get enough of these inversions of various sizes, you end up with pieces of DNA that used to be near the start of the chromosome ending up near the end, and vice versa. (Repeat for all intermediate locations as well.) If you want to know more about how and why this ends up making an X-pattern on a dot-plot, be sure and read the classic paper by Eisen et al. called "Evidence for symmetric chromosomal inversions around the replication origin in bacteria," Genome Biology 2000, 1(6):research0011.1–0011.9 (unlocked PDF here).

Genomes compared with synteny-block size 250.

Synteny block size 200.

Block size 175.

Block size 120, max domain size 180 genes.

Block size 90, max domain 130.

Block size 2, max domain size 50.

The third and fourth graphs in this series show what happens when we tune our match for smaller block sizes. In the third graph, we've set "Maximum Distance Between Two Matches" to 500 and "Minimum Number of Aligned Pairs" to 175, which produces what looks like two really poorly drawn X's superimposed on each other. As we get more permissive with our synteny matches, we start to see the results of more inversion events. It makes sense that shorter synteny blocks will be swept up in more successful inversions, because an inversion that cuts across a large synteny block is probably fatal in many cases. (Some large groups of genes need to be kept together, for proper gene regulation. If an inversion event cuts through a critical regulon at the wrong spot, the cell might not go on to reproduce.)

As we keep tuning the "Minimum Number of Aligned Pairs" downward, the graphs become more cluttered as we see the results of many thousands of inversion events in the history of the chromosomes.

The fourth graph uses values of 180 and 120 for Max Distance and Minimum Number of Aligned Pairs, then in graph five we have values of 130 and 90. And finally, in the last graph, we have 50 and 2. The final graph is mostly noise. But buried in the noise are many faint signals that can be seen by twiddling the knobs on the synteny settings.

I hope this bit of desktop genomics has convinced you that desktop genomics has reached an exciting stage indeed. (I've only scratched the surface, here, of what the tools at http://genomevolution.org can do.) I also hope I've convinced any microbial geneticists who might be reading this that Bordetella and Ralstonia are very closely related indeed. (Which should come as news. I don't think it's been reported.) You wouldn't think a hydrogen-loving soil organism would have much in common with a throat-dwelling pathogen, but as I like to say: DNA doesn't lie!

reade more...

Pages

.

More Science on the Desktop

A Tale of Two Microbes