Pages

.

Showing posts with label proteomics. Show all posts
Showing posts with label proteomics. Show all posts

More ZunZun Graphs and Biohacking

A few days ago I gave a detailed front-to-back tutorial on how to do a bit of bio-hacking. I showed how to quickly get the amino acid sequences for 25 versions of the Hsp40 (DnaJ) heat shock protein, as produced by 25 different organisms, then create a graph of arginine content versus lysine content for the 25 organisms (all bacterial and Archaea). The resulting graph looked something like this:

Lysine and arginine (mole-fraction) in the DnaJ protein of 50 microorganisms.

In this particular graph there are 50 points, because I went back to UniProt and added 25 more organisms to the mix, to see if the trend line would hold true. (The correlation actually got stronger: r=0.82.) The graph clearly shows that as lysine concentration goes up, arginine concentration goes down. Does this graph prove that lysine can take the place of arginine in DnaJ? That's not exactly what the graph says, although it's a worthwhile hypothesis. To check it out, you'd want to look at some AA sequence alignments to see if arginine and lysine are, in fact, replacing each other in the exact same spots in the protein, across organisms. Certainly it would be reasonable for lysine to replace arginine. Both are polar, positively charged amino acids.

I should point out that the organisms in this graph vary greatly in genomic G+C content. The codon AAA represents lysine, and experience has taught me (maybe you've noticed this too, if you're a biogeek) that lysine usage is a pretty reliable proxy for low G+C content. If you check the codon usage tables for organisms with low-GC genomes, you'll see that they use lysine more than any other amino acid. In Clostridium botulinum, AAA accounts for 8% of codon usage. In Buchnera it's 9%. Low G+C means high lysine usage.

Likewise, organisms with low G+C quite often have a disproportionately low frequency of 5'-CpG-3' dinucleotides in their DNA (much lower than would occur by chance). The technical explanation for this is interesting, but I'll leave it for another day. Suffice it to say, organisms with low CpG tend, by definition, not to use very many codons that begin with CG, all of which code for (guess which amino acid?) arginine.

To see if the arg-lys inverse relationship holds for higher organisms, I gathered up DnaJ sequences for 25 plants. The results:

Same idea as above, but this time using data for 50 plants (instead of bacteria).

Same negative relationship. However, note one thing: The scale of the graph's axes do not match the scale of the axes in the previous graph (further above). In this graph, we're seeing a very narrow range of frequencies for both Arg and Lys. Fortunately, ZunZun's interface makes it easy for us to re-plot the plant data using the same axis scaling as we had for our bacterial data. By constraining the x- and y-axis limits, we can re-visualize this data in an apples-to-apples context:

The plant data, re-plotted with the same axis scaling as used in the bacterial plot further above.

If you compare this graph carefully to the first graph, at the top of the page, you'll see that the points lie on pretty much the same line.

Just for fun, I went back to the UnitProt site and searched for DnaJ in insects (Arthropoda, which technically also subsumes crustacea). Then I plotted the bug data using the same x- and y-axis scaling as for the bacteria:


Same graph, this time for 62 insect DnaJ sequences.
The insect points tend to cluster higher in the graph, and much further to the left than for plants, indicating that the arthropods seem to like Arg and don't much care to use Lys. The moral, I suppose, is that if your diet is lacking in arginine, you should eat fewer plants and more insects.


reade more... Résuméabuiyad

Do-It-Yourself Bio-Hacking: A Tutorial

Today I want to show you how you can do a slick bio-hacking experiment, and graph the results nicely, all in your browser, in well under 10 minutes. The following experiment will run just fine in Chrome or Firefox. In Firefox, it helps to have the Firebug extension. (If you're using Firefox, click F12. If it pops a console window, you already have Firebug.) I tested against Chrome v28.0.1500.72 and Firefox 15.0.1 with Firebug 1.9.2. Other combinations may work; those are just the ones I tested.

We're going to do a comparative genomics/proteomics experiment designed to explore amino-acid usage in a particular protein (DnaJ) across a couple dozen bacterial species. Even if you're not a bio-geek, I hope you'll follow along. At the very least, you'll learn how to make pretty graphs from any kind of data using the server at ZunZun.

What is the DnaJ protein, you ask? It's one of a class of proteins known as heat shock proteins, which are produced in response to elevated temperatures. (Your body produces heat shock proteins in response to fever, for example.) As you probably know (or can guess), proteins, in general, are rather sensitive to heat. Even a small amount of heat can cause a protein to start to unravel (or denature). DnaJ and its partners have the job of helping proteins re-fold into their correct original 3D shape(s) after exposure to heat. They're like little repair jigs. A partially damaged protein goes in; it re-folds and comes back out good as new.

Heat shock proteins occur widely, across all domains of life, and their amino-acid sequences are highly conserved; but they do differ. As we'll see right now.

Step 1
Go to http://www.uniprot.org/ and enter "DnaJ" (case doesn't matter) in the search field at the top of the page, then hit Enter. A list of organisms with DnaJ will appear, each with a checkbox on the left. Check all the checkboxes on the page (gang-check them with Shift-click).

Step 2
You'll notice at the bottom of the window there's a green bar with buttons "Retrieve," "Align," "Blast," and "Clear." Click the Retrieve button.

Steps 1 and 2.


Step 3
In the page that comes up, look for FASTA on the left. Under it are two links, Download and Open. Click Open. (See screen shot below.) You'll see a bunch of protein sequences (with one-letter abbreviations for amino acids), each preceded by a line that begins with > (greater-than sign). These are our DnaJ proteins.

Step 3.

Step 4
Click F12 to toggle open the console window. Be sure the Console tab is showing. In Firebug, you may also have to click the Console menu and choose Command Editor from the dropdown list.

Enter and execute (with Enter, in Chrome, or with Control-Enter in Firebug) the following lines of code:

all = document.getElementsByTagName("pre")[0].innerHTML.split(/>/);
all.shift(); 

It's important that the part between slashes be ampersand-g-t-semicolon, not a greater-than symbol. The browser is showing you greater-than signs but in the HTML markup it's really ampersand-g-t-semicolon, not angle brackets. We actually do want to split on >, not on >.

Note that to execute a line of code in Firebug you have to type Control-Enter. In Chrome, you just type Enter. But in Chrome's JavaScript console, you have to use Shift-Enter to type on more than one line.

The variable all now contains an array of protein sequences. If you want to verify it, type all.length (then Enter, or in Firebug Control-Enter), and you should see the length of the array, 25.

Step 5
Enter the following code in the console (and execute it with Enter; it'll do nothing, which is fine).


function analyze( item ) {

var sequence = item.split(/SV=\d\n(?=\w)/)[1];
var lysineCount = sequence.match(/K/g).length;
var arginineCount = sequence.match(/R/g).length;
lysineCount /= sequence.length;
arginineCount /= sequence.length;
console.log( lysineCount + " " + arginineCount );

}
 
This is the callback code we'll use to process every member of the all array. Each item in the array consists of a FASTA header followed by a protein sequence. We just want the sequence, not the header, which is why we have a first line that splits off the part we need. The remaining lines obtain the number of lysines (K) and the number of arginines (R) in the protein sequence, then we divide those numbers by the sequence length to get a frequency-of-occurrence. The final line prints the results to the console window.

This function, by itself, doesn't do anything until we run it against each amino-acid sequence in the all array. That's the next step.

Step 6
Enter the following line of code into the console and run it with Enter (or Control-Enter, in Firebug):

all.forEach( analyze );
 
The console should immediately fill with numbers (25 rows of two numbers each). That's our data. We need to graph it to see what it looks like. Ready?

Step 7
Go now to http://zunzun.com and notice four pulldown menus at the top of the page. Use the far-left dropdown to select Polynomial.

Select Polynomial from the ZunZun function list.

A new window appears with ugly (or beautiful, depending on your mindset) formulas. Click the link to First-Order (Linear) 2D. Why? Because in the absence of any foreknowledge, we're going to blindly assume that our data is best fit by a straight line. If it's not straight-line data, we can come back and change our selection later.

When you click the First-Order (Linear) 2D link, you'll quickly be in a stark-looking window with a single pulldown menu at the top. Click it and select Data Labels for Graphs. Replace "X data" with "Lysine" and "Y data" with "Arginine."

Select Data Labels for Graph.

Step 8
Now use the single pulldown menu to select Text Data Editor.

Quickly go back to that console window and Copy all of your data (all 25 rows of numbers), then Paste the data into the Text Data Editor box.



Click the Submit button near the top of the page. Be patient, as it may take up to 20 seconds or so for your graph to be ready.

You'll know your graph is ready when the window changes to one that shows four pulldown menus at the top. The far-right menu is Data Graphs. Click into it and select Arginine vs. Lysine with model. NOTE: The exact names of the menu items will depend on how you labeled your axes at the end of Step 7 above.

You should see the window change to a view of a graph that looks like this:

Graph created on demand by the ZunZun server.

Pretty easy, right? It gets better. The line that ZunZun drew through the data points is a regression curve that minimizes the sum of squared error. To see the formula for the line, including coefficients, use the far-left menu, called Coefficients and Text Reports, to select Coefficients. Don't worry, your graph will still be there when you're done. To get back to the graph at any time, just use the far-right menu and any of the commands under it (which re-display the graph in various ways).

The graph seems to be saying that Arginine levels go down as Lysine goes up. But how good of a correlation is this, really? Use the far-left pulldown menu again. This time select Coefficient and Fit Statistics. You'll notice a ton of stats (chi-squared and so on). Among them, r-squared is given as 0.637834788057. That means the correlation coefficient, r, is 0.799, which is pretty solid.

I'll save the interpretation of our experiment's results for another time. For now, notice that underneath your ZunZun graph are links for saving the graph either as PNG or SVG. I strongly recommend you save it as both. You can open SVG in both Photoshop and Illustrator (and most browsers too). You will definitely want to keep an SVG version around to edit by hand in your favorite text editor (SVG is just a variety of XML). I'll be showing you how to do lots of sexy things with SVG graphs in upcoming posts.

reade more... Résuméabuiyad

Strict Anaerobes that Produce Catalase

One thing every new bacteriology student learns on Day One is that some microbes are strict anaerobes (completely unable to use oxygen), and a universal characteristic of strict anaerobes is that they lack an important enzyme called catalase that breaks down hydrogen peroxide to oxygen and water. The idea is that anaerobes don't need to have catalase, because they don't live in the kind of highly oxidized environments where hydrogen peroxide forms. Lack of catalase is supposedly why many anaerobes are killed upon exposure to air. According to legend, once oxygen gets into the cells, hydrogen peroxide starts to build up, and with no catalase to break it down, anaerobes choke on toxic peroxides.

I'll let you in on a little secret, though. This nice-sounding story (about peroxide buildup killing anaerobes upon exposure to air) turns out to be mostly conjecture, not well supported by science. Even the bit about anaerobes lacking catalase isn't completely true. Many anaerobes do make catalase.

For today's post, I did a protein-sequence BLAST search against several families of obligate anaerobes using the katA gene of Proteus mirabilis as a reference, and I was quickly able to identify two dozen strict anaerobes that do, in fact, have a catalase gene (see table below).

Table 1: Strict Anaerobes that Produce Catalase
(tblastn query: Proteus mirabilis katA gene)

Organism
Length (AA)
E-value
Percent identities
Alkaliphilus metalliredigens strain QYMF
475
4.0E-97
40.0
Anaerococcus prevotii strain DSM 20548
473
2.0E-162
59.6
Anaerococcus vaginalis strain ATCC 51170
482
3.0E-171
61.4
Bacteroides coprocola strain DSM 17136
479
0
68.6
Bacteroides coprophilus strain DSM 18228
477
0
68.3
Bacteroides eggerthii strain 1_2_48FAA
478
0
69.6
Bacteroides intestinalis strain DSM 17393
478
0
70.0
Bacteroides ovatus strain 3_8_47FAA
478
0
69.0
Bacteroides plebeius strain DSM 17135
479
0
68.2
Bacteroides thetaiotaomicron strain VPI-5482
480
0
68.7
Clostridium botulinum A3 strain Loch Maree
341
4.0E-67
38.1
Clostridium botulinum B1 strain Okra
463
1.0E-67
33.9
Clostridium hathewayi strain WAL-18680
474
7.0E-167
58.6
Clostridium lentocellum strain DSM 5427
476
2.0E-168
59.4
Clostridium phytofermentans strain ISDg
472
3.0E-107
43.6
Desulfitobacterium dichloroeliminans strain LMG P-21439
477
0
72.3
Desulfitobacterium hafniense DCB-2
493
1.0E-100
39.9
Desulfosporosinus youngiae strain DSM 17734
491
7.0E-103
41.1
Desulfotomaculum ruminis strain DSM 2154
477
2.0E-142
52.2
Dethiobacter alkaliphilus strain AHT 1
468
5.0E-102
40.3
Lachnospiraceae bacterium strain 3_1_57FAA_CT1
470
1.0E-165
59.5
Propionibacterium acnes strain 266
444
3.0E-114
47.2
Syntrophobotulus glycolicus strain DSM 8271
484
2.0E-102
40.7
Veillonella sp. strain 3_1_44
474
0
66.0

Each entry in this table represents a protein-sequence (not DNA sequence) match between a gene in the organism listed and the catalase gene of Proteus mirabilis. (Proteus is a facultative anaerobe related to E. coli and Salmonella.) The length of each organism's catalase enzyme, in amino acids, is shown under Length. (By way of reference, the Proteus catalase is 484 amino acids long.) E-value is the so-called expectation value, a measure of how likely the sequence match would be by chance. All of the values shown are extraordinarily low. "Percent identities" is the percentage of amino-acid matches between the Proteus enzyme and the target organism's enzyme. Values in the 30% to 40% range are not unusual for functionally related enzymes in otherwise distantly related organisms. Values above 60% tend to suggest a phylogenetic relationship, whereas in two organisms that are known to be unrelated, a value above 70% would (in many cases) be considered evidence of possible horizontal gene transfer. 

Here's the protein-blast query sequence I used, in case you want to verify these results (or go looking for more catalase-producing anaerobes):

>Proteus mirabilis strain HI4320(v1, unmasked), Name: PMI1740, YP_002151471.1, katA, Type: CDS, Feature Location: (Chr: 1, 1861974..1863428) Genomic Location: 1861974-1863428
MEKKKLTTAAGAPVVDNNNVITAGPRGPMLLQDVWFLEKLAHFDREVIPERRMHAKGSGAFGTFTVTHDITKYTRAKIFSEVGKKTEMFARFSTVAGER
GAADAERDIRGFALKFYTEEGNWDMVGNNTPVFYLRDPLKFPDLNHIVKRDPRTNMRNMAYKWDFFSHLPESLHQLTIDMSDRGLPLSYRFVHGFGSHT
YSFINKDNERFWVKFHFRCQQGIKNLMDDEAEALVGKDRESSQRDLFEAIERGDYPRWKLQIQIMPEKEASTVPYNPFDLTKVWPHADYPLMDVGYFEL
NRNPDNYFSDVEQAAFSPANIVPGISFSPDKMLQGRLFSYGDAHRYRLGVNHHQIPVNAPKCPFHNYHRDGAMRVDGNSGNGITYEPNSGGVFQEQPDF
KEPPLSIEGAADHWNHREDEDYFSQPRALYELLSDDEHQRMFARIAGELSQASKETQQRQIDLFTKVHPEYGAGVEKAIKVLEGKDAK


reade more... Résuméabuiyad

A New Biological Constant?

Earlier, I gave evidence for a surprising relationship between the amount of G+C (guanine plus cytosine) in DNA and the amount of "purine loading" on the message strand in coding regions. The fact that message strands are often purine-rich is not new, of course; it's called Szybalski's Rule. What's new and unexpected is that the amount of G+C in the genome lets you predict the amount of purine loading. Also, Szybalski's rule is not always right.

Genome A+T content versus message-strand purine content (A+G) for 260 bacterial genera. Chargaff's second parity rule predicts a horizontal line at Y = 0.50. (Szybalski's rule says that all points should lie at or above 0.50.) Surprisingly, as A+T approaches 1.0, A/T approaches the Golden Ratio.
When you look at coding regions from many different bacterial species, you find that if a species has DNA with a G+C content below about 68%, it tends to have more purines than pyrimidines on the message strand (thus purine-rich mRNA). On the other hand, if an organism has extremely GC-rich DNA (G+C > 68%), a gene's message strand tends to have more pyrimidines than purines. What it means is that Szybalski's Rule is correct only for organisms with genome G+C content less than 68%. And Chargaff's second parity rule (which says that A=T an G=C even within a single strand of DNA) is flat-out wrong all the time, except at the 68% G+C point, where Chargaff is right now and then by chance.

Since the last time I wrote on this subject, I've had the chance to look at more than 1,000 additional genomes. What I've found is that the relationship between purine loading and G+C content applies not only to bacteria (and archaea) and eukaryotes, but to mitochondrial DNA, chloroplast DNA, and virus genomes (plant, animal, phage), as well.

The accompanying graphs tell the story, but I should explain a change in the way these graphs are prepared versus the graphs in my earlier posts. Earlier, I plotted G+C along the X-axis and purine/pyrmidine ratio on the Y-axis. I now plot A+T on the X-axis instead of G+C, in order to convert an inverse relationship to a direct relationship. Also, I now plot A+G (purines, as a mole fraction) on the Y-axis. Thus, X- and Y-axes are now both expressed in mole fractions, hence both are normalized to the unit interval (i.e., all values range from 0..1).

The graph above shows the relationship between genome A+T content and purine content of message strands in genomes for 260 bacterial genera. The straight line is regression-fitted to minimize the sum of squared absolute error. (Software by http://zunzun.com.) The line conforms to:

y = a + bx
 
where:
a =  0.45544384965539358
b = 0.14454244707261443


The line predicts that if a genome were to consist entirely of G+C (guanine and cytosine), it would be 45.54% guanine, whereas if (in some mythical creature) the genome were to consist entirely of A+T (adenine and thymine), adenine would comprise 59.99% of the DNA. Interestingly, the 95% confidence interval permits a value of 0.61803 at X = 1.0, which would mean that as guanine and cytosine diminish to zero, A/T approaches the Golden Ratio.

Do the most primitive bacteria (Archaea) also obey this relationship? Yes, they do. In preparing the graph below, I analyzed codon usage in 122 Archaeal genera to obtain A, G, T,  and C relative proportions in coding regions of genes. As you can see, the same basic relationship exists between purine content and A+T in Archaea as in Eubacteria. Regression analysis yielded a line with a slope of 0.16911 and a vertical offset 0.45865. So again, it's possible (or maybe it's just a very strange coincidence) that A/T approaches the Golden Ratio as A+T approaches unity.

Analysis of coding regions in 122 Archaea reveals that the same relationship exists between A+T content and purine mole-fraction (A+G) as exists in eubacteria.
For the graph below, I analyzed 114 eukaryotic genomes (everything from fungi and protists to insects, fish, worms, flowering and non-flowering plants, mosses, algae, and sundry warm- and cold-blooded animals). The slope of the generated regression line is 0.11567 and the vertical offset is 0.46116.

Eukaryotic organisms (N=114).

Mitochondria and chloroplasts (see the two graphs below) show a good bit more scatter in the data, but regression analysis still comes back with positive slopes (0.06702 and .13188, respectively) for the line of least squared absolute error.

Mitochondrial DNA (N=203).
Chloroplast DNA (N=227).
To see if this same fundamental relationship might hold even for viral genetic material, I looked at codon usage in 229 varieties of bacteriophage and 536 plant and animal viruses ranging in size from 3Kb to over 200 kilobases. Interestingly enough, the relationship between A+T and message-strand purine loading does indeed apply to viruses, despite the absence of dedicated protein-making machinery in a virion.

Plant and animal viruses (N=536).
Bacteriophage (N=229).
For the 536 plant and animal viruses (above left), the regression line has a slope of 0.23707 and meets the Y-axis at 0.62337 when X = 1.0. For bacteriophage (above right), the line's slope is 0.13733 and the vertical offset is 0.46395. (When inspecting the graphs, take note that the vertical-axis scaling is not the same for each graph. Hence the slopes are deceptive.) The Y-intercept at X = 1.0 is 0.60128. So again, it's possible A/T approaches the golden ratio as A+T approaches 100%.

The fact that viral nucleic acids follow the same purine trajectories as their hosts perhaps shouldn't come as a surprise, because viral genetic material is (in general) highly adapted to host machinery. Purine loading appropriate to the A+T milieu is just another adaptation.

It's striking that so many genomes, from so many diverse organisms (eubacteria, archaea, eukaryotes, viruses, bacteriophages, plus organelles), follow the same basic law of approximately

A+G = 0.46 + 0.14 * (A+T)

The above law is as universal a law of biology as I've ever seen. The only question is what to call the slope term. It's clearly a biological constant of considerable significance. Its physical interpretation is clear: It's the rate at which purines are accumulated in mRNA as genome A+T content increases. It says that a 1% increase in A+T content (or a 1% decrease in genome  G+C content) is worth a 0.14% increase in purine content in message strands. Maybe it should be called the purine rise rate? The purine amelioration rate?

Biologists, please feel free to get in touch to discuss. I'm interested in hearing your ideas. Reach out to me on LinkedIn, or simply leave a comment below.





reade more... Résuméabuiyad

Decrypting DNA

In a previous post ("Information Theory in Three Minutes"), I hinted at the power of information theory to gage redundancy in a language. A fundamental finding of information theory is that when a language uses symbols in such a way that some symbols appear more often than others (for example when vowels turn up more often than consonants, in English), it's a tipoff to redundancy.

DNA is a language with many hidden redundancies. It's a four-letter language, with symbol choices of A, G, C, and T (adenine, guanine, cytosine, and thymine), which means any given symbol should be able to convey two bits' worth of information, since log2(4) is two. But it turns out, different organisms speak different "dialects" of this language. Some organisms use G and C twice as often as A and T, which (if you do the math) means each symbol is actually carrying a maximum of 1.837 bits (not 2 bits) of information.

Consider how an alien visitor to earth might be able to use information theory to figure out terrestrial molecular biology.

The first thing an alien visitor might notice is that there are four "symbols" in DNA (A, G, C, T).

By analyzing the frequencies of various naturally occurring combinations of these letters, the alien would quickly determine that the natural "word length" of DNA is three.

There are 64 possible 3-letter words that can be spelled with a 4-letter alphabet. So in theory, a 3-letter "word" in DNA should convey 6 bits worth of information (since 2 to the 6th power is 64). But an alien would look at many samples of earthly DNA, from many creatures, and do a summation of -F * log2(F) for every 3-letter "word" used by a given creature's DNA (where F is simply the frequency of usage of the 3-letter combo). From this sort of analysis, the alien would find that even though 64 different codons (3-letter words) are, in fact, being used in earthly DNA, in actuality the entropy per codon in some cases is as little as 4.524 bits. (Or at least, it approaches that value asymptotically.)

Since 2 to the 4.524 power is 23, and since proteins (the predominant macromolecule in earthly biology) are made of amino acids, a canny alien would surmise that there must be around 23 different amino acids; and earthly DNA is a language for mapping 3-letters words to those 23 amino acids.

As it turns out, the genetic code does use 3-letter "words" (codons) to specify amino acids, but there are 20 amino acids (not 23), with 3 "stop codons" reserved for telling the cell's protein-making machinery "this is the end of this protein; stop here."

E. coli codon usage.
The above chart shows the actual codon usage pattern for E. coli. Note that all organisms use the same 3-letter codes for the same amino acids, and most organisms use all 64 possible codons, but the codons are used with vastly unequal frequencies. If you look in the upper right corner of the above chart, for example, you'll see that E. coli uses CTG (one of the six codons for Leucine) far more often than CTA (another codon for Leucine). One of the open questions in biology is why organisms favor certain synonymous codons over others (a phenomenon called codon usage bias).

While DNA's 6-bit codon bandwidth permits 64 different codons, and while organisms do generally make use of all 64 codons, the uneven usage pattern means fewer than 6 bits of information are used per codon. To get the actual codon entropy, all you have to do is take each usage frequency and calculate -F * log2(F) for each codon, then sum. If you do that for E. coli, you get 5.679 bits per codon. As it happens, E. coli actually does make use of almost all the available bandwidth (of 6 bits) in its codons. This turns out not to be true for all organisms, however.
reade more... Résuméabuiyad