Introns, Exons, and So-ons
(Part II)

Discussion (continued from Part I):


From Keith Robison: Okay Jeff, I think you are close to understanding the argument, but it is still eluding you.

Senapathy is claiming:

  1. It is improbable to find ORFs in random sequence. "The only way a gene longer than 600 nts could originate was to select some short reading frames and splice them together by editing out the intervening regions containing many stop codons."

  2. It is probable to find ORFs in spliced random sequence "Such a splicing resulted in a long reading frame which could then code for a long protein."

Do you understand? He is claiming that the splicing process enables the formation of long ORF-bearing sequences. The long ORFs are in the spliced mRNA, not in the DNA. But because the initial sequence is random, the splicing signals will be randomly distributed. And because the splicing signals are much bigger than translational stop signals, and unrelated to them, the output spliced mRNA sequence will look statistically like the input random DNA sequence. So there must be another source of information in order for this to work.

In brief, your chance of finding a long ORF in the spliced mRNA transcribed from random DNA sequence is identical to the chance of finding a long ORF in the unspliced random DNA.


From Wesley R. Elsberry: I'm intrigued. Why do you think the timing of the cut & paste job makes a difference, such that Robison's point no longer applies?

JM: It is important to our discussion because the random DNA looks like eukaryote DNA. If the splicing was done in the pond, before the seed cell was formed, then the genes would not have had exons and introns. Dr. Senapathy's theory and introns-early are very closely related.


From Keith Robison: (reprise) But because the initial sequence is random, the splicing signals will be randomly distributed. And because the splicing signals are much bigger than translational stop signals, and unrelated to them, the output spliced mRNA sequence will look statistically like the input random DNA sequence. So there must be another source of information in order for this to work.

JM: So, if the stop codons that would end a gene are not related to the splicing signals that start and end an exon or intron, then, after splicing, there would still be stop codons all over the place in the exons (and so it really wouldn't be a gene). I believe that summarizes your point, so you have no doubt now that I understand your point.

Keith: Bingo!

(reprise) In brief, your chance of finding a long ORF in the spliced mRNA >transcribed from random DNA sequence is identical to the chance of >finding a long ORF in the unspliced random DNA.

JM: All this assumes that the stop codons and the splicing signals are not related to each other. However, starting on page 244, Senapathy explains that the splicing signals are related to the stop codons and that the splicing mechanism must have come about through a selection process so as to achieve this relationship. He writes: "This system [of distinguishing between exons and introns] must have been primarily able to distinguish between what is a reading frame and what is a stop codon." Continuing on page 245 he shows that stop codons are correlated with the splice sites and that "the mechanism that identified genes consecutively selected its successive exons by looking for stop codons while reading a random sequence from 5' to 3'. ... the splice junction sequences which contain these stop codons must have originated due to these reasons, and serve as molecular signals for the exon-splicing process."

This may sound like he's imparting an intelligence to the splicing process, but that is not so, just as there is no intelligence behind the putative workings of natural selection. Senapathy is saying: (1) we see long reading frames in life, (2) it is apparently necessary to have long reading frames for life (at least life as we know it), (3) the splicing mechanism that works must be one that results in long reading frames, and (4) this is confirmed by finding a correlation between the locations of stop codons and the "resulting" splice signals. If this particular mechanism (or some other viable one) had not come about, we wouldn't be here to ponder it.

Keith: Senapathy is just plain wrong. For a careful analysis of splicing signals, see:

Looking at the logos tells us several things:

  1. The donor ("start-splice") signal has the consensus

    exon intron
    G GT[AG]AG

    Yes, you can find all both TAA and TGA stops here, but of course only about 50% of the time. Furthermore, the stop would lie in phase 1 (between the first and second bases of a codon), and there is a slight excess of phase 0 introns. So, for the majority of the data (phase 0 + phase 2 introns > 50% of all introns), this is a poor explanation

  2. The donor site contains about 5bp of conservation (the paper has the exact amount) -- >2 more than a stop codon. Therefore, at best 1 out of every 16 stops would be a splicing signal (actually, 1 in 16 TAA or TGA codons, but not TAA).

  3. The acceptor site has the consensus


    with C almost equiprobable -- but C predominating. Again, the resemblance to a stop codon is tenuous.

In any case, this is only dealing with the probability of finding an ORF. Senapathy's fantastic statistics claim you are likely to find isocoding sequence for known proteins. But, since the spliced random sequence has the same information content as the unspliced random sequence, the splicing process has gained nothing. We can look at known proteins and calculate their information content, which can thereby be converted to the probability of finding them in random sequence. This has been done quite nicely by Hubert Yockey, and the probability is vanishingly small.

        AUTHOR: Yockey, Hubert P.
         TITLE: Information theory and molecular biology.
     PUB. INFO: Cambridge ; New York : Cambridge University Press, 1992.
   DESCRIPTION: xix, 408 p. : ill. ; 24 cm.
      SUBJECTS: *S1 Molecular biology.
                *S2 Information theory in biology.

(Side note: Yockey's book is doubtful of all origin of life scenarios on similar grounds).


From Don Cates: What does Dr. S say about the fact that differences in redundant bases in codons mimic quite well the morphological relationships across many species. E.g. Take the code for some almost universally used enzyme. There are many different base sequences that can code for the enzyme. It happens that the closer two species (or sub-species or even individuals) are evolutionally, the more the sequences are alike.

JM: Dr. Senapathy spends a lot of time talking about codon degeneracy, mostly in terms of that redundancy making the probability of finding genes for particular proteins more likely. However, as to using these redundancies when looking at similar genomes he writes (at page 434):

"Evolutionary geneticists deal with an inherent problem when they analyze protein similarities looking for assumed evolutionary relationships. They start with a prior, strongly-rooted notion of evolution. Therefore, according to them, those proteins with functional similarities have evolved from one another. Consequently, they expect the proteins to have structural similarities and sequence similarity. So if they find sequence similarity between two functionally-similar proteins or genes, they believe that it is a direct proof for Darwin's theory of evolution."

"Because evolutionists expect two proteins which are functionally similar to be evolutionarily related, they look for sequence similarity even before one knows whether these proteins have sequence similarity. When a sequence similarity is found -- which is expected simply because of the functional similarity even without evolutionary connection -- they confidently provide it as evidence for evolution having occurred. On the other hand, if there is little or less significant sequence similarity, they try to bend the methods of aligning or searching for similarity of sequences in order to "improve" the similarity."

And at page 438: "In analyzing the coding sequence of a given gene found in many organisms, there exists a phenomenon concerning the variations of codons. If we take one gene and analyze its coding sequence in many different organisms, we naturally find sequence variations. [snip] Usually there are three or four codons, with the same first two bases but different third bases, that code for the same amino acid. As a result, if we analyze the frequency of the nucleotide differences at the three possible codon positions in the sequence of a gene from many different organisms, they vary most at the third codon position, less at the second and the first. ...this phenomenon can arise when organisms were independently born -- by mutational changes of the same gene in each organism without altering the basic function of the protein ... or if two gene sequences coding for functionally the same protein arise independently of each other. But evolutionists believe that this phenomenon is due to the evolution of organisms from one another."

Don: This is the stuff I was looking for. Please note that, as far as I can tell, Dr. S's theory would predict that the distribution of differences in the third base of these codons would be random across the different "independently born" organisms. However, this is not what is observed. Organisms that are considered to be close evolutionarily are more likely to have a higher proportion of same "third bases".

...if two gene sequences coding for functionally the same protein arise independently of each other. But evolutionists believe that this phenomenon is due to the evolution of organisms from one another."

Don: Again, the point is not that these similarities exist. They would also exist in "special creation" if the creator used the same basic blueprints for all its creations. What is important is the pattern of the differences across different organisms. This pattern is completely consistent with evolution but requires some sort of special pleading for both Dr. S and creationists.

Do you see why I think that this information poses a problem for Dr. S (and creationists)?

JM: I think the special pleading is on your side. Although Senapathy did not use the word "pattern" (because he would say there is no pattern), is it not true that the pattern you mentioned has been the primary foundation for the evolutionary tree? If so, then you cannot argue that the evidence supports evolution when it was that evidence that was used to create the tree!

Don: Arlin Stoltzfus countered this argument quite succinctly a long while back. The important "pattern" we see is that when we superimpose trees generated from different data (e.g. different genes, different morphological features), they are nearly always congruent. This is what Senapathy's "independent birth" theory cannot explain, except by the "special pleading" of genome reuse.


From Keith Robison: ... immutable you say, I don't think that you mean immutable, since genomes are clearly shown to be "plastic" in many ways.

JM: Here, the term "immutable" means that no new genes could come about.

Keith: And again, Senapathy bats .000 here. We know of examples of new genes arising (e.g. jingwei), and know of many more mechanisms which could form genes. Note that Senapathy's scenario must invoke a lot of some of these mechanisms in order to explain away some unpleasant facts.

For example, bacterial genomes are mostly intron-free and some eukaryotic microbial genomes are either intron-free or intron-poor (as are organellar genomes). Senapathy must invoke large amounts of intron-loss through exon-fusion. But there's no particular reason two exons of the same transcription unit must be fused -- fusions could just as easily occur between unrelated genes. Each such fusion is potentially a new gene, with new properties.

And so the trend continues. Senapathy's assertions are almost universally either contrary to known data or require implausible assumptions.

JM: I cannot find any discussion in The Book about how the introns were deleted to form prokaryotes. Did I miss that? If not, why do you assume there is only one method to remove introns? I guess my problem here is: are you assuming a particular mechanism to remove the introns, and why must that be the method Dr. Senapathy would have to use when he does not even discuss this?

Keith: I'll have to dig through it, but I believe its there. In any case, it's mostly not a question of mechanism. If Senapathy is right, then somehow all those introns must have been lost, and that alone represents an enormous degree of evolution.

There are basically two ways of losing an intron. One, recombination between the genome and a reverse-transcribed mRNA, can potentially "cleanly" excise introns. The other possibility are genomic deletions excising the intron.

Note that both mechanisms, within the known properties of genomes, are likely to lead to some degree of novel gene formation. While recombination with a reverse-transcribed mRNA would generally tend to cleanly erase introns, the presence of repetitive sequences in the mRNA (not unheard of) could cause recombination elsewhere in the genome, leading to new, chimaeric genes. Similarly, deletions are likely to cause fusions between adjacent genes. Either way, at some frequency new genes will be acquired by the genome and made available for evolution.


From Ralph M Bernstein: There aren't too many theories on how introns were lost, that's why. Can you and Senapathy propose another method? The best one that I know of is WF Doolittle's "genome streamlining" -- very simply: because of faster replication times and less need for the regulatory aspects of introns, they were "streamlined" out.


From Keith Robison: It is interesting to ask that even if Senapathy could get the ORF-probability calculations right, what is the probability of finding a particular gene in a Senapathian pond -- is it anywhere in the ballpark of Senapathy's calculations.

In his book Information Theory and Molecular Biology, Hubert Yockey calculates the information content of the protein cytochrome c. That is, based on an alignment of many cytochrome c's, we can estimate the degree of plasticity allowed -- how much change can the protein tolerate and still function as cytochrome c. The information content is directly convertible to the probability of finding a cytochrome c sequence at random from an ORF of similar length.

iso-1-cytochrome c has an information content of 373.6 bits. Therefore, the probability of finding a cytochrome c at random is

    2^373.4 = 2.54 * 10^112

Real data is not kind to Dr. S.

JM: By "ORF," do you mean a long, open reading frame of a gene (w/o introns) or just the reading from of an exon? If you mean of a gene, then your calculation has nothing to do with the probability of finding a part of that gene, an exon, which is what Dr. S is computing. If you mean an exon, then the probability calculation (chance of finding a given exon in a run of random DNA) is straightforward and I don't see how the esoteric use of information content is helpful -- how does it apply?

Keith: This calculation is estimating the probability of finding a cytochrome c once you have generated a translatable mRNA.

Jeff, I'm surprised at you. You are always calling for rigorous estimates. That is exactly what the information theory approach tries to be -- a rigorous estimate of the probability of finding a functional cytochrome c sequence in a mountain of random peptide sequence. Senapathy can splice and dice all he wants -- but unless you believe the splicing process can generate >10^100 possible messages we ain't going to see a cytochrome c (which would be a pretty good trick with 10^30 nucleotides!).


From FORSDYKE@QUCDN.QueensU.CA: A new introns-early theory is presented in the September and November issues of Molecular Biology and Evolution, (volume 12, 949-958 for the September issue entitled: "A stem-loop 'kissing' model for the origin of introns and recombination".

About a year ago Nature rejected the following letter on the topic which might be of interest to readers of this newsgroup.


SIR - In his New & Views item entitled "The uncertain origin of introns"(1) Laurence Hurst presents some of the arguments for "introns early" (the Gilbert school;(2) and "introns late" (the Stoltzfus school;(3). Both schools seem not to have noticed that introns interrupt both coding and non-coding parts of genes(4). It has long been known that genes for rRNAs and tRNAs contain interruptions, but these may be special cases. Recently, however, "mRNAs" have been discovered which have no protein product. The corresponding genes look like most protein-encoding genes, and possess multiple introns(5). Thus, introns interrupt genetic information, not just protein-encoding information. It is not too surprising then, that it is difficult to associate exons with domains of protein structure or function(2,3). It does not follow that this disposes of the introns early viewpoint. There may be other exon theories of genes, as well as "the" exon theory of genes (i.e. "the" introns early theory).

One alternative exon (introns early) theory can be derived from the growing evidence for involvement of stem-loop structures in recombination(6-12) a process which should have arisen early in evolution. In the early "RNA world"(13)it is likely that exchange of segments between protypic replicators would have been advantageous(14). Thus, if it were possible for recombination to have arisen early, it would have done so. Mutations which favour recombination would have affected either the enzymes (ribozymes) involved in recombination, or their substrate, RNA itself (hence stem-loops). Eventually the RNA world gave way to the DNA world, but stem-loop potential remained. Consistent with this, stem-loop potential is abundant and widely dispersed in modern genomes(12).

The basic postulate of the proposed alternative exon theory of genes is that stem-loop potential was widespread in genomes from an early stage. Information for new functions as they arose had to compete with the information for the stem-loop-forming function (i.e. complementary bases in the stems). In the case of protein-encoding functions the conflict was managed in three ways. First synonymous codons were used so that a sequence could at the same time both optimize its folding propensity and encode a protein. If this failed, then conservative amino acid exchanges were accepted to widen the range of codon choice without impairing protein function. Finally, if these failed, the protein was permitted only to evolve in segments interrupted by regions of high stem-loop potential. Remarkably, traces of this primitive arrangement can be discerned in some modern genes(12). In the compact genome of C. elegens stem-loops are abundant and 43% of these occur in introns, which represent only 20% of the genome(15).

  1. Hurst,L.D. Nature 371, 381-382 (1994).
  2. Gilbert,W. & Glynias,M. Gene 135, 137-144 (1994).
  3. Stoltzfus et al. Science 265, 202-207 (1994).
  4. Hawkins, J.D. Nucleic Acids Res. 16, 9853-9905 (1988).
  5. Pfeifer, K. & Tilghman, S.M. Genes Devel. 8, 1867-1874 (1994).
  6. Sobell, H.M. Proc. natn. Acad. Sci. USA 69, 2483-2487 (1972).
  7. Wagner, R.E. & Radman, M. Proc. natn. Acad. Sci. USA 72, 3619-3622 (1975).
  8. Doyle, G.G. J. Theor. Biol. 70, 171-184 (1978).
  9. Kleckner, N. & Weiner, B.M. Cold Spring Harbor Symp. Quant. Biol. 58, 553-565 (1991).
  10. Kleckner, N., Padmore, R. & Bishop, D.K. Cold Spring Harbor Symp. Quant. Biol. 56, 729-743 (1993).
  11. Reed et al. J. Mol. Evol. 38, 352-362 (1994).
  12. Forsdyke, D.R. FASEB.J. 8, 1395A (1994).
  13. Joyce, G. F. & Orgel, L. E. The RNA World, 1-25 (Cold Spring Harbor Laboratory Press, New York, 1993).
  14. Bernstein, C. & Bernstein, H. Aging, Sex and DNA Repair, (Academic Press, San Diego, 1991).
  15. Wilson et al. Nature 368, 32-38 (1994).


From Andrew J. Roger:

If you look at the current phylogenetic distribution of spliceosomal introns, they are restricted to eukaryotic genomes. Currently the best estimate of global phylogeny is a tree where the root lies between the eubacteria on the one hand, and an archaebacterial/eukaryotic clade on the other hand. Both eubacteria and archaebacteria lack ANY spliceosomal introns. So if introns were present in the common ancestor of all these lineages they must have been COMPLETELY extinguished in both eubacteria and archaebacteria. If one then considers what is known about eukaryotic phylogeny, and one considers information about the frequency of introns in eukaryotic becomes clear that high intron density (more than 4/kilobase) is restricted to recently evolved clades such as animals, plants and SOME fungi. The deepest eukaryotic lineages- protists like Giardia, Trichomonas, Trypanosomes, Entamoebids and Heteroloboseans either completely lack introns (as far as we can tell now) or have them at very low densities. Thus, multiple independent outgroup lineages to animals, plants and fungi appear to have few if any introns. It is likely that the common ancestor of all eukaryotes, if it had introns, had very very few (much less than 1 per kilobase of mRNA). The alternative "introns early" interpretation is that introns keep on getting cataclysmically lost multiple independent times in evolution, yet are mysteriously retained in the common ancestors of all of the eukaryotic lineages. This is just not very parsimonious. We would not wish to argue that fingernails were ancestral to all life simply because some vertebrates have them -- I suggest that we shouldn't argue that high intron density is ancestral to all life simply because some recently evolved eukaryotic clades have it.

The problem with any introns early theory which concerns itself with spliceosomal introns, is that the phylogenetic evidence suggests that they are NOT ancient....this doesn't mean that stemloops couldn't have played an important role in the RNA world -- its just that they likely never turned into spliceosomal introns.


From Arlin Stoltzfus: (quoting D.R. Forsdyke): "Thus, introns interrupt genetic information, not just protein-encoding information. It is not too surprising then, that it is difficult to associate exons with domains of protein structure or function(2,3)."

Arlin: Uh, it was surprising to the many who believed that protein genes evolved originally by combinatorial assembly of exons, each exon contributing to some discrete structural or functional feature of the protein. This view, which was called the "introns-early" view until about a year ago (:->), was presented as almost-established-fact in several textbooks from the 1980's.

Quoting Forsdyke: "It does not follow that this [lack of correspondence] disposes of the introns early viewpoint."

Arlin: This is not what was argued in ref. 2. Instead, it was argued that the weight of phylogenetic evidence (among other things) strongly favored a recent origin of spliceosomal introns, given that they are found only in some eukaryotes. To propose that spliceosomal introns as a family are ancient is like proposing that meiosis or mitochondria or microtubules are ancient. No one would even consider such a view unless there were some compelling logical or empirical grounds to doubt the clear (phylogenetic) evidence that these are derived characters. In the case of introns, it was felt that there really was specific evidence -- namely a general exon-protein correspondence -- that could only be accommodated by an introns-early view. The point of ref. 2 was that the absence of any reliable evidence for such a correspondence, though it does not constitute proof, deprives the introns-early view of its only evidentiary argument.

Quoting Ralph Bernstein: I think the point of this was to shore up the idea of introns early. The 'kissing-loop' idea is a really strong support of this concept."

Arlin: I fail to see how this "shores up" the introns-early position. It has been adequately demonstrated by Forsdyke and others that phylogenetically widely dispersed genomes have a statistical excess of inversed-repeat sequences over random expectations, even when local base composition is taken into account. This includes organisms with and without introns.

In organisms with introns, Forsdyke suggests on arguable grounds that inversed-repeats are more common in the introns than the exons This is interpreted, again arguably, to mean that the introns were always there, and (again arguably) that they exist for the sake of containing the inversed-repeat sequences so as to stimulate recombination.

The conclusion that the excess is due to selection is arguable because the alternative that the inversed-repeats arise (whether in introns, in exons, or in intronless bacterial genomes) due to mutational biases is simply never addressed. Instead, it is assumed that all deviations from randomness must arise from selection subsequent to mutation.

The suggestion that the introns are ancient is gratuitous. It is clear from phylogenetic comparisons of intron-containing genes that most intron positions are recent acquisitions. The minor conclusion that inversed-repeats are more common in introns than in exons is also arguable because Forsdyke (see his paper in the most recent Mol. Biol. Evol.) must exclude long exons in order to support this conclusion statistically. His rationale is that the long exons are long because they were able to evolve the requisite inversed-repeat sequences without including introns. The problem with this sub-division of the dataset is that, if one holds to the strict no-insertion introns-early position, there is no such thing as an ancestral long exon. When homologues of a gene are sequenced from many different organisms, large numbers of different intron positions are found (e.g., 45 in GAPDH, 24 in TPI, probably 70 in tubulin, 40 in actin, ca. 20 in SOD, etc. -- with more introns being found every month in newly sequenced genes). A "long exon" in one organism is broken up many times by the intron positions found in homologues, whereas Forsdyke's view would imply that a long inversed-repeat-containing exon in one organism represents an ancestral state that does not need to be broken up by introns. If one does not hold to the strict introns-early position, and instead allows that all or most intron positions have arisen recently (as is obvious from the data), then the inversed repeats may simply have arisen recently in introns, more commonly than in exons. And finally, if a long exon arose by deletion of intervening introns, yet was still able to evolve inversed-repeats, then this suggests yet again that the inversed repeats do not have to be ancient. So, any way one looks at it, one must allow that inversed repeats can arise recently in introns and exons, so that there is no need to propose additionally that the specific pattern of inversed repeats is ancient.

More importantly, the likelihood that all nearly all spliceosomal intron positions are recent (i.e., subsequent to eukaryotes) in origin in no way contradicts Forsdyke's major suggestion that the inversed-repeats exist to stimulate recombination. If there is indeed selection favoring the genesis of inversed-repeats for the sake of recombinational pairing, then such repeats will arise in introns, exons, intronless bacterial genes, in inter-genic spacers, and in repeat DNA (wouldn't this be the best way to do it -- have a self-replicating repetitive family bearing inverted repeats, that could spread throughout the genome?). Again, as Forsdyke argues in his recent paper, if constraints on sequences are lower in introns than exons, the inversed repeats will be more likely to arise and be maintained there, rather than in the exons.

Although the "kissing" theory doesn't shed light on the origin of introns, it does point to a general feature of genomes (with or without introns) that requires an explanation, probably a very interesting one.


JM: He has done computer simulations with random DNA, and he reviews this work on pages 273-288. It did not involve a full-length run of DNA (10^30 nts), but enough simulated DNA was used to search for genes and other things.

From Dave Oldridge: simulation just won't cut it here. His simulation assumes too much of the theory is true to be a true test. Computer emulations can sometimes help us disprove a theory or support it, but I want to see physical (in this case biological) tests.

All that happened was that a program that Senapathy wrote (or had written) behaved according to his expectations. He may (I won't concede that without seeing the whole program) have shown that his thesis is possible; he has not shown yet that it is probable.

And recent work on self-replicating molecules points in a somewhat different direction. I can't remember the exact reference, but I'm sure someone will come up with the past year I read in Scientific American of an experiment with some very simple self-replicating molecules that showed that, even at this level, mutation and selection can occur. It begins to seem quite likely that DNA itself is the product of an evolution.


From Keith Robison: (quoting JM) However, Senapathy provides much detail, based on his own research over many years, for the most important parts of the theory -- the formation of genes from random DNA.

Keith: Jeff, you have never answered my arguments from information theory about how absurd Senapathy's theory of gene formation is. To wit: the probability of assembling a gene is not aided by splicing, and in any case the amount of information needed to build a modern organism rules out finding a working genome in Senapathy's pond.

JM: Well, I cannot say that I accept your information theory argument as being relevant to this. I've got Yockey's book from the library (it's sitting right here by my feet, as you say), and if you tell me exactly where to look for any parts that are relevant to finding split genes, then maybe I can get past that barrier. Anyway, Yockey doesn't like the pond idea AT ALL -- so how does he figure things got started?

Keith: Jeff, it does not matter if the genes are split or not. In order for a genetic message to be readable, there must be a way of decoding the message without knowing the message in advance. Senapathy's stuff works only because he is looking for known messages -- he has provided no mechanism for decoding the random sequence into intelligible messages. We've been over all this before -- stop codons are not splice sites!

JM: OK, I can forget about the stop codons. But eukaryote genes are random (according to Dr. Senapathy's research), and the slice signals are not chosen by Senapathy -- they are the result of chemical processes that just happen to work correctly, hence we are alive to examine them. If there a different set of splice signals, we would be contemplating those instead. You once said I had proved you did not exist. It seems to me you are now proving that life does not exist. It does not matter who says it is improbable to find genes in DNA -- they are there, and the eukaryote DNA they are found in is random. I must be missing something in your argument. Could you try again, or refer me directly to Yockey's discussion about this?

JM: (continuing) I cannot say that I accept your information theory argument as being relevant to this. I've got Yockey's book from the library (it's sitting right here by my feet, as you say), and if you tell me exactly where to look for any parts that are relevant to finding split genes, then maybe I can get past that barrier.

Keith: The real beauty of the IT approach in this situation is that splicing is irrelevant. The IT estimate is the probability of finding such a pattern at random after any arbitrary sequence of deterministic transformations. I.e., it does not matter if you invert the whole sequence, translate them according to a table, etc., so long as you follow deterministic rules (this is the whole reason Shannon invented the theory after all -- to predict the behavior of messages under compression, encryption, etc.).

So Senapathy's theory of the emergence of modern genes and genomes from random soup is completely preposterous from a statistical standpoint.

JM: As for the odds of finding an entire organism, then I will "reuse" your argument that the odds of you or me existing is so great that we cannot exist. OK, so that doesn't strictly apply -- I just couldn't help myself. How about this: the state of going from non-life to life could have been multi-step. That is, you allow "selection" to operate in evolution (hence you can ADD the probabilities of each step), and there very well could have been some form of "selection" in the pond (although not at a living level) that caused the results of chemical process in the pond to migrate from pure garbage to viability. That is, the results of certain chemical processes could have contributed to those processes remaining around. As I've said before, this part of the new theory may require your imagination, and you refuse to let yourself think along these lines. (This has nothing to do with Darwin. -- Keith: how you YOU explain the origins of life?)

Keith: Good question -- I don't really try to. Clearly there was a transition from clearly-not-life to clearly-life, probably going through a long phase of somewhere-in-between. However, that early clearly-life stage was much simpler than the common ancestor of all living things (CAoALT), and the CAoALT was a descendant.


From Keith Robison: (via email) I was saying is not that life is improbable (we'll get to that), but that the Information Theory says that Senapathy's scenario is improbable -- modern genes will not spring in toto from as small a pool of random sequence as he posits (I hope you did get a good laugh out of his claim that the probability of finding 1 gene = the probability of finding all genes).

JM: I know he discussed the probabilities of finding one gene and of finding any gene, but I don't recall that he equated "one to all." Where was that? Can you give me a page, date, or other reference?

Keith: Page 288 -- you can't miss it :-)

JM: Once again, I have no trouble finding an explanation. Regarding: "The probability for finding millions of genes is the same as the probability for finding one gene." Just read a little further. In the last paragraph on the following page, you will find: "if one typical gene could probabilistically occur in the USP, then almost any gene for any particular biochemical function ... would occur in the USP." So, since he can find one "typical" gene in the USP and since there are millions of other genes with similar characteristics to that typical gene (needed for the "multitudes of unique biochemical functions"), he can find any of those other genes ("almost any gene for any particular biochemical function"), all millions of them, in the same USP each with the same probability. He is NOT saying P(1) = 10^6 * P(1), as you alleged. You saw a patently ridiculous statement there -- because that was what you were looking for. I agree that his statement on page 288 is confusing and it could have been worded more clearly, but you refused to see his true meaning, and instead were complaining about style.

Keith: No, it's typical Senapathian sloppiness. Think about how you would really go about calculating the probability of finding every known gene given the probability of finding one gene. This is just simple statistics (a place where Dr. S. tends to slip up frequently).

Oh well, same old thing...

JM: OK, I have been thinking.

1. On page 288, he is not talking about "every known gene." He is clearly discussing any one gene of a group of millions of genes that are similar to a "typical" gene. That was the subject of my previous message.

2. Given any gene ("g") that is similar in length and exon/intron makeup to the given "typical" gene "t" (specifically, it has no exons that are longer than those found in the typical gene), then I would compute the probability of finding that gene as follows:

    P(finding g) = P(finding t) = almost 1

That's what he's doing on page 288.

3. The probability of finding every one of those known genes is:


where "n" is the number of known genes. But I don't see what sense that makes. For a certain seed cell, I only have to find the 20,000 or so particular genes needed. So:

    P(genome) = P(t)^20000

Even this is like asking me to compute the probability of "Keith" or "Jeff" being conceived. You taught me that doing that is senseless, and this is, too, for the same reason. So let's back up and look at the problem....

Please read again page 287 which leads up to his "millions of genes" section. He is not computing probabilities there, he is computing the amount of DNA needed to find (with a very high likelihood) a typical gene and also find one with reasonable intron lengths. Once he computes how much DNA is needed, then you will find ANY and ALL such typical (or shorter) genes in that length of DNA.

This is analogous to computing how long of a string of random letters would be needed to find a certain three-letter word ("NOT") with a very high likelihood (you need the expected mean length times six -- about 10^5 characters), and then saying you can find ANY three-letter word ("AAA" through "ZZZ") in that random string. (Reference: pages 223-225 and Chapter 7 footnotes 9 & 10.) There is NOTHING wrong with that logic or that math. He computes the amount of DNA needed using the same math as for the three-letter word example. Finding any long (600 nt) exon seems ridiculous, but, given enough random DNA, it is not.

BTW, not all 600 nt runs would be a valid exon, so not all combinations of 600 nt will ever need to be found. That is what you mean, I assume, by "every known" gene -- only the valid, known ones.

Although you point out that there are a few exons longer than 600 nt, most are also much shorter than 600 nt. Why there are a few longer ones is a good question, but that one needs to be answered in a different discussion (there are plenty of exceptional cases allowed to evolution, too). For the time being, I think Senapathy is being generous by using 600 nt so often rather than a smaller value.

Now, do you want me to find an entire genome in the random DNA? OK, I still can, although with a lower degree of certainty. But, we're dealing with likelihoods so close to one, that even raising them to the 20,000 power (the number of genes in an average genome) does not hurt much, and you can use more DNA very easily because there is plenty more DNA available. The probabilities of finding all genes needed for one genome will be smaller by 2x10^5 power, but look at the numbers on pages 287-288:

    P(t) = Probability of finding one gene in 6 x 10^20 nt = 99.9999%
    (Reference: footnote #42 on page 602)

    P(genome) = P(t)^20000 = 98%

Keith: (reprise) No, it's typical Senapathian sloppiness.

JM: So, what problem do you find here? Can you be specific and show me the sloppiness on pages 287-288? I agree with you that the wording is confusing, but let's look at his meaning and his math, which I find very clear and understandable. You obviously don't, but I don't understand why, and I'd like to know.

Keith, this is deja vu -- we went through this process on point mutations, and you ended up agreeing that the Senapathy/Mattox math was correct, saying it was the model that must be bogus. If you recall, it was YOU who originally posted some sloppy math while saying Senapathy's math was wrong. I'd just like to resolve this one, too.


From Keith Robison: (via email) I shouldn't have done this -- neither of us seems to be very good at convincing the other, and we can just go round forever.

I think Senapathy's statement depends on how you define "every gene" -- and I would say he is clearly arguing that every known gene can be found in a pond of the size he is stating. Since neither of us can jump in his head, I don't see a real resolution.

JM: (reprise) He computes the amount of DNA needed using the same math as for the three-letter word example. Finding any long (600 nt) exon seems ridiculous, but, given enough random DNA, it is not.

Keith: True -- but you need a lot more then Senapathy's pond has. If I remember my calculations correctly, his pond contains every possible 50-mer or so -- but certainly only a small fraction of the 600-mers.

JM: (reprise) For the time being, I think Senapathy is being generous by using 600 nt so often rather than a smaller value.

Keith: Of course, the key point here is that the distribution of exon lengths looks nothing like what Senapathy claims (it's not a simple exponential dist). So there's a lot more of a problem than the long exons.

There really isn't an average genome, but rather they are stratified into a few size categories. There are many, MANY, genomes with more like 50,000-100,000 genes in them.

Okay, perhaps if S could get the calculation right for the probability of one gene, then maybe he's not completely off base. But, as I have said before, his calculation (p.287) is so grossly flawed as to make it meaningless.

  1. He neglects stop codons in the estimate
  2. More importantly, he grossly underestimates the information content of a protein (as I've pointed out before).
  3. The optimization approach he describes at the bottom is completely out of left field -- it has no basis in biology.

Anyway, we could both go on at this forever. I think your time would be much more productively spent if you possibly did some of the following:

I love my Mac [top] -- [home] -- [Introns & Exons, Part I]