The Exon/Gene Search Engine

Introduction:

A fundamental aspect of the independent birth theory is the ability to find complete genes in a string of random DNA in the primordial pond. Because of the length of an average gene, the probability of finding just one unsplit (prokaryote) gene in any amount of DNA is very, very small, even if we consumed all of the mass in the universe when making the DNA. However, when a gene is split into numerous exons, the gene can be found in a far shorter, and very manageable length of random DNA. In fact, almost any gene can be found in a small amount of random DNA -- if the gene is split into exons. [Reference: Independent Birth of Organisms, Chapter 7, pages 223 - 253]

Furthermore, because of codon and amino acid degeneracies, the probability of finding a split gene that codes for a particular protein function is even higher in a reasonable amount of DNA. [Reference: pages 254 - 266]

This search engine demonstrates how easily a complete gene composed of exons (or words) can be found in a random string of DNA (or alphabetic characters). You specify a set of exons and their length, and the search engine generates a random sequence, searches for the exons in order (and hence the complete gene), and reports the results.

Read more about it:

Why are split genes easier to find? More details from Dr. Senapathy.
A discussion of split genes by Dr. Senapathy (Q&A #3)
Introns and Exons (SBE discussions)
The primordial pond (SBE discussions)
Pasteur Institue listing of analysis tools

Data entry instructions:

There are two search options:

To search for a complete gene consisting of one or more exons in a sequence of random DNA, enter your desired exons using the characters 'a', 't', 'c', or 'g'. You do not need to enter a value for all exons. The longer your longest exon, the longer the search will take (up to about 10 seconds). It is easiest to experiment with short exons of from two to four characters.

Example:
Although the exon specifications are limited to 10 characters, codon and animo acid degeneracies make a search for a 10-nt exon mathematically equivalent to finding a 185-long exon with degeneracies. With one 10-nt and five 9-nt exons, the equivalent gene length is about 1000 nucleotides.
"Use full text searches" will allow you to search for any text ('a' to 'z') in a random string of alphabetic characters. Since this requires a much longer random string, the search is limited to "exons" of four characters or less.
You can include a "wild" character in your exons that will match any character in the random string -- just type a '?' character.
Example:
"Do not display long runs of unmatched DNA" will reduce the length and transfer time of the output. This is automatic if the amount of DNA used is greater than about 60K).
"Total all occurrences" will scan the entire random DNA looking for all non-overlapping instances of each specified exon.
"Tally all 4-mers" will look for all DNA 4-mers (including overlapping sequences). This option makes the most sense to use when your longest exon is four characters long (because the amount of DNA computed is based on the longest exon). However, you can experiment and search for all 4-mers in a longer or shorter random sequence by changing the length of your longest exon. You can even search for the DNA 4-mers in the random strings of alphabetic characters.
"Iterate to reduce gene length" will automatically repeat the search from 5 to 1000 times (using a different random sequence each time) looking for the shortest span of the specified exons. This demonstrates that very short genes can be found if just a few hundred times more DNA is used. The number of iterations is computed to limit the search time to a reasonable value (about a minute, maximum).
To search for stop codons in a fixed-length sequence of random DNA, select "Show stop codons". You may select a codon phase ('1', '2', or '3'). Use '0' to show the stop codons found in all phases.
This search mode examines the random DNA for "frames" (the DNA between the stop codons) and generates a histogram of the lengths of the frames. The histogram will show that the reading frame lengths are exponentially distributed. Also, note that the stop codons are so numerous that a gene must be split if an average-length gene is needed for a vital protein.

In both cases, you may specify a "seed" value of from 1 to 8 for the random number generator. By default, the seed is '1'. Changing the seed will result in a different random sequence of DNA or text. For any particular seed value, the random sequence will be the same from one search to the next.

How much random DNA (or text) is needed:

The exon/gene search engine computes the amount of random DNA (or text) needed to find a typical "gene" with a very high likelihood. You can likely find any gene with the same size or shorter exons in that amount of DNA.

Steps:

Compute the probability of finding the longest exon. This is simply p = [1/4]^n, where "n" is the length of the longest exon. If you are allowing a full text search, then this is p = [1/26]^n.
Compute the "expected mean length" (EML) of random DNA needed to find an exon of length n. This is 1/p. The EML means that for every such length of DNA searched you would expect to find, on average, one occurrence of the exon. If you only have one longest exon, odds are about 60% that that exon or any other exon will be found in only the expected mean length. The chances of finding the shorter exons closely surrounding the longest exon are extremely high.

Multiply by 6. Unless you have more than one longest exon, this practically guarantees that the longest exon can be found. The probability of finding any exon in length 6 * EML is 99.75% (Although Independent Birth of Organisms, Chapter 7 footnotes #10 and #42 say 99.9999%).

Other multiples and probabilities of success:

    EML*1, P = 63.21%     EML*10, P = 99.9955%      
    EML*2, P = 86.47%     EML*11, P = 99.9983%
    EML*3, P = 95.02%     EML*12, P = 99.9994%
    EML*4, P = 98.17%     EML*13, P = 99.9998%
    EML*5, P = 99.33%     EML*14, P = 99.9999%
    EML*6, P = 99.75%     EML*15, P = 99.99997%
    EML*7, P = 99.91%     EML*16, P = 99.99999%
    EML*8, P = 99.97%     EML*17, P = 99.999996%
    EML*9, P = 99.988%    EML*18, P = 99.999998%

These numbers are based on the following equations:

    n = length of exon
    p = (1/4)^n
    EML = 1/p
    Len = EML * multiple
    P =  p * sigma[ (1-p)^j ]; summed for j=0 to j=Len

Reference: "Distribution and repetition of sequence elements in eukaryotic DNA: New insights by computer-aided statistical analyses," Senapathy, 1988, Molecular Genetics (Life Sciences Advances), 7:53-65.

(Optional -- not used by this search engine) If there are additional longest exons of the same length in the gene, multiply by 50% for each additional exon. This assures that the likelihood of finding a second long gene remains high. After "consuming" roughly 50% of the DNA allocated for finding the first long exon, you will still have enough DNA left over to find the second long gene, and so on. In practice, the chances of finding an exon are so high that it is usually found long before reaching the middle of the random sequence, and so making this adjustment is unnecessary.
Since this adjustment is not made in this search engine, specifying a complete set of equally long exons might result in a search failure. In a typical gene, there is only one "longest" exon -- the other exons are shorter and are easily found in the random DNA relatively close to the longest exon. This search engine does not increase the amount of DNA so that you can experiment and do searches for failing sequences.

Note that the computations do not involve the total number of exons or the length of the other exons (unless two or more exons are of the same longest length).

This method can also be used to compute how long of a string of random characters would be needed to find a certain three-letter word (e.g., "not") with a very high likelihood. You would take the expected mean length of characters needed to find the longest word, and multiply by six -- for a three-letter word this is 105,456 characters. With a string of random characters that long, you should be able to find almost any and three-letter word ("aaa" through "zzz") in the string (with a probability of 99.75% for each word). In the word example used in the book, the objective is to find the sentence "To be or not to be" in a random sequence of characters, and to find that sentence with the shortest total span from start to finish as possible. A sentence with any one four-letter word would require a string of about 3 million random characters. Check "Use full text searches" to experiment with text searches.

In this exon search engine, the following values are used:

   Longest    |-- for atcg exons --|    |-- for full text --|
   exon (n)   4^n (EML)      6*(4^n)        26^4     6*(26^n)
   --------   ---------    ---------    --------    ---------  
      1               4           24          26          156
      2              16           96         676        4,056
      3              64          384      17,576      105,456   
      4             256        1,536     456,976    2,741,856
      5           1,024        6,144
      6           4,096       24,576
      7          16,384       98,304
      8          65,536      393,216
      9         262,144    1,572,864
     10       1,048,576    6,291,456

Computer resources (RAM and search time) are finite, hence this search engine limits the size and number of exons that you can specify to exons of maximum length 10 (or length 4 for full text searches).

The search algorithm:

The search engine looks forward through the random sequence for each exon in turn, and then scans in the reverse direction from the last exon. This method finds the shortest span, and it is detailed in Independent Birth of Organisms, figure 7.20, page 276:

                                   <----- Backward search ------.
                                  +     +        +   +          +
 .---- Forward search ----->                                    |             
 |        *          *  *                            *          *
 | B  C   A   C   E  B  C         A  C  B    E   C   D     B    E
 --|--|---|---|---|--|--|- - - - -|--|--|----|---|---|-----|----|--

                Identified gene: -|-----|--------|---|----------|-
                                  A     B        C   D          E

The method of identifying the gene of our interest in a random sequence. Let A, B, C, D, and E be the exons of the gene and let D be the longest of them. In a long random sequence, all the shorter exons can occur multiple times before the longest exon occurs. In order to find the shortest possible and the first pattern of ABCDE, we search for the first occurrence of A, then the next occurrence of B, and so on until the last exon E is found. Then, we search backwards from E for the first occurrence of D, then C, B, and lastly A. This approach ensures that the first shortest occurring gene in the random sequence is identified. The exons with "*" above them are the ones that we find in the forward search. The exons with "+" are found by backward search. In reality, many more As, Bs, and Cs will be found than shown in the figure before we reach the longest exon D; and the length of the random sequence before D may also be considerably longer than that shown in the figure. The gap "- - - -" indicates the long sequence in which many As, Bs, Cs, and Es can occur.

The effects of codon and amino acid degeneracies:

If codon degeneracy in genes and amino acid degeneracy in proteins are taken into account, the probability of finding a particular gene function goes up dramatically. Conceptually, this translates as follows:

You can search for any one of a multitude of different nucleotide sequences that code for the same function. These are called "variant" exons.
You can look for a shorter (but "invariant") exon.

In either case, the same amount of random DNA is needed to assure finding the particular gene function, but the second case is easier to simulate here.

Specifically, the effects of the degeneracies reduces the probability exponent by a factor of 0.0542 (or about 1/20). So, in this exon/gene search engine, instead of looking for many 100-mer exons that code for a given function -- there will be a whopping 4^95 (10^56) variations that will work out of the 4^100 (10^60) total exons) -- you can instead look for a particular 5-mer invariant exon (this is a 95% degeneracy). The probabilities of finding the function and the amount of random DNA needed will be the same. For the extreme case of searching for a 600-nt variant exon (allowing any one of many sequences), you could instead search for a particular 32-mer invariant exon.

Reference: Independent Birth of Organisms, pages 287-288 and Chapter 7 footnote #51.

In this exon search engine, you can create degeneracies by using the wild-card character (a question mark) in place of a nucleic acid letter. For example, you could specify "a??????????" to simulate a 90% degeneracy.

Searching for a gene consisting of a set of specified exons (without wild cards) is equivalent to searching for a much longer gene (20 times as long) having a set of degenerate exons.

Interesting things to look for:

As you experiment with the search engine, look for these characteristics:

Finding exon "aaaa" is no more difficult than finding "atcc" or any other "chaotic" exon. There is nothing special about an exon that looks orderly.
Making all the exons the same value (and the same size) has the same odds of success as if the values are different. That is, finding "aaaa" and then another "aaaa" is no more difficult than searching for "aaaa" and then, say, "ccgt". The odds that "ccgt" will occur within any particular distance from "aaaa" are the same as for another "aaaa" occurring.
All of the individual words of "to be or not to be" can easily be found in a random string of about 100,000 characters. However, to find "tobeornottobe" (i.e., a complete "gene" without "introns") would require about 10^20 characters.
Turn on "Total all occurrences" and you will see that the longest exons are all found, on average, about six times. (Try this using all 3-nt exons.) This is because the length of the random DNA used is six times the expected mean length required to find the longest exon.
Notice the high number of total occurrences of shorter exons. Once you have enough DNA to find the longest exon, finding the other exons is easy.
After doing a search, change all but one or two exons and search again. The new gene will probably be found in a completely different location in the same random DNA. Thus the exons that were not changed can be found in multiple locations in the DNA, demonstrating that part similarity can occur in totally independent genes (proteins) from the common pool of primordial DNA.
See how close the short exons are found to the longest exon (measured by the "span"). If all exons are the same size, they are spread out. Which leads to...
The only sure way to fail is to have three or more longest exons. If you have more than one longest exon (i.e., two or more of the same longest length), it is less likely you will find them all in the random DNA because the amount of DNA is computed assuming there is just one longest exon.
Five long exons will, on average, consume about 5/6 (83%) of the random string.
Look at the search result and then change your exons to try to make the search fail. It will be very difficult unless you keep the exons short so you can see all of the random string and make all of your exons the same length. Remember, this is cheating -- but go ahead anyway.
Specify all exons as the same length, and try to make the search fail. When it does, make just one of the exons one character longer, and they will now all be found.
Specify one exon of length four and check "Tally all 4-mers." Then, look to see if all 4-mers were found. There is a 99.75% chance that any particular exon will be found in a random DNA string of length EML*6, so there is about a 1% chance that one or more of the possible 256 4-mers will be missing.
Check the "Iterate to reduce gene length" option to search for short genes. Note that the shortest gene found is much shorter than the average length of the same gene found in the other random sequences. Thus, having a longer sequence of random DNA will allow all genes to be found with relatively short lengths (spans).
In the "Show stop codons" search engine, notice that the histogram of frame lengths traces an exponential function.
To account for splice sites and stop codons simply add 2 or 3 characters to each exon. Make the exons shorter if necessary so your specifications will still fit into the 10 character fields. Although this is not a perfect simulation of the actual splice sequences, the methods of computing the length of random DNA and the probabilities are the same.