Furthermore, because of codon and amino acid degeneracies, the probability of finding a split gene that codes for a particular protein function is even higher in a reasonable amount of DNA. [Reference: pages 254 - 266]
This search engine demonstrates how easily a complete gene composed of exons (or words) can be found in a random string of DNA (or alphabetic characters). You specify a set of exons and their length, and the search engine generates a random sequence, searches for the exons in order (and hence the complete gene), and reports the results.
Read more about it:
This search mode examines the random DNA for "frames" (the DNA between the stop codons) and generates a histogram of the lengths of the frames. The histogram will show that the reading frame lengths are exponentially distributed. Also, note that the stop codons are so numerous that a gene must be split if an average-length gene is needed for a vital protein.
In both cases, you may specify a "seed" value of from 1 to 8 for the random number generator. By default, the seed is '1'. Changing the seed will result in a different random sequence of DNA or text. For any particular seed value, the random sequence will be the same from one search to the next.
Steps:
Other multiples and probabilities of success:
EML*1, P = 63.21% EML*10, P = 99.9955%
EML*2, P = 86.47% EML*11, P = 99.9983%
EML*3, P = 95.02% EML*12, P = 99.9994%
EML*4, P = 98.17% EML*13, P = 99.9998%
EML*5, P = 99.33% EML*14, P = 99.9999%
EML*6, P = 99.75% EML*15, P = 99.99997%
EML*7, P = 99.91% EML*16, P = 99.99999%
EML*8, P = 99.97% EML*17, P = 99.999996%
EML*9, P = 99.988% EML*18, P = 99.999998%
These numbers are based on the following equations:
n = length of exon
p = (1/4)^n
EML = 1/p
Len = EML * multiple
P = p * sigma[ (1-p)^j ]; summed for j=0 to j=Len
Reference: "Distribution and repetition of sequence elements in eukaryotic DNA: New insights by computer-aided statistical analyses," Senapathy, 1988, Molecular Genetics (Life Sciences Advances), 7:53-65.
Since this adjustment is not made in this search engine, specifying a complete set of equally long exons might result in a search failure. In a typical gene, there is only one "longest" exon -- the other exons are shorter and are easily found in the random DNA relatively close to the longest exon. This search engine does not increase the amount of DNA so that you can experiment and do searches for failing sequences.
Note that the computations do not involve the total number of exons or the length of the other exons (unless two or more exons are of the same longest length).
This method can also be used to compute how long of a string of random characters would be needed to find a certain three-letter word (e.g., "not") with a very high likelihood. You would take the expected mean length of characters needed to find the longest word, and multiply by six -- for a three-letter word this is 105,456 characters. With a string of random characters that long, you should be able to find almost any and three-letter word ("aaa" through "zzz") in the string (with a probability of 99.75% for each word). In the word example used in the book, the objective is to find the sentence "To be or not to be" in a random sequence of characters, and to find that sentence with the shortest total span from start to finish as possible. A sentence with any one four-letter word would require a string of about 3 million random characters. Check "Use full text searches" to experiment with text searches.
In this exon search engine, the following values are used:
Longest |-- for atcg exons --| |-- for full text --|
exon (n) 4^n (EML) 6*(4^n) 26^4 6*(26^n)
-------- --------- --------- -------- ---------
1 4 24 26 156
2 16 96 676 4,056
3 64 384 17,576 105,456
4 256 1,536 456,976 2,741,856
5 1,024 6,144
6 4,096 24,576
7 16,384 98,304
8 65,536 393,216
9 262,144 1,572,864
10 1,048,576 6,291,456
Computer resources (RAM and search time) are finite, hence this search engine limits the size and number of exons that you can specify to exons of maximum length 10 (or length 4 for full text searches).
<----- Backward search ------.
+ + + + +
.---- Forward search -----> |
| * * * * *
| B C A C E B C A C B E C D B E
--|--|---|---|---|--|--|- - - - -|--|--|----|---|---|-----|----|--
Identified gene: -|-----|--------|---|----------|-
A B C D E
The method of identifying the gene of our interest in a random sequence. Let A, B, C, D, and E be the exons of the gene and let D be the longest of them. In a long random sequence, all the shorter exons can occur multiple times before the longest exon occurs. In order to find the shortest possible and the first pattern of ABCDE, we search for the first occurrence of A, then the next occurrence of B, and so on until the last exon E is found. Then, we search backwards from E for the first occurrence of D, then C, B, and lastly A. This approach ensures that the first shortest occurring gene in the random sequence is identified. The exons with "*" above them are the ones that we find in the forward search. The exons with "+" are found by backward search. In reality, many more As, Bs, and Cs will be found than shown in the figure before we reach the longest exon D; and the length of the random sequence before D may also be considerably longer than that shown in the figure. The gap "- - - -" indicates the long sequence in which many As, Bs, Cs, and Es can occur.
In either case, the same amount of random DNA is needed to assure finding the particular gene function, but the second case is easier to simulate here.
Specifically, the effects of the degeneracies reduces the probability exponent by a factor of 0.0542 (or about 1/20). So, in this exon/gene search engine, instead of looking for many 100-mer exons that code for a given function -- there will be a whopping 4^95 (10^56) variations that will work out of the 4^100 (10^60) total exons) -- you can instead look for a particular 5-mer invariant exon (this is a 95% degeneracy). The probabilities of finding the function and the amount of random DNA needed will be the same. For the extreme case of searching for a 600-nt variant exon (allowing any one of many sequences), you could instead search for a particular 32-mer invariant exon.
Reference: Independent Birth of Organisms, pages 287-288 and Chapter 7 footnote #51.
In this exon search engine, you can create degeneracies by using the wild-card character (a question mark) in place of a nucleic acid letter. For example, you could specify "a??????????" to simulate a 90% degeneracy.
Searching for a gene consisting of a set of specified exons (without wild cards) is equivalent to searching for a much longer gene (20 times as long) having a set of degenerate exons.