Splice sites are the key signal sequences that determine the
boundaries of exons. A method for splice site detection should
ideally be based on a thorough understanding of the complex
eukaryotic splicing process. We trained a backpropagation
feedforward neural network with one layer of hidden units to
recognize 5' and 3' splice sites, using a representative data set
(Drosophila
melanogaster data set). We only consider genes that have
constraint consensus splice sites, i.e., GT' for the 5'
andAG' for the 3' splice site. The output of the network is
a score between 0 and 1 for a potential splice site.
The neural network method is described in detail in
References and
Abstract
A carefully randomly chosen independent test set of 43 human genes (/sequence/human-datasets.html) with no related sequences to the training set gave the following results:
+------------+-----------+----------------+------------+
| threshold | % | % | correlation|
| | sites | false positive | coefficient|
| | recognized| sites | (CC) |
+------------+-----------+----------------+------------+
| | | | |
| 0.99 | 26.0% | 0.1% | 0.46 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.95 | 50.4% | 0.7% | 0.65 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.90 | 64.1% | 1.1% | 0.73 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.85 | 72.7% | 1.4% | 0.78 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.80 | 74.4% | 1.9% | 0.78 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.75 | 77.8% | 1.9% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.70 | 81.6% | 2.7% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.65 | 85.0% | 3.2% | 0.83 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.60 | 88.0% | 3.5% | 0.84 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.55 | 89.3% | 3.7% | 0.84 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.50 | 91.5% | 4.2% | 0.85 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.45 | 93.2% | 4.7% | 0.85 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.40 | 93.2% | 5.2% | 0.84 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.35 | 93.6% | 5.3% | 0.84 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.30 | 94.9% | 5.8% | 0.84 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.25 | 95.3% | 6.2% | 0.84 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.20 | 96.2% | 6.7% | 0.83 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.15 | 96.6% | 8.2% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.10 | 97.9% | 9.1% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.05 | 98.3% | 11.1% | 0.78 |
| | | | |
+------------+-----------+----------------+------------+
These percentages are defined by:
predicted sites
sites recognized = -------------------------
all observed sites
predicted sites
false positive sites = -------------------------
all observed non-sites
(TPxTN)-(FNxFP)
correlation coefficient (CC) = ------------------------------------
________________________________
V (TP+FN)x(TN+FP)x(TP+FP)x(TN+FN)
TP = true positive = sites recognized
TN = true negative = non-sites recognized
FP = false positive = observed non-sites predicted as sites
FN = false negatives = observed sites predicted as non-sites
+------------+-----------+----------------+------------+
| threshold | % | % | correlation|
| | sites | false positive | coefficient|
| | recognized| sites | (CC) |
+------------+-----------+----------------+------------+
| | | | |
| 0.99 | 7.3% | 0.0% | 0.25 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.95 | 33.3% | 0.4% | 0.52 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.90 | 47.9% | 0.5% | 0.64 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.85 | 57.7% | 0.6% | 0.70 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.80 | 61.2% | 0.9% | 0.72 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.75 | 65.4% | 1.1% | 0.74 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.70 | 69.7% | 1.3% | 0.77 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.65 | 73.5% | 1.5% | 0.79 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.60 | 76.5% | 1.8% | 0.80 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.55 | 79.1% | 2.0% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.50 | 80.8% | 2.4% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.45 | 82.5% | 2.9% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.40 | 83.8% | 3.1% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.35 | 86.8% | 3.7% | 0.82 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.30 | 88.5% | 4.0% | 0.82 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.25 | 88.5% | 4.5% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.20 | 90.2% | 4.8% | 0.82 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.15 | 91.0% | 6.0% | 0.80 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.10 | 92.3% | 7.9% | 0.77 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.05 | 94.9% | 10.4% | 0.74 |
| | | | |
+------------+-----------+----------------+------------+
Neural Network based "consensi" sequences: Extensive analysis of the perceptron neural network weight matrices have revealed the following "refined" 5' and 3' splice site consensus and non-consensus sequences:
5' Splice Site:
-7 6 5 4 3 2 -1 +1 2 3 4 5 6 7 +8
consensus: a a a A C|a A G / G T A A G T - c
non-consensus: g g g G G|T G|T A|T - - C|t g|t - - t -
3' Splice Site:
-21 -20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 -1
consensus: - T T T|c T T|C T|C T|c T|c T|c T|c T|c T|c T|C T|c T|C T|c A T|C A G
non-consensus: G
+1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 +20
consensus: G T c - - - g g - g g|a c g a a a|c a g - -
non-consensus: c|t t g|t
Capital letters indicate strong weights and lower case letters
weaker weights.
"|" means "or"
"-" no significant weight "non-consensus" indicates bases that are
very unlikely to appear at this position.
A carefully randomly chosen independent test set of 41 genes (Drosophila melanogaster gene set) with no related sequences to the training set gave the following results:
+------------+-----------+----------------+------------+
| threshold | % | % | correlation|
| | sites | false positive | coefficient|
| | recognized| sites | (CC) |
+------------+-----------+----------------+------------+
| | | | |
| 0.99 | 0.0% | 0.0% | - |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.95 | 22.9% | 0.0% | 0.44 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.90 | 53.3% | 0.0% | 0.69 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.85 | 61.9% | 0.0% | 0.75 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.80 | 66.7% | 0.0% | 0.78 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.75 | 69.5% | 0.8% | 0.78 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.70 | 77.1% | 0.8% | 0.83 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.65 | 78.1% | 1.0% | 0.83 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.60 | 81.9% | 1.0% | 0.86 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.55 | 82.9% | 1.0% | 0.86 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.50 | 88.6% | 1.8% | 0.88 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.45 | 90.5% | 2.5% | 0.88 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.40 | 91.4% | 3.0% | 0.88 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.35 | 91.4% | 4.0% | 0.85 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.30 | 94.3% | 4.8% | 0.86 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.25 | 96.2% | 5.3% | 0.86 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.20 | 97.1% | 5.8% | 0.86 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.15 | 97.1% | 8.0% | 0.82 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.10 | 99.1% | 10.3% | 0.80 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.05 | 99.1% | 15.1% | 0.73 |
| | | | |
+------------+-----------+----------------+------------+
#### _Drosophila melanogaster _3' Splice Site prediction:
+------------+-----------+----------------+------------+
| threshold | % | % | correlation|
| | sites | false positive | coefficient|
| | recognized| sites | (CC) |
+------------+-----------+----------------+------------+
| | | | |
| 0.99 | 1.9% | 0.0% | 0.12 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.95 | 11.4% | 0.0% | 0.30 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.90 | 28.6% | 0.6% | 0.46 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.85 | 44.8% | 0.6% | 0.60 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.80 | 53.3% | 1.1% | 0.65 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.75 | 60.1% | 2.0% | 0.69 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.70 | 69.5% | 2.3% | 0.74 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.65 | 73.3% | 2.5% | 0.76 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.60 | 76.2% | 3.1% | 0.77 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.55 | 79.0% | 4.2% | 0.77 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.50 | 83.8% | 5.4% | 0.78 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.45 | 87.6% | 5.9% | 0.80 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.40 | 90.5% | 6.5% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.35 | 92.4% | 7.0% | 0.81 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.30 | 94.3% | 9.0% | 0.79 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.25 | 94.3% | 10.7% | 0.77 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.20 | 96.2% | 13.0% | 0.75 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.15 | 96.2% | 14.7% | 0.73 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.10 | 96.2% | 17.5% | 0.69 |
| | | | |
+------------+-----------+----------------+------------+
| | | | |
| 0.05 | 97.1% | 30.7% | 0.56 |
| | | | |
+------------+-----------+----------------+------------+