Martin G. Reese, Nomi L. Harris and Frank H. Eeckman.
Lawrence Berkeley Laboratory
Genome Informatics Group
1 Cyclotron Road
Berkeley, CA, 94720
mgreese@lbl.gov
We analyze the structure of the individual elements within promoters and splice sites using a novel technique that combines neural networks with weight pruning. A neural network is trained to recognize promoter or splice site elements until it reaches a local minimum. Then the pruning procedure deletes those weights in the network that add the lowest predictive value to the overall prediction. After pruning, the neural network is retrained until it is stuck again in a new minimum. This procedure is repeated until a defined error level is reached. Eventually, the pruned neural network gives clues about the importance of specific positions in the promoter element and splice junction by the distribution of the remaining weights.
To predict promoter sites, we use time-delay neural networks to combine the predictions that were made for each of the individual promoter elements. TDNNs are appropriate for recognizing promoter elements because they are able to combine multiple features, even those that appear at different relative positions in different sequences. Another advantage is the high selectivity of the TDNN, which is extremely important for promoter prediction systems, in order to avoid generating too many false positives.
Our TDNN predicts most of the annotated promoters in a set of human genes from Genbank (version 86.0). As an example, the TDNN finds the annotated promoter from a 13,865 basepair test gene, HUMTFPB, with a false positive score of 0.05% (6 false positive predictions out of 13,865). On a test set containing 42 known human gene promoters and 84 random DNA sequences we were able to recognize 50% of the human gene promoters without false positive classification (correlation coefficient of 0.61).
We have applied this network and the splice site prediction networks to our most recently produced sequences and will present data at the conference.