Paper Summary

Title : An Evolutionary Algorithm Approach for Feature Generation from Sequence Data and its Application to DNA Splice-Site Prediction
Authors: Uday Kamath, Jack Compton, Rezarta Islamaj Dogan, Kenneth A De Jong, and Amarda Shehu

Abstract: Associating functional information with biological sequences remains a challenge for machine learning methods. The performance of these methods often depends on deriving predictive features from the sequences sought to be classified. Feature generation is a difficult problem, as the connection between the sequence features and the sought property is not known a priori. It is often the task of domain experts or exhaustive feature enumeration techniques to generate a few features whose predictive power is then tested in the context of classification. This paper proposes an evolutionary algorithm approach to effectively explore a large feature space and automatically generate predictive features from sequence data. The effectiveness of the algorithm is demonstrated on an important component of the gene-finding problem, DNA splice-site prediction. This application is chosen due to the demonstrated complexity of the features needed to obtain high classification accuracy and precision. Our results test the effectiveness of the obtained features in the context of classification by Support Vector Machines and show significant improvement in accuracy and precision over state-of-the-art approaches.

A. SOFTWARE DETAILS

The evolutionary algorithm described in the paper is a feature generation algorithm that can be generally applied to any feature-based sequence classification problem. The implementation of this algoirithm is made available under the Open Source License. The description below consists of two sections. The first provides basic information on how to evolve kernels and provides a link to the code. The second provides details for developers who may want to extend it.

  1. Basics

 2. Developer Details

JavaDoc detailing all the GP terminals, non-terminals, utilities, and parsers is available here.
This is a more detailed description for developers who want to either tune our code for other problems or need more information.

1. sequence.params: This is the configuration file to run ECJ for GP-based Feature Generator. Please note that it has defined a sample set of nodes. All biological features get mapped to GP-based representation in an extensible and flexible manner. For instance, the ERC IntegerTerminal has 3 entries, one with [min,max] of [0,162] for position-based integers, [0,80] for upstream position integers and [82,162] for downstream position integers, where [81,82] are GT and AG terminals. Thus, the same IntegerTerminal in combination with the PositionBased function can generate both Regional and Positional features. Simillarly, Correlational Feature are explicitly defined with 2 Motifs, start Position and closeness or shift (1,2,3) etc. We can also remove this Correlational feature and let evolution find it by combining two positional features with an AND operator. Thus, there are various ways to achieve mappings between biological features and GP trees. The strategy of choice may be problem- or data-dependent.

2.  SequenceClassificationProblem (GPProblem implementation for Fitness):

This class measures Fitness of an individual by running it across the positive and negative sequences, measuring the True Positive Ratio and Discriminating Factor (Information Gain between positive and negative sequences). As data size increases, evaluation becomes more costly. In those cases, network master-slave deployment may be useful. More information on configuring and converting sequence.params to master.params and slave.params can be found here.

3. ThreadedSequenceFeatureInterpreter (Interpreter for Features)
This code is meant to read the Features stored in file (hall of fame output) and generate LibSVM specific format files. It will also output the features separtely for those who like to analyze them further. The features are output to the file SSCleanFeatures.txt. The parallelization of feature matching is controlled by the input argument of Threads. It will use chunking; total number of sequence % threads-1 will get equal share, and the last one will get all the sequences in case of odd/even distribution. The number of threads used should be equal to the number of cores/processors on machines for faster throughput. This class additionally performs some simplifications. Statements such as (AND true true) or (OR (NOT false)) will be reduced. This class can also be used to simplify the features and remove any present redundancies such as those present in the feature matchesAtPosition motif3 AGT @ 45 AND matchesAtPosition motif1 T at 47.

4.StatisticPluginForHallofFame and HallOfFame (Capture Hall of Fame Features)
This class is meant for getting plugin for HallOfFame. By using this we get hook when evaluations are done, and we try to rank the individuals based on their fitness. Top individuals are written to the hall-of-fame-file. This can then be used for interpretation, as described above.

5. Tools
In the process of classifying sequences, we have built many tools, provided in the code linked above. Some of these tools are described here in detail, in case they are useful for similar efforts.
                         sequenceId start1 end1
                         sequenceId start2 end2
          and the Sequence file can have Sequence sequenceId. It is run using
java ec.app.sequence.ExonParser exonDataFile sequenceFile
java ec.app.sequence.FastaParserGenome source maxPositive maxNegative outputFile
The output file with positive and negative samples then can be used for fitness evaluation.

It can be run using
java ec.app.sequence.RNASequenceParser dataFile donorSaveFile accecptorSaveFile positiveDonorFile positiveAcceptorFile sequenceId
 
   where donorSaveFile will get all possible donor sequences with true positive and true negative
               acceptorSaveFile will get all possible acceptor with true positive and true negative
               dataFile contains the whole sequences with sequenceId and actual sequence
               positiveDonorFile positiveAcceptorFile are files we obtained from ExonParser for getting true positives/negatives
               and sequenceId is sequence we are interested in.
   



B. DATA

Various data-sets and samples used in our TCBB 2012 paper are available below.

We have provided a sample data set of 3000 sequences with 2500 negative and 500 positive sequences from the total set.
         This is the full training set used for training.
Here is the data set we used for testing the trained SVM and for for annotation.
A comparison tool is written to compare every sequence from training set for overlap in testing set.

            Here is the data set after we removed overlap




C. MODELS, RESULTS, PARAMETERS, FEATURES

Best model(s) according to ROC measure:
    C=1.0    shift=2

Best model(s) according to PRC measure:
    C=1.0    shift=2

Best model(s) according to accuracy measure:
    C=10.0    shift=2

Detailed results:
    C    shift    ROC    PRC    Accuracy (at threshold 0)
    0.1    0     99.1%     85.6%     96.3%
    0.1    1     99.2%     87.6%     96.3%
    0.1    2     99.3%     88.4%     96.3%
    1.0    0     99.1%     85.8%     97.7%
    1.0    1     99.3%     87.7%     98.2%
    1.0    2    +99.3%    +88.7%     98.3%
    10.0    0     99.0%     85.5%     98.2%
    10.0    1     99.2%     87.1%     98.5%
    10.0    2     99.3%     88.6%    +98.6%

SVM Cross-validation with Spectrum Kernel

We performed some cross-validation benchmark runs using the basic Spectrum Kernel. Here are the auPRC curves obtained for
acceptor and donor on sampled data sets of 20K worm sequences. The spectrum kernel did not do very well, as it is basically composition-based, and lacks information available in positional, regional, correlational, and other more complex features.



Acceptor 20K run with 3 fold cross validation
Donor 20K run with 5 fold cross validation


SVM LibLinear Model for FG-EA

Gene Annotation for 5 Sequences (Scores at all positions, Exon markings, TP/FP data)

We took 5 Random Sequences of B2Hum Genesplicer sequence and for each window (162 nt same as training window) starting from position till end, predict position +80 (83) for donor and acceptor
positions. We didn't include the first 80 and last 80 as they will be outlier due to filler random positions.

Sequence 1:

Sequence Data:

tgtgggccaaagctgaccaactcccccaccgtcatcgtcatggtgggcctccccgcccgggg
caagacctacatctccaagaagctgactcgctacctcaactggattggcgtccccacaaaag
gtgagactgggtctcgaggccggacccctgctcgtgcaaaaacttgaccttccatctcagtt
gctggtttcacaggcaggaccacccctgggcgcctctgtcccctggtcggggagtctgatcc
tgactcccatgcagctgggtccccttcacactgtgtctacaccctttttcttctgttggggt
tttctcctcactggccattctctttcccactcccctaagtcctgcccgactcagatttcacc
tccttcaaagccatctccagccacctcagtcctttacattcctgtcgcactcagtcctagct
aacattcctctgctgatagccctaagcactgcctaccgtagatcaagggtcttcaaccccca
ggctgtggactgctatgttgggtggcctgttaggaacctggctgcaggaggtgagtggcagt
gagtgcgcattcctgcctgagctccgcctccagtctgatcagtggtggcattagattcccat
gggagtgaaccctattgtgaactgggcatgtgatggatctaggttgtgtgctccttatgata
atctaatgcctgatgatctgaggtggaacagtttcatcccaaagccgtacacccacaacctc
cctgtccgtggcaaaattgtcttccacgaaagtggtccctggtgccaaaaaggatggggact
gctgttgtggatgacacatttaagaatctctgcttggtcttcccaactagatggggaacatc
ttgtggataggcactgggtattttttttttttcaattaattttattttagattcgggtacat
gtgcatgtttgttacctgagtatattgcatattgattggggattgggcttctagtgaacctg
ttacccagacagtgaacattgtacccaactggttattttttaacccttgctcatctcccacc
ctcccctcttttgaaaggatggtttttatgtttaccatccccgctgtgcccagctcagtgcc
caggaaccatgggacacatgtctccggccctggccctggcggctgcaccagcagaagggttg
acctagatctggcggtgggcatgaggccccactcgctttctgataccattggtggaaatgtg
ttgtccctatttttgctagacttggggcctccccccacccatcctcacagtctcagtctctg
gtttaggatattgtgggactgagattgtagacttggctggaggggacagtgtcgcagagcat
gcggcaggggcattgcagctggactggggaagccgggcagcctgtgccctgctgtcctcagg
agttgagggtagcgtaggactgggcgccggcattgcttccacaccatctcattcaagtcggt
gtggttggcctggtggctcttcctttggtcgatcctatggtcccggtgtgagctggcccctt
cctcctgctcgatcatccagactgttctctttcccgtccacagtgttcaacgtcggggagta
tcgccgggaggctgtgaagcagtacagctcctacaacttcttccgccccgacaatgaggaag
ccatgaaagtccggaagtaaggctgggccgcgggcgtagggctgggctgtgggaataaggct
gggccgcgggcataaggctgggctgcaggagtaaggctgggccgcgggcgtagggctgggct
gtgggaataaggctgggctgcggggctgcgggtgtaaggctgggctgcgggcttgggcttgc
tttgttctgagccaggctctgtaggaggcagagaaagccggcctgcgggtgggtcctgctgg
tggcttcagggctgtgcagtgtgggaacaaggttgctgctctcatctgttccctgagggtct
cgctgagctgaccctcactgggacagagccgcagaagcaccctcactgagaattagcaagtc
ctgttcatcgggagtgtctgtttatgtttggaaaggatttcttgttaagtgcgagttgattg
tgaaaccctagtgcaagggtgttcggccccggctgtatgctggagtctcctgggggatgcta
aaatccctgatgccaggccccaccccagaccagttagtgcaactctgtgggtggctctgggc
agccatttcttagcagctccccaggagtggttcccgcgtgcagcccggtgcgagcaccgcct
cccgagggttcctgaggccacccgtgtggggccccaggttggaagcctctgggcctgagggg
tgctggccagcctgcccagcgctaagcagtgtagaatggaaccaccagttactctgcttgtc
ggggcgcgtcgggagttgcttgggctgccccagcggatgcaccagcgctgacattcgggcaa
tgtcttgtgcattccaggcaatgtgccttagctgccttgagagatgtcaaaagctacctggc
gaaagaagggggacaaattgcggtaagtccaggcaatgtagccggctcggtggtccagtccc
acccatgagggttgtcctcaccggcctgggttgtacccgagaccctgtgcctctgcccctga
gggtggaggaaagcaggtgccaggccagcccaatggctcaggaaggcggatgtggggcggac
tcagggacccaccccctctgtggccagtcccacacctccagggcgggcaggactcgagttac
ccacaggaccactggcccttccttgtacactgtcccattgcacattactgttttggttgttt
gcttttggggtatttcccccacgctaccacccctgagtgctgttggtagttgcgggaggcct
gcctcctaagcactccctggagaggactgtgggcaggcacaagctcccctgaagcacttttg
gggcttattcctggagatgatgggtctggcatctttgctgttctggtaaacgttgttaaact
caaacttagagttttcttcctggccctttgctctccatatcaggttttcgatgccaccaata
ctactagagagaggagacacatgatccttcattttgccaaagaaaatgactttaaggtgagc
tgagcgtcttgtgttgtgctagagggctcgggagagaggaaaaggcctcaggaagcggcctg
agtcctgcttggtccttcccctcctgctgccatgttccttagtcctgctcggtccctcccct
cctgctgcccacgttttgtggccttaaatttgggcctttgtgtggtagggctcttgggggcc
gtggggcgttccagaggctgcagtgaggaggtgggctttcctgtgtttgggagttggtgggt
ggcgtgcttccgccttgtccgggtgatgacatcgcagtgatggcaaggttgatccttcgtgg
tgtttgttttctaacatccccacctcacttccaccaggcgtttttcatcgagtcggtgtgcg
acgaccctacagttgtggcctccaatatcatggtaagacagccgggagccccgtgcttctgc
ggcagcgtagaccacaagggcatttgcagtctgaggaagtgggggaccgggtccagctcgga
agcgcagctgagcccttag

Exons
45437_AB012229 1 124
45437_AB012229 1594 1690
45437_AB012229 2498 2564
45437_AB012229 3082 3156
45437_AB012229 3510 3677

Predictions:
1. Showing Top line actual exons start-end
2. SVM scaled score [0,1] for each position of window length 162.

Observations:
1. Exon2, Exon3, Exon4 rightly identified
2. Initial Exon acceptor not considered due to boundary condition (first 80 are considered outliers)
3. Final Exon donor not considered dur to boundary condition.
4. One False Positive Donor at 80 corresponds to GT.




Sequence 2:

Sequence Data:

atggtaggtgtttcgtttcttgcttctttttccttgccggcggagacccctaagctgtattc
ccattgcccctagtcatccactccctaccatggtcgggggttccaggctgcgcatggccgcc
tgcggggcagggtggccggcgcgggcccggggcggggctcccggagccgtgtgttaggcccg
cggttcggatctctaggacacgcgggcccctgcgctaccgtggtgagacctcacggccctga
gcggatcggtaccctcagctttcccaaacgctccagaagttaggtctttgacccacaggctt
acaggaccatctcggctggcgggcatcgccccctgcccctaattccttaggccttaccacca
agctttttccacacagccatccagactgaggaagacccggaaacttaggggccacgtgagcc
acggccacggccgcataggtaagtgccggcttcccctcggggtgggccttgggctctcttcg
ggtgcttagctagtctggagatcggtagcctataagtgggttagaataagacctttttgtgg
tcaagttgcacagctgttgatttttttctgacgatcctctagtattccagttctaaggaatt
tcacatcagtggggtaataggaattgagcaggcacggtattgggttagttgaagacatggag
tactgtgggaatgctgtgatgtggaacctgaaaagatgtttcacccggaatcctaaagtaat
cgcattgctgaaaaccggcatcggtaggtgggaacagcgtaagcgggacacagaagtctggg
aaacactctgcttttgtgcgaagaagtattgagatgcatgaagaagctgtgtggtgcatgta
gcttttttgtgtgtgtgtgagactgatcactgtcgcccaggctggagtgcagtggcgaaatc
tcggctcactgcaacccctgcctgccgggctcaagcgattctcctgcctcagcctcccgagt
agctgatattacaggtgtgcgccacgacgcccggctaattctttttctatttttagtagagt
cgggggtttctccgtgttggccaggctggtctcgaactcctgacctcaggtgatctaccccc
cctcgtcctcctaaaatgctgggattacaggcatgagccatcacaccggccccacgtagctt
tgtattcctgcaggcaagcaccggaagcaccccggcggccgcggtaatgctggtggtctgca
tcaccaccggatcaacttcgacaaatagtaagtgtccttggactgcttttattgaaacagct
tgggaggtaggggcagagagagggctggcttaaacaaaaagtttagaagcaagccttgccta
ttgctgttttttaccaagttaacacttggtgtgaactgagaacctgtcatcgaggctagagt
cacgcttgggtatcggctattgcctgagtgtgctagagtcctcgaagagtaactgctgacct
tattcactggctgtgggccttatggcacagtcagtcaccaggttagagacatgcttcacatt
cacctacccacaaactagtggatgataaattttggctattcagaagacgtttattataggag
tatgtagattttccatagagtgctgttatgtgacttgaattttagtctcggccctgcctctg
acattgtcggtggtttatcctggttccaggaaataagactagccttttcctcatgatagtct
ttggtggtttttaaaacagttgtttaagtcaacagatgtatcatatgcctgacactgctcta
caccagtgaataatttacactctaatagggggtggtaactataaagatgataaacatagcat
cttaattggagtgtgtatgaaggtggttgttacctcttcctagccacccaggctactttggg
aaagttggtatgaagcattaccacttaaagaggaaccagagcttctgcccaactgtcaacct
tgacaaattgtggactttggtcagtgaacagacacgggtgaatgctgctaaaaacaagactg
gggctgctcccatcattgatgtggtgcgatcggtaagttaattggatgtttttctgtacttc
cataccttcccttacaaaactctggcttaatctaatccacttatataatctgtacttcccag
ttacctaccagacattgatattcttcctgtggtagaattatcataggtagttccctatccgt
agcagtgcctactgtcactgcccaggttgtatcaggtttgcatttcgtgcttgaactatagc
tggttttcactgagcacagctcttggcccttcatgttctccagataatagaatcctaatatg
ttccattgatactcagtgccatgcattatctgaagagattttcccccaaaacagatgtatta
tgtctgtccttgcgggggttctggtccctgtgtcagtcttaactctcatgaatatagaggta
gtgttaagaggccagaacctagggacgctttaaattcacttcccagcctatttaatgtccat
tgagtagttctggtggtcaggaaggtggttgtcttcttttgcttagcagggggtatttgagc
aggaggaggcttatgctttgccgagactagagtcacatcctgacacaactcttgtcctggtg
tgctagagtactcgaagagaatctactggtcttgattcactggtgggggcagtcggtgcccc
cgttagtgcccagatcagaaacatacataccctgcctagggatttagaaagtgggttggcag
tctttcctcacgcccatcacgcagttggtacctactacagtgtattgtaaacttttttctct
gttcttctagggctactacaaagttctgggaaagggaaagctcccaaagcagcctgtcatcg
tgaaggccaaattcttcagcagaagagctgaggagaagattaagagtgttgggggggcctgt
gtcctggtggcttga

Exons
45514_AB020236 1 3
45514_AB020236 389 452
45514_AB020236 1192 1267
45514_AB020236 1904 2078
45514_AB020236 2863 2991

Predictions:
1. Showing Top line actual exons start-end
2. SVM scaled score [0,1] for each position of window length 162.

Observations:
1. Exon2, Exon3, Exon4 rightly identified
2. Initial Exon acceptor nd donor not considered due to boundary condition (first 80 are considered outliers)
3. Final Exon donor not considered dur to boundary condition.
4. One False Positive Donor at 80 corresponds to GT.




Sequence 3:

Sequence Data:

atgctgctcctgttcctcctcttcgagggtctctgctgtcctggggaaaatacagcaggtaa
gaagagtgcaggtggaaagatacctatggtagggcaccagagggctgagaggaagctctggg
gaggtcctgggggagggagcagtactcttctaggatgcccttggaatatgcctttcaggcta
gttccaggcagagaattcttgctctcagtctcagtttttgtctctgattttggagaaaggaa
gctggccccacaggaaaagggtattggagtatgtacaagctacctaactgtctctcatctct
gggttccttttttccctttggcatcacttttcccatccctttacattctctctacttgtcat
ttccctctctctcagctccccaggctctacaatcctatcatctagcagcagaggagcagctg
tccttccgcatgctccaaacttcctcctttgccaaccacagctgggcacacagtgagggctc
aggatggctgggtgacctgcagactcatggctgggacactgtcttgggcaccatccgctttc
tgaagccctggtcccatggaaacttcagcaagcaggagctgaaaaacttacagtcactgttc
cagttatacttccatagttttatccagatagtgcaagcttctgctggtcaatttcagcttga
atgtaagttcgttgctctaagctgataatttgcctgggaacaccaactatttccaaatgaag
atagatatatagactctgaccatcatttaaccttactaaccttgttccccactctctgactc
ccactccctcctctgcttcacccttcaccaccacccacactccaccatatacacaaaagggc
ctgcatgtacatatctcaacatgaatatagcttcatgtctggctctttggaatgattgtctc
ctctggatcttctgcccctcattcctgccctcagactcagcctttctcaaccctctttctgc
ccttctttatcctttgcctgagtgttgacatggactggcctgtacctaaccactttcacgtg
aattatttatgaccaatctcctatcttcctgatagccttcccatctaccactttcccattag
ttatttcaaagtatctttattatcttcaaattttcttcccacaaattttcttcccctttgcc
agtaaactctagtctccatatgatttcccagcaaatttttcttcccttgaatctctttactg
tctaaattgtttgtttttcttccttgtcattctttccataatgatctctcttccctgtccac
tctcagaccccttcgagatccagatattagctggctgtagaatgaatgccccacaaatcttc
ttaaatatggcatatcaagggtcagatttcctgagtttccaaggaatttcctgggagccatc
tccaggagcagggatccgggcccagaacatctgtaaagtgctcaatcgctacctagatatta
aggaaatactgcaaagccttcttggtcacacctgccctcgatttctagcggggctcatggaa
gcaggggagtcagaactgaaacggaaaggtgagcccaactctctctctcccctcttgttcct
agtactataactctcatatttgaatttgcctctcatcatcattttgaaagacatagtgagag
actagagaatgagatgtgtgggttcaggactgtttcttagacaagagaaagaagtgattact
aaatcactcttagtattattacaaaggcacctgagtctctgagctctggcctggggtgccct
tcaaaattccattttttttctatcttcttcttcctagtgaagccagaggcctggctgtcctg
tggccccagtcctggccctggccgtctgcagcttgtgtgccatgtctcaggattctacccaa
agcccgtgtgggtgatgtggatgcggggtgagcaggagcagcggggcactcagcgaggggac
gtcctgcctaatgctgacgagacatggtatctccgagcaaccctggatgtggcggctgggga
ggcagctggcctgtcctgtcgggtgaaacacagcagtctagggggccatgatctaatcatcc
attggggtgagaaacagctgaggctctgctgggaaataatgaaaatagccctggggcttttg
agtgtggggctgaggaaatgggtaggaatgctaggtacaagaagggtaaaactgggacaatc
aaaataaagaaggatagagtatgacagtagttaaattttaagaaaatggaagtagagaatta
gacatactaacagaaaaaggaggaggaactagtgatttagtgggagagggttgggaggagat
cacagacaaaggatcaggaggaattgaaatgagggctttggaaaacccagatgaaaattcta
ggaaggtcccacccttgtgaaatgggaaatctcagcttggtggaatagagtattttagggtt
ggtattcttattctatccccaaccaggtggatattccatctttctcatcctgatctgtttga
ctgtgatagttaccctggtcatattggttgtagttgactcacggttaaaaaaacagaggtga

Exons
45868_HSCDIR2 1 58
45868_HSCDIR2 418 684
45868_HSCDIR2 1309 1578
45868_HSCDIR2 1836 2114
45868_HSCDIR2 2507 2604

Predictions:
1. Showing Top line actual exons start-end
2. SVM scaled score [0,1] for each position of window length 162.

Observations:
1. Exon2, Exon3, Exon4 rightly identified
2. Initial Exon acceptor and donor not considered due to boundary condition (first 80 are considered outliers)
3. Final Exon donor not considered dur to boundary condition.
4. One False Positive Donor at 81 corresponds to GT and one false positive acceptor around 80 .




Sequence 4:

Sequence Data:

ttcattccacagacacacacagcctctctgcccacctctgcttcctctaggaacacaggtaa
gagcttcaagcctctccagcttaataacatgaattatttttgagaataataatgatactgtg
ttctatatcatgcatctcctgcattctgtctgattatattttacttattctgccagagcaaa
attaaaatacctatttcatctgatttgtcctttatctaaattgcttagttccaagtaaacca
aggcacttttagaacacagagggagagtgccttgcagccagagagtcttgaaggagatgtca
gggacgcatcttaacagctggttggatgtgatccacagaggtctcctgttagcattcattgt
aaagccatcctacctagctctagtgtaaccagcaatgaaagaaagataaagaaagataaaga
gggtcgattacttatttacaatagtctttaaaaacgtagttttgtaagccttctaattagga
cattaatatatttaatatatgcacattgtagaaagattgaagcgttaaaaataagagaaaaa
ctttaaatgtcaaaatctcacaacccagatatatcatttctttaagaaaattgtactacaaa
ataccattccatttattaaagtcattctgacaggaatctgatgcttttccaggagttccaga
tcacatcgagttcaccatgaattcactcagtgaagccaacaccaagttcatgttcgatctgt
tccaacagttcagaaaatcaaaagagaacaacatcttctattcccctatcagcatcacatca
gcattagggatggtcctcttaggagccaaagacaacactgcacaacaaattagcaaggtagc
tatcagcatcattacgttgtcctgttgcagtttttctctggttccgtcggctagcacgcaga
tggtaatagatgtggtggtctgatgggtagcacagggggctgtgcaggaattcccataactg
tgagaccactgacttaaacagatcttttgagtaaagttttcttgtcccgcttcatgtctctt
ccaggttcttcactttgatcaagtcacagagaacaccacagaaaaagctgcaacatatcat

Exons
45378_AB005548 1 58
45378_AB005548 673 863
45378_AB005548 1059 1115

Predictions:
1. Showing Top line actual exons start-end
2. SVM scaled score [0,1] for each position of window length 162.

Observations:
1. Exon2 rightly identified
2. Initial Exon acceptor and donor not considered due to boundary condition (first 80 are considered outliers)
3. Final Exon acceptor and donor not considered dur to boundary condition.
4. No False positives in acceptor and donors .



Sequence 5:

Sequence Data:

atgcttgcgggtgccgggaggcctggcctcccccagggccgccacctctgctggttgctctgt
gctttcaccttaaagctctggtaaggagagtaaggggagggaaaagagtcactggaggacaga
ttcgttccccggcctccgggctggggagattcttatccatggacagcgcaggagggggtgccc
tattttctgggttttgacttttgtcccttcccccaactccgctacagccaagcagaggctccc
gtgcaggaagagaagctgtcaggtgggtgattgtctgggaacccgacacgtcccgtcgtttgt
ttttggttttagtatcttccccatataccaagagcacaggtggaaaaggccggcaccacatgg
aatgagggcaagcgctgtgccgggaagttggatgggcagagcctgggagaaatgttttctgaa
tggcctggggctggcatttatctttcctttcagcaagcacctcaaatttgccatgctggctgg
tggaagagtttgtggtagcagaagagtgctctccatgctctaatttccgggctgtgagtatct
atttctctaccctttttttttttaaacatctaagcaaacacttaagcaacatcttaacatcca
agtcttattaagctgcctttgcatcatctttttttctttcttcttgcagtctcagtgtatgtc
caaatctgtacttgtctttcccctgaaaatgtgcccctttctttccaagctaatggcagtatt
caactgataactatggtttttttgtttgtttttgagacggagtctcactctgttgcccaggct
ggagtgcagtggcgagatctcggctcaccacaacctccgcctcccgggttcaagcgattctcc
tgcctcagcctcctgagtagctgtgactaaaggcgcgttgcaccatgcctggctaatttttgt
atttttagtagactcgggtttcactgtgttggccaggctagtctctaactccttacctcgtga
tctgcccgcctcagcctcccaaagtgctgggattataggcgtgagccgccacacccggcctca
aataactatgttttattcacttttagtatagtaggctctggaatggaatgtatctttgccact
cctagactgttgcccctgaagtgttctaacatacattcgtaatcatgcaaccaccacctccac
catccgcatcagaactctttcattagctctctgttgcctaccactccaaaccatagcagttgg
gcacctgcaccttctgaatggcagcctttttgtttatcctgttgccccttcctaacatgtact
ttgctccttttctcctggcagaaaactacccctgagtgtggtcccacaggatatgtagagaaa
atcacatgcagctcatctaagagaaatgagttcaaaaggtgagtgctgtttctcttctctgag
atcacctcctttgctcctatgactttagatggatggtcccaccttttcccctggtctgggggt
agcatggaatatctgaaagaagagggggccgggctcacacctataatcccagtacttagggag
gctgaggctaacagatcacctgaggtcaggagttcgagatcagcctggccaacgtggtgaaac
cccatctctactaaaaatacaaaaattagctggtgtggtgggtgcccataatcccagctactt
gggaggctgaggcagggagaatcacttgaacccaggaggcagaggttagagtgagccaagatc
atgccactgcactccagcctgggcaatggagcgagactgtgtgtcaaaaaaaaaaaaaaagag
ctgggagggtggctcacgcctgtaatcccagcactttgggagccgagcggtggatcacgagtt
aggagttcaagaccagcctgccaagatggtgaaactcatctctcttaaagatacaaaaattag
ctgggcatgatggcgggcacctgtaatcccagctacttgggaggctgaagcagaaaactgctt
aaacccgggaggtggaggttgcagtgagccaaggtcacgccactgcactccagtctgggtgac
agagcgagacaccatctcaaaaataataagaaagaagacgggatttggagccaaattacttga
acccaagtcccagctctactcctcagcactatgttattggccatattttctcatcagtaaaat
tggttaatactagctctgcttatccagagaataaattaaatgatctatttgagagggatttat
gtactatacacacaatagaagcctggctctgatttcctcttgagtgtatcctggctaggctga
agtagaagggaaattagagggagaagctgtctggccttccttgggaataactattccttcctg
ccctcagctgccgctcagctttgatggaacaacgcttattttggaagttcgaaggggctgtcg
tgtgtgtggccctgatcttcgcttgtcttgtcatcattcgtcagcgacaattggacagaaagg
ctctggaaaaggtccggaagcaaatcgagtccatatag

Exons
45463_AB016492 1 83
45463_AB016492 237 274
45463_AB016492 475 557
45463_AB016492 1345 1424
45463_AB016492 2402 2558

Predictions:
1. Showing Top line actual exons start-end
2. SVM scaled score [0,1] for each position of window length 162.

Observations:
1. Exon2,Exon3, Exon4 rightly identified
2. Initial Exon acceptor not considered due to boundary condition (first 80 are considered outliers)
3. Final Exon donor not considered dur to boundary condition.
4. One False positive in Donor 82 position .



Training, Testing, Interpretation Times

Some recorded times are provided below. We employed a Windows XP 16GB machine with 4 cores, Windows XP.

Note Fasta Files Size  Sparse LibSVM/Liblinear files Evolutionary Run Time Feature Interpretation time SVM Cross Validation Time
Acceptor/Donor (20K) 2.8 MB 563 MB 2.5 Hours 1 hour 15 minutes (LibSVM)
2 minutes (LibLinear)
Acceptor/Donor (180K) 25.68 MB 8 GB 6 Hours 2.5 hours 5 minutes (LibLinear)




Citation

Please cite: Uday Kamath, Jack Compton, Rezarta Islamaj Dogan, Kenneth A. De Jong, and Amarda Shehu. An Evolutionary Algorithm Approach for Feature Generation from Sequence Data and its Application to DNA Splice-Site Prediction. Trans Comp Biol and Bioinf 2012, 9(5):1387-1398.



Copy Rights and Trade marks


1. LIbSVM : Copy Rights of LIBSVM  2000-2010 Chih-Chung Chang and Chih-Jen Lin
All rights reserved are in LibSVM source code.

2. ECJ:  ECJ is licensed under the Academic Free License, version 3.0, included in the package.

3.Java: Java is registered trademark of Oracle.

4.BioJava: http://biojava.org/wiki/Main_Page is licensed under LGPL 2.1

5.LibLinear: Copyright (c) 2007-2010 The LIBLINEAR Project.