Prerequisites Running EFFECT
Algorithm Running Statistical
Spectrum Running Gibbs
Running Spectrum Running Gibbs
Install and Run
Following software are required for running EFFECT end-to-end
1. Java (TM): http://www.oracle.com/technetwork/java/index.html Base Code is written in Java
2. Bio-Java (TM): http://biojava.org/wiki/Main_Page
version 3.0.3 onwards. Bio-Java is used for Sequence pattern matching and i/o.
3. ECJ (TM): http://cs.gmu.edu/~eclab/projects/ecj/
version 20 onwards. ECJ is the base framework on which the GP-based feature construction algorithm, EFC, in EFFECT works.
4. JSTACS (TM): http://www.jstacs.de/index.php/Main_Page
version 2.1 onwards. JSTACS is used for comparative algorithms in Statistical Experiments.
5. SHOGUN (TM): http://www.shogun-toolbox.org/
version 2.1 onwards. SHOGUN is used for running SVM-based kernel methods like Weighted Degree and Weighted Degree with shift.
6. WEKA (TM): http://www.cs.waikato.ac.nz/~ml/weka/
version 3.7.* onwards. WEKA is used for running training/testing and getting basic evaluation metrics like auROC and auPRC. WEKA's feature selection and attribute evaluation is also used for feature selection in EFFECT. Base sequences with class labels are sorted into 10 samples of train/test using stratified sampling in WEKA.
Running EFFECT Algorithm:
We give step by step instruction om how to run EFFECT:
Step 1: Stratified Sampling of Sequence Data, by making the sequences string attribute and class labels into class attribute into 10 folds using WEKA's filter. weka.filters.supervised.instance.StratifiedRemoveFolds -S 0 -N 10 -F 1 (give in italics and different color)
Step 2: For each Training fold or when entire data is training, run the EFC algorithm using the EFC GP code and sequence.params having the right Problem, right parameters etc. java ec.Evolve -file sequence.params
Step 3: The Hall of Fame or Features generated as mentioned in the sequence params file, can be run through the interpreter to generate a machine learning file in libsvm format. For example to run the NN269 interpreter:
java org.java.evolutionary.sequence.NN269SequenceFeatureInterpreter C:\Research\Software\ECJ-Trunk-Latest\NN269Features.txt NN269Train.libsvm 1 C:\Research\datasets\SpliceData\NN269\splice.train-real.A C:\Research\datasets\SpliceData\NN269\splice.train-false.A
Step 4: Using the same Features file, we run the testing fold exactly like Step 3, but the files will be testing positive and negative and generate a testing machine learning file. For example:
java org.java.evolutionary.sequence.NN269SequenceFeatureInterpreter C:\Research\Software\ECJ-Trunk-Latest\NN269Features.txt NN269Test.libsvm 1 C:\Research\datasets\SpliceData\NN269\splice.test-real.A C:\Research\datasets\SpliceData\NN269\splice.test-false.A
Step 5: Use the training and testing files (folds when cross validating) or entire when separate using WEKA. Open the libsvm files and save them as WEKA's ARFF format in the WEKA explorer. Manually or using UI change the class value from numeric to categorical.
Step 6: Run the Training Data and Testing Data through Explorer for Evaluation using the meta learner AttributeSelectedClassifier. Evolutionary Feature Selection is done using GeneticSearch and fitness is evluated using CfsSubsetEval.
weka.classifiers.meta.AttributeSelectedClassifier -E "weka.attributeSelection.CfsSubsetEval " -S "weka.attributeSelection.GeneticSearch -Z 20 -G 20 -C 0.6 -M 0.033 -R 20 -S 1" -W weka.classifiers.bayes.NaiveBayes --
Step 7: Note the auROC and auPRC from Evaluator.
Step 8: For cross validation we perform this 10 times for different training/testing data from original. And get the mean auROC and mean auPRC.
Running Statistical Algorithm Tests:
Running Statistical tests just needs to change the main() of the method to call the test we want to perform and pass the arguments e.x
java org.java.statistics.NN269StatisticalMethodsTest C:\Research\datasets\SpliceData\NN269\Acceptor_Train_PositiveMinusN.fasta C:\Research\datasets\SpliceData\NN269\Acceptor_Train_Negative.fasta C:\Research\datasets\SpliceData\NN269\splice.test-real.A C:\Research\datasets\SpliceData\NN269\splice.test-false.A
Running Spectrum/KMer Tests
Running KMer tests needs to run the code
java org.java.featurebased.kmer.KMerMotifFeatureGenerator C:\Research\datasets\SpliceData\NN269\Acceptor_Train_PositiveMinusN.fasta C:\Research\datasets\SpliceData\NN269\Acceptor_Train_Negative.fasta NN269DonorKmer.libsvm 8
Running Gibbs Sampling Tests
Running Gibbs sampling motif generation is done using
java org.java.featurebased.gibbs.GibbSamplingMotifFeatureGenerator C:\Research\datasets\SpliceData\NN269\Acceptor_Train_PositiveMinusN.fasta C:\Research\datasets\SpliceData\NN269\Acceptor_Train_Negative.fasta NN269AcceptorGibbs.libsvm 8
- We took 1% of the training data with 50-50 of each class as the validation/evaluation data which we don't use in the train-test or cross validation. We either used the parameters well known for some techniques either by contacting the original developers by using their manuscript as a guidance or using grid-basedsearch for parameter tuning.
- We also had a fair parameter usage, for example in
all experiments we used K=1 to 8 as going to 8 Kmer
gave best results in most, so even Gibbs Samping motifs, our
own technique EFFECT motif length, weighted postion,
weighted position degree kernel all used same maximum
length as a constraint.
- For SVM-based methods, we used values for C,gamma and lambda either through past research on same dataset (like NN269) or values that allowed best performance on validation dataset. We used Grid Search for tuning, giving large enough ranges (e.g., C= [10 to 0.00001] , epsilon/gamma) =[10 to 0.00001] with medium step sizes 0.1.
- For statistical methods, there are various parameters
based on choice, such as stopping criteria, epsilon of
stopping criteria, etc. Also each algorithms
has choice of Markov Chain order, etc. We used some
default parameters after consulting experts from JSTACS.
For some elements, like markov chain order, we used the
evaluation data and went to 2nd order or 4th order based
- We have given final parameters used for each
experiment as a table in Supporting Information
document accompanying the paper.