Prerequisites Running EFFECT Algorithm Running Statistical Algorithm
Running Spectrum Running Gibbs
Tuning Algorithms

Install and Run

Prerequisites:
Following software are required for running EFFECT end-to-end

1. Java (TM): http://www.oracle.com/technetwork/java/index.html Base Code is written in Java

2. Bio-Java (TM): http://biojava.org/wiki/Main_Page
version 3.0.3 onwards. Bio-Java is used for Sequence pattern matching and i/o.

3. ECJ (TM): http://cs.gmu.edu/~eclab/projects/ecj/
version 20 onwards. ECJ is the base framework on which the GP-based feature construction algorithm, EFC, in EFFECT works.

4. JSTACS (TM): http://www.jstacs.de/index.php/Main_Page
version 2.1 onwards. JSTACS is used for comparative algorithms in Statistical Experiments.

5. SHOGUN (TM): http://www.shogun-toolbox.org/
version 2.1 onwards. SHOGUN is used for running SVM-based kernel methods like Weighted Degree and Weighted Degree with shift.

6. WEKA (TM): http://www.cs.waikato.ac.nz/~ml/weka/
version 3.7.* onwards. WEKA is used for running training/testing and getting basic evaluation metrics like auROC and auPRC. WEKA's feature selection and attribute evaluation is also used for feature selection in EFFECT. Base sequences with class labels are sorted into 10 samples of train/test using stratified sampling in WEKA.

Running EFFECT Algorithm:

We give step by step instruction om how to run EFFECT:

Step 1: Stratified Sampling of Sequence Data, by making the sequences string attribute and class labels into class attribute into 10 folds using WEKA's filter. weka.filters.supervised.instance.StratifiedRemoveFolds -S 0 -N 10 -F 1 (give in italics and different color)

Step 2: For each Training fold or when entire data is training, run the EFC algorithm using the EFC GP code and sequence.params having the right Problem, right parameters etc. java ec.Evolve -file sequence.params

Step 3: The Hall of Fame or Features generated as mentioned in the sequence params file, can be run through the interpreter to generate a machine learning file in libsvm format. For example to run the NN269 interpreter:

java org.java.evolutionary.sequence.NN269SequenceFeatureInterpreter C:\Research\Software\ECJ-Trunk-Latest\NN269Features.txt NN269Train.libsvm 1 C:\Research\datasets\SpliceData\NN269\splice.train-real.A C:\Research\datasets\SpliceData\NN269\splice.train-false.A

Step 4: Using the same Features file, we run the testing fold exactly like Step 3, but the files will be testing positive and negative and generate a testing machine learning file. For example:

java org.java.evolutionary.sequence.NN269SequenceFeatureInterpreter C:\Research\Software\ECJ-Trunk-Latest\NN269Features.txt NN269Test.libsvm 1 C:\Research\datasets\SpliceData\NN269\splice.test-real.A C:\Research\datasets\SpliceData\NN269\splice.test-false.A

Step 5: Use the training and testing files (folds when cross validating) or entire when separate using WEKA. Open the libsvm files and save them as WEKA's ARFF format in the WEKA explorer. Manually or using UI change the class value from numeric to categorical.

Step 6: Run the Training Data and Testing Data through Explorer for Evaluation using the meta learner AttributeSelectedClassifier. Evolutionary Feature Selection is done using GeneticSearch and fitness is evluated using CfsSubsetEval.

weka.classifiers.meta.AttributeSelectedClassifier -E "weka.attributeSelection.CfsSubsetEval " -S "weka.attributeSelection.GeneticSearch -Z 20 -G 20 -C 0.6 -M 0.033 -R 20 -S 1" -W weka.classifiers.bayes.NaiveBayes --

Step 7: Note the auROC and auPRC from Evaluator.

Step 8: For cross validation we perform this 10 times for different training/testing data from original. And get the mean auROC and mean auPRC.

Running Statistical Algorithm Tests:

Running Statistical tests just needs to change the main() of the method to call the test we want to perform and pass the arguments e.x

java org.java.statistics.NN269StatisticalMethodsTest C:\Research\datasets\SpliceData\NN269\Acceptor_Train_PositiveMinusN.fasta C:\Research\datasets\SpliceData\NN269\Acceptor_Train_Negative.fasta C:\Research\datasets\SpliceData\NN269\splice.test-real.A C:\Research\datasets\SpliceData\NN269\splice.test-false.A

Running Spectrum/KMer Tests

Running KMer tests needs to run the code

java org.java.featurebased.kmer.KMerMotifFeatureGenerator C:\Research\datasets\SpliceData\NN269\Acceptor_Train_PositiveMinusN.fasta C:\Research\datasets\SpliceData\NN269\Acceptor_Train_Negative.fasta NN269DonorKmer.libsvm 8

Running Gibbs Sampling Tests

Running Gibbs sampling motif generation is done using

java org.java.featurebased.gibbs.GibbSamplingMotifFeatureGenerator C:\Research\datasets\SpliceData\NN269\Acceptor_Train_PositiveMinusN.fasta C:\Research\datasets\SpliceData\NN269\Acceptor_Train_Negative.fasta NN269AcceptorGibbs.libsvm 8

Tuning Algorithms (Methodology and Parameters)

We took 1% of the training data with 50-50 of each class as the validation/evaluation data which we don't use in the train-test or cross validation. We either used the parameters well known for some techniques either by contacting the original developers by using their manuscript as a guidance or using grid-basedsearch for parameter tuning.
We also had a fair parameter usage, for example in all experiments we used K=1 to 8 as going to 8 Kmer gave best results in most, so even Gibbs Samping motifs, our own technique EFFECT motif length, weighted postion, weighted position degree kernel all used same maximum length as a constraint.
For SVM-based methods, we used values for C,gamma and lambda either through past research on same dataset (like NN269) or values that allowed best performance on validation dataset. We used Grid Search for tuning, giving large enough ranges (e.g., C= [10 to 0.00001] , epsilon/gamma) =[10 to 0.00001] with medium step sizes 0.1.
For statistical methods, there are various parameters based on choice, such as stopping criteria, epsilon of stopping criteria, etc. Also each algorithms has choice of Markov Chain order, etc. We used some default parameters after consulting experts from JSTACS. For some elements, like markov chain order, we used the evaluation data and went to 2nd order or 4th order based on auROC.
We have given final parameters used for each experiment as a table in Supporting Information document accompanying the paper.

Home Code and Docs Data Files Install Tune Run Model and Features

Prerequisites Running EFFECT Algorithm Running Statistical Algorithm Running Spectrum Running Gibbs Tuning Algorithms

Install and Run

Prerequisites Running EFFECT Algorithm Running Statistical Algorithm
Running Spectrum Running Gibbs
Tuning Algorithms