Prerequisites  Running EFFECT Algorithm  Running Statistical Algorithm  
Running Spectrum  Running Gibbs 
Tuning Algorithms

Install and Run

Prerequisites:
Following software are required for running EFFECT end-to-end

1. Java (TM): http://www.oracle.com/technetwork/java/index.html Base Code is written in Java

2. Bio-Java (TM): http://biojava.org/wiki/Main_Page
version 3.0.3 onwards. Bio-Java is used for Sequence pattern matching and i/o.

3. ECJ (TM): http://cs.gmu.edu/~eclab/projects/ecj/
version 20 onwards. ECJ is the base framework on which the GP-based feature construction algorithm, EFC, in EFFECT works.

4. JSTACS (TM): http://www.jstacs.de/index.php/Main_Page
version 2.1 onwards. JSTACS is used for comparative algorithms in Statistical Experiments.

5. SHOGUN (TM): http://www.shogun-toolbox.org/
version 2.1 onwards. SHOGUN is used for running SVM-based kernel methods like Weighted Degree and Weighted Degree with shift.

6. WEKA (TM): http://www.cs.waikato.ac.nz/~ml/weka/
version 3.7.* onwards. WEKA is used for running training/testing and getting basic evaluation metrics like auROC and auPRC. WEKA's feature selection and attribute evaluation is also used for feature selection in EFFECT. Base sequences with class labels are sorted into 10 samples of train/test using stratified sampling in WEKA.

Running EFFECT Algorithm:

We give step by step instruction om how to run EFFECT:

Step 1: Stratified Sampling of Sequence Data, by making the sequences string attribute and class labels into class attribute into 10 folds using WEKA's filter. weka.filters.supervised.instance.StratifiedRemoveFolds -S 0 -N 10 -F 1 (give in italics and different color)

Step 2: For each Training fold or when entire data is training, run the EFC algorithm using the EFC GP code and sequence.params having the right Problem, right parameters etc. java ec.Evolve -file sequence.params

Step 3: The Hall of Fame or Features generated as mentioned in the sequence params file, can be run through the interpreter to generate a machine learning file in libsvm format. For example to run the NN269 interpreter:

java org.java.evolutionary.sequence.NN269SequenceFeatureInterpreter C:\Research\Software\ECJ-Trunk-Latest\NN269Features.txt NN269Train.libsvm 1 C:\Research\datasets\SpliceData\NN269\splice.train-real.A C:\Research\datasets\SpliceData\NN269\splice.train-false.A

Step 4: Using the same Features file, we run the testing fold exactly like Step 3, but the files will be testing positive and negative and generate a testing machine learning file. For example:

java org.java.evolutionary.sequence.NN269SequenceFeatureInterpreter C:\Research\Software\ECJ-Trunk-Latest\NN269Features.txt NN269Test.libsvm 1 C:\Research\datasets\SpliceData\NN269\splice.test-real.A C:\Research\datasets\SpliceData\NN269\splice.test-false.A

Step 5: Use the training and testing files (folds when cross validating) or entire when separate using WEKA. Open the libsvm files and save them as WEKA's ARFF format in the WEKA explorer. Manually or using UI change the class value from numeric to categorical.

Step 6: Run the Training Data and Testing Data through Explorer for Evaluation using the meta learner AttributeSelectedClassifier. Evolutionary Feature Selection is done using GeneticSearch and fitness is evluated using CfsSubsetEval.

weka.classifiers.meta.AttributeSelectedClassifier -E "weka.attributeSelection.CfsSubsetEval " -S "weka.attributeSelection.GeneticSearch -Z 20 -G 20 -C 0.6 -M 0.033 -R 20 -S 1" -W weka.classifiers.bayes.NaiveBayes --

Step 7: Note the auROC and auPRC from Evaluator.

Step 8: For cross validation we perform this 10 times for different training/testing data from original. And get the mean auROC and mean auPRC.

Running Statistical Algorithm Tests:

Running Statistical tests just needs to change the main() of the method to call the test we want to perform and pass the arguments e.x

java org.java.statistics.NN269StatisticalMethodsTest C:\Research\datasets\SpliceData\NN269\Acceptor_Train_PositiveMinusN.fasta C:\Research\datasets\SpliceData\NN269\Acceptor_Train_Negative.fasta C:\Research\datasets\SpliceData\NN269\splice.test-real.A C:\Research\datasets\SpliceData\NN269\splice.test-false.A

Running Spectrum/KMer Tests

Running KMer tests needs to run the code

java org.java.featurebased.kmer.KMerMotifFeatureGenerator C:\Research\datasets\SpliceData\NN269\Acceptor_Train_PositiveMinusN.fasta C:\Research\datasets\SpliceData\NN269\Acceptor_Train_Negative.fasta NN269DonorKmer.libsvm 8

Running Gibbs Sampling Tests

Running Gibbs sampling motif generation is done using

java org.java.featurebased.gibbs.GibbSamplingMotifFeatureGenerator C:\Research\datasets\SpliceData\NN269\Acceptor_Train_PositiveMinusN.fasta C:\Research\datasets\SpliceData\NN269\Acceptor_Train_Negative.fasta NN269AcceptorGibbs.libsvm 8

Tuning Algorithms (Methodology and Parameters)