Evolutionary
Code Statistical Methods Code Feature Based
Methods Code
Source Code
The entire EFFECT code for Evolutionary, Feature-based, (Spectrum and Gibbs Sampling), Kernel-based, and other Statistical Methods is available here. We highlight some important classes below for guidance to developers and users.
Evolutionary Code
The entire code is in two packages:
org.java.evolutionary.sequence.* contains all the Utilities
and Fitness function for evolution;
org.java.evolutionary.sequence.func.* contains all the
low-level GP functions for carrying out the evolutionary
machinery. This uses basic java with BioJava for matching
and parsing. Below we summarize some important java files
and map them to the paper for further clarity.
org.java.evolutionary.sequence.*
- sequence.params: This is the configuration file to run ECJ for GP-based Feature Generator. Please note that it has defined a sample set of nodes. All biological features get mapped to the GP-based representation in an extensible and flexible manner. Definitions of building blocks and higher-order features, such as motif, correlational, compositional, positional, regional, conjunctive, and disjunctive are found at this level. A correlational feature, for example, takes 4 parameters, explicitly defined with 2 Motifs, start Position and closeness or shift (1,2,3) etc.
- org.java.evolutionary.sequence.HypersensitiveSiteRecognitionProblem.java: This is the Fitness function in the GP/Evolutionary world. This class measures the fitness of an individual by running it across the positive and negative sequences, measuring the True Positive Ratio and Discriminating Factor (Information Gain between positive and negative sequences).
- StatisticPluginForHallofFame.java: and HallOfFame.java This class is meant for getting the plugin for the HallOfFame (the hall of fame is described in the paper). This can then be used for interpretation of the top features it contains in the end. The file is a parameter in the sequence.params.
- HypersensitiveSequenceFeatureInterpreter.java: This code is meant to read the Features stored in a file (hall of fame output) and generate LibSVM-specific format files. It will also output the features separately for those who like to analyze them further. The features are output to the file SSCleanFeatures.txt. The parallelization of feature matching is controlled by the input argument of Threads. It will use chunking, total number of sequence % threads-1 will get equal share, and the last one will get all the sequences in case of odd/even distribution. The number of threads used should be equal to the number of cores/processors on machines for faster throughput. This class additionally performs some simplifications. Statements such as (AND true true) or (OR (NOT false)) will be reduced.
- Compositional.java: This function or non-terminal is used for "Compositional" matching, i.e indepndent of position testing for compositions like AGGG anywhere in the sequence. It takes just one parameter, motif. It returns a boolean.
- PositionBased.java: This function or non-terminal is used for generating "Positional" features. It takes two parameters, motif and position. If position is any uniformly generated integer, it will find positional features. It returns a boolean.
- RegionDownstreamMatch.java and RegionUpstreamMatch.java: Implements "Regional Features, as described in paper. It takes two parameters, motif and region. It returns a boolean.
- ShiftPositional.java: Implements the "Shift Positional" Feature, like matching a pattern at a position with some tolerance of error for the position. It takes three parameters, motif, position, and the tolerance (1,2,3) positions left/right. It returns a boolean.
- Correlational.java: This can be used for generating explicit correlational features. It takes 4 parameters, 2 motifs for comparison, startPosition, and closeness integer (1,2,etc). Example AGGT, TAC, 24, 3 means 'check for correlation of having AGCT @ position 24 position and TAC at position 27. It returns a boolean.
- And.java: This allows obtaining conjunctive features. Its two parameters are features. It returns a boolean.
- Or.java: This allows obtaining disjunctive features. Its two parameters are features. It returns a boolean.
- Not.java: This allows obtaining negational features. Its only argument is a feature. It returns a boolean.
Statistical Methods:
As described in the paper, we used JSTACs(http://www.jstacs.de/index.php/Main_Page) for employing different statistical methods for sequence classification. A list of these methods is in the paper. The code is present in the package org.java.statistics. Since each problem has different tuning parameters and different specifications (e.g., length), we have implemented 4 classes for 4 different problems and have utilities to run each method.
HSS
HypersensitiveStatisticalMethodsTest.java: Implements all the techniques for comparisons mentioned above for the HSS dataset.
NN269
NN269StatisticalMethodsTest.java: Implements all the techniques for comparisons mentioned above for the NN269 (splice site) dataset.
CELEGANS
C_ElegansStatisticalMethodsTest.java: Implements all the techniques for comparisons mentioned above for the Worm (splice site) dataset.
ALU
AluStatisticalMethodsTest.java: Implements all the techniques for comparisons mentioned above for the ALU dataset.
Feature-based Methods:
Two such methods are implemented, Kmer-based and Gibbs
Sampling. We highlight the base elements of the code, found in
the org.java.featurebased package.
1.
org.java.featurebased.FeatureBasedMotif.java:
This
is the basic interface for generating features and files
to train/test.
2.
org.java.featurebased.AbstractFeatureBased.java:
This is the base class inheritted by both Kmer-based and
Gibbs Sampling methods. The responsiblity of the class is
to implement parsing the sequence for boolean matching of features.
Gibbs Sampling Feature Generator
GibbSamplingMotifFeatureGenerator.java: This implements Gibbs Samping for feature generation. It delegates to the SimpleGibbsAligner.java, giving it the DNA sequences (positive dataset). The iterative methodology of finding motifs that are overrepresented by alignment is carried out in SimpleGibbsAligner till the criteria defined by StoppingCriteria.java is met.
KMER based Feature Generator
KMerMotifFeatureGenerator.java:
This extends the AbstractFeatureBased. It implements
generateFeatures() to generate Kmers of specified length. For
now, this is specific to DNA alphabets {A,C,G,T} but can be
extended to any alphabet.
Kernel Based Methods:
This is the SVM-light and shogun 2.1-based kernel
classifier that was used to run on the different datasets,
provided at classifier_svmlight_cross_wdshift.py