Evolutionary Code  Statistical Methods Code  Feature Based Methods Code

Source Code

The entire EFFECT code for Evolutionary, Feature-based, (Spectrum and Gibbs Sampling), Kernel-based, and other Statistical Methods is available here. We highlight some important classes below for guidance to developers and users.

Evolutionary Code

The entire code is in two packages: org.java.evolutionary.sequence.* contains all the Utilities and Fitness function for evolution; org.java.evolutionary.sequence.func.* contains all the low-level GP functions for carrying out the evolutionary machinery. This uses basic java with BioJava for matching and parsing. Below we summarize some important java files and map them to the paper for further clarity.

org.java.evolutionary.sequence.*

  1. sequence.params: This is the configuration file to run ECJ for GP-based Feature Generator. Please note that it has defined a sample set of nodes. All biological features get mapped to the GP-based representation in an extensible and flexible manner. Definitions of building blocks and higher-order features, such as motif, correlational, compositional, positional, regional, conjunctive, and disjunctive are found at this level. A correlational feature, for example, takes 4 parameters, explicitly defined with 2 Motifs, start Position and closeness or shift (1,2,3) etc.
  2. org.java.evolutionary.sequence.HypersensitiveSiteRecognitionProblem.java: This is the Fitness function in the GP/Evolutionary world. This class measures the fitness of an individual by running it across the positive and negative sequences, measuring the True Positive Ratio and Discriminating Factor (Information Gain between positive and negative sequences).
  3. StatisticPluginForHallofFame.java: and HallOfFame.java This class is meant for getting the plugin for the HallOfFame (the hall of fame is described in the paper). This can then be used for interpretation of the top features it contains in the end. The file is a parameter in the sequence.params.
  4. HypersensitiveSequenceFeatureInterpreter.java: This code is meant to read the Features stored in a file (hall of fame output) and generate LibSVM-specific format files. It will also output the features separately for those who like to analyze them further. The features are output to the file SSCleanFeatures.txt. The parallelization of feature matching is controlled by the input argument of Threads. It will use chunking, total number of sequence % threads-1 will get equal share, and the last one will get all the sequences in case of odd/even distribution. The number of threads used should be equal to the number of cores/processors on machines for faster throughput. This class additionally performs some simplifications. Statements such as (AND true true) or (OR (NOT false)) will be reduced.
org.java.evolutionary.sequence.func.*
  1. Compositional.java: This function or non-terminal is used for "Compositional" matching, i.e indepndent of position testing for compositions like AGGG anywhere in the sequence. It takes just one parameter, motif. It returns a boolean.
  2. PositionBased.java: This function or non-terminal is used for generating "Positional" features. It takes two parameters, motif and position. If position is any uniformly generated integer, it will find positional features. It returns a boolean.
  3. RegionDownstreamMatch.java and RegionUpstreamMatch.java: Implements "Regional Features, as described in paper. It takes two parameters, motif and region. It returns a boolean.
  4. ShiftPositional.java: Implements the "Shift Positional" Feature, like matching a pattern at a position with some tolerance of error for the position. It takes three parameters, motif, position, and the tolerance (1,2,3) positions left/right. It returns a boolean.
  5. Correlational.java: This can be used for generating explicit correlational features. It takes 4 parameters, 2 motifs for comparison, startPosition, and closeness integer (1,2,etc). Example AGGT, TAC, 24, 3 means 'check for correlation of having AGCT @ position 24 position and TAC at position 27. It returns a boolean.
  6. And.java: This allows obtaining conjunctive features. Its two parameters are features. It returns a boolean.
  7. Or.java: This allows obtaining disjunctive features. Its two parameters are features. It returns a boolean.
  8. Not.java: This allows obtaining negational features. Its only argument is a feature. It returns a boolean.

Statistical Methods:

As described in the paper, we used JSTACs(http://www.jstacs.de/index.php/Main_Page) for employing different statistical methods for sequence classification. A list of these methods is in the paper. The code is present in the package org.java.statistics. Since each problem has different tuning parameters and different specifications (e.g., length), we have implemented 4 classes for 4 different problems and have utilities to run each method.

HSS

HypersensitiveStatisticalMethodsTest.java: Implements all the techniques for comparisons mentioned above for the HSS dataset.

NN269

NN269StatisticalMethodsTest.java: Implements all the techniques for comparisons mentioned above for the NN269 (splice site) dataset.

CELEGANS

C_ElegansStatisticalMethodsTest.java: Implements all the techniques for comparisons mentioned above for the Worm (splice site) dataset.

ALU

AluStatisticalMethodsTest.java: Implements all the techniques for comparisons mentioned above for the ALU dataset.

Feature-based Methods:

Two such methods are implemented, Kmer-based and Gibbs Sampling. We highlight the base elements of the code, found in the org.java.featurebased package.

1. org.java.featurebased.FeatureBasedMotif.java:
This is the basic interface for generating features and files to train/test.

2. org.java.featurebased.AbstractFeatureBased.java:
This is the base class inheritted by both Kmer-based and Gibbs Sampling methods. The responsiblity of the class is to implement parsing the sequence for boolean matching of features.

Gibbs Sampling Feature Generator

  GibbSamplingMotifFeatureGenerator.java: This implements Gibbs Samping for feature generation. It delegates to the SimpleGibbsAligner.java, giving it the DNA sequences (positive dataset). The iterative methodology of finding motifs that are overrepresented by alignment is carried out in SimpleGibbsAligner till the criteria defined by StoppingCriteria.java is met.

KMER based Feature Generator

  KMerMotifFeatureGenerator.java: This extends the AbstractFeatureBased. It implements generateFeatures() to generate Kmers of specified length. For now, this is specific to DNA alphabets {A,C,G,T} but can be extended to any alphabet.

Kernel Based Methods:

This is the SVM-light and shogun 2.1-based kernel classifier that was used to run on the different datasets, provided at classifier_svmlight_cross_wdshift.py