JBCBKernelSupplementPage

Paper Summary

Title : A Two-Stage Evolutionary Approach for Effective Classiffication of Hypersensitive DNA Sequences

Authors: Uday Kamath, Amarda Shehu, and Kenneth A De Jong

Abstract: Hypersensitive (HS) sites in genomic sequences are reliable markers of DNA regulatory regions that control gene expression. Annotation of regulatory regions is important in understanding phenotypical differences among cells and diseases linked to pathologies in protein expression. Several computational techniques are devoted to mapping out regulatory regions in DNA by initially identifying HS sequences. Statistical learning techniques like Support Vector Machines (SVM), for instance, are employed to classify DNA sequences sequences as HS or non-HS. This paper proposes a method to automate the basic steps in designing an SVM that improves the accuracy of such classification. The method proceeds in two stages and makes use of evolutionary algorithms. An evolutionary algorithm first designs optimal sequence motifs to associate explicit discriminating feature vectors with input DNA sequences. A second evolutionary algorithm then designs SVM kernel functions and parameters that optimally separate the HS and non-HS classes. Results show that this two-stage method significantly improves SVM classification accuracy. The method promises to be generally useful in automating the analysis of biological sequences.

Software Details

The evolutionary algorithm described in the paper can generally be applied to finding optimal kernels for a given problem at hand. The implementation of this algorithm is made available under the Open Source License. The description below consists of two sections. The first provides basic information on how to evolve kernels and provides a link to the code. The second provides development details.

Basics

Prerequisites:

Need Java 1.5 and above
Need ECJ version 19. (It will work with future versions too, but may need small API tweaks. Some interfaces, like Problem, have changed). If not familiar with ECJ, we recommend the tutorials on the ECJ website.
ECJ version 19 needs modification

GPNodeConstraints.java needs to have "final" removed from the method setup. This is fixed in later versions. Since we extend this class in our version, we need a non-final method.

LibSVM formatted data as input.

Software:

GP Code: GPKernels code has source code of various functions we employed for kernels and terminals. See this detailed Javadoc .
LibSVM Code: LibSVM code has source code modifications so that kernel evaluations employ GP Trees.

Installation

Copy the GP code and make it dependent on ECJ (in Eclipse) or the ECJ jar in normal stand-alone.
Copy the LibSVM code. Make the GP Code dependent on this (in Eclipse) or the jar of libsvm.

Running

java -Xmx1536m ec.Evolve -file kernel.params -p stat.gather-full = true

2. Developer Details

This is a more detailed description for developers who want to either tune our code for other problems or need more information.

1. SVMEvaluator (GPProblem implementation for Fitness):

This class does many things like creating a ThreadEvaluator, passing all the information needed like Kernel Individuals, Kernel static parameters, and Kernel dynamic parameters to it, and waiting in a loop for 15 minutes (can be changed) for kernel evaluation. If kernel returns, the accuracy is obtained as Koza fitness. Otherwise, the kernel is judged to be bad and is associated a small fitness value.

2.LibSVM Changes:

svm_parameter.java: This class is extended to have a new parameter that is the the reference to the ThreadEvaluator.
Kernel.java:

Constructor is extended to store reference to ThreadEvaluator.
evalKernel is extended to call ThreadEvaluator to get all elements of GP and eval GP Tree. The GP Tree gets executed with the operator and operand values properly passed, and so kernel function gets evaluated.

3. Ephemeral Random Constants (ERCs)

The paper explains that the ERCs maintaned in our GP trees keep track of the SVM cost parameter C and various kernel parameters. These constants have different roles and range of values. Mutation of ERCs employs the following techniques:

Real ERCs (for C, gamma, sigma, etc.)

Simple Random Scaling Method: In this method, we have two simple ERC used by everyone. HighGammaERC and LowGammaERC. HighGammaERC scales a random number generated uniformly [0,1] to higher range, say 2^5 and LowGammaERC scales a random number generated uniformly [0,1] to lower range, say 2^-5.
Range Based Random Method: In this method,we have RangeBasedConstraints. We can define user ranges like [-5,+15] and power bases 2, 10, etc. If the power is 2, the method generates a random ERC in the range 2^-5...2^15.
Range Based with Incremental Exploitative Method: In this method, users can set up an incremental range-based search. For example, [-5,15] can be the range, increment can be 1, powerbase can be 2. Users can also specify the number of times N the ERC can stay in the range, where N is user-defined (e.g, 50). The method keeps count of how many iterations the ERC has stayed in the range. It also does incremental linear grid expansion. If, for instance, the chosen ERC is 5.003, it is in the [2,3] range (in powers of 2, i.e. 2^2-2^3 range). The method finds another real number in the [1,4] range, essentially expanding the grid in both directions by 1. The count is incremented by 1 to make sure the "getting stuck" effect is removed. At the same time, a bit of exploitation is done around this range.
Combination: Since all the real ERCs have the same return type, there is no restriction on combining the above to let the algorithm evolve ERCs.

Integer ERCs (for order, etc)

Simple Random Scaling Method: In this method, we just get a random integer in a given range.

Citation

Cite this paper
U. Kamath, A. Shehu, and K. De Jong, “A two-stage evolutionary approach for effective classification of hypersensitive DNA sequences,” J. Bioinf. & Comp. Biol., 2011.

Copy Rights and Trade marks

1. LIbSVM : Copy Rights of LIBSVM 2000-2010 Chih-Chung Chang and Chih-Jen Lin
All rights reserved are in LibSVM source code.

2. ECJ: ECJ is licensed under the Academic Free License, version 3.0, included in the package.

3.Java: Java is registered trademark of Oracle.