Paper Summary
Title : A Two-Stage
Evolutionary Approach for Effective Classiffication of Hypersensitive DNA
Sequences
Authors: Uday Kamath, Amarda
Shehu, and Kenneth A De Jong
Abstract:
Hypersensitive (HS) sites in genomic sequences are reliable markers of
DNA regulatory regions that control gene expression. Annotation of
regulatory regions is important in understanding phenotypical
differences among cells and diseases linked to pathologies in protein
expression. Several computational techniques are devoted to mapping
out regulatory regions in DNA by initially identifying HS
sequences. Statistical learning techniques like Support Vector
Machines (SVM), for instance, are employed to classify DNA sequences
sequences as HS or non-HS. This paper proposes a method to automate
the basic steps in designing an SVM that improves the accuracy of such
classification. The method proceeds in two stages and makes use of
evolutionary algorithms. An evolutionary algorithm first designs
optimal sequence motifs to associate explicit discriminating feature
vectors with input DNA sequences. A second evolutionary algorithm then
designs SVM kernel functions and parameters that optimally separate
the HS and non-HS classes. Results show that this two-stage method
significantly improves SVM classification accuracy. The method
promises to be generally useful in automating the analysis of
biological sequences.
Software Details
The evolutionary algorithm described in the paper can generally be
applied to finding optimal kernels for a given problem at hand. The
implementation of this algorithm is made available under the Open
Source License. The description below consists of two sections. The
first provides basic information on how to
evolve kernels and provides a link to the code. The second provides
development details.
-
Basics
- Prerequisites:
- Need Java 1.5 and above
- Need ECJ
version 19. (It will work with future versions too, but may
need small API tweaks. Some interfaces, like Problem, have changed). If
not familiar with ECJ, we recommend the tutorials on the ECJ website.
- ECJ version 19 needs modification
- GPNodeConstraints.java needs to have "final" removed from
the method setup. This is fixed in later versions. Since we extend this
class in our version, we need a non-final method.
- LibSVM
formatted data as input.
- Software:
- GP Code: GPKernels
code has source code of various functions we employed for kernels and
terminals. See this detailed Javadoc .
- LibSVM Code: LibSVM
code has source code modifications so that kernel
evaluations employ GP Trees.
- Installation
- Copy the GP code and make it dependent on ECJ (in Eclipse) or
the ECJ jar in normal stand-alone.
- Copy the LibSVM code. Make the GP Code dependent on this (in
Eclipse) or the jar of libsvm.
- Running
- java -Xmx1536m ec.Evolve -file
kernel.params -p stat.gather-full = true
2. Developer Details
This is a more detailed description for developers who want to either tune
our code for other problems or need more information.
1. SVMEvaluator (GPProblem implementation for Fitness):
This class does many things like creating a ThreadEvaluator, passing
all the information needed like Kernel Individuals, Kernel static
parameters, and Kernel dynamic parameters to it, and waiting in a loop for
15 minutes (can be changed) for kernel evaluation. If kernel returns,
the accuracy is obtained as Koza fitness. Otherwise, the kernel is
judged to be bad and is associated a small fitness value.
2.LibSVM Changes:
- svm_parameter.java: This class is extended to have a new parameter
that is the the reference to the ThreadEvaluator.
- Kernel.java:
- Constructor is
extended to store reference to ThreadEvaluator.
- evalKernel is extended
to call ThreadEvaluator to get all elements of GP and eval GP Tree. The
GP Tree gets executed with the operator and operand values properly
passed, and so kernel function gets evaluated.
3. Ephemeral Random Constants (ERCs)
The paper explains that the ERCs maintaned in our GP trees keep track
of the SVM cost parameter C and various kernel parameters. These
constants have different roles and range of values. Mutation of ERCs
employs the following techniques:
-
Real ERCs (for C, gamma, sigma, etc.)
- Simple Random Scaling
Method: In this method, we have two simple ERC used
by everyone. HighGammaERC and LowGammaERC. HighGammaERC scales a
random number generated uniformly [0,1] to higher range, say 2^5
and LowGammaERC scales a random number generated uniformly [0,1]
to lower range, say 2^-5.
- Range Based Random
Method: In this method,we have
RangeBasedConstraints. We can define user ranges like [-5,+15] and
power bases 2, 10, etc. If the power is 2, the method generates a
random ERC in the range 2^-5...2^15.
- Range Based with
Incremental Exploitative Method: In this method,
users can set up an incremental range-based search. For example,
[-5,15] can be the range, increment can be 1, powerbase can be
2. Users can also specify the number of times N the ERC can stay
in the range, where N is user-defined (e.g, 50). The method keeps
count of how many iterations the ERC has stayed in the range. It
also does incremental linear grid expansion. If, for instance, the
chosen ERC is 5.003, it is in the [2,3] range (in powers of 2,
i.e. 2^2-2^3 range). The method finds another real number in the
[1,4] range, essentially expanding the grid in both directions by
1. The count is incremented by 1 to make sure the "getting stuck"
effect is removed. At the same time, a bit of exploitation is done
around this range.
- Combination:
Since all the real ERCs have the same return type, there is
no restriction on combining the above to let the algorithm evolve
ERCs.
-
Integer ERCs (for order, etc)
- Simple Random Scaling
Method: In this method, we just get a random integer
in a given range.
Citation
Cite this paper
U. Kamath, A. Shehu, and K. De Jong, “A two-stage evolutionary approach
for effective classification of hypersensitive DNA sequences,” J.
Bioinf. & Comp. Biol., 2011.
Copy Rights and Trade marks
1. LIbSVM : Copy Rights of LIBSVM 2000-2010 Chih-Chung Chang and
Chih-Jen Lin
All rights reserved are in LibSVM source code.
2. ECJ: ECJ is licensed under the Academic Free License,
version 3.0, included in the package.
3.Java: Java is registered trademark of Oracle.