ECJ

## Tutorial 4: Build a Multivalued Genetic Programming Symbolic Regression Problem

In the fourth tutorial, we will use Koza-style GP to evolve solutions to a simple multivalued symbolic regression problem. This will walk through many of the complexities of evolving s-expression trees. We will also learn how to print out trees in a variety of formats, including ones readable by LaTeX. We'll also add a dash of elitism to the traditional Koza breeding mix.

### Create an app subdirectory

Go into the ec/app directory and create a directory called tutorial4.

### How ECJ's GP System Works

[This sections assumes you are familiar with Koza-style genetic programming]

ECJ implements genetic programming individuals as forests of trees. Each tree is implemented as a tree of GPNodes. A GPIndividual is a subclass of Individual which holds an array of GPTrees. In most GP problems, the GPIndividual uses KozaFitness, a Fitness subclass which implements Koza's raw and standardized fitness and hits measure.

A GPTree contains a pointer to the root GPNode of its tree. GPNodes contain an array of pointers to their children in the tree. At present, all GPNodes contain this array, even the terminals, for whom the array is of zero length. For convenience, GPNodes also maintain a pointer to their parent in the tree (always a GPNode, except the root's parent, which is the owning GPTree) and an integer indicating which number child they are of their parent (their argposition).

The species of GPIndividuals is GPSpecies. Just as Individuals have a many-to-one relationship with Species in order to provide a global storage area, GPTrees and GPNodes also have many-to-one relationships with GPTreeConstraints and GPNodeConstraints objects. At present the primary function of these "constraints" objects is to specify the arity of the respective GPNodes, and to specify the GPTypes associated with the GPNodes and GPTrees.

ECJ supports atomic and set-based strong typing in order to place constraints on where various GPNodes may appear in the tree. Atomic strong typing assigns an atomic symbol to each return value and argument slot of a GPNode, and to the return value of the GPTree. GPNodes are permitted to be children of other GPNode if the return value type symbol of the child GPNode matches the relevant argument slot type symbol of the parent. Similarly, a GPNode is permitted to be the root if its return value type symbol matches the return value type symbol of the GPTree. Atomic types are defined with a subclass of GPType called GPAtomicType.

Set-based strong typing extends the notion of matching. Types are no longer just atomic symbols can also be sets of atomic symbols. A set matches with another set if their intersection is nonempty. A set matches with an atomic symbol if the atomic symbol is found in the set. Set types are defined with a subclass of GPType called GPSetType. Set-based strong typing is sufficient for a wide variety of needs, including type inheritance (subclassing) and certain forms of general types. Set-based typing is not full polymorphic typing: for example, it assumes that the number of types is finite, so ECJ's typing facility is not appropriate for matrix multiplication operators, etc.

While ECJ supports strong typing, in fact most GP applications only need a single type (they are "typeless"). For convenience, ECJ's default parameter files for GP define a single type called (for want of a better name) nil, and some default GPNodeConstraints of various arities which only use the type nil

Each GPTree subscribes to a GPFunctionSet which defines the GPNodes legally permitted in the tree. GPFunctionSets store, in a variety of redundant tables, the relevant GPNodes. When an algorithm needs to generate a tree, it does so by finding the appropriate GPFunctionSet choosing GPNodes from the set, and cloning them to produce nodes to hang in the tree.

GP builds trees using tree generation algorithms, all of which are subclasses of GPNodeBuilder. Tree generation algorithms are usually used in one of two places: population initialization and mutation. Mutation algorithms have their own parameters where you can specify the GPNodeBuilder subclass you'd like. Additionally, each GPTree has a reference (not shown on the graph) to a GPNodeBuilder resonsible for initialization of that tree.

ECJ's GP package has its own special GPInitializer subclass of Initializer. The reason for this is that GP needs some location to set up the GPFunctionSets, GPTreeConstraints, and GPNodeConstraints before evolution begins. Initializer is the most obvious location. The ec.gp.koza package adds the Koza-specific features to the general notion of tree-based genetic programming.

### Define the GPData Object

Whew!

What we will do is use GPTrees to evolve s-expression parse trees representing two-variable equations. Our function set will contain five GPNodes representing the functions add (+), sub (-), mul (*), X, and Y. The objective is to evolve an equation most closing a set of <X,Y,f(X,Y)> data.

We begin by defining a GPData object. This object will be passed from GPNodes to their children, and usually more importantly, from the children back to the parents (storing the child node's return value after the child is evaluated). It provides a general mechanism for nodes to transfer information to one another.

In our case, we're interested only in the return values of GPNodes, and those values are doubles. So we'll make a GPData object which consists solely of a double. Create a file called DoubleData.java. In this file add:

package ec.app.tutorial4;
import ec.util.*;
import ec.*;
import ec.gp.*;

public class DoubleData extends GPData
{
public double x;    // return value

public void copyTo(final GPData gpd)   // copy my stuff to another DoubleData
{ ((DoubleData)gpd).x = x; }
}


### Define the Function Set

Next, let's start writing the functions.

package ec.app.tutorial4;
import ec.*;
import ec.gp.*;
import ec.util.*;

{
public String toString() { return "+"; }

public int expectedChildren() { return 2; }
}


Add needs to specify its printed form ("+"), and a function (expectedChildren(...) ) which give Add a chance to do a sanity check when first loaded, to verify that the child array size is correct. Instead of expectedChildren(...), you could have instead written:

    // public int expectedChildren() { return 2; }

// You could do this instead.  It'd give you a chance to do more verifications
// and checks.  The default implementation of checkConstraints just calls
// expectedChildren(), which is why you can just implement expectedChildren
// instead in most simple cases.
public void checkConstraints(final EvolutionState state,
final int tree,
final GPIndividual typicalIndividual,
final Parameter individualBase)
{
super.checkConstraints(state,tree,typicalIndividual,individualBase);
if (children.length!=2)
state.output.error("Incorrect number of children for node " +
toStringForError() + " at " +
individualBase);
}

ECJ will call the checkConstraints() method, which gives you a chance to do more sophisticated checks. Above we're doing more or less what the default version of checkConstraints() does anyway, which is just call expectedChildren() and verify that the children.length variable is the same as it. Thus if you don't need to do anything more than a sanity check on the number of child nodes, just implement expectedChildren() and don't bother with checkConstraints().

Next we will define the central function to the GPNode: the eval method, which is called when the GPNode must do its part during the execution of the tree as s-expression computer code. eval is called by the GPNode's parent to execute the GPNode and get its return value.

The eval method is passed a bunch of stuff, including the EvolutionState, the current thread number (so you can get a random number via state.random[thread], an ADFStack (don't worry about this -- just pass it on to your children), the GPIndividual which is executing, the Problem which is executing the GPIndividual, and the GPData used to pass information to and from the GPNode's parent.

Here we go:

    public void eval(final EvolutionState state,
final GPData input,
final GPIndividual individual,
final Problem problem)
{
double result;
DoubleData rd = ((DoubleData)(input));

result = rd.x;

rd.x = result + rd.x;
}
}


What we're doing here is simple enough: we evaluate our left child, then we evaluate our right child, then we add the two together and return that result. Note that we reuse the GPData object rather than allocating new ones. You are free to allocate a new GPData object to pass to your children, but you should use the GPData object provided you when you return a value to your parent. But I suggest you reuse the GPData object to keep from allocating millions of objects during evaluation.

Sub and Mul. These functions are set up nearly identically. Create a file called Sub.java and add to it the following:

package ec.app.tutorial4;
import ec.*;
import ec.gp.*;
import ec.util.*;

public class Sub extends GPNode
{
public String toString() { return "-"; }

public int expectedChildren() { return 2; }

public void eval(final EvolutionState state,
final GPData input,
final GPIndividual individual,
final Problem problem)
{
double result;
DoubleData rd = ((DoubleData)(input));

result = rd.x;

rd.x = result - rd.x;
}
}


Similarly, create a file called Mul.java, and add to it:

package ec.app.tutorial4;
import ec.*;
import ec.gp.*;
import ec.util.*;

public class Mul extends GPNode
{
public String toString() { return "*"; }

public int expectedChildren() { return 2; }

public void eval(final EvolutionState state,
final GPData input,
final GPIndividual individual,
final Problem problem)
{
double result;
DoubleData rd = ((DoubleData)(input));

result = rd.x;

rd.x = result * rd.x;
}
}


X and Y. So far we have defined nonterminals (nonleaf nodes in the tree), all of which have arity 2. Now we'll finish by defining two terminals (leaf nodes in the tree) which of course have arity 0 (no children). Before a tree is evaluated, the X and Y values will be set, then as the tree is evaluated, the X and Y terminals will simply return these values when they are evaluated. We begin by creating a file called X.java and adding to it:

package ec.app.tutorial4;
import ec.*;
import ec.gp.*;
import ec.util.*;

public class X extends GPNode
{
public String toString() { return "x"; }

public int expectedChildren() { return 0; }



Note that there are no children. Now what we'll do is define in our Problem two public doubles called currentX and currentY. These are the values that these nodes return when evaluated. Hence we add:

    public void eval(final EvolutionState state,
final GPData input,
final GPIndividual individual,
final Problem problem)
{
DoubleData rd = ((DoubleData)(input));
rd.x = ((MultiValuedRegression)problem).currentX;
}
}


The Y class is created similarly. Make a file called "Y.java", and add to it:

package ec.app.tutorial4;
import ec.*;
import ec.gp.*;
import ec.util.*;

public class Y extends GPNode
{
public String toString() { return "y"; }

public int expectedChildren() { return 0; }

public void eval(final EvolutionState state,
final GPData input,
final GPIndividual individual,
final Problem problem)
{
DoubleData rd = ((DoubleData)(input));
rd.x = ((MultiValuedRegression)problem).currentY;
}
}


### Define the Problem

We don't need to define a GPIndividual or a GPTree -- the default forms for them are usually fine. So all that's left for us is to define the Problem. To keep things simple for this tutorial, we will do certain things. First, we will hard-code to 10 the number of <X,Y,f(X,Y)> presentations given to the GPindividual. Second, we rather than pre-generating the <X,Y,f(X,Y)> data elements, we will just generate a random one each time we need to provide a new presentation. They will be generated at random from the function x*x*y + x*y + y. Third, we will ignore floating point rounding errors which might make achieving a perfect 0.0 fitness difficult. Fourth, our randomly-generated elements will range from 0.0 to 1.0. Fifth, we will assume that nailing an answer within 0.01 counts as one hit.

Create a file called "MultiValuedRegression.java". In it add the following:

package ec.app.tutorial4;
import ec.util.*;
import ec.*;
import ec.gp.*;
import ec.gp.koza.*;
import ec.simple.*;

public class MultiValuedRegression extends GPProblem implements SimpleProblemForm
{
public static final String P_DATA = "data";

public double currentX;
public double currentY;

public void setup(final EvolutionState state,
final Parameter base)
{
// very important, remember this
super.setup(state,base);

// verify our input is the right class (or subclasses from it)
if (!(input instanceof DoubleData))
state.output.fatal("GPData class must subclass from " + DoubleData.class,
base.push(P_DATA), null);
}


GPProblem defines an instance variable called input, which contains a GPData instance loaded from our parameters. Here, we get a chance to verify that this instance is of a class we can use. In our case, we want it to be a DoubleData class or some subclass of that.

Last we need to define the evaluate method, which actually evaluates the individual and sets its fitness. We will do so by testing the individual against ten data points, then setting its fitness to the sum of how close it got in each case. Thus 0 is the ideal fitness -- in GP, 0 is the ideal and infinity is worse than the worst possible fitness. The hit measure (an auxillary measure) is simply how often the system got "reasonably close".

    public void evaluate(final EvolutionState state,
final Individual ind,
final int subpopulation,
{
if (!ind.evaluated)  // don't bother reevaluating
{
DoubleData input = (DoubleData)(this.input);

int hits = 0;
double sum = 0.0;
double expectedResult;
double result;
for (int y=0;y<10;y++)
{
expectedResult = currentX*currentX*currentY + currentX*currentY + currentY;
((GPIndividual)ind).trees[0].child.eval(

result = Math.abs(expectedResult - input.x);
if (result <= 0.01) hits++;
sum += result;
}

// the fitness better be KozaFitness!
KozaFitness f = ((KozaFitness)ind.fitness);
f.setStandardizedFitness(state, sum);
f.hits = hits;
ind.evaluated = true;
}
}
}


### Set up the Parameter File

Strongly-typed, multi-tree GP has a lot of parameters. Fortunately ECJ has written nearly all of them for you already, in the ec/gp/koza/koza.params file. Take a minute and peruse through it, and also through the ec/gp/gp.params file which is its parent. You'll see lots of clues there about what kinds of things you can do to bend ECJ's GP system to your will. ECJ differs from Koza-I in very slight modifications of RAMPED-HALF-AND-HALF (ECJ uses lil-gp's approach), the use of 7-tournament selection, and a default population size of 1024.

What we'll need to define in our own parameter file: the function set, the Problem, and the GPData object. Create a file called tutorial4.params, and in it add:

parent.0 = ../../gp/koza/koza.params

# the next four items are already defined in koza.params, but we
# put them here to be clear.

# We have one function set, of class GPFunctionSet
gp.fs.size = 1
gp.fs.0 = ec.gp.GPFunctionSet
# We'll call the function set "f0".
gp.fs.0.name = f0

# We have five functions in the function set.  They are:
gp.fs.0.size = 5
gp.fs.0.func.0 = ec.app.tutorial4.X
gp.fs.0.func.0.nc = nc0
gp.fs.0.func.1 = ec.app.tutorial4.Y
gp.fs.0.func.1.nc = nc0
gp.fs.0.func.2.nc = nc2
gp.fs.0.func.3 = ec.app.tutorial4.Sub
gp.fs.0.func.3.nc = nc2
gp.fs.0.func.4 = ec.app.tutorial4.Mul
gp.fs.0.func.4.nc = nc2


The "nc0" and "nc2" are names for GPNodeConstraints objects defined in the koza.params file. nc0 means a GPNodeConstraints object defining nodes with no children and nil as the return type. nc2 means a GPNodeConstraints object defining nodes with two children (each with nil), plus nil as the return type. Continuing, we add to the file information indicating our Problem and GPData choices:

eval.problem = ec.app.tutorial4.MultiValuedRegression
eval.problem.data = ec.app.tutorial4.DoubleData


### Run the Problem

Save the files and close them. Compile the java files. The run the problem with java ec.Evolve -file tutorial4.params

On most Java VM's ECJ should discover the solution in about five generations (it's an easy problem). The out.stat file looks something like this:

Generation: 0
Best Individual:
Subpopulation 0:
Evaluated: true
Tree 0:
(- (+ (- y x) (+ x x)) (* (* y y) (- x x)))

Generation: 1
Best Individual:
Subpopulation 0:
Evaluated: true
Tree 0:
(- (+ y x) (* x x))

Generation: 2
Best Individual:
Subpopulation 0:
Evaluated: true
Tree 0:
(+ (* (* y x) (* x y)) (- (+ (* y x) x) (-
x y)))

Generation: 3
Best Individual:
Subpopulation 0:
Evaluated: true
Tree 0:
(+ (* (* y x) (* x y)) (- (+ (* y x) x) (-
x y)))

Generation: 4
Best Individual:
Subpopulation 0:
Evaluated: true
Tree 0:
(- (+ (* x y) y) (* (* y y) (- x x)))

Generation: 5
Best Individual:
Subpopulation 0:
Evaluated: true
Tree 0:
(+ (* x (+ y (* x y))) y)

Best Individual of Run:
Subpopulation 0:
Evaluated: true
Tree 0:
(+ (* x (+ y (* x y))) y)


Another option is to print out tabular statistics. We'll change the statistics from the default KozaStatistics to the line-by-line data format of KozaShortStatistics. Run the system as java ec.Evolve -file tutorial4.params -p stat=ec.gp.koza.KozaShortStatistics and we might get something like:

0 0.11815579608850547 0.5106663312374864 0.5106663312374864
1 0.16516424252767573 0.7553260554636446 0.7553260554636446
2 0.2032414840372659 0.6840854153019551 0.7553260554636446
3 0.23092182973239558 1.0 1.0

The four columns show the generation number, the mean fitness for the population each generation, the best fitness for the population each generation, and the best fitness so far in the run. You can extend this with additional statistical columns (describing time and space usage) by saying: java ec.Evolve -file tutorial4.params -p stat=ec.gp.koza.KozaShortStatistics -p stat.do-size=true -p stat.do-time-true As a result, you might get something like this:
0 67 22 [ 23.740234375 ] 23 23 9.0 9.0 0.12096272693400785 0.6076038106112858 0.6076038106112858
1 14 2 [ 16.232421875 ] 16 19 7.0 7.0 0.17651978477818045 0.6208479277810929 0.6208479277810929
2 7 1 [ 13.9375 ] 13 17 13.0 13.0 0.2093994867069308 0.623960607294017 0.623960607294017
3 4 1 [ 13.740234375 ] 13 16 11.0 11.0 0.23955326657131645 0.9999999999999998 0.9999999999999998
4 5 1 [ 14.25390625 ] 14 16 11.0 11.0 0.2636658620971829 0.9999999999999998 0.9999999999999998
5 4 1 [ 13.8359375 ] 13 15 11.0 11.0 0.28518720248196483 1.0 1.0
6 3 2 [ 14.99609375 ] 14 15 9.0 9.0 0.3058538096982688 1.0 1.0


Consult the class documentation for KozaStatistics, or more usefully, the ECJ Owner's Manual, to understand these column.

GPNodes are the dominant part of an ECJ GP process. If you run this process under memory profiling, you'll find that in Java 7 it takes up about 1500K of space. That's not a lot but at about 14 GPNodes per individual in generation 3, or about 14300 GPNodes per population, or about 28600 GPNodes in memory at a time during breeding (the prior and current generations), this comes to about 50 bytes per node. This ignores the non-GPNode memory usage of course: in my experience ECJ takes up about 30 bytes per GPNode. Compare this to lil-gp, which uses 4 bytes per node, or 17 bytes for an Ephemeral Random Constant. This is a packed array approach, and compared it, ECJ is a memory hog. This is partially due to Java, and partially due to ECJ using a tree structure rather than a packed array structure. The upside is that ECJ's trees are far easier to modify and customize than packed arrays of simple numbers.

Since GPNodes absolutely dominate the memory usage, we only bother looking at them. Here's where the memory goes:

• 8 bytes in the GPNode object header (on HotSpot anyway)
• 4 bytes to point to the parent
• 4 bytes to point to the child array
• 12 bytes in the array's header (on HotSpot)
• 4 bytes for the pointer from the parent's array to the object
• 1 byte for the argposition
• 1 byte for the constraints

If you've been counting, that totals to 34 bytes. HotSpot is clever when dealing with zero-sized arrays, which I believe accounts for the slight lowering in usage. PLUS: to evolve a population, ECJ must keep around the old population to breed into it. That doubles your expected memory.

### Print out the trees for LaTeX

Here's a nifty trick. GPTrees can be printed out in LaTeX form rather than in s-expression form. Run the system with the following command: java ec.Evolve -file tutorial4.params -p gp.tree.print-style=latex and the trees will be outputted in a form that can be stuck into a LaTeX file and printed.

The best individual of the generation now looks like this:

Best Individual of Run:
Evaluated: true
Tree 0:
\begin{bundle}{\gpbox{+}}\chunk{\gpbox{y}}\chunk{\begin{bundle}{\gpbox{+}}\chunk{\begin{bundle}{\gpbox{*}}
\chunk{\gpbox{x}}\chunk{\gpbox{y}}\end{bundle}}\chunk{\begin{bundle}{\gpbox{*}}\chunk{\gpbox{y}}
\chunk{\begin{bundle}{\gpbox{*}}\chunk{\gpbox{x}}\chunk{\gpbox{x}}\end{bundle}}\end{bundle}}\end{bundle}}
\end{bundle}

The chunk in blue can be snipped out and inserted into a LaTeX file with a few extra items as shown:
\documentclass[]{article}
\usepackage{epic}	% required by ecltree and fancybox packages
\usepackage{ecltree}	% to draw the GP trees
\usepackage{fancybox}	% required by \Ovalbox

\begin{document}

% minimum distance between nodes on the same line
\setlength{\GapWidth}{1em}

% draw with a thick dashed line, very nice looking
\thicklines \drawwith{\dottedline{2}}

% draw an oval and center it with the rule.  You may want to fool with the
% rule values, though these seem to work quite well for me.  If you make the
% rule smaller than the text height, then the GP nodes may not line up with
% each other horizontally quite right, so watch out.
\newcommand{\gpbox}[1]{\Ovalbox{#1\rule[-.7ex]{0ex}{2.7ex}}}

\begin{bundle}{\gpbox{+}}\chunk{\gpbox{y}}\chunk{\begin{bundle}{\gpbox{+}}\chunk{\begin{bundle}{\gpbox{*}}
\chunk{\gpbox{x}}\chunk{\gpbox{y}}\end{bundle}}\chunk{\begin{bundle}{\gpbox{*}}\chunk{\gpbox{y}}
\chunk{\begin{bundle}{\gpbox{*}}\chunk{\gpbox{x}}\chunk{\gpbox{x}}\end{bundle}}\end{bundle}}\end{bundle}}
\end{bundle}
\end{document}


Run latex on this file, and you'll get out a picture like the one shown at right.

### Print out the trees in C-ish format

ECJ can print out trees in a somewhat C-style format, where nodes with exactly two children are printed as if the node was an operator (+, *, etc.), and nodes with one child or three or greater children would be printed out in a function style a-la func(arg, arg, ...). For example,

(exp (- (sin (+ 4 (cos x))) (* x x)) x)

(sin(4 + cos(x)) - (x * x)) exp x

We prefer a Lisp style, but this might be easier for you to read. To do this, you can run the command like this: java ec.Evolve -file tutorial4.params -p gp.tree.print-style=c

### Print out the trees in Graphviz

Graphviz is a graph-layout language and renderer developed by AT&T. ECJ can generate trees that Graphviz renders quite nicely to PDF and other formats for inclusion in documents as well. To do this, you can run the command like this: java ec.Evolve -file tutorial4.params -p gp.tree.print-style=dot

### Try some Elitism

Now we'll make the model elitist, which in this case will be keeping the two best individuals in the subpopulation 0 of the population (you can specify different elitism values for each subpopulation). SimpleBreeder has an elitist feature. Run the system as java ec.Evolve -file tutorial4.params -p breed.elite.0=2

The symbolic regression example here is fairly simple. More sophisticated (but one-variable) versions can be found in the ec.app.regression package. At this point understanding them should be pretty easy.