ECJ

An Evolutionary Computation and Genetic Programming System

Extension to support Gene Expression Programming (GEP)

 

Bob Orchard (bob.orchard@nrc-cnrc.gc.ca)

National Research Council of Canada

Institute for Information Technology

 

What is Gene Expression Programming?

 

How Do I Use GEP in the ECJ System? Some Simple Examples.

 

Some Other Examples



Classification Examples

 

Time Series Examples

 

Logical Problem Examples

 

A Detailed Look at the GEP Parameter File (gep.param)

 

Some of the Implementation Details

 

 

What is Gene Expression Programming?

 

There are a number of topics that fall under the general category of Evolutionary Computing. These include Genetic Algorithms (GA) and Genetic Programs (GP). Gene Expression Programming (GEP) is a ‘new’ evolutionary algorithm (developed by Cândida Ferreira) that evolves computer programs in a way similar to GP. But the representation of the programs is more efficient and leads to less complex implementation of the genetic operators. The individuals of gene expression programming are encoded in linear chromosomes which are expressed or translated into expression trees when evaluation is required. Thus, in GEP, the genotype (the linear chromosomes) and the phenotype (the expression trees) are different entities (both structurally and functionally).

 

As in nature, the linear chromosomes consist of the genetic material that is passed on with modification to the next generation. Therefore, in GEP, all the genetic modifications take place in the chromosomes, and only the chromosomes are transmitted in the process of reproduction. After reproduction the new chromosomes are expressed forming the body or expression trees (ETs).

 

The ETs are computer programs evolved to solve a particular problem and are selected according to their capabilities in solving the problem at hand. With time, populations of such computer programs discover new traits and become better adapted to a particular selection environment (for instance, a set of experimental results) and, hopefully, a good solution evolves.

 

This is variation of Genetic Programming and not a ‘new’ type of evolutionary computing. However, due to the genotype/phenotype representation and to the multigenic organization of GEP chromosomes, this new algorithm generally leads to better performance. It may also generate better solutions in many cases.

 

For all the details of this new algorithm, see the seminal GEP paper by Ferreira (published in Complex Systems) where the algorithm is fully described and applied to a set of problems, including symbolic regression, boolean concept learning, and cellular automata.

 

Now much of the above is taken directly from the Wikipedia definition of GEP and as it says, the details can be found in Ferreira’s paper and in much greater detail in her excellent book, Gene Expression programming: mathematical modeling by an artificial intelligence. 2nd Edition. Springer-Verlag: 2006. ISBN 3-540-32796-7. There is also a commercial implementation of the algorithm by the company founded by Ferreira, Gepsoft (www.gepsoft.com). The product is called GeneXproTools. You might also look at the documentation provided with the fully functioning (but limited) download of GeneXproTools to gain much more insight and detail about gep, or just view the short tutorial (http://www.gepsoft.com/tutorial002.htm).

 

The intent of this ECJ implementation of the gep algorithm is to provide a free, and hopefully useful and accurate implementation of the software, with most of the major functionality. And, since the source code is provided, researchers and other interested parties can experiment with ways to enhance or improve some aspect of this approach to evolutionary computing.

 

How Do I Use GEP in the ECJ System? Some Simple Examples.

 

One doesn’t need to understand the detailed implementation of gep in ECJ to use the system but it is necessary to know about some things so that the appropriate parameters can be set and your data can be processed.

Please NOTE that the output from any examples in this documentation may be different than the output you see when you execute the problems. The code is changing and this can cause changes to the output even with the same input (especially if code changes affect the way or order in which random numbers are allocated).

 

You need to supply 3 things to run an experiment: a parameter file that defines how to do the evolution, a very simple Java class (actually you don't even need to do any programming at all ... more about that option later), and some data (the data can be in a file or encoded in your simple Java class). We’ll start by looking at a very simple problem, showing the simplest of parameter files and the minimal Java program.

 

The example is to determine a model (equation) that satisfies the following set of data (stored in the file test1.txt). The variables x and y are independent variables and z is the dependent variable, so  z = f(x,y).

 

x,     y,    z

0.5,  1.0,  1.25

1.0,  2.0,  3.0

2.0,  4.0,  8.0

3.0, 12.0, 21.0

 

You might quickly see that this is ‘perfect’ sample data for the function z = x*x + y. Data files that are currently supported are the so called CSV, comma separated value, text files. NOTE: gzip'd files can also be used directly. So if a file ends in .gz or .gzip we will handle the de-compression of the file. For example in this case we could have compressed test1.txt to create test1.txt.gz and identified the gzip'd file rather than the .txt file in the parameter file (see below).

Generally one will use either a comma, a space or a tab to be the field separator. The example above uses a comma. The first row of the file must identify the independent and dependent variable names with the last column being the dependent variable. You’ll see later that for time series problems there is another file format that is used with ‘raw’ time series data, but for most problems the training data, when in a file, has this format. So here we have 3 columns that identify the 2 independent variables x and y plus the dependent variable z. There are 4 rows of numeric data that we will use as 'training' data during the evolutionary process. 

The parameter file that tells the gep system how to behave as it generates models and proceeds from generation to generation is test1.param. It has the following content: 

 

# This parameter file supports a simple test example to solve x^2 + y.
#
# The data (including variable names) is in the file test1.txt.

# My parent:
parent.0 =                ../../../gep/gep.params

# Problem
#===============================
eval.problem = ec.app.gep.test1.test1

# run for 100 generations, quit prematurely if I find something ideal
generations =                    100
quit-on-run-complete =           true

# ec.Subpopulation
# ==============================

# subpop size is 50 individuals
pop.subpop.0.size =            50

# ec.Species
# ==============================

gep.species.numgenes = 2
gep.species.gene-headsize = 5
gep.species.gene-linking-function = +

# Problem type must be one of: functionfinding, classification, timeseries, logical
# Set default to be 'unknown' so user is forced to specify in the problem params file
gep.species.problemtype = functionfinding

# specify the symbols ... functions and terminals (variables)
#
# the terminal symbols (variable names) will be obtained from the data file (1st row)
gep.species.symbolset.terminalfilename = ec/app/gep/test1/test1.txt
# 4 functions + - * /  are used in this problem
gep.species.symbolset.functionsize = 4
gep.species.symbolset.function.0 = Add
gep.species.symbolset.function.1 = Sub
gep.species.symbolset.function.2 = Mul
gep.species.symbolset.function.3 = Div

# Statistics
# ==============================

# output statistics to the file "test1.stat" in the directory
# the run was started in
stat =                ec.gep.GEPSimpleStatistics
stat.file =           test1.stat

 

Let's look at the parameters in some detail. First we see that the file references a parent file (../../../gep/gep.params). This is the parameter file that holds all of the default settings for a gep problem. You can over-ride any of the settings in this file by including new values in your own param file (in this case, test1.param). But your param file must reference this file so that the settings that you are not concerned about are properly defined. If you looked at the gep.params file you'd see the full set of parameters that can be specified (more on this later). You'd also see that it references the basic ecj param file as well. But for now we're only concerned with the parameters that are set for this simple example. 

 

Next we see that the parameter eval.problem is set to ec.app.gep.test1.test1. This is the name of the class that contains the code we need to supply for our problem. We'll look at this in more detail below. 

 

The next 2 parameters identify the maximum number of generations (in this case 100) that we should run when trying to find a solution to the problem and whether we should quit when we find a 'suitable' solution. In this case we say that we should quit when the solution is found, rather than continuing until 100 generations have been run. The parameter pop.subpop.0.size tells how many instances of possible models (or chromosomes or individuals in gep terminology) we should try in each generation. In this example we use 50.  At this time a gep population consists of only 1 subpopulation so we will always use pop.subpop.0 to refer to the population.

 

The following set of parameters that start with gep.species identify various things that are very specific to gep problems. Each individual in the population is a chromosome that can be composed of 1 or more genes (there is an expectation that you've taken the time to read external documents that describe the gep approach and that you will be familiar with the concepts at this point; this documentation will refer to these concepts without great explanations). Each gene is identified by its 'head' size and if there are multiple genes one needs to specify the operator that will link the gene expressions together. Here we use chromosomes with 2 genes, each with a head size of 5 and linked by addition. (NOTE: unlike Ferreira’s implementation, any operator can be used as the linking function and not just +, -, * and / or ‘or’, ‘and’, ‘xor’, ‘nor’, and ‘nand’ for logical problems. Most commonly the linking operator is addition. If operators other than these standard ones are used then the type of operator (logical or non-logical) must be correct and the arity must agree with the number of genes in a chromosome).

There are 4 types of problems that can be handled by the gep approach (at this time). In this case we are doing a function finding problem. Genes can be composed of various symbols that represent the functions and terminals (independent variables) that can be used. There can also be constants but more on that later. We need to identify these symbols so that the system can build appropriate genes. The functions are identified by telling how many there are (functionsize) and what they are (function.0, function.1, etc.). In this problem we have decided to allow the genes to use the 4 functions: addition, subtraction, multiplication and addition. There are a very large number of supported functions. These can be found in the ec/gep/symbols directory. It is quite simple to add more functions as needed. Just follow the example of a similar type of function and create a new class. The terminal or independent variable symbols can also be defined in the parameter file but for this example we have a training data file that contains the terminal symbols in the 1st row of the file. So we specify that gep.species.symbolset.terminalfilename is ec/app/gep/test1/test1.txt. As shown above this will identify the 2 terminal symbols x and y (as well as the dependent variable symbol z). 

 

The final entries in the parameter file describe what we want in the way of output from the system. This is called the statistics output. In this example we use the class ec.gep.GEPSimpleStatistics to generate our output and specify that the output will go to the file stats1.txt. Users may want to build their own statistics classes to control the type of information that is generated.

Finally we need to provide a simple program that tells the system how to evaluate the fitness of chromosomes so decisions can be made as the individuals are discarded or moved to the next generation (possibly with modifications -- mutations, inversions, and various forms of recombination and transposition). The code for the class ec.gep.test.test1 is shown below.


import ec.*;
import ec.gep.*;
import ec.simple.*;

/**
 *  The problem is to find the equation x*x + y
 * 
 *  Data and variable names are in the data file test1.txt
 */
public class test1 extends GEPProblem implements SimpleProblemForm
{
  static double IDEAL_FITNESS_MINIMUM = 999.9999;
   
  public void evaluate(EvolutionState state, Individual ind, int threadnum)
  {
    if (!ind.evaluated)  // don't bother reevaluating
    {
      // Mean Squared Error (MSE) fitness is normalized between 0 and 1000 (1000 * (1/(1+MSE))
      double fitness = GEPFitnessFunction.MSEfitness(true, (GEPIndividual)ind);
           
      // the fitness better be SimpleFitness!
      SimpleFitness f = ((SimpleFitness)ind.fitness);
      f.setFitness(state,(float)fitness, fitness >= IDEAL_FITNESS_MINIMUM);
      ind.evaluated = true;
          
      if (fitness >= IDEAL_FITNESS_MINIMUM)
      {   
         ((GEPIndividual)ind).printIndividualForHumans(state, 1, 1);
      }   
    }
  }
}

This is the basic format that should be used in the user program. The outer test just says that if we've already evaluated the fitness of the individual (chromosome/model) then don't do anything. Otherwise use one of the fitness functions to evaluate the model (in this case we use Mean Squared Error), set the fitness indicating whether it is an ideal solution to the problem and then indicate that the model was evaluated. Printing the model is not required but was done in this case. It will be captured in the statistics file output. There are a number of possible fitness functions to choose from (see ec.gep.GEPFitnessFunction class) and you can easily create your own. 

 

Note: the 1st argument to a fitness function evaluation is a boolean that specifies whether training data (true) or testing data (false) should be used when calculating the fitness of the function. The training data will always be used when measuring the fitness of functions (models) created during the evolutionary process. The fitness of the model with testing data will normally be used to validate that ‘good’ models selected from the training (evolutionary) exercise are still good when subjected to other data. 

 

So now we have all of the required pieces to define our problem and to execute it we can use the command line or the ECJ console. To run from a command line execute something like:

D:\ecj>java -classpath ./;./start/javacsv.jar ec.Evolve -file ec/app/gep/test1/test1.params

You'll notice that there is a jar file called javacsv.jar that must be included in the classpath. This is used to read the CSV files and can be found at http://sourceforge.net/projects/javacsv/. Assuming all goes well you will see the following output (or very similar output) on the console.

| ECJ
| An evolutionary computation system (version 15)
| By Sean Luke
| Contributors: L. Panait, G. Balan, S. Paus, Z. Skolicki,
|               J. Bassett, R. Hubley, and A. Chircop
| URL: http://cs.gmu.edu/~eclab/projects/ecj/
| Mail: ecj-help@cs.gmu.edu
|       (better: join ECJ-INTEREST at URL above)
| Date: April 4, 2006
| Current Java: 1.5.0_10 / Java HotSpot(TM) Client VM-1.5.0_10-b03
| Required Minimum Java: 1.3

Threads:  breed/1 eval/1
Seed: 4357
Job: 0
Setting up
Initializing Generation 0
WARNING:
Weight for GEP Function must be > 0; defaulting to 1)
PARAMETER: pop.subpop.0.species.symbolset.function.0.weight
     ALSO: gep.species.symbolset.function.0.weight
WARNING:
Weight for GEP Function must be > 0; defaulting to 1)
PARAMETER: pop.subpop.0.species.symbolset.function.1.weight
     ALSO: gep.species.symbolset.function.1.weight
WARNING:
Weight for GEP Function must be > 0; defaulting to 1)
PARAMETER: pop.subpop.0.species.symbolset.function.2.weight
     ALSO: gep.species.symbolset.function.2.weight
WARNING:
Weight for GEP Function must be > 0; defaulting to 1)
PARAMETER: pop.subpop.0.species.symbolset.function.3.weight
     ALSO: gep.species.symbolset.function.3.weight
Generation 1

Evaluated: T
Fitness: 1000.0
Linking function: +
Gene 0
*.x.x.y.y.x.x.y.x.x.x
Gene 1
y./.x.+.+.x.x.x.x.x.x

(x*x) + y

Found Ideal Individual


The warnings about weight for the functions can be ignored. There is an option to identify weights for the functions so that the ones with higher weights are selected more often when building and evolving genes. The default of 1 was used by not specifying the weights. In the code we chose to display the best individual when it was found and thus the output showing the details for the chromosome with an ideal fitness of 1000. It identifies the linking function used (+), the contents of the 2 genes (in Karva format ... see the gep documentation in papers by Ferreira, etc.) and the expression of the chromosome in a mathematical formula:

    (x*x) + y

This is clearly the solution we were looking for. You will also find the statistics for the run in the file test1.stat. This should contain the following information:


Seed(s) used in this job: 4357
Problem Type: functionfinding

Generation: 0
Best Individual:
Evaluated: T
Fitness: 274.6781
Linking function: +
Gene 0
-.x.-.y.y.y.y.x.y.y.y
Gene 1
+.x.y.y.-.x.y.y.y.y.x

(x-(y-y)) + (x+y)

Generation: 1
Best Individual:
Evaluated: T
Fitness: 1000.0<small><code>
Linking function: +
Gene 0
*.x.x.y.y.x.x.y.x.x.x
Gene 1
y./.x.+.+.x.x.x.x.x.x

(x*x) + y

Best Individual of Run:
Found at Generation: 1
Evaluated: T
Fitness: 1000.0
Linking function: +
Gene 0
*.x.x.y.y.x.x.y.x.x.x
Gene 1
y./.x.+.+.x.x.x.x.x.x

(x*x) + y

Size of program: 4
Variables used(count): x(2), y(1)

***** TRAINING data results *****
Statistics:
MSE:  0.0        RAE:  0.0        RSE:  0.0
RMSE: 0.0        MAE:  0.0        RRSE: 0.0

Observed    Computed
1.25        1.25
3.0         3.0
8.0         8.0
21.0        21.0

The output shows the best individual found in each generation. There were only 2 generations needed to find a solution to this problem since it was a simple one. Notice that in the first generation (generation 0), the best solution found was (x-(y-y) + (x+y). It had a fitness value of 274.6781, well below the perfect value of 1000 that we get with the best solution of generation 1. The statistics also provide some other information. The size of the best solution found during the run was 4. This means that there were 4 symbols used in the expression (excluding the linking function). So we had 2 x terminal symbols, 1 y terminal symbol and the * function symbol in the solution. We also display some statistical values including mean squared error (MSE), root mean squared error (RMSE), etc. Finally we display the values expected or observed (from our training data) and the values produced by the best model in the run, in this case a perfect model.

 

Note that in more recent versions of ECJ-GEP you can request that the mathematical expressions found be simplified. In which case you would see in the results and statistics file something like:

 

MATH

(y+(x*x))

 

MATH (SIMPLIFIED)

x^2+y

 

But so note that in most cases it is best to do your experiments without simplification turned on since the software system used to do the simplification of expressions (meditor) has been know to cause crashes or non terminating loops for some expressions. Also the meditor.jar file must be in the ‘start’ directory. To turn on simplification the following line must be in the parameter file:

 

pop.subpop.0.species.ind.simplify-expressions = true

 

So, that's all there is to it. Let's look at 2 more simple problems, test2 and test3, to see some minor variations in the process. First the test2 example is solving a problem that should result in a model with the expression x^4 + x^3 + x^2 + x. It has a single independent variable, x. And in this example, instead of getting the terminal symbols and data from a file we define the terminal symbols in the param file, test2.param, and define the data in the program file, test2.java. The relevant part of the param file is:


gep.species.symbolset.terminalsize = 1
gep.species.symbolset.terminal.0 = x
gep.species.symbolset.functionsize = 4
gep.species.symbolset.function.0 = Add
gep.species.symbolset.function.0.weight = 1
gep.species.symbolset.function.1 = Sub
gep.species.symbolset.function.1.weight = 1
gep.species.symbolset.function.2 = Mul
gep.species.symbolset.function.2.weight = 1
gep.species.symbolset.function.3 = Div
gep.species.symbolset.function.3.weight = 1

Here we've identified that there is 1 terminal symbol, x. We've also explicitly specified that the weight for each function should be 1. Specifying the data in the program is a bit more complex but not much.


package ec.gep.test;
import ec.*;
import ec.gep.*;
import ec.simple.*;

public class test2 extends GEPProblem implements SimpleProblemForm
{
    public static final double IDEAL_FITNESS_MINIMUM = 0.9999999999;

    public double xvalues[] = { 9.500366, -6.130432, 3.252685, 7.88797, 9.090484,
            1.485199, -3.950531, 10.003326, -0.607453, 5.469299};   
    public double zvalues[] = { 9103.54918799079, 1213.47815867463, 160.18146840485,
            4432.23529247904, 7671.79392962917, 11.8327154232663, 193.570365560718,
            11124.3786278048, -0.326443121119579, 1093.78835980998};  // x^4 + x^3 + x^2 + x
   
   
    public double[] getDataValues( String label )
    {
        if (label.equals("x"))
            return (xvalues);
        else if (label.equals("dependentVariable")) // always called 'dependentVariable'
            return (zvalues);
        else
            return null;          
    }

    public void evaluate(EvolutionState state, Individual ind, int threadnum)
    {
        if (!ind.evaluated)  // don't bother reevaluating
        {
            // Mean Squared Error (MSE) fitness is normalized between 0 and 1000 (1000 * (1/(1+MSE))
            double fitness = GEPFitnessFunction.MSEfitness((GEPIndividual)ind);
           
            // the fitness better be SimpleFitness!
            SimpleFitness f = (true, (SimpleFitness)ind.fitness);
            f.setFitness(state,(float)fitness, fitness >= IDEAL_FITNESS_MINIMUM);
            ind.evaluated = true;
          
        if (fitness >= IDEAL_FITNESS_MINIMUM)
        {   
            ((GEPIndividual)ind).printIndividualForHumans(state, 1, 1);
        }   
        }
    }
}

The main difference here is that we have to provide a method getDataValues to supply the data values for the terminal symbols identified in the param file. In this case that is the x values. The ECJ system will call this function during initialization to get the values, passing the symbol "x" as an argument. The user program must provide a double[] result. In the previous example the data was in a file and the first row had the symbols for the independent variable and the dependent variable. In the parameter file we don't specify the name of the dependent variable so the system will ask for the dependent variable values using the label "dependentVariable". It is probably most common to use a file to hold the training data. However, if the training data is to be provided by performing a calculation (via a simulation or some other means for example), then using the user program to provide the data might make sense.  For test2 the statistics file will look like:


Seed(s) used in this job: 4357
Problem Type: functionfinding

Generation: 0
Best Individual:
Evaluated: T
Fitness: 0.004819467
Linking function: +
Gene 0
-.-.-.x.*.*.-.x.x.x.x.x.x.x.x.x.x
Gene 1
*.+.-.x./.x././.x.x.x.x.x.x.x.x.x
Gene 2
*.+.+.x.*.*.x.x.x.x.x.x.x.x.x.x.x

((x-(x*x))-((x*x)-(x-x))) + ((x+((x/x)/x))*(x-(x/x))) + ((x+(x*x))*((x*x)+x))

Generation: 1
Best Individual:
Evaluated: T
Fitness: 410.7618
Linking function: +
Gene 0
-.-.*.x.*.*.-.x.x.x.x.x.x.x.x.x.x
Gene 1
*.+.-.x./.x././.x.x.x.x.x.x.x.x.x
Gene 2
*.+.+.x.*.*./.x.x.x.x.x.x.x.x.x.x

((x-(x*x))-((x*x)*(x-x))) + ((x+((x/x)/x))*(x-(x/x))) + ((x+(x*x))*((x*x)+(x/x)))

Generation: 6
Best Individual:
Evaluated: T
Fitness: 500.0
Linking function: +
Gene 0
-.-.*.x.x.x.x.x.x.x.x.x.x.x.x.x.x
Gene 1
*.+./.x./.x././.x.x.x.x.x.x.x.x.x
Gene 2
*.+.+.x.*.*./.x.x.x.x.x.x.x.x.x.x

((x-x)-(x*x)) + ((x+((x/x)/x))*(x/(x/x))) + ((x+(x*x))*((x*x)+(x/x)))

Generation: 14
Best Individual:
Evaluated: T
Fitness: 1000.0
Linking function: +
Gene 0
-.-.*.x.x.x.x.x.x.x.x.x.x.x.x.x.x
Gene 1
*.x.x.x.x.x.x.+.x.x.x.x.x.x.x.x.x
Gene 2
*.+.+.x.*.*./.x.x.x.x.x.x.x.x.x.x

((x-x)-(x*x)) + (x*x) + ((x+(x*x))*((x*x)+(x/x)))

Best Individual of Run:
Found at Generation: 14
Evaluated: T
Fitness: 1000.0
Linking function: +
Gene 0
-.-.*.x.x.x.x.x.x.x.x.x.x.x.x.x.x
Gene 1
*.x.x.x.x.x.x.+.x.x.x.x.x.x.x.x.x
Gene 2
*.+.+.x.*.*./.x.x.x.x.x.x.x.x.x.x

((x-x)-(x*x)) + (x*x) + ((x+(x*x))*((x*x)+(x/x)))

Size of program: 23
Variables used(count): x(13)

***** TRAINING data results *****
Statistics:
MSE:  7.182716450022835E-24        RAE:  4.832470396077297E-16        RSE:  4.344516272977063E-31
RMSE: 2.680059038533076E-12        MAE:  1.7715995337397316E-12        RRSE: 6.591294465411982E-16

Observed            Computed
9103.54918799079    9103.549187990786
1213.47815867463    1213.4781586746294
160.18146840485     160.18146840484988
4432.23529247904    4432.235292479036
7671.79392962917    7671.793929629167
11.8327154232663    11.832715423266338
193.570365560718    193.57036556071805
11124.3786278048    11124.378627804794
-0.326443121119579  -0.3264431211195794
1093.78835980998    1093.7883598099795

In this case it took 14 generations to solve the problem with an ideal (fitness = 1000) solution. The solution could use some simplification so that we see the first term is -x*x, the second is x*x (which cancel each other) and the third is (x+x*x)(x+1) or x^4+x^3+x^2+x, the desired solution. The errors in the statistics are not exactly zero but are quite small. This is due to loss of precision as the complex (not simplified) expression is evaluated. It also demonstrates why we use double and not float for all calculations. One final thing to note is that we don't print the best result after each generation. We only printed the results when there was an improvement in the fitness of the best of a generation (at generations 1, 6 and 14). This can be controlled by a parameter for the statistics. You can print the results for every generation, only when the fitness improves (as we did here by default) or only the final generation of the run (see the gep.params files for details).

 

Again, if simplification of math expressions was turned on you would see something like the following in the output:

 

MATH

((x-x)-(x*x)) + (x*x) + ((x+(x*x))*((x*x)+(x/x)))

 

MATH (SIMPLIFIED)

x+x^2+x^3+x^4



Test3 is just test1 again, but this time we do not have to supply a Java program at all, we'll use the default one. The test3.param file has the following changes:


# Problem
#===============================
eval.problem = ec.gep.GEPDefaultUserProg
eval.problem.fitness-function = RH
eval.problem.fitness-function-arg0 = 2.0

Instead of specifying some user program like ec.gep.test.test3 we use ec.gep.GEPDefaultUserProg. But because we do this we need to identify which fitness function to use. There are more than 25 fitness functions provided and most require a single argument, the individual being evaluated. Some require one or 2 extra numeric arguments. In the case of test3 we've used the Relative Hits (RH) fitness function. It requires one numeric argument, the precision. It is not necessarily the best choice for this problem but does illustrate the use of the parameters to specify the fitness function and an argument. The RH function counts the number of hits (matches) between the expected results in the training set and the computed results. A comparison is considered to be a hit if the percentage difference between the expected and computed values is less than the precision specified, in this case 2%. So the test is:


       
 100*(abs(expected-computed)/expected) < precision

The maximum number of hits possible is the number of sets of training data. The GEPDefaultUserProg uses the RH function to determine fitness. Check out the GEPFitnessFunction class to see the selection of functions available and to add your own if so inclined. So for test3 we only had to create a parameter file and a testing data file. The results of the run should look something like:


Seed(s) used in this job: 4357
Problem Type: functionfinding

Generation: 0
Best Individual:
Evaluated: T
Fitness: 3.0
Linking function: +
Gene 0
*.-.-.+.y.y.x.x.y.x.x
Gene 1
-.y.-.y.y.y.y.y.x.y.x

(((x+y)-y)*(y-x)) + (y-(y-y))

Generation: 1
Best Individual:
Evaluated: T
Fitness: 4.0
Linking function: +
Gene 0
*.x.x.y.y.x.x.y.x.x.x
Gene 1
y./.x.+.+.x.x.x.x.x.x

(x*x) + y

Best Individual of Run:
Found at Generation: 1
Evaluated: T
Fitness: 4.0
Linking function: +
Gene 0
*.x.x.y.y.x.x.y.x.x.x
Gene 1
y./.x.+.+.x.x.x.x.x.x

(x*x) + y

Size of program: 4
Variables used(count): x(2), y(1)

***** TRAINING data results *****
Statistics:
MSE:  0.0        RAE:  0.0        RSE:  0.0
RMSE: 0.0        MAE:  0.0        RRSE: 0.0

Observed    Computed
1.25        1.25
3.0         3.0
8.0         8.0
21.0        21.0

As you can see the final result is the same, we get the x*x+y equation that the data supports. The maximum fitness value in this case was 4, since there were only 4 training sets of data.

This concludes the introduction to using the ECJ GEP extension. The next section identifies a number of other examples that have been implemented, showing more complex examples from the four types of problems: function finding, logical, classification and time series. We'll explain some of the particular parameters and considerations for these types of problems as we briefly describe the examples.

 

Some Other Examples

Function Finding Examples


These examples are similar to the test1, test2 and test3 examples we explored in the previous section. We'll just describe them very briefly and point out any interesting new things that are introduced.

1. Acot Example - this is an attempt to discover the ArcCotangent function using a set of 100 training cases. There is only 1 gene, 30 individuals in the population and the RRSE (Root Relative Squared Error) fitness function is used. There is one independent variable, d0, specified in the Acot.param file. There are 17 functions used (+, -, *, /, Sqrt, Exp, Floor, Ceiling, Abs, Neg, Sin, Cos, Tan, Asin, Acos, Atan, and Ln). The add, subtract, multiply and divide functions have a weight of 3 and the others a weight of 1. These are just variations of things we've seen before. The new thing is that in this example we allow the genes to encode not only functions and terminals but also constants. This is described in Ferreira's book and the GeneXproSoft software. But in order to use constants you do need to supply some extra information.

gep.species.use-constants             = true
gep.species.numconstantspergene       = 2
gep.species.integer-constants         = false
gep.species.constants-lowerlimit      = -10
gep.species.constants-upperlimit      = 10
gep.species.rnc-mutation-prob         = 0.01
gep.species.dc-mutation-prob          = 0.044
gep.species.dc-inversion-prob         = 0.1
gep.species.dc-istransposition-prob   = 0.1


First we identify that we are going to use constants and that there will be 2 constants allocated per gene. We are using floating point constants since the integer-constants parameter is set to false. The values of the constants are allowed to range between -10.0 and +10.0. The other parameters relate to how constants evolve from generation to generation. This again is covered in Ferreira's documentation. But these values are the ones she suggests as defaults that work well for most problems.

After running the example we get the output in the Acot.stat file.

Seed(s) used in this job: 4357
Problem Type: functionfinding

Generation: 0
Best Individual:
Evaluated: T
Fitness: 499.85913
Linking function: +
Gene 0
ceil.tan.C1.tan.d0.C0.tan.acos.d0.d0.d0.C1.C1.C1.d0.C1.C1
C0: -8.147180372252937
C1: 5.652730175877368

ceiling(tan(5.652730175877368))

Generation: 10
Best Individual:
Evaluated: T
Fitness: 503.19714
Linking function: +
Gene 0
/.cos.C1.tan.tan.sin.d0.d0.d0.C1.C0.C1.C0.C0.d0.C0.d0
C0: -1.9238205968468058
C1: 5.576872726631521

(cos(tan(tan(sin(d0))))/5.576872726631521)

Generation: 14
Best Individual:
Evaluated: T
Fitness: 614.31683
Linking function: +
Gene 0
/.cos.d0.floor.C1.d0.floor.d0.d0.C1.C0.C1.C1.C0.d0.C1.d0
C0: -8.147180372252937
C1: 5.652730175877368

(cos(floor(5.652730175877368))/d0)

Generation: 21
Best Individual:
Evaluated: T
Fitness: 873.75574
Linking function: +
Gene 0
sin././.cos./.neg.*.floor.d0.C1.C0.C0.C1.C1.d0.d0.d0
C0: -8.147180372252937
C1: 5.652730175877368

sin((((floor(5.652730175877368)/d0)/(-5.652730175877368))/cos(((-8.147180372252937)*(-8.147180372252937)))))

Generation: 366
Best Individual:
Evaluated: T
Fitness: 1000.0
Linking function: +
Gene 0
atan././.d0.d0.d0./.C1.d0.d0.C1.C0.C0.d0.C0.C1.C1
C0: -8.147180372252937
C1: -0.24815065988408236

atan(((d0/d0)/d0))

Best Individual of Run:
Found at Generation: 366
Evaluated: T
Fitness: 1000.0
Linking function: +
Gene 0
atan././.d0.d0.d0./.C1.d0.d0.C1.C0.C0.d0.C0.C1.C1
C0: -8.147180372252937
C1: -0.24815065988408236

atan(((d0/d0)/d0))

Size of program: 6
Variables used(count): d0(3)

***** TRAINING data results *****
Statistics:
MSE:  6.5223928682607005E-31       RAE:  1.4189629717074661E-15       RSE:  5.2194614261805544E-30
RMSE: 8.076133275436148E-16        MAE:  3.738676035425215E-16        RRSE: 2.284614065040429E-15

Observed              Computed
-0.204002518946359    -0.2040025189463591
-0.339365885046348    -0.3393658850463476
-0.387522092113786    -0.3875220921137857
-0.23995309898349     -0.23995309898349046
0.515703851271719      0.5157038512717189
0.310364838997948      0.31036483899794765
    ...
0.146141817137733      0.14614181713773297
-0.124890152806863    -0.1248901528068634
-0.182368282843355    -0.18236828284335482
0.160813070832659      0.1608130708326594
0.139880646214395      0.13988064621439456

It took 366 generations in this case (with the seed provided) and found the expression atan(((d0/d0)/d0)) which is equivalent to atan(1/d0), the definition of acot(d0). There were improvements in the best fitness value at generations 10, 14, 21, and 366. So there was a lot of exploration of models through evolution to improve between generation 21 and 366.

2. Data mining example - This is an interesting variation of the test2 problem described above. This high-dimensional toy problem illustrates the use of a great deal of extraneous information being filtered during evolution. In this example 15 out of 16 independent variables are meaningless. As in test2 the target function is the quartic polynomial f(x) = x^4+x^3+x^2+x. The data contained 161 sets with 15 independent variable values plus the dependent variable values. As we can see from the (partial) output shown below, it takes 1596 generations (with the given seed as the starting point) to get to the best solution.

Seed(s) used in this job: 4357
Problem Type: functionfinding

Generation: 0
Best Individual:
Evaluated: T
Fitness: 459.41364
Linking function: +
Gene 0
+./.-.*.*./.-.*.d0.d2.d15.d13.d4.d12.d10.d6.d6
Gene 1
+.d6.*.-.+.d5./.*.d3.d12.d3.d13.d0.d8.d14.d12.d7
Gene 2
-.*.-./.-.d8.*./.d0.d13.d8.d7.d12.d0.d11.d4.d5

((((d6*d6)*d0)/(d2*d15))+((d13/d4)-(d12-d10))) + (d6+((d5-(d12/d3))*((d13*d0)+d3))) + ((((d0/d11)/d0)*(d13-d8))-(d8-(d7*d12)))

Generation: 3
Best Individual:
Evaluated: T
Fitness: 461.31992
Linking function: +
Gene 0
+.-.-.*.*./.-.*.d0.d2.d15.d13.d4.d12.d10.d6.d6
Gene 1
+.d6.*.-.+.d5./.*.d3.d12.d3.d13.d0.d8.d14.d12.d7
Gene 2
-.*.-./.-.d8.*./.d0.d13.d8.d7.d12.d0.d11.d4.d5

((((d6*d6)*d0)-(d2*d15))+((d13/d4)-(d12-d10))) + (d6+((d5-(d12/d3))*((d13*d0)+d3))) + ((((d0/d11)/d0)*(d13-d8))-(d8-(d7*d12)))

    ...

Generation: 1347
Best Individual:
Evaluated: T
Fitness: 998.6959
Linking function: +
Gene 0
+.*.d6.*.d0.d0.d0.-.d5.d12.d7.d5.d8.d4.d10.d5.d4
Gene 1
-.*.d1.d0.+.*.d0.*.d0.d0.d0.d10.d11.d8.d10.d5.d10
Gene 2
+.-.d0.d1.+.d6.d1.d1.d3.d6.d0.d8.d5.d8.d0.d2.d14

(((d0*d0)*d0)+d6) + ((d0*(((d0*d0)*d0)+d0))-d1) + ((d1-(d6+d1))+d0)

Generation: 1596
Best Individual:
Evaluated: T
Fitness: 1000.0
Linking function: +
Gene 0
+.-.d0.d1.d6.-.d1.*.d0.d12.d2.d9.d5.d6.d9.d12.d12
Gene 1
-.*.d1.+.d0.*.d0.*.d0.d0.d0.d7.d5.d15.d9.d12.d1
Gene 2
+.*.d6.*.*./.d0.d0.d0.d0.d0.d10.d5.d8.d5.d8.d7

((d1-d6)+d0) + (((((d0*d0)*d0)+d0)*d0)-d1) + ((((d0/d0)*d0)*(d0*d0))+d6)

Best Individual of Run:
Found at Generation: 1596
Evaluated: T
Fitness: 1000.0
Linking function: +
Gene 0
+.-.d0.d1.d6.-.d1.*.d0.d12.d2.d9.d5.d6.d9.d12.d12
Gene 1
-.*.d1.+.d0.*.d0.*.d0.d0.d0.d7.d5.d15.d9.d12.d1
Gene 2
+.*.d6.*.*./.d0.d0.d0.d0.d0.d10.d5.d8.d5.d8.d7

((d1-d6)+d0) + (((((d0*d0)*d0)+d0)*d0)-d1) + ((((d0/d0)*d0)*(d0*d0))+d6)

Size of program: 27
Variables used(count): d0(11), d1(2), d6(2)

***** TRAINING data results *****
Statistics:
MSE:  7.99571803020474E-23        RAE:  1.0804622056054273E-15        RSE:  4.6576363300204115E-30
RMSE: 8.941877895724556E-12       MAE:  3.3320664912135643E-12        RRSE: 2.1581557705643983E-15

Observed            Computed
1418.74707069507    1418.7470706950726
75.3238688522697    75.3238688522697
11019.4252335609    11019.425233560933
1526.50067316805    1526.500673168051
11309.522753932     11309.522753931999
6755.18648304066    6755.186483040658
    ...
537.603328324592    537.603328324592
533.236668818159    533.2366688181589
3742.26191483226    3742.261914832265
6664.98407320731    6664.984073207315
8312.11655791031    8312.116557910313


You'll also notice that for this particular run the solution looks pretty complex but it simplifies to:

    d1-d6+d0 + d0^4+d0^2-d1 + d0^3+d6
or
    d0 + d0^4 + d0^2 + d0^3

the expected solution. There is a second version of the program, dataMining2 that has the data in a CSV text file.

3. Production example - In this example we have real-world production data from 446 firms. The goal is to explain the Production as a function of Labor, Material, and Capital. There is nothing new in the parameter file that we haven't seen before. The data is provided in the user program. The only unique thing we haven't seen before is that the program runs to completion at 3000 generations without reaching an 'ideal' or 'perfect' solution. This is probably more like real problems with real (imperfect) data. We only explored the solution using the functions add, subtract, multiply and divide. Perhaps there are more complicated relations ships and trying some trigonometric functions might help. In any case, this attempt provided a solution with a fitness of 8878.4472. In real investigations we might run this problem 500 or 1000 times with different seeds for 10000 generations each to try to find a set of best possible solutions. The best solution of the run is extracted from the statistics output file and shown below. Note the role of constants in the solution.

Best Individual of Run:
Found at Generation: 2999
Evaluated: T
Fitness: 878.4472
Linking function: +
Gene 0
/.*./.-.-./.material./.material.capital.material.C0.labour.material.material.C0.C0
C0: 7.249937058678704
C1: -0.5794351361736538

Gene 1
material.material.material./.C0.+.labour.labour.capital.capital.labour.C1.material.material.capital.capital.C1
C0: 3.697751463839939
C1: 2.8929911331385565

Gene 2
*./././.-.C1./.-.material.C0.material.C1.material.material.labour.labour.labour
C0: -3.2099741094356737
C1: -0.49638400660778004

((((material/material)-material)*(capital-material))/((7.249937058678704/labour)/material)) + material + ((((material-labour)/material)/((-3.2099741094356737)-material))*((-0.49638400660778004)/((-0.49638400660778004)/material)))

Size of program: 31
Variables used(count): labour(2), material(10), capital(1)

4.  Analog Circuit Design Example - The goal is to find the transfer function expressing the yield of an analog circuit in terms of three parameter tolerances. The training set, consisting of 40 pairs of tolerances and their corresponding yields, was obtained from n runs of Monte Carlo simulations. The source of the problem is:

Zielinski, L. and J. Rutkowski, 2004. Design Tolerancing with Utilization of Gene
Expression Programming and Genetic Algorithm. In Proceedings of the International
Conference on Signals and Electronic Systems, Poznan, Poland.


Again this is similar to other function finding problems we've seen. However, this problem sets parameters to: use 10 integer constants per gene; run to 5000 generations; weight add, subtract and multiply at 3; and to use ln, sqrt, and pow functions. Check out the files analogCircuit.java, analogCircuit.param and analogCircuit.stat.
   

Classification Examples


Classification examples we haven't seen yet. Typically we'll have a set of examples of things that belong to various classes. Each of the examples will have a set of attributes that we hope allow us to predict which class the examples belong to and as new (unclassified) examples arrive we can use this knowledge to determine the class of the new example. In the gep approach we work with one class at a time by taking the training examples we have and identifying the selected class if the example belongs to the class or does not belong to the class. In this method, the dependent variable will have the values 0 and 1, representing whether the example does not or does belong to the class. By repeating for all of the classes we can construct a set of equations that allows up to predict whether a new example belongs to each class or not. The model (equation) produced will need to result in a zero or a 1. To do this we select a threshold value. If the equations results in a value greater than this value then return a 1 (in the class) and if below the threshold return a 0 (not in the class). Let's look at a few examples and see how to set things up in the parameter files. 

1. Iris Virginica Example - In this problem the goal is to classify three different types of irises based on four measurements: sepal length, sepal width, petal length, and petal width. The original iris dataset contains fifty examples each of three types of  iris: Iris setosa, Iris versicolor, and Iris virginica. Here the sub-problem Virginica versus Not Virginica is analyzed, where 100 randomly chosen samples are used for training and the remaining 50 for testing. The new things in the parameter file (IrisVirginica.param) are shown below.

# Problem type must be one of: functionfinding, classification, timeseries, logical
# Set default to be 'unknown' so user is forced to specify in the problem params file
gep.species.problemtype = classification

# if the problem is a classification type problem a threshold value (used
# to convert real values to 0 or 1 during fitness calculations) must be
# specified (do not specify this unless it IS a classification type problem).
gep.species.classification-threshold  = 0.5

gep.species.symbolset.terminalfilename = ec/app/gep/IrisVirginica/IrisVirginica.txt
gep.species.symbolset.testingdatafilename = ec/app/gep/IrisVirginica/IrisVirginicaTest.txt

We see that the problem type is set to classification. With classification problems we expect to see a classification threshold specified, In this case the value is set to 0.5. The data is in CSV files. The training data set is in IrisVirginica.txt. There are 100 sets of data in the file along with the first record that holds the names of the terminals (variables) as d0, d1, d2, and D.V. We see also that there is a testing data set file specified as IrisVirginicaTest.txt. This file holds the 50 testing sets of data and does not have the names of the variables in the first row! When the best model is determined from the run, the gep system will automatically access the test data and compare the results of using the model with this data against the expected values. This will help in the assessment of the quality of the model that was chosen. Let's look at the output from a run (IrisVirginica.stat). We set the problem to run for 5000 generations.


Seed(s) used in this job: -1368642168
Problem Type: classification
Classification rounding threshold: 0.5

Generation: 0
Best Individual:
Evaluated: T
Fitness: 801.24774
Linking function: +
Gene 0
*./.+.+.*./.-.d2.d3.d0.d1.d3.d3.d3.d1.d3.d3
C0: 4.19396659531718
C1: 4.2865486803954775

Gene 1
+./.d3.-.+./.d2.d1.d1.d3.d0.C1.d2.d1.C1.d2.d1
C0: 2.4700381059064878
C1: 8.534256473201435

Gene 2
/.d1.*.-.*.d1.d2.d1.d0.d2.d2.d0.C1.d1.d0.C0.C0
C0: 4.031580228177383
C1: 1.4026131599950642

(((d2+d3)/(d0*d1))*((d3/d3)+(d3-d1))) + ((((d3/d0)-d2)/(d1+d1))+d3) + (d1/((d1-d2)*(d1*d0)))

    ...

Generation: 9
Best Individual:
Evaluated: T
Fitness: 969.69696
Linking function: +
Gene 0
-.d3./.+.d2.*./.d0.d1.d1.d1.d1.d3.d1.d2.d2.d2
C0: 0.2103929964396789
C1: 4.2865486803954775

Gene 1
-.*./.-.d3.d3.C1.d2.d1.d3.d2.C1.d0.d2.d3.d2.d1
C0: 2.4700381059064878
C1: 8.534256473201435

Gene 2
/.d1.*.-.*.d1.d2.d1.C1.d1.d3.d3.C0.d3.d0.C1.C0
C0: 4.031580228177383
C1: 1.4026131599950642

(d3-(((d0*d1)+(d1/d1))/d2)) + (((d2-d1)*d3)-(d3/8.534256473201435)) + (d1/((d1-d2)*(d1*1.4026131599950642)))

Generation: 421
Best Individual:
Evaluated: T
Fitness: 984.8485
Linking function: +
Gene 0
-.-././.d3.d0.d3.*.d1.d1.d2.C0.d2.d3.C0.d2.d2
C0: 9.274259134965515
C1: 4.873682980602775

Gene 1
-.d3./.+.d2.*.C1.d0.d1.d2.d2.d2.d2.C0.d2.C1.d0
C0: 3.9855714658085004
C1: 8.534256473201435

Gene 2
-.d2././.d2.d2.C1.-.C0.C0.d1.d0.d0.d0.C1.d3.d0
C0: 1.8621216876649604
C1: 6.2762163390852646

((((d1*d2)/d1)-d3)-(d0/d3)) + (d3-(((d0*d1)+8.534256473201435)/d2)) + (d2-((d2/6.2762163390852646)/d2))

Best Individual of Run:
Found at Generation: 421
Evaluated: T
Fitness: 984.8485
Linking function: +
Gene 0
-.-././.d3.d0.d3.*.d1.d1.d2.C0.d2.d3.C0.d2.d2
C0: 9.274259134965515
C1: 4.873682980602775

Gene 1
-.d3./.+.d2.*.C1.d0.d1.d2.d2.d2.d2.C0.d2.C1.d0
C0: 3.9855714658085004
C1: 8.534256473201435

Gene 2
-.d2././.d2.d2.C1.-.C0.C0.d1.d0.d0.d0.C1.d3.d0
C0: 1.8621216876649604
C1: 6.2762163390852646

((((d1*d2)/d1)-d3)-(d0/d3)) + (d3-(((d0*d1)+8.534256473201435)/d2)) + (d2-((d2/6.2762163390852646)/d2))

Size of program: 27
Variables used(count): d0(2), d1(3), d2(5), d3(3)

***** TRAINING data results *****
Confusion Matrix:
                Predicted Value
               |  Yes   |  No
               |----------------
            Yes|  34    |  0
Actual Value   |----------------
            No |  1     |  65
               |----------------


***** TEST data results *****
Number of Calculation Errors: 0 out of 50 test sets
Confusion Matrix:
                Predicted Value
               |  Yes   |  No
               |----------------
            Yes|  16    |  0
Actual Value   |----------------
            No |  0     |  34
               |----------------

Observed    Computed
1.0         1.0
1.0         1.0
0.0         0.0
      ...

We see from the 'confusion matrix' that the model produced from this run predicted all but 1 data set correctly for the training data (a false positive). And for the test data set it predicted all of the examples correctly.  We can also see that the output has been truncated. In the stat file you'll see that all of the observed and computed values are displayed, including the results for the test data values (after the training data values). One last note: the line that identifies the number of calculation errors refers to the number of times that the model produced an arithmetic error when applied to the test set of data. This can happen, for example, when the model has a division and the test values lead to a zero divisor.

 

2. Breast Cancer Example - In this example of diagnosis of breast cancer the task is to classify a tumor as either benign (0) or malignant (1) based on nine different cell analysis (clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses). Data obtained from PROBEN1 (Prechelt, L., 1994. PROBEN1 - A set of neural network benchmark problems and benchmarking rules. Technical Report 21/94, Univ. Karlsruhe, Germany). Both the technical report and the data set cancer1 used here are available for anonymous FTP from Neural Bench archive at Carnegie Mellon University (machine ftp.cs.cmu.edu, directory /afs/cs/project/connect/bench/contrib/prechelt) and from machine ftp.ira.uka.de in directory /pub/neuron. The file name in both cases is proben1.tar.gz.

 

Nothing new here except that there is a large amount of data (training and testing). Run the problem or check out the results in BreastCancer.stat.

 

3. DNA MicroArrays Example - The objective of this problem is to distinguish between 2 types of leukemia, ALL (0) and AML (1). There are 38 training data sets and 34 testing data sets. The major difference of this example is the number of attributes. There are 7129 independent variables! One minor change from the other examples is that we specified a tab as the separator in the data files rather than the default comma (gep.species.symbolset.terminalfileseparator = tab). The data sets are available at:  http://sdmc.lit.org.sg/GEDatasets/Data/ALL-AML_Leukemia.zip. Note that these files have very long lines and many editors will have difficulty reading or editing them. With the default seed value an ideal solution is found after 2924 generations. But as you can see from the output the model did not perform well with the test data. A good model for the training data doesn't mean we have a solution. More runs will be required to find solutions that can match the test and training data well.


Best Individual of Run:
Found at Generation: 2924
Evaluated: T
Fitness: 1000.0
Linking function: +
Gene 0
-.d2452.*.d4739.*.*.+.*.d4572.d4791.d1108.d4210.d670.d4458.d1985.d5621.d5374
Gene 1
+.d6935.*.+.+.*.+.d4555.d221.d1529.d3911.d2841.d1838.d2397.d4492.d6959.d1045
Gene 2
*.-.*.*.d6801.-.*.*.d11.d749.d6727.d2841.d7054.d5751.d6694.d436.d6498

(d2452-(d4739*(((d4210*d670)*d4572)*(d4791+d1108)))) + (d6935+(((d1529*d3911)+(d2841+d1838))*(d4555+d221))) + ((((d5751*d6694)*d11)-d6801)*((d749-d6727)*(d2841*d7054)))

Size of program: 41
Variables used(count): d11(1), d221(1), d670(1), d749(1), d1108(1), d1529(1), d1838(1), d2452(1), d2841(2), d3911(1), d4210(1), d4555(1), d4572(1), d4739(1), d4791(1), d5751(1), d6694(1), d6727(1), d6801(1), d6935(1), d7054(1)

***** TRAINING data results *****
Confusion Matrix:
                Predicted Value
               |  Yes   |  No
               |----------------
            Yes|  11    |  0
Actual Value   |----------------
            No |  0     |  27
               |----------------


***** TEST data results *****
Number of Calculation Errors: 0 out of 34 test sets
Confusion Matrix:
                Predicted Value
               |  Yes   |  No
               |----------------
            Yes|  4     |  10
Actual Value   |----------------
            No |  2     |  18
               |----------------

Also see the publication:

    "Molecular Classification of Cancer: Class Discovery and Class Prediction by
    Gene Expression Monitoring". Science, 286:531-537, October 1999.

 

Time Series Examples

 

The special thing about time series problems (as found in gep) is that the data is a single sequence of data. From a training sequence of time series data we'd like to create a model that can predict new values (i.e. given a test sequence of time series data what would the next value in the sequence be). To this end the sequential time series data is preprocessed to create a set of training data sets and a set of testing data sets. To do this the user must provide a bit of information about how the processing should be done. The user must specify 3 things in the param file:

timeseries-delay - This integer value determines how to use the values in the timeseries data. If the value is 1, then use every time series value, if it is 2, then use every other one, etc.

timeseries-embeddingdimension - This integr value determines the number of timeseries points to use as independent variables when transforming the set of time series data. Another data point is used as the dependent variable value. So the time series 'raw' data consisting of a list of single values is processed by splitting the data into groups (rows) of size embeddingdimension+1. From the end of the time series data embeddingdimension+1 values are chosen (if delay is 1 all values are chosen, if it is 2 every other one is chosen). The last value is the independent variable value. Then the next row is selected by moving 'delay' values from the end and choosing another embeddingdimension+1 values. This is repeated until no more sets of size embeddingdimension+1 can be chosen. If this produces n sets of data then timeseries-testingpredictions of them are used for testing and (n - timeseries-testingpredictions) are used for training.

timeseries-testingpredictions - an integer that specifies the number of sets of data to devote to testing.

So if we had the time series data:

        1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

and delay was 1 and timeseries-embeddingdimension was 4 then we'd process the set into the following 17 data sets. If timeseries-testingpredictions was 6 then the 1st 11 would be used for training and the LAST 6 for testing. 

        iv1 iv2 iv3 iv4  dv
          1   2   3   4   5
          2   3   4   5   6
          3   4   5   6   7
              . . .
         14  15  16  17  18
         15  16  17  18  19
         16  17  18  19  20
         17  18  19  20  21

If delay was 2 then 7 sets would be formed as:

        iv1 iv2 iv3 iv4  dv
          1   3   5   7   9
          3   5   7   9  11
              . . .
          9  11  13  15  17
         11  13  15  17  19
         13  15  17  19  21


 Of course the user can provide the data already preprocessed into the 2 data sets as we have done before, but that could require a lot more effort. The timeseries data file must not have any information regarding the names of the independent variables, since these are automatically created.

1. Dow Jones Example - This example looks at the monthly closings of the Dow-Jones industrial index, Aug. 1968 - Aug. 1981. (Source: Hipel and Mcleod (1994). Time Series Modelling of Water Resources and Environmental Systems, Elsevier. The original time series data file DJ.DAT came from:

    Hyndman, R.J. (n.d.) Time Series Data Library, http://www.robhyndman.info/TSDL/.
    Accessed on June 3, 2006.

For the integral time series used in this example, each value from the original raw timeseries was replaced by its moving average using a smoothing period p = 10.

Of interest in the example is that there are 40 functions used to build the models and that the 3 timeseries values are set to:

gep.species.timeseries-delay   =  1
gep.species.timeseries-embeddingdimension = 10
gep.species.timeseries-testingpredictions = 10

Some of the statistics output:

Seed(s) used in this job: 4357
Problem Type: timeseries

Generation: 0
Best Individual:
Evaluated: T
Fitness: 862.8322
Linking function: +
Gene 0
gau.comp.*.v6.C1.*.v7.mul3.v6.C0.v3.v3.v6.v8.v8.v3.v5.v6.v6.v9.v6.v1.C0.v0.v4.v2.v0.v2.v9.v3.C1.v8.v9
C0: -1.8928414198253574
C1: -9.528899601444923

Gene 1
max2.X3.v9.C0.exp.-.v0.v9.v0.v8.v3.v9.v6.v2.v1.v7.v6.v6.v0.v2.v3.v7.v6.v7.v3.v3.v1.v8.v7.v3.v8.v3.v7
C0: 5.353542014548175
C1: -9.236307382044629

Gene 2
pi.v4.X5.v0.*.comp.min2.v7.v3.C1.v9.v3.v5.v1.v2.v9.v7.C0.v7.v8.v6.v5.v3.v8.v9.v1.v8.v0.v9.v2.v5.v0.v1
C0: 9.111412390259325
C1: 9.94532215973475

exp(-pow((1-(v6*(-9.528899601444923))),2)) + max((5.353542014548175^3),v9) + pi

Generation: 2
Best Individual:
Evaluated: T
Fitness: 866.1386
Linking function: +
Gene 0
gau.comp.*.v7.C1.*.v7.mul3.v6.C0.v4.v3.v6.v8.v8.v3.v5.v6.v6.v9.v6.v5.C0.v0.v4.v2.v0.v2.v9.v3.C1.v8.C1
C0: -1.8928414198253574
C1: -9.528899601444923

Gene 1
max2.X3.v9.C0.exp.-.v0.v9.v0.v8.v3.v9.v6.v2.v1.v7.v6.v6.v0.v2.v3.v7.v6.v7.v3.v3.v1.v8.v7.v3.v8.v3.v7
C0: 5.353542014548175
C1: -9.236307382044629

Gene 2
gau.comp.*.v6.C1.*.v7.mul3.v6.C0.v1.v2.v6.v8.v3.v3.v5.v2.v6.v9.v6.v1.C1.v0.v4.v2.v0.v2.v9.v3.C0.v8.v9
C0: 9.111412390259325
C1: 9.94532215973475

exp(-pow((1-(v7*(-9.528899601444923))),2)) + max((5.353542014548175^3),v9) + exp(-pow((1-(v6*9.94532215973475)),2))

    ...

Generation: 238
Best Individual:
Evaluated: T
Fitness: 934.1794
Linking function: +
Gene 0
-.v9.v8.sech.v1.v7.v9.v4.v7.C1.v2.v0.v3.v8.v3.v1.v5.v8.v8.v2.v8.v6.v2.v9.v2.v2.v1.v7.v8.v7.v0.v0.C0
C0: -1.8928414198253574
C1: 4.140535916050434

Gene 1
sin.min3.avg2.v2.v9.v8.v5.v0.v5.v8.v2.v9.v8.v8.v3.v0.v5.v5.v3.C1.v2.v0.v6.v9.v9.v2.v7.v5.C1.v8.v7.v0.v6
C0: 7.570619662652462
C1: -1.542568610489699

Gene 2
max2.v9.v9.acsc.v7.v8.add3.tan.v4.v0.v0.v0.v8.C1.v1.v3.v5.v2.v8.v2.v5.v9.v2.v5.v4.v1.v2.v8.v7.v6.v0.v5.C0
C0: -9.226903090150953
C1: 0.7626090735460895

(v9-v8) + sin(min(((v8+v5)/2),v2,v9)) + max(v9,v9)

    ...

Best Individual of Run:
Found at Generation: 4293
Evaluated: T
Fitness: 936.5758
Linking function: +
Gene 0
-.v9.v8.v0.v9.v7.acsc.v8.v2.v0.v3.v2.C1.v4.C0.v5.v9.v3.v5.v0.v2.v7.v7.v5.v9.v6.v2.v9.v2.v3.v9.v5.v5
C0: 5.233037180754705
C1: -7.143261425650365

Gene 1
sin.min3.mul3.v2.v6.cot.v6.mul4.v3.v6.v9.v8.v9.v5.v3.v0.v0.v0.v9.v5.v6.v0.v8.v4.v2.v8.v5.v4.v1.v0.C1.v5.v5
C0: 3.8794116272712955
C1: 7.460702927280668

Gene 2
v9.*.v8.v9.v8.v2.v7.*.v6.v4.v2.v7.v1.v9.v8.v8.v2.v6.v1.v8.v3.C0.v3.v3.v2.C0.C0.v9.v4.v2.v2.v7.v0
C0: 1.24681550366709
C1: -0.13544436934751047

(v9-v8) + sin(min((cot(v3)*v6*(v6*v9*v8*v9)),v2,v6)) + v9

Size of program: 17
Variables used(count): v2(1), v3(1), v6(3), v8(2), v9(4)

***** TRAINING data results *****
Statistics:
MSE:  24.239096329510474       RAE:  0.06769687388092159      RSE:  0.004585892048274829
RMSE: 4.923321676420349        MAE:  4.017510790472716        RRSE: 0.06771921476416298


***** TEST data results *****
Number of Calculation Errors: 0 out of 10 test sets
Statistics:
MSE:  16.72184789340559        RAE:  0.11520996098236828      RSE:  0.019535079218796565
RMSE: 4.089235612361508        MAE:  3.011454736524365        RRSE: 0.13976794775196696

Observed    Computed
811.63994    808.827861961812
797.92192    797.2229981918084
786.41891    785.1324338242816
771.09691    774.3511732579256
763.19992    756.108499923999

    ...

858.957    853.8591859872175
869.964    871.6444114907323
878.539    880.63390351821

890.288    886.6045236016611
900.373    901.3867552973694
916.525    910.849758599105
932.277    932.416393052961
947.579    947.2086536125466
960.562    962.9643013882095
966.205    973.9101611582124
970.634    970.9536648254193
972.626    974.1782314966911
968.324    975.5769759596939


2. Exchange Rate Example - This example follows the exchange rate of the Australian dollar against the US dollar. The original time series data file USEXCHM.DAT can be obtained from:

    Hyndman, R.J. (n.d.) Time Series Data Library, http://www.robhyndman.info/TSDL/.
    Accessed on June 3, 2006.

For the integral time series used in this run, each value from the last 100 records of the original raw series was replaced by its moving average using a smoothing period p = 12.

3. Earthquake Data Example - This example follows the number of earthquakes per year of magnitude 7.0 or greater from 1900 to 1998. The source is the US National Earthquake Information Center and the original time series data file EARTHQ.DAT can be obtained from:

    Hyndman, R.J. (n.d.) Time Series Data Library, http://www.robhyndman.info/TSDL/. Accessed on June 3, 2006.

For the integral time series used in this run, each value from original raw series was replaced by its moving average using a smoothing period p = 10.

4. Sunspost Data Example - This example uses the Wolfer sunspots data that can be found at http://mason.gmu.edu/~csutton/sun554.html or do a search on the internet. For this example we have 3 variations to show how the timeseries files can be accessed. The Sunspots.param variation gets the data from a timeseries text file, Sunspots.txt. Sunspots2.param gets the data from a user program. Sunspots3.param get the data from a text file but it is preprocessed already, so one should not specify the delay, or other 2 parameters that are required when the data must be processed.

 

Logical Problem Examples

 

Logical problems have data values that are all logical, i.e. true or false (represented by 1 and 0 in gep). They can use only logical functions such as and, or, not, etc. and the linking function must be a logical function as well (from the list: and, or, nand, nor, xor, nxor). 

1. XOR example - Given a set of x, y and result values (where the result is the exclusive or of the x and y values) we try to discover an equivalent function using only the functions and, or and not. The Xor.param file contains:


gep.species.numgenes = 2
gep.species.gene-headsize = 8
gep.species.problemtype = logical
gep.species.symbolset.terminalfilename = ec/gep/test/Xor.txt
gep.species.symbolset.functionsize = 3
gep.species.symbolset.function.0 = Not
gep.species.symbolset.function.0.weight = 1
gep.species.symbolset.function.1 = And
gep.species.symbolset.function.1.weight = 1
gep.species.symbolset.function.2 = Or
gep.species.symbolset.function.2.weight = 1

The resulting statistic file shows that an expression was discovered that matches the requirement exactly.

Seed(s) used in this job: 4357
Problem Type: logical

Generation: 0
Best Individual:
Evaluated: T
Fitness: 500.0
Linking function: and
Gene 0
not.and.d1.and.d1.d0.not.d0.d0.d0.d0.d1.d1.d0.d0.d1.d1
Gene 1
not.and.d0.d1.d0.d1.d0.d1.d0.d0.d0.d0.d1.d0.d1.d0.d0

(not (d1 and (d1 and d0))) and (not (d0 and d1))

Generation: 2
Best Individual:
Evaluated: T
Fitness: 1000.0
Linking function: and
Gene 0
not.and.d0.d1.d0.and.d1.not.d0.d1.d1.d0.d0.d1.d1.d0.d0
Gene 1
or.d0.or.d1.d1.not.not.not.d1.d0.d0.d1.d0.d0.d1.d0.d0

(not (d0 and d1)) and (d0 or (d1 or d1))

Best Individual of Run:
Found at Generation: 2
Evaluated: T
Fitness: 1000.0
Linking function: and
Gene 0
not.and.d0.d1.d0.and.d1.not.d0.d1.d1.d0.d0.d1.d1.d0.d0
Gene 1
or.d0.or.d1.d1.not.not.not.d1.d0.d0.d1.d0.d0.d1.d0.d0

(not (d0 and d1)) and (d0 or (d1 or d1))

Size of program: 9
Variables used(count): d0(2), d1(3)

***** TRAINING data results *****
Confusion Matrix:
                Predicted Value
               |  Yes   |  No
               |----------------
            Yes|  2     |  0
Actual Value   |----------------
            No |  0     |  2
               |----------------

Observed    Computed
0.0         0.0
1.0         1.0
1.0         1.0
0.0         0.0


2. Majority Function Example - This example looks for a logical expression that can determine if given three inputs (x, y and z) there are at least 2 true (1) values. The solution uses only the operators not, and and or.

3. Odd 3-Parity Example - This example looks at 3 boolean input variables (x, y, and z) and looks for a logical expression that can decide if there are an odd number of true (1) values. The solution uses only the operators not, and and or.

4. 6 Bit Multiplexer Example - This example is an attempt to discover a 6 bit multiplexer function using and, or and not operators. It looks at 6 boolean variables representing 2 control bits and 4 data bits. The 1st 2 bits are the control bits and they identify which of the next 4 input bits will be in the output. For example, if the control bits are 00 then the 3rd bit value is value of the output; if the control bits are 01 then the 4th bit value is the output value; if the control bits are 10 then the 5th bit value is the output value; if the control bits are 11 then the 6th bit value is the output value. This can be expressed as:

    A&~c1&~c2 v B&~c1&c2 v C&c1&~c2 v D&c1&c2 = output

    where c1 is control bit 1 and c2 is control bit 2
              A, B, C, D are the 4 input bits

So if we have 6 bits labelled d0, d1, d2, d3, d4, d5 and d0=c1, d1=c2, A=d2, B=d3, C=d4, D=d5 then an excellent solution to the problem will be:

    output = d2&~d0&~d1 v d3&~d0&d1 v d4&d0&~d1 v d5&d0&d1

 

A Detailed Look at the GEP Parameter File (gep.params)


The file ec/gep/gep.params has most of the required details in comments. But here we list the parameter settings that are of concern for gep problems.

eval.problem - The user code to support the problem. For example, ec.app.gep.test2.test2. These classes will be of type GEPProblem. If the user wants to just take a default route and not have to create any code at all he can use the default GEPProblem class, ec.gep.GEPDefaultUserProb. Then he will need to also specify a 'fitness-function' to use by setting the parameter eval.problem.fitness-function and possibly one or more parameters of the form eval.problem.fitness-function-arg0, eval.problem.fitness-function-arg1

evalthreads and breedthreads  - should both be 1 always. It's possible to have multiple threads but it would NOT work as per the GeneXProTools system and I expect the code as is now will not work (at tleast the code would need to be modified so an adjustment is made so that the multiple paths through breeding  would each reduce the rates for inversion, mutation, etc.)

checkpoint - not well tested but it should work OK. If not let me know. May not be important since rerunning a problem may be fast enough.
state - Should be ec.simple.SimpleEvolutionState; do not change this!
init - Should be ec.gep.GEPInitializer; probably should not need to change this.
finish - Should be ec.simple.SimpleFinisher; probably should not need to change this.
exch - Should be ec.simple.SimpleExchanger; probably should not need to change this.
breed -Should be ec.simple.SimpleBreeder; do not change this.
eval
- Should be ec.simple.SimpleEvaluator; do not change this without some serious thought.
generations - As for standard ecj, maximum number of generations to run
quit-on-run-complete - As for standard ecj, true if  will quit when an 'ideal' individual is found in a generation; false if will run until all generations are completed. an 'ideal' individual is normally when the maximum fitness value (or something very vlose to maximum) is found. This can be controlled by the user written code (see eval.problem) or by modifying the default user code.
fitness-function-ideal-percent-threshold - specifies a percentage of the maximum fitness value (MaxThreshold) that must be reached to be considered an ideal fitness value; the default value is 99.999999; so if the maximum fitness value is 1000 and the fitness-function-ideal-percent-threshold is set to 99.9 then a fitness value >= 999 will be considered ideal; by default a value >= 999.99999 will be ideal.


pop - Should be ec.Population. - As for standard ecj.
pop.subpops - must be set to 1 (do not change since only 1 subpop makes sense).
pop.subpop.0 - must be ec.Subpopulation. Do not change.
pop.subpop.0.duplicate-retries - should be 0? Haven't looked closely at this.
pop.subpop.0.size - application dependent; number of individuals in each generation

pop.subpop.0.species - must be ec.gep.GEPSpecies.
pop.subpop.0.species.fitness - Should be ec.simple.SimpleFitness. Could change this as required (for example to support a more complex fitness or to use doubles rather than floats).
pop.subpop.0.species.ind - must be ec.gep.GEPIndividual.

The following are the standard defaults for the evolutionary parameters as suggested by Ferreira. These can be changed as required. I called these probabilities but they are actually 'rates' the way Ferreira uses them. If they were probabilities as I first thought then the number of affected individuals in each generation would vary according to the probability. But instead these rates are used to determine the number of individuals that will be affected in every generation (constant). The individuals are randomly chosen.
gep.species.inversion-prob                = 0.1
gep.species.mutation-prob                 = 0.044
gep.species.istransposition-prob       = 0.1
gep.species.ristransposition-prob      = 0.1
gep.species.onepointrecomb-prob     = 0.3
gep.species.twopointrecomb-prob     = 0.3
gep.species.generecomb-prob           = 0.1
gep.species.genetransposition-prob  = 0.1

Constants can not be used with Logical problem types and will be ignored if specified.
gep.species.use-constants - set to true if using constants else false.
gep.species.numconstantspergene - Number of constants generated for each gene.
gep.species.integer-constants - Set to true if using integer constants or false to use floating point constants.
gep.species.constants-lowerlimit - Minimum value of the constants chosen.
gep.species.constants-upperlimit - Maximum value of the constants chosen.
These are default to these values as suggested by Ferreira. They can be changed as required for the problem, but will need to understand what they mean (read the information in Ferreira's book, etc.)
gep.species.rnc-mutation-prob         = 0.01
gep.species.dc-mutation-prob          = 0.044
gep.species.dc-inversion-prob         = 0.1
gep.species.dc-istransposition-prob   = 0.1

gep.species.numchromosomes - The number of chromosomes to be used for the problem. This is not part of Ferreira's implementation of GEP and is set to 1 normally when using the standard GEP of Ferreira. It is a new feature added at NRC to deal with multi criteria decision making problems (represented by multiple dependent variables in the data set). When used, it will require a special fitness function to deal with the multi chromosomes: like the SEWD (Sammon Error with Dissimilarity) fitness function.
gep.species.numgenes - The number of genes to be used for the problem.
gep.species.gene-headsize - The length of the head of each gene.

The linking-function is the function to use to combine the values of the expressions for each gene. In early versions the linking-function could be one of +, -, * or / for numerical problem types and for logic problems it could be one of: and, or, nand, nor, xor, nxor. This was as provided in Ferreira’s implementation. These choices are still available, but now any function defined as a GEPFunctionSymbol (such as Add, Sub, Mul, Div, Ifgt4, Ifeq4, Pow, etc) that has arity of 2 or more can be used. It must be specified with the same ‘case’ as the class name and there must be a ‘correct’ number of genes that suit the arity of the function. For a function with arity of 2 any number of genes is OK; for arity 3 there must be 3, 5, 7, 9, etc. genes; for arity 4 there must be 4, 7, 10, 13, etc. genes; For 1 gene then no function is needed.
gep.species.gene-linking-function - The function to use when combining the values the expressions for each gene.

Problem type must be one of: functionfinding, classification, timeseries, logical. Set default to be 'unknown' so user is forced to specify in the problem params file.
gep.species.problemtype = unknown

gep.species.classification-threshold - If the problem is a classification type problem then a threshold value (used to convert real values to 0 or 1 during fitness calculations) must be specified.

If the problem type is timeseries then the user needs to specify a number of parameters. See ec.gep.gep.params for full details.
gep.species.timeseries-delay 
gep.species.timeseries-embeddingdimension
gep.species.timeseries-testingpredictions
 

gep.species.symbolset  - Must be ec.gep.GEPSymbolSet; do NOT change.
gep.species.symbolset.name - can be any string but must be specified; not actually used (at one point it seemed we might have multiple symbols sets but this will not likely happen0; should be removed at some point.

Examples for specifiying the functions to be used in the problem and their weights. The set of functions can be found by looking at the directory ec/.gep/symbols.
gep.species.symbolset.functionsize = 4
gep.species.symbolset.function.0 = Add
gep.species.symbolset.function.0.weight = 1
gep.species.symbolset.function.1 = Sub
gep.species.symbolset.function.1.weight = 1
gep.species.symbolset.function.2 = Mul
gep.species.symbolset.function.2.weight = 1
gep.species.symbolset.function.3 = Div
gep.species.symbolset.function.3.weight = 1

The number and names of the terminal symbols (independent variables) can be specified as shown below:
gep.species.symbolset.terminalsize = 1
gep.species.symbolset.terminal.0 = x
Or for the terminal set, the info can be in a text file where the file has the names of the terminal symbols and the dependent variable in the first record/line (comma separated, space or tab separated, etc) and the training values for the terminal symbols in the other records; the number of terminals is computed from the number of terminal symbols in the 1st record. See details in ec/gep/gep.params and look at some examples.
gep.species.symbolset.terminalfilename = filename
gep.species.symbolset.terminalfileseparator = ,
or for a tab
gep.species.symbolset.terminalfileseparator = tab
or for a space
gep.species.symbolset.terminalfileseparator = space

If there is testing data as well it can be specified. Again see full details in ec/gep/gep.params since there are some slightly complex situations (timeseries data etc.).
gep.species.symbolset.testingterminalfilename = testDataFileName

For gep systems breeding should be as shown below. Changing theseis possible but will not be as per original gep by Ferreira (as in GeneXProTools) and will require some in-depth knowledge!
gep.species.pipe = ec.gep.breed.DcGeneticOperatorsPipeline
gep.dcgeneticoperators.source.0 = ec.gep.breed.GenerecombinationPipeline
gep.generecombination.source.0 = ec.gep.breed.TwopointrecombinationPipeline
gep.twopointrecombination.source.0 = ec.gep.breed.OnepointrecombinationPipeline
gep.onepointrecombination.source.0 = ec.gep.breed.GenetranspositionPipeline
gep.genetransposition.source.0 = ec.gep.breed.RIStranspositionPipeline
gep.RIStransposition.source.0 = ec.gep.breed.IStranspositionPipeline
gep.IStransposition.source.0 = ec.gep.breed.InversionPipeline
gep.inversion.source.0 = ec.gep.breed.MutationPipeline
# FitProportionateSelection (roulette selection) is used exclusively by Ferreira - could be changed
gep.mutation.source.0 =
ec.select.FitProportionateSelection

breed.elite.0 - Should be set to 1 since this is the only option used by Ferreira; however you might want to experiment with this to see if it improves the evolution for your problem or just to force the best n individuals to be carried forward.

We supply a not so simple GEPSimpleStatistics class to handle the basic capture of information from a gep run. For the basic setting of statistics output you can do something like the following.

stat = ec.gep.GEPSimpleStatistics
stat.file = ourProblem.stat
stat.file-observed-versus-computed = ourOvsC.stat
stat.file-observed-versus-computed-test = ourOvsCtest.stat
stat.detail-to-log = change
stat.number-of-best-to-log = 1

This contains all of the parameters that you might like to use but variations can be done depending what you want to capture. As shown above, GEPSimpleStatistics has a number of parameters it can use to control the output:
stat.file - The main output is directed to the file specified by the file parameter.
file-observed-versus-computed - this file, if specified, gets the results of the best of run model by listing the observed values versus the model's predicted values (useful for plotting). By default this will include the training and testing data values (with test values appended to the end if they exist). To get the training and testing values separated identify a second file in the parameter:
file-observed-versus-computed-test. If file-observed-versus-computed is not specified it prints to the main output file, if specified, or to the console.
file-observed-versus-computed-test - If specified, this identifies a file with only the computed test data values. If not specified it prints to the console.
no-observed-versus-computed - If true then don't print any observed/computed values even if files are specified. Default is false.
detail-to-log - the user can specify the detail of the output to the main log file (param 'stat.file'). Values can be one of:
        all - display the best model from every generation (can be a large amount of output)
        change - logs the best model of the generation when the fitness improves (default)
        final - logs only the best of run model information
number-of-best-to-log - this determines the number of best individuals to display in the final results. By default it is 1.

For the multiple chromosome case there is another not so simple GEPSimpleStatisticsMultipleChromosomes class to handle the capture of information from a gep run. For the basic setting of statistics output you can do something like the following.

stat = ec.gep.GEPSimpleStatisticsMultipleChromosomes
stat.file = ourProblem.stat
stat.file-computed-values = ourOvsC.stat
stat.no-computed = false
stat.detail-to-log = change
stat.number-of-best-to-log = 1

These are similar to the parameters in the GEPSimpleStatistics case but we don't display the observed/computed pairs (just the computed values for the multiple dependent variables) so we only need the stat.file-computed-values and stat.no-computed parameters. At some point we should (will?) get the parameters to be the same so we don't need different ones for the 2 cases.

More complex things can be done as per standard ECJ. Consider the example below examples for setting multiple statistic packages.
stat.num-children = 3
stat.child.0 = ec.gep.test.SimpleXYSeriesChartStatistics
stat.child.0.title = Best of Generation
stat.child.0.x-axis-label = generation
stat.child.0.y-axis-label = fitness
stat.child.1 = ec.simple.SimpleShortStatistics
stat.child.1.file = simpleshortstats.txt
stat.child.1.gather-full = true
stat.child.2 = ec.gep.GEPSimpleStatistics
stat.child.2.file = simplestats.txt
stat.child.2.file-observed-versus-computed = simplestatsOvsC.txt


Note that as revisions are made, new parameters may be added or changed, so it is best to also look at the gep.params file for details.

Some of the Implementation Details

The system has been tested using Java version 1.6.

One thing to note about the gep implementation is that all values are double, rather than float as in other ECJ parts. The only thing not in double is the fitness and this only because we used the SimpleFitness provided with ecj. When complicated expressions are created during the evolution, considerable loss of precision creeps in when they are evaluated as floats rather than doubles. The penalty may be a loss of speed but I don't think it's too great. The gep implementation is still quite speedy and we've done a few demanding examples (50 individuals in the population, 5 genes, head size of 15, 2 constants per gene, 13 functions, 25000 generations) and the performance is quite acceptable. Try some of the examples or your own problems and I think you'll see that the performance is quite good. Note also that the logical problems, although they use only values 0 and 1 are still done with double values. This could be much improved I suppose, but no effort has been made to do so. It works fairly quickly as is and these are not problems we were all that interested in for our research.

This is the first version of this gep extension and the code could use some reorganization, optimization and perhaps better adherence to the ECJ structures (although it isn't too bad).

I did modify a couple of the standard ECJ files. These were very minor modifications. First the
ec.display.Console was changed slightly to allow the seed that is used in a run to be stored for access later in statistics output. This also required some changes to other files (below). The change was near line 935 in ec.display.Console:

        state.evalthreads = evalthreads;
        state.breedthreads = breedthreads;
        state.seeds = seeds;  // this line was added


Corresponding changes were made to files ec.Evolve and ec.EvolutionState to support the need to have access to the seed value(s) for reporting in statistics. This is important so that one can rerun the program with the same seed at some later time and there was no way to record this in the standard version of ECJ. In ec.Evolve near line 371 we added the code:

        state.breedthreads = breedthreads;
        state.randomSeedOffset = randomSeedOffset;
        state.seeds = seeds;  // this line was added

and near line 187 in ec.EvolutionState we added the code:

    /**
     * The seeds used for the current job
     */
    public int seeds[];



The two areas where users might want to add functionality to this gep implementation are
new mathematical/logical functions and new fitness functions. All mathematical functions are derived from the abstract class GEPFunctionSymbol. Any new function will have to provide just a few simple methods. Let's look at the Add function.

package ec.gep.symbols;
import ec.EvolutionState;
import ec.gep.GEPFunctionSymbol;
import ec.gp.GPData;
import ec.gp.GPIndividual;
import ec.util.Parameter;

public class Add extends GEPFunctionSymbol {

    public Add()
    {
        super("+", 2);
    }

    public double eval(double params[])
    {
        //should check that there are 2 params
        return (params[0] + params[1]);
    }
   
    public boolean isLogicalFunction()
    {
        return false;
    }

    public String getMathExpressionAsString( String p[] )
    {
        return "(" + p[0] + "+" + p[1] + ")";
    }
}


The constructor will simply call the super constructor and pass the symbol to be used when printing the Karva representation of the expressions formed in the genes. It will also pass the 'arity' of the function (how many arguments it needs when executed). It must also implement an eval method to calculate it's value using the arguments passed to it. As the comment says some checking could (should?) be done to ensure there are 2 arguments available in the params array, but this is probably not the place to do it. It should be done in the place where eval is called so every function doesn't have to do it (i.e. in the eval method of GEPExpressionTreeNode). There are places where it is necessary to know if this is a logical function or not (so we can ensure that logical problems use only logical functions and other problem types don't use logical functions. so we must provide the simple isLogicalFunction method. Finally in order to display the expressions in a mathematical notation it is necessary for the function to provide an appropriate string representation for this purpose. I'd suggest looking at a similar function to see how it was done and proceed from there. The array passed to getMathExpressionAsString contains the string representations of its parameters.

Adding new fitness functions is done by adding some methods to the class ec.gep.GEPFitnessFunction. For each of the fitness functions there are 3 methods that need to be supplied: one to calculate the fitness (with a value from 0 to some maximum); one to give the raw fitness value, the value prior to being mapped between 0 and the maximum value; and one to provide the maximum value that the fitness value can take for that fitness function. Actually the raw value is not always needed (see for example the specialized fitness function WCorrRMSE (Weighted correlation coefficient and Root Mean Squared Error). There is no raw value for this since it combines to other fitness values. but since some users might want access to the value before it is converted to a value from 0 to a maximum value, it seemed prudent to make this available whenever possible. By convention we use the names XXXXfitness, XXXXrawFitness and XXXXmaxFitness for all of the fitness functions. XXXX is the pneumonic for the fitness function (e.g. RRSE, MAE, etc.). This allows a user to specifiy a fitness function to be used in the default user program by specifying only the short pneumonic for the fitness function (they don't provide ANY code if they use this - check the code in GEPDefaultUserProg.java). The XXXXmaxFitness functions all have 1 arg when most don't need one. This was just to make it consistent and easier to call the method based on the XXXX part only (again see GEPDefaultUserProg.java). For example the functions for the MSE fitness functions are called as:

      GEPFitnessFunction.MSEfitness(useTrainingData, gepindividual );
          - calculates the fitness using the individual's gene expressions

    - it uses either the training date (useTrainingData is true) or the testing data (useTrainingData is false)
          - it gets the raw fitness first from MSErawFitness (the mean squared error
            of the predicted values versus the text/expected values)
          - then it normalizes the result between 0 and 1000 
              (1000 * (1/(1 + raw MSE))
      GEPFitnessFunction.MSErawFitness(useTrainingData, gepindividual, chromosomeNum);
          - sum((predicted valuei - expected valuei)**2)/n
      GEPFitnessFunction.NHmaxFitness( gepindividual );
          - in this case max is always 1000 but some fitness functions
            have a maximum that depends on the number of training data sets

Again let's look at a the code of a fitness function to see what's required.

      /**

       * Calculates the 'raw' fitness for the RAE (Relative Absolute Error) type

       * fitness (before the normalization from 0 to max value is done).

       * @param useTrainingData true if using training data, else use testing data

       * @param ind the GEP individual that needs its fitness calculated.

       * @param chromosomeNum which chromosome in the individual to use to calc the raw RAE

       * @return the 'raw' fitness value before normalization from 0 to max value

       */

      public static double RAErawFitness(boolean useTrainingData, GEPIndividual ind, int chromosomeNum)

      {

        double sumOfAbsoluteError = 0.0;

        double expectedResult;

        double result;

        double error;

        GEPDependentVariable dv;

 

        if (useTrainingData)

            dv = GEPDependentVariable.trainingData;

        else

            dv = GEPDependentVariable.testingData;

       

        double dvValues[] = dv.getDependentVariableValues(chromosomeNum);

        double dvSumOfAbsoluteError = dv.getDependentVariableSumOfAbsoluteError(chromosomeNum);

 

        for (int i=0; i<dvValues.length; i++)

        {

            expectedResult = dvValues[i];

            result = ind.eval(chromosomeNum, useTrainingData, i);

            error = result - expectedResult;

            sumOfAbsoluteError += Math.abs(error);

        }

       

        if (dvSumOfAbsoluteError == 0.0)

        {   dvSumOfAbsoluteError = RELATIVE_ERROR_ZERO_FACTOR;

            System.err.println("Warning: sum of error for dependent variable is 0 in RAE fitness calculation. Adjusting to avoid division by zero.");

        }

        // the raw fitness ... RAE

        return (sumOfAbsoluteError/dvSumOfAbsoluteError);

      }

     

      /**

       * Calculates the fitness for the RAE (Relative Absolute Error) type fitness.

       * Gets the raw fitness and then normalizes between 0 and max value as (maxValue * (1/(1+RAE)).

       * @param useTrainingData true if using training data, else use testing data

       * @param ind the GEP individual that needs its fitness calculated.

       * @return the fitness value after normalization from 0 to max value

       */

      public static double RAEfitness(boolean useTrainingData, GEPIndividual ind)

      {

        double RAE = RAErawFitness(useTrainingData, ind, 0);

        // raw fitness is normalized between 0 and 1000  (1000 * (1/(1+RAE))

        return (1000.0)/(1.0+RAE);

      }

 

 

      /**

       * The max value for this type of fitness is always 1000.

       * @param ind the GEP individual that needs its fitness calculated.

       * @return value 1000.0

       */

      public static double RAEmaxfitness(GEPIndividual ind)

      {

            // always 1000

            return 1000.0;

      }

 

 The code is fairly straight forward. the only thing to note is that at one point we accessed the sum of the absolute error of the values of the dependent variable data in the training set using - GEPDependentVariable.getDependentVariableSumOfAbsoluteError(). When the training and testing data for the program is obtained, a number of values are calculated and stored statically in the class GEPDependentVariable. This is so these values, which are used by several fitness functions, don't have to be recalculated many times as the evolutionary process measures the fitness of many individuals. The values available for use in the fitness functions are:

 

    /** Mean value of the training values for the dependent variable */
    static double dependentVariableMean = 0.0;
    /** Sum of the absolute error between dependent variables training values and their mean value. The
     *  absolute error is Abs(dv[i] - dvMean).
     */
    static double dependentVariableSumOfAbsoluteError = 0.0;
    /** Sum of the squares of the absolute errors between dependent variables training values and their mean
     *  value. The absolute error is Abs(dv[i] - dvMean).
     */
    static double dependentVariableSumOfSquaredAbsoluteError = 0.0;
    /** Sum of the relative errors between dependent variables training values and their mean value.
     *  The relative error is Abs((dv[i] - dvMean)/dvMean).
     */
    static double dependentVariableSumOfRelativeError = 0.0;
    /** Sum of the squares relative errors between dependent variables training values and their mean value.
     *  The relative error is Abs((dv[i] - dvMean)/dvMean).
     */
    static double dependentVariableSumOfSquaredRelativeError = 0.0;

These values are available for both the training data as shown above and for the test data as well. See the class ec.gep.GEPDependentVariable.