Code and Documentation:


Source Code for Evolutionary Feature Construction (EFC) is available here: EFC_Source.zip

File Descriptions for Code:

AMPProblem.java: This code is used to read the training data files in the fasta format (positives and negatives, separately). The code evaluates each GP feature tree by interpreting it on every positive and negative peptide, using the fitness function described in the paper.

HallOfFame.java and StatisticalPluginForHallOfFame.java: Code in these files collects top features from each generation and stores them in the hall of fame. For persistence, this is written to a file.

AMPSequenceFeatureInterpreter.java: This code is used to take the features in the hall of fame and run them against training/test data files to prepare data files that can be used to train a machine learning model. Teh data files are in the format of dtbsvm files, i.e featurenumber:boolean(1/0). The code also cleans up some features, removing trivial redundancies.

amp.params: This is the parameter file used to set the mutation/crossover rate in EFC, the number of non-terminals and terminals, constraints for values of these, as well as various tuning parameters, such as maximum positions and training file locations.