HierCost - Hierarchical Cost Sensitive Classification

HierCost - Hierarchical Cost Sensitive Learning

HierCost toolkit is a set of programs for supervised classification for single-label and multi-label hierarchical classification using cost sensitive logistic regression based classifier written in python. It consists of the following scripts

train.py is the program to learn "n" one-versus-rest classification models, where "n" is the number of terminal labels in the hierarchical data set.

predict.py is the program to predict the class labels given a test data set.

The source code/executables can be downloaded here: HierCost Source

Command Line Interface for train.py

Usage

python train.py [-h] -d DATASET -f FEATURES -t HIERARCHY -m MODEL_DIR [-c COST_TYPE] [-r RHO] [-u] [-i] [-n NODES]

Input/Options

-h, --help

print help and exit.

-d DATASET, --dataset DATASET

Location of the training dataset file in LibSVM format (see file formats).

-f FEATURES, --features FEATURES

Integer value representing the number of training features.

-t HIERARCHY, --hierarchy HIERARCHY

Hierarchy in edge-list format (see file formats).

-m MODEL_DIR, --model_dir MODEL_DIR

Directory/Folder where the model output files are saved. Any existing files will be overwritten.

-c COST_TYPE, --cost_type COST_TYPE

Cost type refers to the different strategies for deriving costs based on the hierarchy. Valid values for COST_TYPE are lr -- Standard Logistic Regression ( Default ) trd -- Tree Distance nca -- Number of Common Ancestors etrd -- Exponetiated Tree DistanceSee Reference for further explanation.

-r RHO, --rho RHO

Value of regularization parameter, which should be a positive floating point value. Default = 1.

-u, --multi

Train models for mult-label classification. Default is single-label classification.

-i, --imbalance

Include imalance costs

See Reference for further explanation.

-n NODES, --nodes NODES

Comma separated list of training nodes (No space around commas). By default models are trained for all the leaf nodes.
E.g. -n 2,1,33

Output

For each node in the hierarchy for which a model is trained, the program outputs a model to a file with the name <node_id>.p in the directory provided by the file system path MODEL_DIR.

Command Line Interface for predict.py

Usage

python predict.py [-h] -d DATASET -f FEATURES -t HIERARCHY -m MODEL_DIR [-u] -p PRED_PATH

Input/Options

-h, --help

print help and exit.

-d DATASET, --dataset DATASET

File location of the training dataset file in LibSVM format (see file formats).

-f FEATURES, --features FEATURES

Integer value representing the number of training features.

-t HIERARCHY, --hierarchy HIERARCHY

Hierarchy in edge-list format (see file formats).

-m MODEL_DIR, --model_dir MODEL_DIR

Directory/Folder where the model output files are saved from the training script.

-u, --multi

Type of training (single-label or multi-label) used in training. Must match training.

-p PRED_PATH, --pred_path PRED_PATH

File location for the predicted output (see file formats).

File Formats

DATASET

The dataset for training/testing should be provided in libsvm format. With multi-label dataset, option -u must be set.

Format for input file:

<label1,lable2,...> <index1>:<value1> <index2>:<value2> ...

Example for input file (Single Label):

1 1:0.01 2:1.5 3:1.25 
2 1:1.1 4:5.5 
 ...

Example for input file (Multi Label):

1,2 1:0.01 2:1.5 3:1.25 
2,4,5 1:1.1 4:5.5 
 ...

HIERARCHY

Hierarchy is a text file representing the hierarchy in edge-list format. Each line of the file represents an edge between a parent and child node.

Format for hierarchy:

parent_node_id child_node_id
parent_node_id child_node_id
parent_node_id child_node_id
 ...

Example for hierarchy:

PREDICTIONS

Predictions are saved in a text file. Each line contains to the predicted labels for the corresponding instance from the test data set.

Example for single-label prediction:

1
2
1
1
3
...

Example for multi-label prediction:

1
1,2
1,3
1
3,5
...

Contact Information

If you have any questions or problems with HierCost please send an email to acharuva@gmu.edu.

Citing HierCost

In citing hierCost in your papers, please use the following reference:

"HierCost: Improving Large Scale Hierarchical Classification with Cost Sensitive Learning". Anveshi Charuvaka and Huzefa Rangwala Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2015, Porto, Portugal.

Copyright and License Information
HierCost is primarily written and maintained by Anveshi Charuvaka (George Mason University) and is copyrighted by George Mason University It can be freely used for educational and research purposes by non-profit institutions and US government agencies only. Other organizations are allowed to use HierCost only for evaluation purposes, and any further uses will require prior approval.
The software may not be sold or redistributed without prior approval. One may make copies of the software for their use provided that the copies, are not sold or distributed, are used under the same terms and conditions.

As unestablished research software, this code is provided on an ``as is'' basis without warranty of any kind, either expressed or implied. The downloading, or executing any part of this software constitutes an implicit agreement to these terms. These terms and conditions are subject to change at any time without prior notice.
Funding Provided by NSF Grants IIS 1252318 and 0905117