HierCost - Hierarchical Cost Sensitive Learning


HierCost toolkit is a set of programs for supervised  classification for single-label and multi-label hierarchical classification using cost sensitive logistic regression based classifier written in python. It consists of the following scripts
The source code/executables can be downloaded here: HierCost Source

Command Line Interface for train.py

Usage
python train.py [-h] -d DATASET -f FEATURES -t HIERARCHY -m MODEL_DIR
[-c COST_TYPE] [-r RHO] [-u] [-i] [-n NODES]
Input/Options
-h, --help
print help and exit.
-d DATASET, --dataset DATASET
Location of the training dataset file in LibSVM format (see file formats).
-f FEATURES, --features FEATURES
Integer value representing the number of training features.
-t HIERARCHY, --hierarchy HIERARCHY
Hierarchy in edge-list format (see file formats).
-m MODEL_DIR, --model_dir MODEL_DIR
Directory/Folder where the model output files are saved. Any existing files will be overwritten.
-c COST_TYPE, --cost_type COST_TYPE
Cost type refers to the different strategies for deriving costs based on the hierarchy. Valid values for COST_TYPE are
lr   -- Standard Logistic Regression ( Default )
trd  -- Tree Distance
nca  -- Number of Common Ancestors
etrd -- Exponetiated Tree Distance

See Reference for further explanation.
-r RHO, --rho RHO
Value of regularization parameter, which should be a positive floating point value. Default = 1.
-u, --multi
Train models for mult-label classification. Default is single-label classification.
-i, --imbalance
Include imalance costs

See Reference for further explanation.
-n NODES, --nodes NODES
Comma separated list of training nodes (No space around commas). By default models are trained for all the leaf nodes.
E.g. -n  2,1,33
Output
For each node in the hierarchy for which a model is trained, the program outputs a model to a file with the name <node_id>.p in the directory provided by the file system path MODEL_DIR.

Command Line Interface for predict.py

Usage
python predict.py [-h] -d DATASET -f FEATURES -t HIERARCHY -m MODEL_DIR [-u]
                  -p PRED_PATH
Input/Options
-h, --help
print help and exit.
-d DATASET, --dataset DATASET
File location of the training dataset file in LibSVM format (see file formats).
-f FEATURES, --features FEATURES
Integer value representing the number of training features.
-t HIERARCHY, --hierarchy HIERARCHY
Hierarchy in edge-list format (see file formats).
-m MODEL_DIR, --model_dir MODEL_DIR
Directory/Folder where the model output files are saved from the training script.
-u, --multi
Type of training (single-label or multi-label) used in training. Must match training.
-p PRED_PATH, --pred_path PRED_PATH
File location for the predicted output (see file formats).

File Formats

DATASET
The dataset for training/testing should be provided in libsvm format.  With multi-label dataset, option -u must be set.
Format for input file:

<label1,lable2,...> <index1>:<value1> <index2>:<value2> ... 

Example for input file (Single Label):

1 1:0.01 2:1.5 3:1.25 
2 1:1.1 4:5.5
...

Example for input file (Multi Label):

1,2 1:0.01 2:1.5 3:1.25 
2,4,5 1:1.1 4:5.5
...


HIERARCHY
Hierarchy is a text file representing the hierarchy in edge-list format. Each line of the file represents an edge between a parent and child node.
Format for hierarchy:

parent_node_id child_node_id
parent_node_id child_node_id
parent_node_id child_node_id
...

Example for hierarchy:

0 1
0 66
0 69
0 12
1 9
1 2
...

PREDICTIONS
Predictions are saved in a text file. Each line contains to the predicted labels for the corresponding instance from the test data set.
Example for single-label prediction:

1
2
1
1
3
...

Example for multi-label prediction:

1
1,2
1,3
1
3,5
...

Contact Information

If you have any questions or problems with HierCost please send an email to acharuva@gmu.edu.

Citing HierCost

In citing hierCost in your papers, please use the following reference:

"HierCost: Improving Large Scale Hierarchical Classification with Cost Sensitive Learning". Anveshi Charuvaka and Huzefa Rangwala  Proceedings of the  European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2015, Porto, Portugal.

Copyright and License Information

HierCost is primarily written and maintained by Anveshi Charuvaka (George Mason University) and is copyrighted by George Mason University It can be freely used for educational and research purposes by non-profit institutions and US government agencies only. Other organizations are allowed to use HierCost only for evaluation purposes, and any further uses will require prior approval.

The software may not be sold or redistributed without prior approval. One may make copies of the software for their use provided that the copies, are not sold or distributed, are used under the same terms and conditions.

As unestablished research software, this code is provided on an ``as is'' basis without warranty of any kind, either expressed or implied. The downloading, or executing any part of this software constitutes an implicit agreement to these terms. These terms and conditions are subject to change at any time without prior notice.

Funding Provided by NSF Grants IIS 1252318 and 0905117