Fairkit-learn: A fairness evaluation and comparison toolkit

Overview

Modern software relies heavily on data and machine learning, and affects decisions that shape our world. Unfortunately, recent studies have shown that because of biases in data, software systems frequently inject bias into their decisions, from producing better closed caption transcriptions of men’s voices than of women’s voices to overcharging people of color for financial loans. To address bias in machine learning, data scientists need tools that help them understand the trade-offs between model quality and fairness in their specific data domains. Toward that end, we present fairkit-learn, a toolkit for helping data scientists reason about and understand fairness. Fairkit-learn works with state-of-the-art machine learning tools and uses the same interfaces to ease adoption. It can evaluate thousands of models produced by multiple machine learning algorithms, hyperparameters, and data permutations, and compute and visualize a small Pareto-optimal set of models that describe the optimal trade-offs between fairness and quality. We evaluate fairkitlearn via a user study with 54 students, showing that students using fairkit-learn produce models that provide a better balance between fairness and quality than students using scikit-learn and IBM AI Fairness 360 toolkits. With fairkit-learn, users can select models that are up to 67% more fair and 10% more accurate than the models they are likely to train with scikit-learn.

Collaborators

Brittany Johnson
Jesse Bartola, Hubspot
Rico Angell, University of Massachusetts Amherst
Sam Witty, University of Massachusetts Amherst
Stephen Giguere, University of Texas at Austin
Yuriy Brun, University of Massachusetts Amherst

Study Materials

Tools Under Investigation

Our study compared our prototype tool, fairkit-learn, to two other state-of-the-art machine learning tools: scikit-learn and AI Fairness 360.

Scikit-learn
Scikit-learn is a machine learning toolkit that providesalgorithms and metrics for training and evaluating machine learning models. While scikit-learn supports model evaluation using metrics such as accruacy, precision, and recall, it does not support training or evaluating models for fairness. It also does not have built-in support for exploring the space of machine learning model configurations and only supports evaluating machine learning models by one metric at a time.

AI Fairness 360
IBM AI Fairness 360 provides datasets, models, algorithms, and metrics that pertain to machine learning model fairness. Along with this large set of functionalities, the website provides detailed documentation and examples for using the various components of the toolkit. AI Fairness 360 is built using scikit-learn and, like scikit-learn, does not provide built-in support for exploring the space of models and configurations nor does it provide support for evaluating trade-offs between multiple metrics.

Fairkit-learn
Fairkit-learn is an open-source Python toolkit designed to help data scientists evaluate and explore machine learning models with respect to quality and fairness metrics simultaneously. Fairkit-learn builds on top of scikit-learn and AI Fairness 360 and supports all metrics and learning algorithms available in scikit-learn and AI Fairness 360. It also supports all of the bias mitigating pre- and post-processing algorithms available in AI Fairness 360, and provides extension points to add more metrics and algorithms.

Task Notebooks

Each task notebook includes a walk through tutorial with instructions on how to use the tool to complete various tasks, followed by a set of tasks to complete using that tool. Each file is a jupyter notebook file that can be opened and ran in any jupyter notebook environment.