{ "cells": [ { "cell_type": "markdown", "metadata": { "comet_cell_id": "b4bcefbe0a41f" }, "source": [ "Before beginning task 3, make sure to run the following cell to import all necessary packages. If you need any additional packages, add the import statement(s) to the cell below and re-run the cell before adding and running code that uses the additional packages.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "comet_cell_id": "2c7cd3e562e7" }, "outputs": [], "source": [ "# Load all necessary packages\n", "import numpy as np\n", "import sklearn as skl\n", "import six\n", "\n", "# dataset\n", "from aif360.datasets import AdultDataset\n", "\n", "# models\n", "from sklearn.linear_model.logistic import LogisticRegression \n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.ensemble import RandomForestClassifier \n", "from sklearn.svm import SVC \n", "\n", "# metric\n", "from sklearn.metrics import accuracy_score" ] }, { "cell_type": "markdown", "metadata": { "comet_cell_id": "46991f420accb" }, "source": [ "# Tutorial 3: scikit-learn\n", "\n", "Now we show you how to train and evaluate models using scikit-learn. You will use the knowledge from this tutorial to complete Task 3, so please read thoroughly and execute the code cells in order.\n", "\n", "## Step 1: Import the dataset\n", "\n", "First we need to import the dataset we will use for training and testing our model.\n", "\n", "Below, we provide code that imports the Adult dataset. **Note: a warning may pop up when you run this cell. As long as you don't see any errors in the code, it is fine to continue.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "comet_cell_id": "936840797dfba" }, "outputs": [], "source": [ "data_orig = AdultDataset()" ] }, { "cell_type": "markdown", "metadata": { "comet_cell_id": "1f7cb8ab1c822" }, "source": [ "## Step 2: Split the dataset into train and test data\n", "\n", "Now that the dataset has been imported, we need to split the original dataset into training and test data. \n", "\n", "The code to do so is as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "comet_cell_id": "8f3e98f0712d1" }, "outputs": [], "source": [ "data_orig_train, data_orig_test = data_orig.split([0.7], shuffle=False)" ] }, { "cell_type": "markdown", "metadata": { "comet_cell_id": "8e6392efce817" }, "source": [ "## Step 3: Initialize model \n", "\n", "Next, we need to initialize our model. We can initialize a model with the default parameters (see documentation), no parameters (which initializes with default parameter values), or we can modify parameter values.\n", "\n", "For the tutorial, we use the Logistic Regression model with default hyper-parameter values; you will be able to use any of the scikit-learn models listed above, and modify hyper-parameter values, when completing the exercise. \n", "\n", "Below we provide code for initialzing the Logistic Regression model, with default hyper-parameter values. We also provide (commented) code that reminds you of how to initialize each model available during this exercise." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "comet_cell_id": "e89b66337a2a6" }, "outputs": [], "source": [ "# model is populated with default values; modifying parameters is allowed but optional\n", "model = LogisticRegression(penalty='l2', dual=False,tol=0.0001,C=1.0,\n", " fit_intercept=True,intercept_scaling=1,class_weight=None,\n", " random_state=None,solver='liblinear',max_iter=100, \n", " multi_class='warn',verbose=0,warm_start=False,\n", " n_jobs=None)\n", "\n", "#model = KNeighborsClassifier(n_neighbors=5,weights='uniform',algorithm='auto',\n", "# leaf_size=30,p=2,metric='minkowski',metric_params=None,\n", "# n_jobs=None)\n", "\n", "#model = RandomForestClassifier(n_estimators='warn',criterion='gini',max_depth=None,\n", "# min_samples_leaf=1,min_weight_fraction_leaf=0.0,\n", "# min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, \n", "# random_state=None, verbose=0, warm_start=False, class_weight=None)\n", "\n", "#model = SVC(C=1.0, kernel='rbf', degree=3, gamma='auto_deprecated', coef0=0.0, shrinking=True, \n", "# probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, \n", "# max_iter=-1, decision_function_shape='ovr', random_state=None)" ] }, { "cell_type": "markdown", "metadata": { "comet_cell_id": "7bba44eb7e995" }, "source": [ "## Step 4: Train the model\n", "\n", "After initialing the model, we train it using the training dataset. \n", "\n", "Below we provide code that prepares our dataset to be used with scikit-learn and trains the model using our prepared data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "comet_cell_id": "470c3e83a0934" }, "outputs": [], "source": [ "# prepare data for use with scikit-learn\n", "from sklearn.preprocessing import StandardScaler\n", "\n", "scaler = StandardScaler()\n", "\n", "x_train = scaler.fit_transform(data_orig_train.features)\n", "y_train = data_orig_train.labels.ravel()\n", "\n", "\n", "model.fit(x_train, y_train)" ] }, { "cell_type": "markdown", "metadata": { "comet_cell_id": "483779469d731" }, "source": [ "## Step 5: Evaluate the model\n", "\n", "Now we're ready to evaluate your trained model with the test data using the performance metric provided by scikit-learn.\n", "\n", "Below we provide code snippets that show how to evaluate a model's performance using scikit-learn." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "comet_cell_id": "f3c98baf23fd4" }, "outputs": [ { "ename": "NameError", "evalue": "name 'lr' is not defined", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mpredictions\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mlr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpredict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata_orig_test\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0maccuracy\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0maccuracy_score\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata_orig_test\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlabels\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpredictions\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mNameError\u001b[0m: name 'lr' is not defined" ] } ], "source": [ "x_test = scaler.fit_transform(data_orig_test.features)\n", "\n", "predictions = model.predict(x_test)\n", "accuracy = accuracy_score(data_orig_test.labels.ravel(), predictions)\n", "\n", "print ('Accuracy = ' + str(accuracy))\n" ] }, { "cell_type": "markdown", "metadata": { "comet_cell_id": "f42e810454232" }, "source": [ "# Task 3: Model evaluation with scikit-learn\n", "\n", "Your turn! Use what you learned in the above tutorial to train and evaluate models for performance, fairness, and overall quality. 