Julio Cárdenas-Rodríguez

A simple code to calculate the performance metrics of a binary classifier

Thu, 08 Feb 2018 11:43:56 -0700

1. Code

import numpy as np

def perf_metrics_2X2(yobs, yhat):
    """
    Returns the specificity, sensitivity, positive predictive value, and negeative predictive value 
    of a 2X2 table.

    where:
    0 = negative case
    1 = positive case

    Parameters
    ----------
    yobs :  array of positive and negative ``observed`` cases
    yhat : array of positive and negative ``predicted`` cases

    Returns
    -------
    sensitivity  = TP / (TP+FN)
    specificity  = TN / (TN+FP)
    pos_pred_val = TP/ (TP+FP)
    neg_pred_val = TN/ (TN+FN)

    Author: Julio Cardenas-Rodriguez
    """
    TP = np.sum(  yobs[yobs==1] == yhat[yobs==1] )
    TN = np.sum(  yobs[yobs==0] == yhat[yobs==0] )
    FP = np.sum(  yobs[yobs==1] == yhat[yobs==0] )
    FN = np.sum(  yobs[yobs==0] == yhat[yobs==1] )

    sensitivity  = TP / (TP+FN)
    specificity  = TN / (TN+FP)
    pos_pred_val = TP/ (TP+FP)
    neg_pred_val = TN/ (TN+FN)

    return sensitivity, specificity, pos_pred_val, neg_pred_val

2. Test

import pandas as pd
y     =    np.array([0, 1, 1, 1, 0, 1, 0,0,1,0])
y_hat = np.array([1, 1, 0, 1, 0, 1,0, 1,1,0])

metrics  =  perf_metrics_2X2(y, y_hat)

print(pd.DataFrame( dict( Metric = ['Specificity', 'Sensitivity', 'PPV','NPV'], 
                          Performance = np.round(metrics,3))))

        Metric  Performance
0  Specificity        0.800
1  Sensitivity        0.600
2          PPV        0.667
3          NPV        0.750

How to create an alias in Windows Powershell to launch Jupyter and Python

Fri, 01 Dec 2017 09:14:38 -0700

- Why am I doing this?

Running Jupyter and PIP becomes painful if you don't have admin privileges for your computer; A solution to this is to enter the entire path to PIP and/or Jupyter but it can very time consuming and inefficient. A way around this is to create and alias for the terminal to interpret a short command as if you were entering the entire path to PIP and/or Jupyter.

1. Find the location of your executables

These are the locations for Python, pip, and jupyter in my computer:

> ~\AppData\Local\Continuum\Anaconda3\python.exe
>~\AppData\Local\Continuum\Anaconda3\Scripts\pip.exe
> ~\AppData\Local\Continuum\Anaconda3\Scripts\jupyter-notebook.exe

2. Create a PowerShell profile

Just type the following in your PowerShell terminal:

> New-Item -Type file -Force $profile

3. Find the location of your profile file

For my computer the location is the following:

> \Documents\WindowsPowerShell\Microsoft.PowerShell_profile.ps1

4. Edit your profile file with the following aliases

Set-Alias py   "~\AppData\Local\Continuum\Anaconda3\python.exe"
Set-Alias pip  "~\AppData\Local\Continuum\Anaconda3\Scripts\pip.exe"
Set-Alias jup  "~\AppData\Local\Continuum\Anaconda3\Scripts\jupyter-notebook.exe"

Remember that the path above are specific for my computer, you should update according to the file structure in your own computer.

5. Close PowerShell and open a new session

6. Test the aliases in PowerShell

For example, type the following to launch a Jupyter notebook in any directory:

> cd Documents
> ezj

Tuning scikit-learn parameters using optimization instead of random search

Mon, 13 Nov 2017 11:01:49 -0700

The problem

On previous posts I described how to perform non-linear curve fitting in Ptyhon and Julia. At their core non-linear and linear curve fitting (or regression) are optimization problems in which we find the parameters that minimize an objective function. The entire field of mathematical optimization is concerned with finding the most efficient and accurate methods to minimize such functions.
On the other hand, the current standard to find the optimal values for the parameters of the algorithms used in machine learning is to perform a random search or a grid search throughout the space of the possible values that such parameters can take. These approaches have several limitations:

They are not computationally efficient for large data sets
the parameters tested are not informed in any way by the results from the previous step.

The solution

However, the implementation of optimization-driven approaches for scikit-learn is not a trivial matter. Thankfully, James Bergstra and other brave souls have created hyperopt, a Python library for optimizing over awkward search spaces with real-valued, discrete, and conditional dimensions, which makes it ideal for tuning hyper parameters with scikit-learn.

What we need

In order to tune the parameters of scikit-learn estimator, hyperopt needs the following:
1. Data
2. The objective function to be minimized
3. The search space from which to sample the parameters
4. The algorithm to be used for the minimization of the objective function, and the number of time the optimization should be run

Python implementation

#modules
from sklearn.metrics.regression import mean_absolute_error as mae
from sklearn.metrics import make_scorer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import GradientBoostingRegressor
from hyperopt import hp, fmin, tpe
from hyperopt.pyll import scope

import numpy as np

# hyperopt object for 
scope.define(GradientBoostingRegressor)    

def train_GradientBoostingRegressor(Xdata, Ydata, loss='ls' ,alpha = 0.50, cv = 2, n_steps = 10):
    """
    Trains a Gradient Boosting Regressor using bayesian optimization 
    Parameters
    ----------
    Xdata: numpy array of size KxN and composed of floating and/or integers
    Ydata: numpy array of size K (1D array) of floating
    loss: loss function to be optimized.
    alpha: quantile for the quantile and hubber loss; floating < 1.0 and > 0.0
    CV: K-fold cross-validation size for the training procedure
    n_steps: Number of times the `hyperopt` mimizer will run to find the optimal parameters

    Returns
    -------
    Regressor : A sckiki-learn obkect with the trained Gradient Boosting Regressor

    """
    #split data
    X_train, X_test, y_train, y_test = train_test_split(Xdata, Ydata, test_size=.33, random_state = 42)

    # create and objective function
    def objective_function_regression(estimator):
        mae_array = cross_val_score( estimator, X_train, y_train, cv= cv, n_jobs=-1, 
                                    scoring = make_scorer(mae) )
        return mae_array.mean()


    # search space
    n_estimators     = hp.randint('n_estimators',1000) 
    learning_rate    = hp.loguniform('learning_rate',-3,1)
    max_depth        = hp.randint('max_depth', 10)
    max_features     = hp.randint('max_features',X_train.shape[1])
    min_samples_leaf = hp.randint('min_samples_leaf', 10)
    criterion        = hp.choice('criterion', ['friedman_mse'])

    # model / estimator to be optimized
    est0 = (0.1, scope.GradientBoostingRegressor( loss = loss,
                                                 alpha = alpha,
                                                 n_estimators  = n_estimators + 1,
                                                learning_rate = learning_rate,
                                                max_depth = max_depth + 1,
                                                max_features = max_features + 1,
                                                min_samples_leaf = min_samples_leaf + 1,
                                                criterion = criterion,
                                                random_state= 101) 
                  )  

    # search space
    search_space_regression = hp.pchoice('estimator', [est0])

    print('--'*20)
    print('Finding optimal parameters')
    # perform the optimization
    best = fmin(fn= objective_function_regression,
                             space= search_space_regression,
                              algo = tpe.suggest, 
                             max_evals = n_steps,
                             verbose = 0  # The number of iterations
                             )

    # Allocate optimized parameters and apply to test data set
    Regressor = GradientBoostingRegressor( loss = loss, alpha = alpha,
                          learning_rate = best['learning_rate'],
                          max_depth = best['max_depth'],
                          max_features = best['max_features'],
                          min_samples_leaf = best['min_samples_leaf'],
                          n_estimators = best['n_estimators'],
                          random_state = 101
                                       )
    # fit
    Regressor.fit(X_train,y_train)

    #evaluate
    yhat = Regressor.predict(X_test) ;
    error_pct = np.round( np.median(np.abs(yhat - y_test)), 2)
    #print('--'*20)

    print(  
        "{} {}".format('The Median Abs. Error (%) for the test set is :', error_pct) 
          )

    return Regressor, y_test, yhat

Now, we can use the Boston housing data set to test our beautiful code:

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

D= load_boston()

R1, ytest1, yhat1 = train_GradientBoostingRegressor( D.data, 
                                                                                                                     D.target,  
                                                                                                                     loss='quantile', 
                                                                                                                     alpha = 0.50, 
                                                                                                                     n_steps = 50)