Machine Learning Workflow: Classifiers With Nonlinear Features

1 Overview

We illustrate the typical machine learning workflow for regression problems using the Default data set from R ISLR2 package. Our goal is to classify whether a credit card customer will default or not. The steps are

Initial splitting to test and non-test sets.
Pre-processing of data: one-hot-encoder for categorical variables, add nonlinear features (B-splines) for some continuous predictors.
Choose a set of candidate classifiers: logistic regression, LDA, QDA, NB, KNN.
Tune the hyper-parameter(s) (n_knots in SplineTransformer, classifier) using 10-fold cross-validation (CV) on the non-test data.
Choose the best model by CV and refit it on the whole non-test data.
Final classification on the test data.

2 Default data

A documentation of the Default data is here. The goal is to classify whether a credit card customer will default or not.

##| eval: true
library(GGally)
library(ISLR2)
library(tidymodels)
library(tidyverse)

Default <- as_tibble(Default) %>%
  print(width = Inf)

# A tibble: 10,000 × 4
   default student balance income
   <fct>   <fct>     <dbl>  <dbl>
 1 No      No         730. 44362.
 2 No      Yes        817. 12106.
 3 No      No        1074. 31767.
 4 No      No         529. 35704.
 5 No      No         786. 38463.
 6 No      Yes        920.  7492.
 7 No      No         826. 24905.
 8 No      Yes        809. 17600.
 9 No      No        1161. 37469.
10 No      No           0  29275.
# ℹ 9,990 more rows

# Numerical summaries
summary(Default)

 default    student       balance           income     
 No :9667   No :7056   Min.   :   0.0   Min.   :  772  
 Yes: 333   Yes:2944   1st Qu.: 481.7   1st Qu.:21340  
                       Median : 823.6   Median :34553  
                       Mean   : 835.4   Mean   :33517  
                       3rd Qu.:1166.3   3rd Qu.:43808  
                       Max.   :2654.3   Max.   :73554

Graphical summary.

# Graphical summaries
ggpairs(
  data = Default, 
  mapping = aes(alpha = 0.25), 
  lower = list(continuous = "smooth")
  ) + 
  labs(title = "Default Data")

# Load the pandas library
import pandas as pd
# Load numpy for array manipulation
import numpy as np
# Load seaborn plotting library
import seaborn as sns
import matplotlib.pyplot as plt

# Set font sizes in plots
sns.set(font_scale = 1.2)
# Display all columns
pd.set_option('display.max_columns', None)

Default = pd.read_csv("../data/Default.csv")
Default

     default student      balance        income
0         No      No   729.526495  44361.625074
1         No     Yes   817.180407  12106.134700
2         No      No  1073.549164  31767.138947
3         No      No   529.250605  35704.493935
4         No      No   785.655883  38463.495879
...      ...     ...          ...           ...
9995      No      No   711.555020  52992.378914
9996      No      No   757.962918  19660.721768
9997      No      No   845.411989  58636.156984
9998      No      No  1569.009053  36669.112365
9999      No     Yes   200.922183  16862.952321

[10000 rows x 4 columns]

# Numerical summaries
Default.describe()

            balance        income
count  10000.000000  10000.000000
mean     835.374886  33516.981876
std      483.714985  13336.639563
min        0.000000    771.967729
25%      481.731105  21340.462903
50%      823.636973  34552.644802
75%     1166.308386  43807.729272
max     2654.322576  73554.233495

Graphical summary:

# Graphical summaries
plt.figure()
sns.pairplot(data = Default);
plt.show()

3 Initial split into test and non-test sets

It is a good idea to keep the proportions of default = 'Yes' to be roughly same between test and non-test data.

##| eval: false
# For reproducibility
set.seed(425)
data_split <- initial_split(
  Wage, 
  # # stratify by percentiles
  # strata = "Salary", 
  prop = 0.75
  )

Wage_other <- training(data_split)
dim(Wage_other)

[1] 2250   11

Wage_test <- testing(data_split)
dim(Wage_test)

[1] 750  11

from sklearn.model_selection import train_test_split

Default_other, Default_test = train_test_split(
  Default, 
  train_size = 0.75,
  random_state = 425, # seed
  stratify = Default.default
  )
Default_test.shape

(2500, 4)

Default_other.shape

(7500, 4)

Separate \(X\) and \(y\).

# Non-test X and y
X_other = Default_other[['balance', 'income', 'student']]
y_other = Default_other.default
# Test X and y
X_test = Default_test[['balance', 'income', 'student']]
y_test = Default_test.default

4 Recipe (R) or Preprocessing (Python)

##| eval: false
norm_recipe <- 
  recipe(
    wage ~ year + age + education, 
    data = Wage_other
  ) %>%
  # create traditional dummy variables
  step_dummy(all_nominal()) %>%
  # zero-variance filter
  step_zv(all_predictors()) %>% 
  # B-splines of age
  step_bs(age, deg_free = 5) %>%
  # B-splines of year
  step_bs(year, deg_free = 4) %>%
  # center and scale numeric data
  step_normalize(all_predictors())#  %>%
  # estimate the means and standard deviations
  # prep(training = Wage_other, retain = TRUE)
norm_recipe

Pre-processor for one-hot coding of categorical variables and then standardizing all numeric predictors.

from sklearn.preprocessing import OneHotEncoder, StandardScaler, SplineTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Transformer for numeric variables
numeric_tf = Pipeline(steps = [
  # ("scalar", StandardScaler()),
  ("bspline", SplineTransformer(degree = 3, extrapolation = 'linear'))
])
# Transformer for categorical variables
categorical_tf = Pipeline(steps = [
  ("encoder", OneHotEncoder(drop = 'first'))
])

# Column transformer
col_tf = ColumnTransformer(transformers = [
  ('num', numeric_tf, ['balance', 'income']),
  ('cat', categorical_tf, ['student'])
])

5 Model

Let’s skip this step for now, because the model (or classifer) itself is being tuned by CV.

R

##| eval: false
enet_mod <- 
  # mixture = 0 (ridge), mixture = 1 (lasso)
  linear_reg(penalty = tune(), mixture = tune()) %>% 
  set_engine("glmnet")
enet_mod

Linear Regression Model Specification (regression)

Main Arguments:
  penalty = tune()
  mixture = tune()

Computational engine: glmnet

6 Workflow (R) or Pipeline (Python)

Here we bundle the preprocessing step (Python) or recipe (R) and model. Again remember KNN is a placeholder here. Later we will choose the better classifier according to cross validation.

##| eval: false
lr_wf <- 
  workflow() %>%
  add_model(enet_mod) %>%
  add_recipe(norm_recipe)
lr_wf

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_dummy()
• step_zv()
• step_bs()
• step_bs()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = tune()
  mixture = tune()

Computational engine: glmnet

from sklearn.neighbors import KNeighborsClassifier

pipe = Pipeline(steps = [
  ("col_tf", col_tf),
  ("model", KNeighborsClassifier())
  ])
pipe

Pipeline(steps=[('col_tf',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('bspline',
                                                                   SplineTransformer(extrapolation='linear'))]),
                                                  ['balance', 'income']),
                                                 ('cat',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(drop='first'))]),
                                                  ['student'])])),
                ('model', KNeighborsClassifier())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

7 Tuning grid

Set up the 2D grid for tuning.

##| eval: false
param_grid <-grid_regular(
  penalty(range = c(-5, 0), trans = log10_trans()), 
  mixture(range = c(0, 1)),
  levels = c(penalty = 50, mixture = 6)
  )
param_grid

# A tibble: 300 × 2
     penalty mixture
       <dbl>   <dbl>
 1 0.00001         0
 2 0.0000126       0
 3 0.0000160       0
 4 0.0000202       0
 5 0.0000256       0
 6 0.0000324       0
 7 0.0000409       0
 8 0.0000518       0
 9 0.0000655       0
10 0.0000829       0
# ℹ 290 more rows

from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB

# Tune hyper-parameter(s)
nknots_grid = np.array(range(2, 6))
classifers = [
  LogisticRegression(),
  LinearDiscriminantAnalysis(),
  QuadraticDiscriminantAnalysis(),
  GaussianNB(),
  KNeighborsClassifier(n_neighbors = 5)
  ]
tuned_parameters = {
  "col_tf__num__bspline__n_knots": nknots_grid,
  "model": classifers
  }
tuned_parameters

{'col_tf__num__bspline__n_knots': array([2, 3, 4, 5]), 'model': [LogisticRegression(), LinearDiscriminantAnalysis(), QuadraticDiscriminantAnalysis(), GaussianNB(), KNeighborsClassifier()]}

8 Cross-validation (CV)

Set cross-validation partitions.

##| eval: false
set.seed(250)
folds <- vfold_cv(Wage_other, v = 10)
folds

#  10-fold cross-validation 
# A tibble: 10 × 2
   splits             id    
   <list>             <chr> 
 1 <split [2025/225]> Fold01
 2 <split [2025/225]> Fold02
 3 <split [2025/225]> Fold03
 4 <split [2025/225]> Fold04
 5 <split [2025/225]> Fold05
 6 <split [2025/225]> Fold06
 7 <split [2025/225]> Fold07
 8 <split [2025/225]> Fold08
 9 <split [2025/225]> Fold09
10 <split [2025/225]> Fold10

Fit cross-validation.

##| eval: false
enet_fit <- 
  lr_wf %>%
  tune_grid(
    resamples = folds,
    grid = param_grid,
    )
enet_fit

# Tuning results
# 10-fold cross-validation 
# A tibble: 10 × 4
   splits             id     .metrics           .notes          
   <list>             <chr>  <list>             <list>          
 1 <split [2025/225]> Fold01 <tibble [600 × 6]> <tibble [0 × 4]>
 2 <split [2025/225]> Fold02 <tibble [600 × 6]> <tibble [0 × 4]>
 3 <split [2025/225]> Fold03 <tibble [600 × 6]> <tibble [0 × 4]>
 4 <split [2025/225]> Fold04 <tibble [600 × 6]> <tibble [0 × 4]>
 5 <split [2025/225]> Fold05 <tibble [600 × 6]> <tibble [0 × 4]>
 6 <split [2025/225]> Fold06 <tibble [600 × 6]> <tibble [0 × 4]>
 7 <split [2025/225]> Fold07 <tibble [600 × 6]> <tibble [0 × 4]>
 8 <split [2025/225]> Fold08 <tibble [600 × 6]> <tibble [0 × 4]>
 9 <split [2025/225]> Fold09 <tibble [600 × 6]> <tibble [0 × 4]>
10 <split [2025/225]> Fold10 <tibble [600 × 6]> <tibble [0 × 4]>

Visualize CV criterion.

##| eval: false
enet_fit %>%
  collect_metrics() %>%
  print(width = Inf) %>%
  filter(.metric == "rmse") %>%
  ggplot(mapping = aes(x = penalty, y = mean)) + 
  geom_point() + 
  geom_line(aes(group = mixture)) + 
  labs(x = "Penalty", y = "CV RMSE") + 
  scale_x_log10(labels = scales::label_number())

# A tibble: 600 × 8
   penalty mixture .metric .estimator   mean     n std_err .config          
     <dbl>   <dbl> <chr>   <chr>       <dbl> <int>   <dbl> <chr>            
 1 0.00001     0   rmse    standard   35.1      10  0.431  pre0_mod001_post0
 2 0.00001     0   rsq     standard    0.300    10  0.0134 pre0_mod001_post0
 3 0.00001     0.2 rmse    standard   35.0      10  0.441  pre0_mod002_post0
 4 0.00001     0.2 rsq     standard    0.302    10  0.0135 pre0_mod002_post0
 5 0.00001     0.4 rmse    standard   35.0      10  0.441  pre0_mod003_post0
 6 0.00001     0.4 rsq     standard    0.302    10  0.0135 pre0_mod003_post0
 7 0.00001     0.6 rmse    standard   35.0      10  0.441  pre0_mod004_post0
 8 0.00001     0.6 rsq     standard    0.302    10  0.0135 pre0_mod004_post0
 9 0.00001     0.8 rmse    standard   35.0      10  0.441  pre0_mod005_post0
10 0.00001     0.8 rsq     standard    0.302    10  0.0135 pre0_mod005_post0
# ℹ 590 more rows

Show the top 5 models (\(\lambda\) values)

##| eval: false
enet_fit %>%
  show_best(metric = "rmse")

# A tibble: 5 × 8
  penalty mixture .metric .estimator  mean     n std_err .config          
    <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>            
1  0.0596     1   rmse    standard    35.0    10   0.438 pre0_mod228_post0
2  0.0754     1   rmse    standard    35.0    10   0.437 pre0_mod234_post0
3  0.0754     0.8 rmse    standard    35.0    10   0.438 pre0_mod233_post0
4  0.0954     0.6 rmse    standard    35.0    10   0.438 pre0_mod238_post0
5  0.0954     0.8 rmse    standard    35.0    10   0.437 pre0_mod239_post0

Let’s select the best model

##| eval: false
best_enet <- enet_fit %>%
  select_best(metric = "rmse")
best_enet

# A tibble: 1 × 3
  penalty mixture .config          
    <dbl>   <dbl> <chr>            
1  0.0596       1 pre0_mod228_post0

Set up CV partitions and CV criterion. Again it’s a good idea to keep the case proportions roughly same between splits. According to the GridSearchCV documentation, StratifiedKFold is used automatically.

from sklearn.model_selection import GridSearchCV

# Set up CV
n_folds = 10
search = GridSearchCV(
  pipe,
  tuned_parameters,
  cv = n_folds, 
  scoring = "roc_auc",
  # Refit the best model on the whole data set
  refit = True
  )

Fit CV. This is typically the most time-consuming step.

# Fit CV
search.fit(X_other, y_other)

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('col_tf',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('bspline',
                                                                                          SplineTransformer(extrapolation='linear'))]),
                                                                         ['balance',
                                                                          'income']),
                                                                        ('cat',
                                                                         Pipeline(steps=[('encoder',
                                                                                          OneHotEncoder(drop='first'))]),
                                                                         ['student'])])),
                                       ('model', KNeighborsClassifier())]),
             param_grid={'col_tf__num__bspline__n_knots': array([2, 3, 4, 5]),
                         'model': [LogisticRegression(),
                                   LinearDiscriminantAnalysis(),
                                   QuadraticDiscriminantAnalysis(),
                                   GaussianNB(), KNeighborsClassifier()]},
             scoring='roc_auc')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

GridSearchCV

?Documentation for GridSearchCViFitted

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('col_tf',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('bspline',
                                                                                          SplineTransformer(extrapolation='linear'))]),
                                                                         ['balance',
                                                                          'income']),
                                                                        ('cat',
                                                                         Pipeline(steps=[('encoder',
                                                                                          OneHotEncoder(drop='first'))]),
                                                                         ['student'])])),
                                       ('model', KNeighborsClassifier())]),
             param_grid={'col_tf__num__bspline__n_knots': array([2, 3, 4, 5]),
                         'model': [LogisticRegression(),
                                   LinearDiscriminantAnalysis(),
                                   QuadraticDiscriminantAnalysis(),
                                   GaussianNB(), KNeighborsClassifier()]},
             scoring='roc_auc')

best_estimator_: Pipeline

Pipeline(steps=[('col_tf',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('bspline',
                                                                   SplineTransformer(extrapolation='linear',
                                                                                     n_knots=np.int64(3)))]),
                                                  ['balance', 'income']),
                                                 ('cat',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(drop='first'))]),
                                                  ['student'])])),
                ('model', LogisticRegression())])

col_tf: ColumnTransformer

?Documentation for col_tf: ColumnTransformer

ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('bspline',
                                                  SplineTransformer(extrapolation='linear',
                                                                    n_knots=np.int64(3)))]),
                                 ['balance', 'income']),
                                ('cat',
                                 Pipeline(steps=[('encoder',
                                                  OneHotEncoder(drop='first'))]),
                                 ['student'])])

num

['balance', 'income']

SplineTransformer

?Documentation for SplineTransformer

SplineTransformer(extrapolation='linear', n_knots=np.int64(3))

cat

['student']

OneHotEncoder

?Documentation for OneHotEncoder

OneHotEncoder(drop='first')

LogisticRegression

?Documentation for LogisticRegression

LogisticRegression()

Visualize CV results.

Code

cv_res = pd.DataFrame({
  "nknots": np.array(search.cv_results_["param_col_tf__num__bspline__n_knots"]),
  "classifier": np.array(search.cv_results_["param_model"]),
  "aucroc": search.cv_results_["mean_test_score"]
  })

plt.figure()
sns.relplot(
  kind = "line",
  data = cv_res,
  x = "nknots",
  y = "aucroc",
  hue = "classifier"
  ).set(
    xlabel = "Number of Knots",
    ylabel = "CV AUC"
);
plt.show()

Best CV AUC-ROC:

search.best_score_

np.float64(0.9444744827586206)

9 Finalize our model

Now we are done tuning. Finally, let’s fit this final model to the whole training data and use our test data to estimate the model performance we expect to see with new data.

##| eval: false
# Final workflow
final_wf <- lr_wf %>%
  finalize_workflow(best_enet)
final_wf

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_dummy()
• step_zv()
• step_bs()
• step_bs()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = 0.0596362331659464
  mixture = 1

Computational engine: glmnet

# Fit the whole training set, then predict the test cases
final_fit <- 
  final_wf %>%
  last_fit(data_split)
final_fit

# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits             id               .metrics .notes   .predictions .workflow 
  <list>             <chr>            <list>   <list>   <list>       <list>    
1 <split [2250/750]> train/test split <tibble> <tibble> <tibble>     <workflow>

# Test metrics
final_fit %>% collect_metrics()

# A tibble: 2 × 4
  .metric .estimator .estimate .config        
  <chr>   <chr>          <dbl> <chr>          
1 rmse    standard      35.8   pre0_mod0_post0
2 rsq     standard       0.249 pre0_mod0_post0

Since we called GridSearchCV with refit = True, the best model fit on the whole non-test data is readily available.

search.best_estimator_

Pipeline(steps=[('col_tf',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('bspline',
                                                                   SplineTransformer(extrapolation='linear',
                                                                                     n_knots=np.int64(3)))]),
                                                  ['balance', 'income']),
                                                 ('cat',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(drop='first'))]),
                                                  ['student'])])),
                ('model', LogisticRegression())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The final AUC on the test set is

from sklearn.metrics import roc_auc_score

roc_auc_score(
  y_test, 
  search.best_estimator_.predict_proba(X_test)[:, 1]
  )

np.float64(0.9647327414747946)