Machine Learning Workflow: SVM with Polynomial Kernel (Heart Data)

Biostat 212A

Author

Dr. Jin Zhou @ UCLA

Published

February 23, 2026

Display system information for reproducibility.

sessionInfo()

R version 4.5.2 (2025-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.7.4

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.5.2    fastmap_1.2.0     cli_3.6.5        
 [5] tools_4.5.2       htmltools_0.5.8.1 rstudioapi_0.17.1 yaml_2.3.10      
 [9] rmarkdown_2.29    knitr_1.50        jsonlite_2.0.0    xfun_0.56        
[13] digest_0.6.37     rlang_1.1.6       evaluate_1.0.5

1 Overview

We illustrate the typical machine learning workflow for support vector machines using the Heart data set. The outcome is AHD (Yes or No).

Initial splitting to test and non-test sets.
Pre-processing of data: dummy coding categorical variables, standardizing numerical variables, imputing missing values, …
Tune the SVM algorithm using 5-fold cross-validation (CV) on the non-test data.
Choose the best model by CV and refit it on the whole non-test data.
Final classification on the test data.

2 Heart data

The goal is to predict the binary outcome AHD (Yes or No) of patients.

# Load libraries
library(GGally)
library(gtsummary)
library(kernlab)
library(tidyverse)
library(tidymodels)

# Load the `Heart.csv` data.
Heart <- read_csv("../data/Heart.csv") %>% 
  # first column is patient ID, which we don't need
  select(-1) %>%
  # RestECG is categorical with value 0, 1, 2
  mutate(RestECG = as.character(RestECG)) %>%
  print(width = Inf)

# A tibble: 303 × 14
     Age   Sex ChestPain    RestBP  Chol   Fbs RestECG MaxHR ExAng Oldpeak Slope
   <dbl> <dbl> <chr>         <dbl> <dbl> <dbl> <chr>   <dbl> <dbl>   <dbl> <dbl>
 1    63     1 typical         145   233     1 2         150     0     2.3     3
 2    67     1 asymptomatic    160   286     0 2         108     1     1.5     2
 3    67     1 asymptomatic    120   229     0 2         129     1     2.6     2
 4    37     1 nonanginal      130   250     0 0         187     0     3.5     3
 5    41     0 nontypical      130   204     0 2         172     0     1.4     1
 6    56     1 nontypical      120   236     0 0         178     0     0.8     1
 7    62     0 asymptomatic    140   268     0 2         160     0     3.6     3
 8    57     0 asymptomatic    120   354     0 0         163     1     0.6     1
 9    63     1 asymptomatic    130   254     0 2         147     0     1.4     2
10    53     1 asymptomatic    140   203     1 2         155     1     3.1     3
      Ca Thal       AHD  
   <dbl> <chr>      <chr>
 1     0 fixed      No   
 2     3 normal     Yes  
 3     2 reversable Yes  
 4     0 normal     No   
 5     0 normal     No   
 6     0 normal     No   
 7     2 normal     Yes  
 8     0 normal     No   
 9     1 reversable Yes  
10     0 reversable Yes  
# ℹ 293 more rows

# Numerical summaries stratified by the outcome `AHD`.
Heart %>% tbl_summary(by = AHD)

Characteristic	No N = 164¹	Yes N = 139¹
Age	52 (45, 59)	58 (52, 62)
Sex	92 (56%)	114 (82%)
ChestPain
asymptomatic	39 (24%)	105 (76%)
nonanginal	68 (41%)	18 (13%)
nontypical	41 (25%)	9 (6.5%)
typical	16 (9.8%)	7 (5.0%)
RestBP	130 (120, 140)	130 (120, 145)
Chol	235 (209, 268)	249 (217, 284)
Fbs	23 (14%)	22 (16%)
RestECG
0	95 (58%)	56 (40%)
1	1 (0.6%)	3 (2.2%)
2	68 (41%)	80 (58%)
MaxHR	161 (149, 172)	142 (125, 157)
ExAng	23 (14%)	76 (55%)
Oldpeak	0.20 (0.00, 1.05)	1.40 (0.50, 2.50)
Slope
1	106 (65%)	36 (26%)
2	49 (30%)	91 (65%)
3	9 (5.5%)	12 (8.6%)
Ca
0	130 (81%)	46 (33%)
1	21 (13%)	44 (32%)
2	7 (4.3%)	31 (22%)
3	3 (1.9%)	17 (12%)
Unknown	3	1
Thal
fixed	6 (3.7%)	12 (8.7%)
normal	129 (79%)	37 (27%)
reversable	28 (17%)	89 (64%)
Unknown	1	1
¹ Median (Q1, Q3); n (%)

# Graphical summary:
# Heart %>% ggpairs()

3 Initial split into test and non-test sets

We randomly split the data into 25% test data and 75% non-test data. Stratify on AHD.

# For reproducibility
set.seed(203)

data_split <- initial_split(
  Heart, 
  # stratify by AHD
  strata = "AHD", 
  prop = 0.75
  )
data_split

<Training/Testing/Total>
<227/76/303>

Heart_other <- training(data_split)
dim(Heart_other)

[1] 227  14

Heart_test <- testing(data_split)
dim(Heart_test)

[1] 76 14

4 Recipe (R) and Preprocessing (Python)

A data dictionary (roughly) is at https://keras.io/examples/structured_data/structured_data_classification_with_feature_space/.
We have following features:
- Numerical features: Age, RestBP, Chol, Slope (1, 2 or 3), MaxHR, ExAng, Oldpeak, Ca (0, 1, 2 or 3).
- Categorical features coded as integer: Sex (0 or 1), Fbs (0 or 1), RestECG (0, 1 or 2).
- Categorical features coded as string: ChestPain, Thal.
There are missing values in Ca and Thal. Since missing proportion is not high, we will use simple mean (for numerical feature Ca) and mode (for categorical feature Thal) imputation.

svm_recipe <- 
  recipe(
    AHD ~ ., 
    data = Heart_other
  ) %>%
  # mean imputation for Ca
  step_impute_mean(Ca) %>%
  # mode imputation for Thal
  step_impute_mode(Thal) %>%
  # create traditional dummy variables (necessary for svm)
  step_dummy(all_nominal_predictors()) %>%
  # zero-variance filter
  step_zv(all_numeric_predictors()) %>% 
  # center and scale numeric data
  step_normalize(all_numeric_predictors()) # %>%
  # estimate the means and standard deviations
  # prep(training = Heart_other, retain = TRUE)
svm_recipe

5 Model

svm_mod <- 
  svm_poly(
    mode = "classification",
    cost = tune(),
    degree = tune(),
    # scale_factor = tune()
  ) %>% 
  set_engine("kernlab")
svm_mod

Polynomial Support Vector Machine Model Specification (classification)

Main Arguments:
  cost = tune()
  degree = tune()

Computational engine: kernlab

6 Workflow in R and pipeline in Python

Here we bundle the preprocessing step (Python) or recipe (R) and model.

svm_wf <- workflow() %>%
  add_recipe(svm_recipe) %>%
  add_model(svm_mod)
svm_wf

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: svm_poly()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_impute_mean()
• step_impute_mode()
• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Polynomial Support Vector Machine Model Specification (classification)

Main Arguments:
  cost = tune()
  degree = tune()

Computational engine: kernlab

7 Tuning grid

Here we tune the cost and radial scale rbf_sigma.

param_grid <- grid_regular(
  cost(range = c(-3, 2)),
  degree(range = c(1, 5)),
  #scale_factor(range = c(-1, 1)),
  levels = c(5)
  )
param_grid

# A tibble: 25 × 2
    cost degree
   <dbl>  <dbl>
 1 0.125      1
 2 0.297      1
 3 0.707      1
 4 1.68       1
 5 4          1
 6 0.125      2
 7 0.297      2
 8 0.707      2
 9 1.68       2
10 4          2
# ℹ 15 more rows

8 Cross-validation (CV)

Set cross-validation partitions.

set.seed(203)

folds <- vfold_cv(Heart_other, v = 5)
folds

#  5-fold cross-validation 
# A tibble: 5 × 2
  splits           id   
  <list>           <chr>
1 <split [181/46]> Fold1
2 <split [181/46]> Fold2
3 <split [182/45]> Fold3
4 <split [182/45]> Fold4
5 <split [182/45]> Fold5

Fit cross-validation.

svm_fit <- svm_wf %>%
  tune_grid(
    resamples = folds,
    grid = param_grid,
    metrics = metric_set(roc_auc, accuracy)
    )
svm_fit

# Tuning results
# 5-fold cross-validation 
# A tibble: 5 × 4
  splits           id    .metrics          .notes          
  <list>           <chr> <list>            <list>          
1 <split [181/46]> Fold1 <tibble [50 × 6]> <tibble [0 × 4]>
2 <split [181/46]> Fold2 <tibble [50 × 6]> <tibble [0 × 4]>
3 <split [182/45]> Fold3 <tibble [50 × 6]> <tibble [0 × 4]>
4 <split [182/45]> Fold4 <tibble [50 × 6]> <tibble [0 × 4]>
5 <split [182/45]> Fold5 <tibble [50 × 6]> <tibble [0 × 4]>

Visualize CV results:

svm_fit %>%
  collect_metrics() %>%
  print(width = Inf) %>%
  filter(.metric == "roc_auc" ) %>%
  ggplot(mapping = aes(x = degree, y = mean)) +
  geom_point() +
  geom_line() +
  labs(x = "Cost", y = "CV AUC") +
  scale_x_log10()

# A tibble: 50 × 8
    cost degree .metric  .estimator  mean     n std_err .config         
   <dbl>  <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>           
 1 0.125      1 accuracy binary     0.828     5 0.0187  pre0_mod01_post0
 2 0.125      1 roc_auc  binary     0.911     5 0.00945 pre0_mod01_post0
 3 0.125      2 accuracy binary     0.771     5 0.0293  pre0_mod02_post0
 4 0.125      2 roc_auc  binary     0.845     5 0.0149  pre0_mod02_post0
 5 0.125      3 accuracy binary     0.832     5 0.0154  pre0_mod03_post0
 6 0.125      3 roc_auc  binary     0.877     5 0.0280  pre0_mod03_post0
 7 0.125      4 accuracy binary     0.78      5 0.0213  pre0_mod04_post0
 8 0.125      4 roc_auc  binary     0.858     5 0.0128  pre0_mod04_post0
 9 0.125      5 accuracy binary     0.802     5 0.0300  pre0_mod05_post0
10 0.125      5 roc_auc  binary     0.868     5 0.0394  pre0_mod05_post0
# ℹ 40 more rows

Show the top 5 models.

svm_fit %>%
  show_best(metric = "roc_auc")

# A tibble: 5 × 8
   cost degree .metric .estimator  mean     n std_err .config         
  <dbl>  <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
1 0.707      1 roc_auc binary     0.917     5 0.0110  pre0_mod11_post0
2 4          1 roc_auc binary     0.917     5 0.0119  pre0_mod21_post0
3 1.68       1 roc_auc binary     0.915     5 0.0118  pre0_mod16_post0
4 0.297      1 roc_auc binary     0.912     5 0.0124  pre0_mod06_post0
5 0.125      1 roc_auc binary     0.911     5 0.00945 pre0_mod01_post0

Let’s select the best model.

best_svm <- svm_fit %>%
  select_best(metric ="roc_auc")
best_svm

# A tibble: 1 × 3
   cost degree .config         
  <dbl>  <dbl> <chr>           
1 0.707      1 pre0_mod11_post0

9 Finalize our model

Now we are done tuning. Finally, let’s fit this final model to the whole training data and use our test data to estimate the model performance we expect to see with new data.

# Final workflow
final_wf <- svm_wf %>%
  finalize_workflow(best_svm)
final_wf

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: svm_poly()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_impute_mean()
• step_impute_mode()
• step_dummy()
• step_zv()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Polynomial Support Vector Machine Model Specification (classification)

Main Arguments:
  cost = 0.707106781186548
  degree = 1

Computational engine: kernlab

# Fit the whole training set, then predict the test cases
final_fit <- 
  final_wf %>%
  last_fit(data_split)
final_fit

# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits           id               .metrics .notes   .predictions .workflow 
  <list>           <chr>            <list>   <list>   <list>       <list>    
1 <split [227/76]> train/test split <tibble> <tibble> <tibble>     <workflow>

# Test metrics
final_fit %>% 
  collect_metrics()

# A tibble: 3 × 4
  .metric     .estimator .estimate .config        
  <chr>       <chr>          <dbl> <chr>          
1 accuracy    binary         0.789 pre0_mod0_post0
2 roc_auc     binary         0.860 pre0_mod0_post0
3 brier_class binary         0.155 pre0_mod0_post0

library(doParallel)
set.seed(101)
split_obj <- initial_split(data = Heart, prop = 0.7, strata = AHD)
train <- training(split_obj)
test <- testing(split_obj)


# Create the recipe
recipe(AHD ~ ., data = train) %>%
  # mean imputation for Ca
  step_impute_mean(Ca) %>%
  # mode imputation for Thal
  step_impute_mode(Thal) %>%
  # create traditional dummy variables (necessary for svm)
  step_dummy(all_nominal_predictors()) %>%
  # zero-variance filter
  step_zv(all_numeric_predictors()) %>% 
  # center and scale numeric data
  step_normalize(all_numeric_predictors()) %>%
  # estimate the means and standard deviations
  prep() -> recipe_obj

# Bake
train <- bake(recipe_obj, new_data=train)
test <- bake(recipe_obj, new_data=test)

library(vip)
final_fit %>% 
  pluck(".workflow", 1) %>%   
  pull_workflow_fit() %>% 
  # vip(method = "permute", train= Heart)
  vip(method = "permute", 
      target = "AHD", metric = "accuracy",
      pred_wrapper = kernlab::predict, train = train)

Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
ℹ Please use `extract_fit_parsnip()` instead.

svm_rbf_spec <- svm_rbf() %>%
  set_mode("classification") %>%
  set_engine("kernlab")

svm_rbf_fit <- svm_rbf_spec %>%
  fit(AHD ~ ., data = train[, c('Ca', 'ExAng', 'AHD')])

svm_rbf_fit %>%
  extract_fit_engine() %>%
  plot()