Display system information for reproducibility.
R version 4.5.2 (2025-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.7.4
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] htmlwidgets_1.6.4 compiler_4.5.2 fastmap_1.2.0 cli_3.6.5
[5] tools_4.5.2 htmltools_0.5.8.1 rstudioapi_0.17.1 yaml_2.3.10
[9] rmarkdown_2.29 knitr_1.50 jsonlite_2.0.0 xfun_0.56
[13] digest_0.6.37 rlang_1.1.6 evaluate_1.0.5
Overview
We illustrate the typical machine learning workflow for support vector machines using the Heart data set. The outcome is AHD (Yes or No).
Initial splitting to test and non-test sets.
Pre-processing of data: dummy coding categorical variables, standardizing numerical variables, imputing missing values, …
Tune the SVM algorithm using 5-fold cross-validation (CV) on the non-test data.
Choose the best model by CV and refit it on the whole non-test data.
Final classification on the test data.
Heart data
The goal is to predict the binary outcome AHD (Yes or No) of patients.
# Load libraries
library (GGally)
library (gtsummary)
library (kernlab)
library (tidyverse)
library (tidymodels)
# Load the `Heart.csv` data.
Heart <- read_csv ("../data/Heart.csv" ) %>%
# first column is patient ID, which we don't need
select (- 1 ) %>%
# RestECG is categorical with value 0, 1, 2
mutate (RestECG = as.character (RestECG)) %>%
print (width = Inf )
# A tibble: 303 × 14
Age Sex ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak Slope
<dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 63 1 typical 145 233 1 2 150 0 2.3 3
2 67 1 asymptomatic 160 286 0 2 108 1 1.5 2
3 67 1 asymptomatic 120 229 0 2 129 1 2.6 2
4 37 1 nonanginal 130 250 0 0 187 0 3.5 3
5 41 0 nontypical 130 204 0 2 172 0 1.4 1
6 56 1 nontypical 120 236 0 0 178 0 0.8 1
7 62 0 asymptomatic 140 268 0 2 160 0 3.6 3
8 57 0 asymptomatic 120 354 0 0 163 1 0.6 1
9 63 1 asymptomatic 130 254 0 2 147 0 1.4 2
10 53 1 asymptomatic 140 203 1 2 155 1 3.1 3
Ca Thal AHD
<dbl> <chr> <chr>
1 0 fixed No
2 3 normal Yes
3 2 reversable Yes
4 0 normal No
5 0 normal No
6 0 normal No
7 2 normal Yes
8 0 normal No
9 1 reversable Yes
10 0 reversable Yes
# ℹ 293 more rows
# Numerical summaries stratified by the outcome `AHD`.
Heart %>% tbl_summary (by = AHD)
Age
52 (45, 59)
58 (52, 62)
Sex
92 (56%)
114 (82%)
ChestPain
asymptomatic
39 (24%)
105 (76%)
nonanginal
68 (41%)
18 (13%)
nontypical
41 (25%)
9 (6.5%)
typical
16 (9.8%)
7 (5.0%)
RestBP
130 (120, 140)
130 (120, 145)
Chol
235 (209, 268)
249 (217, 284)
Fbs
23 (14%)
22 (16%)
RestECG
0
95 (58%)
56 (40%)
1
1 (0.6%)
3 (2.2%)
2
68 (41%)
80 (58%)
MaxHR
161 (149, 172)
142 (125, 157)
ExAng
23 (14%)
76 (55%)
Oldpeak
0.20 (0.00, 1.05)
1.40 (0.50, 2.50)
Slope
1
106 (65%)
36 (26%)
2
49 (30%)
91 (65%)
3
9 (5.5%)
12 (8.6%)
Ca
0
130 (81%)
46 (33%)
1
21 (13%)
44 (32%)
2
7 (4.3%)
31 (22%)
3
3 (1.9%)
17 (12%)
Unknown
3
1
Thal
fixed
6 (3.7%)
12 (8.7%)
normal
129 (79%)
37 (27%)
reversable
28 (17%)
89 (64%)
Unknown
1
1
# Graphical summary:
# Heart %>% ggpairs()
Initial split into test and non-test sets
We randomly split the data into 25% test data and 75% non-test data. Stratify on AHD.
# For reproducibility
set.seed (203 )
data_split <- initial_split (
Heart,
# stratify by AHD
strata = "AHD" ,
prop = 0.75
)
data_split
<Training/Testing/Total>
<227/76/303>
Heart_other <- training (data_split)
dim (Heart_other)
Heart_test <- testing (data_split)
dim (Heart_test)
Recipe (R) and Preprocessing (Python)
A data dictionary (roughly) is at https://keras.io/examples/structured_data/structured_data_classification_with_feature_space/ .
We have following features:
Numerical features: Age, RestBP, Chol, Slope (1, 2 or 3), MaxHR, ExAng, Oldpeak, Ca (0, 1, 2 or 3).
Categorical features coded as integer: Sex (0 or 1), Fbs (0 or 1), RestECG (0, 1 or 2).
Categorical features coded as string: ChestPain, Thal.
There are missing values in Ca and Thal. Since missing proportion is not high, we will use simple mean (for numerical feature Ca) and mode (for categorical feature Thal) imputation.
svm_recipe <-
recipe (
AHD ~ .,
data = Heart_other
) %>%
# mean imputation for Ca
step_impute_mean (Ca) %>%
# mode imputation for Thal
step_impute_mode (Thal) %>%
# create traditional dummy variables (necessary for svm)
step_dummy (all_nominal_predictors ()) %>%
# zero-variance filter
step_zv (all_numeric_predictors ()) %>%
# center and scale numeric data
step_normalize (all_numeric_predictors ()) # %>%
# estimate the means and standard deviations
# prep(training = Heart_other, retain = TRUE)
svm_recipe
Model
svm_mod <-
svm_poly (
mode = "classification" ,
cost = tune (),
degree = tune (),
# scale_factor = tune()
) %>%
set_engine ("kernlab" )
svm_mod
Polynomial Support Vector Machine Model Specification (classification)
Main Arguments:
cost = tune()
degree = tune()
Computational engine: kernlab
Workflow in R and pipeline in Python
Here we bundle the preprocessing step (Python) or recipe (R) and model.
svm_wf <- workflow () %>%
add_recipe (svm_recipe) %>%
add_model (svm_mod)
svm_wf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: svm_poly()
── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps
• step_impute_mean()
• step_impute_mode()
• step_dummy()
• step_zv()
• step_normalize()
── Model ───────────────────────────────────────────────────────────────────────
Polynomial Support Vector Machine Model Specification (classification)
Main Arguments:
cost = tune()
degree = tune()
Computational engine: kernlab
Tuning grid
Here we tune the cost and radial scale rbf_sigma.
param_grid <- grid_regular (
cost (range = c (- 3 , 2 )),
degree (range = c (1 , 5 )),
#scale_factor(range = c(-1, 1)),
levels = c (5 )
)
param_grid
# A tibble: 25 × 2
cost degree
<dbl> <dbl>
1 0.125 1
2 0.297 1
3 0.707 1
4 1.68 1
5 4 1
6 0.125 2
7 0.297 2
8 0.707 2
9 1.68 2
10 4 2
# ℹ 15 more rows
Cross-validation (CV)
Set cross-validation partitions.
set.seed (203 )
folds <- vfold_cv (Heart_other, v = 5 )
folds
# 5-fold cross-validation
# A tibble: 5 × 2
splits id
<list> <chr>
1 <split [181/46]> Fold1
2 <split [181/46]> Fold2
3 <split [182/45]> Fold3
4 <split [182/45]> Fold4
5 <split [182/45]> Fold5
Fit cross-validation.
svm_fit <- svm_wf %>%
tune_grid (
resamples = folds,
grid = param_grid,
metrics = metric_set (roc_auc, accuracy)
)
svm_fit
# Tuning results
# 5-fold cross-validation
# A tibble: 5 × 4
splits id .metrics .notes
<list> <chr> <list> <list>
1 <split [181/46]> Fold1 <tibble [50 × 6]> <tibble [0 × 4]>
2 <split [181/46]> Fold2 <tibble [50 × 6]> <tibble [0 × 4]>
3 <split [182/45]> Fold3 <tibble [50 × 6]> <tibble [0 × 4]>
4 <split [182/45]> Fold4 <tibble [50 × 6]> <tibble [0 × 4]>
5 <split [182/45]> Fold5 <tibble [50 × 6]> <tibble [0 × 4]>
Visualize CV results:
svm_fit %>%
collect_metrics () %>%
print (width = Inf ) %>%
filter (.metric == "roc_auc" ) %>%
ggplot (mapping = aes (x = degree, y = mean)) +
geom_point () +
geom_line () +
labs (x = "Cost" , y = "CV AUC" ) +
scale_x_log10 ()
# A tibble: 50 × 8
cost degree .metric .estimator mean n std_err .config
<dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 0.125 1 accuracy binary 0.828 5 0.0187 pre0_mod01_post0
2 0.125 1 roc_auc binary 0.911 5 0.00945 pre0_mod01_post0
3 0.125 2 accuracy binary 0.771 5 0.0293 pre0_mod02_post0
4 0.125 2 roc_auc binary 0.845 5 0.0149 pre0_mod02_post0
5 0.125 3 accuracy binary 0.832 5 0.0154 pre0_mod03_post0
6 0.125 3 roc_auc binary 0.877 5 0.0280 pre0_mod03_post0
7 0.125 4 accuracy binary 0.78 5 0.0213 pre0_mod04_post0
8 0.125 4 roc_auc binary 0.858 5 0.0128 pre0_mod04_post0
9 0.125 5 accuracy binary 0.802 5 0.0300 pre0_mod05_post0
10 0.125 5 roc_auc binary 0.868 5 0.0394 pre0_mod05_post0
# ℹ 40 more rows
Show the top 5 models.
svm_fit %>%
show_best (metric = "roc_auc" )
# A tibble: 5 × 8
cost degree .metric .estimator mean n std_err .config
<dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 0.707 1 roc_auc binary 0.917 5 0.0110 pre0_mod11_post0
2 4 1 roc_auc binary 0.917 5 0.0119 pre0_mod21_post0
3 1.68 1 roc_auc binary 0.915 5 0.0118 pre0_mod16_post0
4 0.297 1 roc_auc binary 0.912 5 0.0124 pre0_mod06_post0
5 0.125 1 roc_auc binary 0.911 5 0.00945 pre0_mod01_post0
Let’s select the best model.
best_svm <- svm_fit %>%
select_best (metric = "roc_auc" )
best_svm
# A tibble: 1 × 3
cost degree .config
<dbl> <dbl> <chr>
1 0.707 1 pre0_mod11_post0
Finalize our model
Now we are done tuning. Finally, let’s fit this final model to the whole training data and use our test data to estimate the model performance we expect to see with new data.
# Final workflow
final_wf <- svm_wf %>%
finalize_workflow (best_svm)
final_wf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: svm_poly()
── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps
• step_impute_mean()
• step_impute_mode()
• step_dummy()
• step_zv()
• step_normalize()
── Model ───────────────────────────────────────────────────────────────────────
Polynomial Support Vector Machine Model Specification (classification)
Main Arguments:
cost = 0.707106781186548
degree = 1
Computational engine: kernlab
# Fit the whole training set, then predict the test cases
final_fit <-
final_wf %>%
last_fit (data_split)
final_fit
# Resampling results
# Manual resampling
# A tibble: 1 × 6
splits id .metrics .notes .predictions .workflow
<list> <chr> <list> <list> <list> <list>
1 <split [227/76]> train/test split <tibble> <tibble> <tibble> <workflow>
# Test metrics
final_fit %>%
collect_metrics ()
# A tibble: 3 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy binary 0.789 pre0_mod0_post0
2 roc_auc binary 0.860 pre0_mod0_post0
3 brier_class binary 0.155 pre0_mod0_post0
library (doParallel)
set.seed (101 )
split_obj <- initial_split (data = Heart, prop = 0.7 , strata = AHD)
train <- training (split_obj)
test <- testing (split_obj)
# Create the recipe
recipe (AHD ~ ., data = train) %>%
# mean imputation for Ca
step_impute_mean (Ca) %>%
# mode imputation for Thal
step_impute_mode (Thal) %>%
# create traditional dummy variables (necessary for svm)
step_dummy (all_nominal_predictors ()) %>%
# zero-variance filter
step_zv (all_numeric_predictors ()) %>%
# center and scale numeric data
step_normalize (all_numeric_predictors ()) %>%
# estimate the means and standard deviations
prep () -> recipe_obj
# Bake
train <- bake (recipe_obj, new_data= train)
test <- bake (recipe_obj, new_data= test)
library (vip)
final_fit %>%
pluck (".workflow" , 1 ) %>%
pull_workflow_fit () %>%
# vip(method = "permute", train= Heart)
vip (method = "permute" ,
target = "AHD" , metric = "accuracy" ,
pred_wrapper = kernlab:: predict, train = train)
Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
ℹ Please use `extract_fit_parsnip()` instead.
svm_rbf_spec <- svm_rbf () %>%
set_mode ("classification" ) %>%
set_engine ("kernlab" )
svm_rbf_fit <- svm_rbf_spec %>%
fit (AHD ~ ., data = train[, c ('Ca' , 'ExAng' , 'AHD' )])
svm_rbf_fit %>%
extract_fit_engine () %>%
plot ()