43  分而治之-应用决策树和规则进行分类

43.1 基本概念

  • 适用范围: 分类、回归。

  • 基本概念: 决策树基于分而治之的决策过程,通过把复杂问题分为多个分支结点进行预测。决策树具有训练速度快、可解释性强的特点,既可以用于回归,也可以用于分类。生成决策树的方法很多,包括ID3、C4.5、C5.0、CART、CHAID等,在实践中可以同时使用并进行比较。

  • 应用场景:决策树可的结果可以利用可视化的方式直接展示,因此非常适合一些需要揭示机理的场景。例如我们需要知道用户违约的模型,就可以利用决策树对用户违约的模式进行辨识。实际上,只要满足分段判断逻辑的场景,都适合使用决策树。

  • 随机森林是一种基于决策树的机器学习模型, 章节 48 中会详细介绍。

注记

决策树基于树结构进行决策,本质上是通过条件判断对样本空间进行划分。从根节点开始,将样本的属性值与节点的测试条件进行比较,根据比较结果选择相应的分支向下遍历,直到到达叶节点,叶节点所代表的类别或值就是该样本的预测结果。

43.2 构建决策树

43.2.1 C5.0算法

  1. 选择最优分裂属性:C5.0算法选择最优分裂属性的方法是基于信息增益的准则,信息增益越高,根据某一特性分割后创建的分组就越均匀。
  2. 修剪决策树:在构建决策树时,要注意设定停止分类的条件,以防止过拟合的发生,这个过程就叫做修剪。C5.0算法的优点之一就是可以进行自动修剪

43.2.2 C5.0算法的实现

43.2.2.1 载入数据

pacman::p_load(
  tidymodels,
  tidyverse,
  rules,
  readr,
  tibble,
  dplyr,
  tidyr,
  purrr,
  stringr,
  forcats
)

credit <- read_csv(
  "D:/Document/0.Study R/0.R4DS/data/practice-data/credit.csv"
) %>%
  mutate(across(
    where(is.character),
    as.factor
  ))
glimpse(credit)
Rows: 1,000
Columns: 17
$ checking_balance     <fct> < 0 DM, 1 - 200 DM, unknown, < 0 DM, < 0 DM, unkn…
$ months_loan_duration <dbl> 6, 48, 12, 42, 24, 36, 24, 36, 12, 30, 12, 48, 12…
$ credit_history       <fct> critical, good, critical, good, poor, good, good,…
$ purpose              <fct> furniture/appliances, furniture/appliances, educa…
$ amount               <dbl> 1169, 5951, 2096, 7882, 4870, 9055, 2835, 6948, 3…
$ savings_balance      <fct> unknown, < 100 DM, < 100 DM, < 100 DM, < 100 DM, …
$ employment_duration  <fct> > 7 years, 1 - 4 years, 4 - 7 years, 4 - 7 years,…
$ percent_of_income    <dbl> 4, 2, 2, 2, 3, 2, 3, 2, 2, 4, 3, 3, 1, 4, 2, 4, 4…
$ years_at_residence   <dbl> 4, 2, 3, 4, 4, 4, 4, 2, 4, 2, 1, 4, 1, 4, 4, 2, 4…
$ age                  <dbl> 67, 22, 49, 45, 53, 35, 53, 35, 61, 28, 25, 24, 2…
$ other_credit         <fct> none, none, none, none, none, none, none, none, n…
$ housing              <fct> own, own, own, other, other, other, own, rent, ow…
$ existing_loans_count <dbl> 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2…
$ job                  <fct> skilled, skilled, unskilled, skilled, skilled, un…
$ dependents           <dbl> 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ phone                <fct> yes, no, no, no, no, yes, no, yes, no, no, no, no…
$ default              <fct> no, yes, no, no, yes, no, no, no, no, yes, yes, y…
table(credit$default)

 no yes 
700 300 

我们使用C5.0算法建立一个简单的信贷审批模型,找出与较高贷款违约风险相关的因素。使用的数据包含了1000行,17列,数据中货币的单位为DM(德国马克)。

  • checking_balance、saving_balance:支票和储蓄账户的余额情况。
  • default:贷款申请者是否符合约定的付款条件或他们是否陷入违约情况。是我们的目标变量。数据集中有30%的申请者违约,在划分数据时,要考虑这个比例。

43.2.2.2 划分数据集

set.seed(123)
credit_split <- initial_split(credit, prop = 0.9, strata = default)
train_data <- training(credit_split)
test_data <- testing(credit_split)
# 验证数据集划分是否符合7:3的比例。
prop.table(table(train_data$default))

 no yes 
0.7 0.3 
prop.table(table(test_data$default))

 no yes 
0.7 0.3 

43.2.2.3 训练模型

# Define the model
c50_model <- decision_tree() %>% # 创建决策树模型对象
  set_engine("C5.0") %>% # 指定C5.0算法
  set_mode("classification") # 设置模型类型为分类

# 创建工作流
c50_wf <- workflow() %>%
  add_model(c50_model) %>%
  add_formula(default ~ .)
c50_wf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
default ~ .

── Model ───────────────────────────────────────────────────────────────────────
Decision Tree Model Specification (classification)

Computational engine: C5.0 
# Fit the model
c50_fit <- c50_wf %>%
  fit(data = train_data)
c50_fit
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Formula
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
default ~ .

── Model ───────────────────────────────────────────────────────────────────────

Call:
C5.0.default(x = x, y = y, trials = 1, control = C50::C5.0Control(minCases =
 2, sample = 0))

Classification Tree
Number of samples: 900 
Number of predictors: 16 

Tree size: 58 

Non-standard options: attempt to group attributes
# predict on test data
c50_pred <- predict(c50_fit, new_data = test_data) %>%
  bind_cols(test_data) %>%
  select(default, .pred_class)
c50_pred <- test_data %>%
  bind_cols(predict(c50_fit, new_data = test_data)) %>%
  bind_cols(predict(c50_fit, new_data = test_data, type = "prob")) %>%
  select(default, .pred_class, .pred_no, .pred_yes)
c50_pred
# A tibble: 100 × 4
   default .pred_class .pred_no .pred_yes
   <fct>   <fct>          <dbl>     <dbl>
 1 no      no             0.814    0.186 
 2 no      yes            0.108    0.892 
 3 yes     no             0.835    0.165 
 4 no      no             0.967    0.0333
 5 no      no             0.866    0.134 
 6 yes     no             0.866    0.134 
 7 no      no             0.866    0.134 
 8 no      no             0.940    0.0600
 9 no      no             0.866    0.134 
10 no      yes            0.350    0.650 
# ℹ 90 more rows
# Evaluate the model
## matrix of confusion
conf_mat(c50_pred, truth = default, estimate = .pred_class)
          Truth
Prediction no yes
       no  60  18
       yes 10  12
## metrics
metrics(c50_pred, truth = default, estimate = .pred_class, .pred_no)
# A tibble: 4 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 accuracy    binary         0.72 
2 kap         binary         0.278
3 mn_log_loss binary         0.642
4 roc_auc     binary         0.678

可以看到,目前模型的准确率为0.72,roc_auc为0.67,说明模型还有进一步优化的空间。

43.2.2.4 模型调优

C5.0 算法有一些超参数可以调整,例如 trials[^trials](提升迭代次数)。在tidymodels中,为方便记忆,树中的这些参数统一设置为了min_n

# 定义调参后的模型
set.seed(123)
c50_tune <- decision_tree(
  min_n = tune()
) %>%
  set_engine("C5.0", trials = 10) %>%
  set_mode("classification")

# 创建调参工作流
c50_tune_wf <- workflow() %>%
  add_model(c50_tune) %>%
  add_formula(default ~ .)

# 创建交叉验证
c50_folds <- vfold_cv(train_data, v = 5, strata = default)

# 定义调优网格
c50_grid <- grid_regular(
  min_n(c(2, 20)) # 最小样本数范围
)

# 进行调参
c50_tune_res <- c50_tune_wf %>%
  tune_grid(
    resamples = c50_folds,
    grid = c50_grid,
    metrics = metric_set(accuracy, roc_auc)
  )

# 选择最优模型
best_c50_model <- select_best(c50_tune_res, metric = "accuracy")

# 重新训练最优模型
c50_final_fit <- c50_tune_wf %>%
  finalize_workflow(best_c50_model) %>%
  fit(data = train_data)
c50_final_fit
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Formula
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
default ~ .

── Model ───────────────────────────────────────────────────────────────────────

Call:
C5.0.default(x = x, y = y, trials = 10, control = C50::C5.0Control(minCases
 = 20L, sample = 0))

Classification Tree
Number of samples: 900 
Number of predictors: 16 

Number of boosting iterations: 10 
Average tree size: 7.2 

Non-standard options: attempt to group attributes, minimum number of cases: 20
# 预测测试数据
c50_final_pred <- test_data %>%
  bind_cols(predict(c50_final_fit, new_data = test_data)) %>%
  bind_cols(predict(c50_final_fit, new_data = test_data, type = "prob")) %>%
  select(default, .pred_class, .pred_no, .pred_yes)
c50_final_pred
# A tibble: 100 × 4
   default .pred_class .pred_no .pred_yes
   <fct>   <fct>          <dbl>     <dbl>
 1 no      no             0.918    0.0818
 2 no      yes            0.194    0.806 
 3 yes     yes            0.320    0.680 
 4 no      no             0.830    0.170 
 5 no      no             0.833    0.167 
 6 yes     no             0.845    0.155 
 7 no      no             0.840    0.160 
 8 no      no             0.911    0.0888
 9 no      no             0.909    0.0910
10 no      yes            0.415    0.585 
# ℹ 90 more rows
# 评估模型
conf_mat(c50_final_pred, truth = default, estimate = .pred_class)
          Truth
Prediction no yes
       no  62  17
       yes  8  13
metrics(c50_final_pred, truth = default, estimate = .pred_class, .pred_no)
# A tibble: 4 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 accuracy    binary         0.75 
2 kap         binary         0.349
3 mn_log_loss binary         1.14 
4 roc_auc     binary         0.799
roc_curve(c50_final_pred, truth = default, .pred_no) %>%
  autoplot()

可以看到,经过调参后,模型的准确率提升到了0.75,roc_auc提升到了0.799,模型的性能有明显提升。

43.3 决策树可视化

TODO: rpart.plot, partykit, treeheatr