4 数据导入

library(tidyverse)
student <- read_csv("D:/Document/0.Study R/0.R4DS/data/students.csv")
student

# A tibble: 6 × 5
  `Student ID` `Full Name`      favourite.food     mealPlan            AGE  
         <dbl> <chr>            <chr>              <chr>               <chr>
1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
2            2 Barclay Lynn     French fries       Lunch only          5    
3            3 Jayendra Lyne    N/A                Breakfast and lunch 7    
4            4 Leon Rossini     Anchovies          Lunch only          <NA> 
5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
6            6 Güvenç Attila    Ice cream          Lunch only          6

4.1 实用建议

导入数据后的第一步，是对其进行某种变换，以便于后续分析。常见的变换包括：重命名变量、选择变量、过滤观测值、创建新变量等。以student数据为例：

favourite.food变量中包含值N/A，需要将其转换为缺失值NA。
数据的前两列列名带有反引号”`“，说明它们包含特殊字符(此处为空格)，是非法变量名。
mealPlan和AGE列的变量类型似乎不对：
- AGE中包含一个five的值，按常理来说它应该为5。
- mealPlan列中的值是有限的几个选项，适合转换为因子型变量。

student <- read_csv(
  "D:/Document/0.Study R/0.R4DS/data/students.csv",
  na = c("N/A", " ")
) |>
  janitor::clean_names() |>
  mutate(
    meal_plan = factor(meal_plan), # 将meal_plan列转换为因子型变量
    age = as.numeric(ifelse(age == "five", "5", age)) # 将age列中的"five"替换为"5"，再转换为数值型变量
  )
student

# A tibble: 6 × 5
  student_id full_name        favourite_food     meal_plan             age
       <dbl> <chr>            <chr>              <fct>               <dbl>
1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
2          2 Barclay Lynn     French fries       Lunch only              5
3          3 Jayendra Lyne    <NA>               Breakfast and lunch     7
4          4 Leon Rossini     Anchovies          Lunch only             NA
5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5
6          6 Güvenç Attila    Ice cream          Lunch only              6

4.2 控制列类型

readr ( Wickham 等 (2026年) )包使用启发式方法来猜测每一列的类型，但有时这种猜测并不准确。

可以使用col_types参数来显式指定每一列的类型。具体可参看read_csv()的帮助文档。我们可以使用一种紧凑的字符串表示法来指定列类型，具体如下：

c = character
i = integer
n = number
d = double
l = logical
f = factor
D = date
T = date time
t = time
? = guess
_ or - = skip