# A tibble: 6 × 5
`Student ID` `Full Name` favourite.food mealPlan AGE
<dbl> <chr> <chr> <chr> <chr>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne N/A Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only <NA>
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
6 6 Güvenç Attila Ice cream Lunch only 6
4 数据导入
4.1 实用建议
导入数据后的第一步,是对其进行某种变换,以便于后续分析。常见的变换包括:重命名变量、选择变量、过滤观测值、创建新变量等。以student数据为例:
-
favourite.food变量中包含值N/A,需要将其转换为缺失值NA。 - 数据的前两列列名带有反引号”`“,说明它们包含特殊字符(此处为空格),是非法变量名。
-
mealPlan和AGE列的变量类型似乎不对:-
AGE中包含一个five的值,按常理来说它应该为5。 -
mealPlan列中的值是有限的几个选项,适合转换为因子型变量。
-
student <- read_csv(
"D:/Document/0.Study R/0.R4DS/data/students.csv",
na = c("N/A", " ")
) |>
janitor::clean_names() |>
mutate(
meal_plan = factor(meal_plan), # 将meal_plan列转换为因子型变量
age = as.numeric(ifelse(age == "five", "5", age)) # 将age列中的"five"替换为"5",再转换为数值型变量
)
student# A tibble: 6 × 5
student_id full_name favourite_food meal_plan age
<dbl> <chr> <chr> <fct> <dbl>
1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
2 2 Barclay Lynn French fries Lunch only 5
3 3 Jayendra Lyne <NA> Breakfast and lunch 7
4 4 Leon Rossini Anchovies Lunch only NA
5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5
6 6 Güvenç Attila Ice cream Lunch only 6
4.2 控制列类型
readr ( Wickham 等 (2026年) )包使用启发式方法来猜测每一列的类型,但有时这种猜测并不准确。
可以使用col_types参数来显式指定每一列的类型。具体可参看read_csv()的帮助文档。我们可以使用一种紧凑的字符串表示法来指定列类型,具体如下:
- c = character
- i = integer
- n = number
- d = double
- l = logical
- f = factor
- D = date
- T = date time
- t = time
- ? = guess
- _ or - = skip