4  数据导入

library(tidyverse)
student <- read_csv("D:/Document/0.Study R/0.R4DS/data/students.csv")
student
# A tibble: 6 × 5
  `Student ID` `Full Name`      favourite.food     mealPlan            AGE  
         <dbl> <chr>            <chr>              <chr>               <chr>
1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
2            2 Barclay Lynn     French fries       Lunch only          5    
3            3 Jayendra Lyne    N/A                Breakfast and lunch 7    
4            4 Leon Rossini     Anchovies          Lunch only          <NA> 
5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
6            6 Güvenç Attila    Ice cream          Lunch only          6    

4.1 实用建议

导入数据后的第一步,是对其进行某种变换,以便于后续分析。常见的变换包括:重命名变量、选择变量、过滤观测值、创建新变量等。以student数据为例:

  • favourite.food变量中包含值N/A,需要将其转换为缺失值NA
  • 数据的前两列列名带有反引号”`“,说明它们包含特殊字符(此处为空格),是非法变量名。
  • mealPlanAGE列的变量类型似乎不对:
    • AGE中包含一个five的值,按常理来说它应该为5。
    • mealPlan列中的值是有限的几个选项,适合转换为因子型变量。
student <- read_csv(
  "D:/Document/0.Study R/0.R4DS/data/students.csv",
  na = c("N/A", " ")
) |>
  janitor::clean_names() |>
  mutate(
    meal_plan = factor(meal_plan), # 将meal_plan列转换为因子型变量
    age = as.numeric(ifelse(age == "five", "5", age)) # 将age列中的"five"替换为"5",再转换为数值型变量
  )
student
# A tibble: 6 × 5
  student_id full_name        favourite_food     meal_plan             age
       <dbl> <chr>            <chr>              <fct>               <dbl>
1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
2          2 Barclay Lynn     French fries       Lunch only              5
3          3 Jayendra Lyne    <NA>               Breakfast and lunch     7
4          4 Leon Rossini     Anchovies          Lunch only             NA
5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5
6          6 Güvenç Attila    Ice cream          Lunch only              6

4.2 控制列类型

readr ( Wickham 等 (2026年) )包使用启发式方法来猜测每一列的类型,但有时这种猜测并不准确。

可以使用col_types参数来显式指定每一列的类型。具体可参看read_csv()的帮助文档。我们可以使用一种紧凑的字符串表示法来指定列类型,具体如下:

  • c = character
  • i = integer
  • n = number
  • d = double
  • l = logical
  • f = factor
  • D = date
  • T = date time
  • t = time
  • ? = guess
  • _ or - = skip