使用dplyr进行数据分析:入门篇

Stay hungry, stay foolish!

tidyverse系列应该算是R语言数据分析中的瑞士军刀了,统一的格式,简洁的代码,管道符便于阅读的形式,都能让大家快速上手。R数据科学就是专门讲这个系列的,但是对于很多函数的用法和细节问题,都没有说,所以在使用时还是会经常遇到各种问题。

我根据R数据科学和tidyverse官网的教程,整理了几篇笔记,主要是对tidyverse的各种函数的用法进行详细的演示。

前面已经介绍过了forcats包处理因子型数据,lubridate包处理日期时间格式数据。

下面介绍dplyr包。

在处理数据时,要明确以下几个问题:

  • 明确你的目的
  • 用计算机程序的方式描述你的任务
  • 执行程序

dplyr包可以帮你又快又简单地处理这些问题。tidyr包主要聚焦于把数据变成整洁数据,dplyr包主要功能在于对整洁数据进行各种操作,比如新增、筛选、汇总、合并等。

  • 安装
  • 数据集:starwars
  • 针对单个数据集的操作
    • filter()根据条件筛选行
    • arrange()进行排序
    • slice()根据位置选择行
    • select()选择列
    • mutate()新建列
    • relocate()重排列的位置
    • summarise()汇总

安装

install.packages("tidyverse")

数据集:starwars

下面使用星战(starwars)数据集演示基本的dplyr用法。

“starwars数据集共有87行,14列,记录了星战里面的87个人物(机器人、外星人等等)的14个特点,比如姓名、身高、体重、头发颜色、眼睛颜色、种族等。

library(dplyr)
## 
## 载入程辑包:'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

dim(starwars)
## [1] 87 14
glimpse(starwars)
## Rows: 87
## Columns: 14
## $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or~
## $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2~
## $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.~
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N~
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "~
## $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",~
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, ~
## $ sex        <chr> "male", "none", "none", "male", "female", "male", "female",~
## $ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini~
## $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T~
## $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma~
## $ films      <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return~
## $ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp~
## $ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",~

针对单个数据集的操作

今天介绍的都是主要针对单个数据集进行操作的各种函数,也是最常见的类型。

根据作用方式不同,大致可以分为以下3类:

  • 作用于行
    • filter()
    • slice()
    • arrange()
  • 作用于列
    • select()
    • rename()
    • mutate()
    • relocate()
  • 作用于一组数据
    • summarise()

filter()根据条件筛选行

filter()函数用于筛选符合条件的行,可以用各种表达式进行筛选,比如筛选眼睛颜色是brown并且皮肤颜色是light的行,注意这里不需要使用 & 符号:

starwars %>% filter(skin_color == "light", eye_color == "brown")
## # A tibble: 7 x 14
##   name     height  mass hair_color skin_color eye_color birth_year sex    gender
##   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr> 
## 1 Leia Or~    150    49 brown      light      brown             19 female femin~
## 2 Biggs D~    183    84 black      light      brown             24 male   mascu~
## 3 Cordé       157    NA brown      light      brown             NA female femin~
## 4 Dormé       165    NA brown      light      brown             NA female femin~
## 5 Raymus ~    188    79 brown      light      brown             NA male   mascu~
## 6 Poe Dam~     NA    NA brown      light      brown             NA male   mascu~
## 7 Padmé A~    165    45 brown      light      brown             46 female femin~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

但是需要注意,filter()函数不支持直接使用行号进行筛选,比如说你想选择第1行到第3行,下面这种写法是错误的:

starwars %>% filter(1:3)

这种情况应该使用slice()函数:

starwars %>% slice(1:3)
## # A tibble: 3 x 14
##   name     height  mass hair_color skin_color  eye_color birth_year sex   gender
##   <chr>     <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr> 
## 1 Luke Sk~    172    77 blond      fair        blue              19 male  mascu~
## 2 C-3PO       167    75 <NA>       gold        yellow           112 none  mascu~
## 3 R2-D2        96    32 <NA>       white, blue red               33 none  mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

arrange()进行排序

arrange()函数是用来排序的,根据某一列进行排序。

starwars %>% arrange(height, mass)
## # A tibble: 87 x 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Yoda         66    17 white      green      brown            896 male  mascu~
##  2 Ratts T~     79    15 none       grey, blue unknown           NA male  mascu~
##  3 Wicket ~     88    20 brown      brown      brown              8 male  mascu~
##  4 Dud Bolt     94    45 none       blue, grey yellow            NA male  mascu~
##  5 R2-D2        96    32 <NA>       white, bl~ red               33 none  mascu~
##  6 R4-P17       96    NA none       silver, r~ red, blue         NA none  femin~
##  7 R5-D4        97    32 <NA>       white, red red               NA none  mascu~
##  8 Sebulba     112    40 none       grey, red  orange            NA male  mascu~
##  9 Gasgano     122    NA none       white, bl~ black             NA male  mascu~
## 10 Watto       137    NA black      blue, grey yellow            NA male  mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

desc()函数可以进行倒序:

starwars %>% arrange(desc(height))
## # A tibble: 87 x 14
##    name    height  mass hair_color skin_color  eye_color birth_year sex   gender
##    <chr>    <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr> 
##  1 Yarael~    264    NA none       white       yellow          NA   male  mascu~
##  2 Tarfful    234   136 brown      brown       blue            NA   male  mascu~
##  3 Lama Su    229    88 none       grey        black           NA   male  mascu~
##  4 Chewba~    228   112 brown      unknown     blue           200   male  mascu~
##  5 Roos T~    224    82 none       grey        orange          NA   male  mascu~
##  6 Grievo~    216   159 none       brown, whi~ green, y~       NA   male  mascu~
##  7 Taun We    213    NA none       grey        black           NA   fema~ femin~
##  8 Rugor ~    206    NA none       green       orange          NA   male  mascu~
##  9 Tion M~    206    80 none       grey        black           NA   male  mascu~
## 10 Darth ~    202   136 none       white       yellow          41.9 male  mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

slice()根据位置选择行

选择第5-10行的数据:

starwars %>% slice(5:10)
## # A tibble: 6 x 14
##   name     height  mass hair_color  skin_color eye_color birth_year sex   gender
##   <chr>     <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Leia Or~    150    49 brown       light      brown             19 fema~ femin~
## 2 Owen La~    178   120 brown, grey light      blue              52 male  mascu~
## 3 Beru Wh~    165    75 brown       light      blue              47 fema~ femin~
## 4 R5-D4        97    32 <NA>        white, red red               NA none  mascu~
## 5 Biggs D~    183    84 black       light      brown             24 male  mascu~
## 6 Obi-Wan~    182    77 auburn, wh~ fair       blue-gray         57 male  mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

这其实是一组函数,还有各种变体,比如下面这个,选择前4行:

starwars %>% slice_head(n = 4)
## # A tibble: 4 x 14
##   name     height  mass hair_color skin_color  eye_color birth_year sex   gender
##   <chr>     <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr> 
## 1 Luke Sk~    172    77 blond      fair        blue            19   male  mascu~
## 2 C-3PO       167    75 <NA>       gold        yellow         112   none  mascu~
## 3 R2-D2        96    32 <NA>       white, blue red             33   none  mascu~
## 4 Darth V~    202   136 none       white       yellow          41.9 male  mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

选择最后面的10%的行:

starwars %>% slice_tail(prop = 0.1)
## # A tibble: 8 x 14
##   name     height  mass hair_color skin_color eye_color birth_year sex    gender
##   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr> 
## 1 Sly Moo~    178    48 none       pale       white             NA <NA>   <NA>  
## 2 Tion Me~    206    80 none       grey       black             NA male   mascu~
## 3 Finn         NA    NA black      dark       dark              NA male   mascu~
## 4 Rey          NA    NA brown      light      hazel             NA female femin~
## 5 Poe Dam~     NA    NA brown      light      brown             NA male   mascu~
## 6 BB8          NA    NA none       none       black             NA none   mascu~
## 7 Captain~     NA    NA unknown    unknown    unknown           NA <NA>   <NA>  
## 8 Padmé A~    165    45 brown      light      brown             46 female femin~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

随机选择10%的行:

starwars %>% slice_sample(prop = 0.1) # n=2
## # A tibble: 8 x 14
##   name    height  mass hair_color skin_color   eye_color birth_year sex   gender
##   <chr>    <int> <dbl> <chr>      <chr>        <chr>          <dbl> <chr> <chr> 
## 1 Lobot      175    79 none       light        blue              37 male  mascu~
## 2 Zam We~    168    55 blonde     fair, green~ yellow            NA fema~ femin~
## 3 Ric Ol~    183    NA brown      fair         blue              NA <NA>  <NA>  
## 4 R4-P17      96    NA none       silver, red  red, blue         NA none  femin~
## 5 Lando ~    177    79 black      dark         brown             31 male  mascu~
## 6 Greedo     173    74 <NA>       green        black             44 male  mascu~
## 7 Ackbar     180    83 none       brown mottle orange            41 male  mascu~
## 8 Rugor ~    206    NA none       green        orange            NA male  mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

随机选择10%的行,可以重复:

starwars %>% slice_sample(n=10, replace = T)
## # A tibble: 10 x 14
##    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
##    <chr>    <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Jek To~    180   110 brown       fair       blue              NA male  mascu~
##  2 Quarsh~    183    NA black       dark       brown             62 <NA>  <NA>  
##  3 Arvel ~     NA    NA brown       fair       brown             NA male  mascu~
##  4 Darth ~    175    80 none        red        yellow            54 male  mascu~
##  5 Finn        NA    NA black       dark       dark              NA male  mascu~
##  6 Ric Ol~    183    NA brown       fair       blue              NA <NA>  <NA>  
##  7 Mace W~    188    84 none        dark       brown             72 male  mascu~
##  8 Jango ~    183    79 black       tan        brown             66 male  mascu~
##  9 San Hi~    191    NA none        grey       gold              NA male  mascu~
## 10 Obi-Wa~    182    77 auburn, wh~ fair       blue-gray         57 male  mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

选择某一列中最大或者最小的几个值所在的行,注意不能有NA值:

starwars %>% filter(!is.na(height)) %>% 
  slice_max(height, n = 5) # 选择的这列不能有NA
## # A tibble: 5 x 14
##   name     height  mass hair_color skin_color eye_color birth_year sex   gender 
##   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>  
## 1 Yarael ~    264    NA none       white      yellow            NA male  mascul~
## 2 Tarfful     234   136 brown      brown      blue              NA male  mascul~
## 3 Lama Su     229    88 none       grey       black             NA male  mascul~
## 4 Chewbac~    228   112 brown      unknown    blue             200 male  mascul~
## 5 Roos Ta~    224    82 none       grey       orange            NA male  mascul~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

select()选择列

直接根据列名选择列,列名不需要使用引号:

starwars %>% select(hair_color, skin_color, eye_color)
## # A tibble: 87 x 3
##    hair_color    skin_color  eye_color
##    <chr>         <chr>       <chr>    
##  1 blond         fair        blue     
##  2 <NA>          gold        yellow   
##  3 <NA>          white, blue red      
##  4 none          white       yellow   
##  5 brown         light       brown    
##  6 brown, grey   light       blue     
##  7 brown         light       blue     
##  8 <NA>          white, red  red      
##  9 black         light       brown    
## 10 auburn, white fair        blue-gray
## # ... with 77 more rows

选择列名中以color结尾的列:

starwars %>% select(ends_with("color"))
## # A tibble: 87 x 3
##    hair_color    skin_color  eye_color
##    <chr>         <chr>       <chr>    
##  1 blond         fair        blue     
##  2 <NA>          gold        yellow   
##  3 <NA>          white, blue red      
##  4 none          white       yellow   
##  5 brown         light       brown    
##  6 brown, grey   light       blue     
##  7 brown         light       blue     
##  8 <NA>          white, red  red      
##  9 black         light       brown    
## 10 auburn, white fair        blue-gray
## # ... with 77 more rows

选择列名中包含color字样的列:

starwars %>% select(contains("color"))
## # A tibble: 87 x 3
##    hair_color    skin_color  eye_color
##    <chr>         <chr>       <chr>    
##  1 blond         fair        blue     
##  2 <NA>          gold        yellow   
##  3 <NA>          white, blue red      
##  4 none          white       yellow   
##  5 brown         light       brown    
##  6 brown, grey   light       blue     
##  7 brown         light       blue     
##  8 <NA>          white, red  red      
##  9 black         light       brown    
## 10 auburn, white fair        blue-gray
## # ... with 77 more rows

重命名列:

starwars %>% rename(home_world = homeworld)
## # A tibble: 87 x 14
##    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
##    <chr>    <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke S~    172    77 blond       fair       blue            19   male  mascu~
##  2 C-3PO      167    75 <NA>        gold       yellow         112   none  mascu~
##  3 R2-D2       96    32 <NA>        white, bl~ red             33   none  mascu~
##  4 Darth ~    202   136 none        white      yellow          41.9 male  mascu~
##  5 Leia O~    150    49 brown       light      brown           19   fema~ femin~
##  6 Owen L~    178   120 brown, grey light      blue            52   male  mascu~
##  7 Beru W~    165    75 brown       light      blue            47   fema~ femin~
##  8 R5-D4       97    32 <NA>        white, red red             NA   none  mascu~
##  9 Biggs ~    183    84 black       light      brown           24   male  mascu~
## 10 Obi-Wa~    182    77 auburn, wh~ fair       blue-gray       57   male  mascu~
## # ... with 77 more rows, and 5 more variables: home_world <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

mutate()新建列

starwars %>% mutate(height_m = height/100, .before = 1)
## # A tibble: 87 x 15
##    height_m name   height  mass hair_color skin_color eye_color birth_year sex  
##       <dbl> <chr>   <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
##  1     1.72 Luke ~    172    77 blond      fair       blue            19   male 
##  2     1.67 C-3PO     167    75 <NA>       gold       yellow         112   none 
##  3     0.96 R2-D2      96    32 <NA>       white, bl~ red             33   none 
##  4     2.02 Darth~    202   136 none       white      yellow          41.9 male 
##  5     1.5  Leia ~    150    49 brown      light      brown           19   fema~
##  6     1.78 Owen ~    178   120 brown, gr~ light      blue            52   male 
##  7     1.65 Beru ~    165    75 brown      light      blue            47   fema~
##  8     0.97 R5-D4      97    32 <NA>       white, red red             NA   none 
##  9     1.83 Biggs~    183    84 black      light      brown           24   male 
## 10     1.82 Obi-W~    182    77 auburn, w~ fair       blue-gray       57   male 
## # ... with 77 more rows, and 6 more variables: gender <chr>, homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>

这个效果和上面是一样的,都是把新建的列放在最前面:

starwars %>% 
  mutate(height_m = height/100) %>% 
  select(height_m, everything())
## # A tibble: 87 x 15
##    height_m name   height  mass hair_color skin_color eye_color birth_year sex  
##       <dbl> <chr>   <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
##  1     1.72 Luke ~    172    77 blond      fair       blue            19   male 
##  2     1.67 C-3PO     167    75 <NA>       gold       yellow         112   none 
##  3     0.96 R2-D2      96    32 <NA>       white, bl~ red             33   none 
##  4     2.02 Darth~    202   136 none       white      yellow          41.9 male 
##  5     1.5  Leia ~    150    49 brown      light      brown           19   fema~
##  6     1.78 Owen ~    178   120 brown, gr~ light      blue            52   male 
##  7     1.65 Beru ~    165    75 brown      light      blue            47   fema~
##  8     0.97 R5-D4      97    32 <NA>       white, red red             NA   none 
##  9     1.83 Biggs~    183    84 black      light      brown           24   male 
## 10     1.82 Obi-W~    182    77 auburn, w~ fair       blue-gray       57   male 
## # ... with 77 more rows, and 6 more variables: gender <chr>, homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>

新建的列可以直接被使用:

starwars %>%
  mutate(
    height_m = height / 100,
    BMI = mass / (height_m^2)
  ) %>%
  select(BMI, everything())
## # A tibble: 87 x 16
##      BMI name     height  mass hair_color  skin_color eye_color birth_year sex  
##    <dbl> <chr>     <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr>
##  1  26.0 Luke Sk~    172    77 blond       fair       blue            19   male 
##  2  26.9 C-3PO       167    75 <NA>        gold       yellow         112   none 
##  3  34.7 R2-D2        96    32 <NA>        white, bl~ red             33   none 
##  4  33.3 Darth V~    202   136 none        white      yellow          41.9 male 
##  5  21.8 Leia Or~    150    49 brown       light      brown           19   fema~
##  6  37.9 Owen La~    178   120 brown, grey light      blue            52   male 
##  7  27.5 Beru Wh~    165    75 brown       light      blue            47   fema~
##  8  34.0 R5-D4        97    32 <NA>        white, red red             NA   none 
##  9  25.1 Biggs D~    183    84 black       light      brown           24   male 
## 10  23.2 Obi-Wan~    182    77 auburn, wh~ fair       blue-gray       57   male 
## # ... with 77 more rows, and 7 more variables: gender <chr>, homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>,
## #   height_m <dbl>

只保留新建的列,其他列不要了:

starwars %>%
  transmute(
    height_m = height / 100,
    BMI = mass / (height_m^2)
  )
## # A tibble: 87 x 2
##    height_m   BMI
##       <dbl> <dbl>
##  1     1.72  26.0
##  2     1.67  26.9
##  3     0.96  34.7
##  4     2.02  33.3
##  5     1.5   21.8
##  6     1.78  37.9
##  7     1.65  27.5
##  8     0.97  34.0
##  9     1.83  25.1
## 10     1.82  23.2
## # ... with 77 more rows

relocate()重排列的位置

主要是使用.before.after参数,控制位置:

starwars %>% relocate(sex:homeworld, .before = height)
## # A tibble: 87 x 14
##    name     sex    gender homeworld height  mass hair_color skin_color eye_color
##    <chr>    <chr>  <chr>  <chr>      <int> <dbl> <chr>      <chr>      <chr>    
##  1 Luke Sk~ male   mascu~ Tatooine     172    77 blond      fair       blue     
##  2 C-3PO    none   mascu~ Tatooine     167    75 <NA>       gold       yellow   
##  3 R2-D2    none   mascu~ Naboo         96    32 <NA>       white, bl~ red      
##  4 Darth V~ male   mascu~ Tatooine     202   136 none       white      yellow   
##  5 Leia Or~ female femin~ Alderaan     150    49 brown      light      brown    
##  6 Owen La~ male   mascu~ Tatooine     178   120 brown, gr~ light      blue     
##  7 Beru Wh~ female femin~ Tatooine     165    75 brown      light      blue     
##  8 R5-D4    none   mascu~ Tatooine      97    32 <NA>       white, red red      
##  9 Biggs D~ male   mascu~ Tatooine     183    84 black      light      brown    
## 10 Obi-Wan~ male   mascu~ Stewjon      182    77 auburn, w~ fair       blue-gray
## # ... with 77 more rows, and 5 more variables: birth_year <dbl>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

summarise()汇总

一般来说和group_by()连用才能发挥威力。

starwars %>% summarise(height = mean(height, na.rm = T))
## # A tibble: 1 x 1
##   height
##    <dbl>
## 1   174.

今天主要是对dplyr有一个大致的认识,熟悉下最常见的操作,后面会根据不同的应用场景继续介绍更多的内容。

以上就是今天的内容,希望对你有帮助哦!欢迎点赞、在看、关注、转发

欢迎在评论区留言或直接添加我的微信!

End

欢迎关注公众号:医学和生信笔记

医学和生信笔记 公众号主要分享:1.医学小知识、肛肠科小知识;2.R语言和Python相关的数据分析、可视化、机器学习等;3.生物信息学学习资料和自己的学习笔记!

未经允许不得转载:木盒主机 » 使用dplyr进行数据分析:入门篇

赞 (0)

相关推荐

    暂无内容!