Stay hungry, stay foolish!
tidyverse
系列应该算是R语言数据分析中的瑞士军刀了,统一的格式,简洁的代码,管道符便于阅读的形式,都能让大家快速上手。R数据科学就是专门讲这个系列的,但是对于很多函数的用法和细节问题,都没有说,所以在使用时还是会经常遇到各种问题。
我根据R数据科学和tidyverse官网的教程,整理了几篇笔记,主要是对tidyverse的各种函数的用法进行详细的演示。
前面已经介绍过了forcats
包处理因子型数据,lubridate
包处理日期时间格式数据。
下面介绍dplyr
包。
在处理数据时,要明确以下几个问题:
- 明确你的目的
- 用计算机程序的方式描述你的任务
- 执行程序
dplyr
包可以帮你又快又简单地处理这些问题。tidyr
包主要聚焦于把数据变成整洁数据,dplyr
包主要功能在于对整洁数据进行各种操作,比如新增、筛选、汇总、合并等。
- 安装
- 数据集:starwars
- 针对单个数据集的操作
- filter()根据条件筛选行
- arrange()进行排序
- slice()根据位置选择行
- select()选择列
- mutate()新建列
- relocate()重排列的位置
- summarise()汇总
安装
install.packages("tidyverse")
数据集:starwars
下面使用星战(starwars)数据集演示基本的dplyr
用法。
“starwars数据集共有87行,14列,记录了星战里面的87个人物(机器人、外星人等等)的14个特点,比如姓名、身高、体重、头发颜色、眼睛颜色、种族等。
library(dplyr)
##
## 载入程辑包:'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
dim(starwars)
## [1] 87 14
glimpse(starwars)
## Rows: 87
## Columns: 14
## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or~
## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2~
## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.~
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N~
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "~
## $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",~
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, ~
## $ sex <chr> "male", "none", "none", "male", "female", "male", "female",~
## $ gender <chr> "masculine", "masculine", "masculine", "masculine", "femini~
## $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T~
## $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma~
## $ films <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return~
## $ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp~
## $ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",~
针对单个数据集的操作
今天介绍的都是主要针对单个数据集进行操作的各种函数,也是最常见的类型。
根据作用方式不同,大致可以分为以下3类:
- 作用于行
filter()
slice()
arrange()
- 作用于列
select()
rename()
mutate()
relocate()
- 作用于一组数据
summarise()
filter()根据条件筛选行
filter()
函数用于筛选符合条件的行,可以用各种表达式进行筛选,比如筛选眼睛颜色是brown并且皮肤颜色是light的行,注意这里不需要使用 & 符号:
starwars %>% filter(skin_color == "light", eye_color == "brown")
## # A tibble: 7 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Leia Or~ 150 49 brown light brown 19 female femin~
## 2 Biggs D~ 183 84 black light brown 24 male mascu~
## 3 Cordé 157 NA brown light brown NA female femin~
## 4 Dormé 165 NA brown light brown NA female femin~
## 5 Raymus ~ 188 79 brown light brown NA male mascu~
## 6 Poe Dam~ NA NA brown light brown NA male mascu~
## 7 Padmé A~ 165 45 brown light brown 46 female femin~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
但是需要注意,filter()
函数不支持直接使用行号进行筛选,比如说你想选择第1行到第3行,下面这种写法是错误的:
starwars %>% filter(1:3)
这种情况应该使用slice()
函数:
starwars %>% slice(1:3)
## # A tibble: 3 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~
## 3 R2-D2 96 32 <NA> white, blue red 33 none mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
arrange()进行排序
arrange()
函数是用来排序的,根据某一列进行排序。
starwars %>% arrange(height, mass)
## # A tibble: 87 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Yoda 66 17 white green brown 896 male mascu~
## 2 Ratts T~ 79 15 none grey, blue unknown NA male mascu~
## 3 Wicket ~ 88 20 brown brown brown 8 male mascu~
## 4 Dud Bolt 94 45 none blue, grey yellow NA male mascu~
## 5 R2-D2 96 32 <NA> white, bl~ red 33 none mascu~
## 6 R4-P17 96 NA none silver, r~ red, blue NA none femin~
## 7 R5-D4 97 32 <NA> white, red red NA none mascu~
## 8 Sebulba 112 40 none grey, red orange NA male mascu~
## 9 Gasgano 122 NA none white, bl~ black NA male mascu~
## 10 Watto 137 NA black blue, grey yellow NA male mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
desc()
函数可以进行倒序:
starwars %>% arrange(desc(height))
## # A tibble: 87 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Yarael~ 264 NA none white yellow NA male mascu~
## 2 Tarfful 234 136 brown brown blue NA male mascu~
## 3 Lama Su 229 88 none grey black NA male mascu~
## 4 Chewba~ 228 112 brown unknown blue 200 male mascu~
## 5 Roos T~ 224 82 none grey orange NA male mascu~
## 6 Grievo~ 216 159 none brown, whi~ green, y~ NA male mascu~
## 7 Taun We 213 NA none grey black NA fema~ femin~
## 8 Rugor ~ 206 NA none green orange NA male mascu~
## 9 Tion M~ 206 80 none grey black NA male mascu~
## 10 Darth ~ 202 136 none white yellow 41.9 male mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
slice()根据位置选择行
选择第5-10行的数据:
starwars %>% slice(5:10)
## # A tibble: 6 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Leia Or~ 150 49 brown light brown 19 fema~ femin~
## 2 Owen La~ 178 120 brown, grey light blue 52 male mascu~
## 3 Beru Wh~ 165 75 brown light blue 47 fema~ femin~
## 4 R5-D4 97 32 <NA> white, red red NA none mascu~
## 5 Biggs D~ 183 84 black light brown 24 male mascu~
## 6 Obi-Wan~ 182 77 auburn, wh~ fair blue-gray 57 male mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
这其实是一组函数,还有各种变体,比如下面这个,选择前4行:
starwars %>% slice_head(n = 4)
## # A tibble: 4 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~
## 3 R2-D2 96 32 <NA> white, blue red 33 none mascu~
## 4 Darth V~ 202 136 none white yellow 41.9 male mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
选择最后面的10%的行:
starwars %>% slice_tail(prop = 0.1)
## # A tibble: 8 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Sly Moo~ 178 48 none pale white NA <NA> <NA>
## 2 Tion Me~ 206 80 none grey black NA male mascu~
## 3 Finn NA NA black dark dark NA male mascu~
## 4 Rey NA NA brown light hazel NA female femin~
## 5 Poe Dam~ NA NA brown light brown NA male mascu~
## 6 BB8 NA NA none none black NA none mascu~
## 7 Captain~ NA NA unknown unknown unknown NA <NA> <NA>
## 8 Padmé A~ 165 45 brown light brown 46 female femin~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
随机选择10%的行:
starwars %>% slice_sample(prop = 0.1) # n=2
## # A tibble: 8 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Lobot 175 79 none light blue 37 male mascu~
## 2 Zam We~ 168 55 blonde fair, green~ yellow NA fema~ femin~
## 3 Ric Ol~ 183 NA brown fair blue NA <NA> <NA>
## 4 R4-P17 96 NA none silver, red red, blue NA none femin~
## 5 Lando ~ 177 79 black dark brown 31 male mascu~
## 6 Greedo 173 74 <NA> green black 44 male mascu~
## 7 Ackbar 180 83 none brown mottle orange 41 male mascu~
## 8 Rugor ~ 206 NA none green orange NA male mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
随机选择10%的行,可以重复:
starwars %>% slice_sample(n=10, replace = T)
## # A tibble: 10 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Jek To~ 180 110 brown fair blue NA male mascu~
## 2 Quarsh~ 183 NA black dark brown 62 <NA> <NA>
## 3 Arvel ~ NA NA brown fair brown NA male mascu~
## 4 Darth ~ 175 80 none red yellow 54 male mascu~
## 5 Finn NA NA black dark dark NA male mascu~
## 6 Ric Ol~ 183 NA brown fair blue NA <NA> <NA>
## 7 Mace W~ 188 84 none dark brown 72 male mascu~
## 8 Jango ~ 183 79 black tan brown 66 male mascu~
## 9 San Hi~ 191 NA none grey gold NA male mascu~
## 10 Obi-Wa~ 182 77 auburn, wh~ fair blue-gray 57 male mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
选择某一列中最大或者最小的几个值所在的行,注意不能有NA值:
starwars %>% filter(!is.na(height)) %>%
slice_max(height, n = 5) # 选择的这列不能有NA
## # A tibble: 5 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Yarael ~ 264 NA none white yellow NA male mascul~
## 2 Tarfful 234 136 brown brown blue NA male mascul~
## 3 Lama Su 229 88 none grey black NA male mascul~
## 4 Chewbac~ 228 112 brown unknown blue 200 male mascul~
## 5 Roos Ta~ 224 82 none grey orange NA male mascul~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
select()选择列
直接根据列名选择列,列名不需要使用引号:
starwars %>% select(hair_color, skin_color, eye_color)
## # A tibble: 87 x 3
## hair_color skin_color eye_color
## <chr> <chr> <chr>
## 1 blond fair blue
## 2 <NA> gold yellow
## 3 <NA> white, blue red
## 4 none white yellow
## 5 brown light brown
## 6 brown, grey light blue
## 7 brown light blue
## 8 <NA> white, red red
## 9 black light brown
## 10 auburn, white fair blue-gray
## # ... with 77 more rows
选择列名中以color结尾的列:
starwars %>% select(ends_with("color"))
## # A tibble: 87 x 3
## hair_color skin_color eye_color
## <chr> <chr> <chr>
## 1 blond fair blue
## 2 <NA> gold yellow
## 3 <NA> white, blue red
## 4 none white yellow
## 5 brown light brown
## 6 brown, grey light blue
## 7 brown light blue
## 8 <NA> white, red red
## 9 black light brown
## 10 auburn, white fair blue-gray
## # ... with 77 more rows
选择列名中包含color字样的列:
starwars %>% select(contains("color"))
## # A tibble: 87 x 3
## hair_color skin_color eye_color
## <chr> <chr> <chr>
## 1 blond fair blue
## 2 <NA> gold yellow
## 3 <NA> white, blue red
## 4 none white yellow
## 5 brown light brown
## 6 brown, grey light blue
## 7 brown light blue
## 8 <NA> white, red red
## 9 black light brown
## 10 auburn, white fair blue-gray
## # ... with 77 more rows
重命名列:
starwars %>% rename(home_world = homeworld)
## # A tibble: 87 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke S~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~
## 3 R2-D2 96 32 <NA> white, bl~ red 33 none mascu~
## 4 Darth ~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia O~ 150 49 brown light brown 19 fema~ femin~
## 6 Owen L~ 178 120 brown, grey light blue 52 male mascu~
## 7 Beru W~ 165 75 brown light blue 47 fema~ femin~
## 8 R5-D4 97 32 <NA> white, red red NA none mascu~
## 9 Biggs ~ 183 84 black light brown 24 male mascu~
## 10 Obi-Wa~ 182 77 auburn, wh~ fair blue-gray 57 male mascu~
## # ... with 77 more rows, and 5 more variables: home_world <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
mutate()新建列
starwars %>% mutate(height_m = height/100, .before = 1)
## # A tibble: 87 x 15
## height_m name height mass hair_color skin_color eye_color birth_year sex
## <dbl> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 1.72 Luke ~ 172 77 blond fair blue 19 male
## 2 1.67 C-3PO 167 75 <NA> gold yellow 112 none
## 3 0.96 R2-D2 96 32 <NA> white, bl~ red 33 none
## 4 2.02 Darth~ 202 136 none white yellow 41.9 male
## 5 1.5 Leia ~ 150 49 brown light brown 19 fema~
## 6 1.78 Owen ~ 178 120 brown, gr~ light blue 52 male
## 7 1.65 Beru ~ 165 75 brown light blue 47 fema~
## 8 0.97 R5-D4 97 32 <NA> white, red red NA none
## 9 1.83 Biggs~ 183 84 black light brown 24 male
## 10 1.82 Obi-W~ 182 77 auburn, w~ fair blue-gray 57 male
## # ... with 77 more rows, and 6 more variables: gender <chr>, homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>
这个效果和上面是一样的,都是把新建的列放在最前面:
starwars %>%
mutate(height_m = height/100) %>%
select(height_m, everything())
## # A tibble: 87 x 15
## height_m name height mass hair_color skin_color eye_color birth_year sex
## <dbl> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 1.72 Luke ~ 172 77 blond fair blue 19 male
## 2 1.67 C-3PO 167 75 <NA> gold yellow 112 none
## 3 0.96 R2-D2 96 32 <NA> white, bl~ red 33 none
## 4 2.02 Darth~ 202 136 none white yellow 41.9 male
## 5 1.5 Leia ~ 150 49 brown light brown 19 fema~
## 6 1.78 Owen ~ 178 120 brown, gr~ light blue 52 male
## 7 1.65 Beru ~ 165 75 brown light blue 47 fema~
## 8 0.97 R5-D4 97 32 <NA> white, red red NA none
## 9 1.83 Biggs~ 183 84 black light brown 24 male
## 10 1.82 Obi-W~ 182 77 auburn, w~ fair blue-gray 57 male
## # ... with 77 more rows, and 6 more variables: gender <chr>, homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>
新建的列可以直接被使用:
starwars %>%
mutate(
height_m = height / 100,
BMI = mass / (height_m^2)
) %>%
select(BMI, everything())
## # A tibble: 87 x 16
## BMI name height mass hair_color skin_color eye_color birth_year sex
## <dbl> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 26.0 Luke Sk~ 172 77 blond fair blue 19 male
## 2 26.9 C-3PO 167 75 <NA> gold yellow 112 none
## 3 34.7 R2-D2 96 32 <NA> white, bl~ red 33 none
## 4 33.3 Darth V~ 202 136 none white yellow 41.9 male
## 5 21.8 Leia Or~ 150 49 brown light brown 19 fema~
## 6 37.9 Owen La~ 178 120 brown, grey light blue 52 male
## 7 27.5 Beru Wh~ 165 75 brown light blue 47 fema~
## 8 34.0 R5-D4 97 32 <NA> white, red red NA none
## 9 25.1 Biggs D~ 183 84 black light brown 24 male
## 10 23.2 Obi-Wan~ 182 77 auburn, wh~ fair blue-gray 57 male
## # ... with 77 more rows, and 7 more variables: gender <chr>, homeworld <chr>,
## # species <chr>, films <list>, vehicles <list>, starships <list>,
## # height_m <dbl>
只保留新建的列,其他列不要了:
starwars %>%
transmute(
height_m = height / 100,
BMI = mass / (height_m^2)
)
## # A tibble: 87 x 2
## height_m BMI
## <dbl> <dbl>
## 1 1.72 26.0
## 2 1.67 26.9
## 3 0.96 34.7
## 4 2.02 33.3
## 5 1.5 21.8
## 6 1.78 37.9
## 7 1.65 27.5
## 8 0.97 34.0
## 9 1.83 25.1
## 10 1.82 23.2
## # ... with 77 more rows
relocate()重排列的位置
主要是使用.before
和.after
参数,控制位置:
starwars %>% relocate(sex:homeworld, .before = height)
## # A tibble: 87 x 14
## name sex gender homeworld height mass hair_color skin_color eye_color
## <chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <chr>
## 1 Luke Sk~ male mascu~ Tatooine 172 77 blond fair blue
## 2 C-3PO none mascu~ Tatooine 167 75 <NA> gold yellow
## 3 R2-D2 none mascu~ Naboo 96 32 <NA> white, bl~ red
## 4 Darth V~ male mascu~ Tatooine 202 136 none white yellow
## 5 Leia Or~ female femin~ Alderaan 150 49 brown light brown
## 6 Owen La~ male mascu~ Tatooine 178 120 brown, gr~ light blue
## 7 Beru Wh~ female femin~ Tatooine 165 75 brown light blue
## 8 R5-D4 none mascu~ Tatooine 97 32 <NA> white, red red
## 9 Biggs D~ male mascu~ Tatooine 183 84 black light brown
## 10 Obi-Wan~ male mascu~ Stewjon 182 77 auburn, w~ fair blue-gray
## # ... with 77 more rows, and 5 more variables: birth_year <dbl>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
summarise()汇总
一般来说和group_by()
连用才能发挥威力。
starwars %>% summarise(height = mean(height, na.rm = T))
## # A tibble: 1 x 1
## height
## <dbl>
## 1 174.
今天主要是对dplyr
有一个大致的认识,熟悉下最常见的操作,后面会根据不同的应用场景继续介绍更多的内容。
以上就是今天的内容,希望对你有帮助哦!欢迎点赞、在看、关注、转发!
欢迎在评论区留言或直接添加我的微信!
End
欢迎关注公众号:医学和生信笔记
“医学和生信笔记 公众号主要分享:1.医学小知识、肛肠科小知识;2.R语言和Python相关的数据分析、可视化、机器学习等;3.生物信息学学习资料和自己的学习笔记!
未经允许不得转载:木盒主机 » 使用dplyr进行数据分析:入门篇