Shixiang Wang

>上士闻道
勤而行之

R-数据操作

王诗翔 · 2018-08-25

分类: r  
标签: r   dplyr   sqldf   data.table   rlist  

本文内容:

数据框的本质是一个由向量构成的列表,由于列长度相同,所以可以当做矩阵进行访问和操作。比如选择满足特定条件的行,使用[]符号,第一个参数提供一个逻辑向量,第二个参数留空。

本文大部分的代码都是基于一组产品的虚拟数据。我们先将数据载入,然后学习怎么用不同的方法操作数据。

if(!require(readr)) install.packages("readr")
#> Loading required package: readr
product_info = read_csv("../../../static/datasets/product-info.csv")
#> Parsed with column specification:
#> cols(
#>   id = col_character(),
#>   name = col_character(),
#>   type = col_character(),
#>   class = col_character(),
#>   released = col_character()
#> )
product_info
#> # A tibble: 6 x 5
#>   id    name      type  class   released
#>   <chr> <chr>     <chr> <chr>   <chr>   
#> 1 T01   SupCar    toy   vehicle yes     
#> 2 T02   SupPlane  toy   vehicle no      
#> 3 M01   JeepX     model vehicle yes     
#> 4 M02   AircraftX model vehicle yes     
#> 5 M03   Runner    model people  yes     
#> 6 M04   Dancer    model people  no

当数据以数据框的形式载入内存后,我们可以使用下面的代码查看每一列的类型:

sapply(product_info, class)
#>          id        name        type       class    released 
#> "character" "character" "character" "character" "character"

注意read_csv函数载入的数据框与内置函数read.csv函数是不同的,主要体现在不会将字符串转换为因子变量,当然前者的速度要快得多。

接下来我们正式学习用R内置的函数操作数据框进行分析和统计的一些方法。

内置函数操作数据框

选取typetoy的行:

product_info[product_info$type == "toy", ]
#> # A tibble: 2 x 5
#>   id    name     type  class   released
#>   <chr> <chr>    <chr> <chr>   <chr>   
#> 1 T01   SupCar   toy   vehicle yes     
#> 2 T02   SupPlane toy   vehicle no

或选取releasedno的行:

product_info[product_info$released == "no", ]
#> # A tibble: 2 x 5
#>   id    name     type  class   released
#>   <chr> <chr>    <chr> <chr>   <chr>   
#> 1 T02   SupPlane toy   vehicle no      
#> 2 M04   Dancer   model people  no

对列进行筛选需要将第1个参数留空,给第2个参数提供字符向量。

product_info[, c("id", "name", "type")]
#> # A tibble: 6 x 3
#>   id    name      type 
#>   <chr> <chr>     <chr>
#> 1 T01   SupCar    toy  
#> 2 T02   SupPlane  toy  
#> 3 M01   JeepX     model
#> 4 M02   AircraftX model
#> 5 M03   Runner    model
#> 6 M04   Dancer    model

行列筛选也是可以的,我们只要组合前面的两种情况即可。

product_info[product_info$type == "toy", c("name", "class", "released")]
#> # A tibble: 2 x 3
#>   name     class   released
#>   <chr>    <chr>   <chr>   
#> 1 SupCar   vehicle yes     
#> 2 SupPlane vehicle no

内置函数subset()可以简化取子集操作的过程:

subset(product_info,
       subset = type == "model" & released == "yes",
       select = name:class)
#> # A tibble: 3 x 3
#>   name      type  class  
#>   <chr>     <chr> <chr>  
#> 1 JeepX     model vehicle
#> 2 AircraftX model vehicle
#> 3 Runner    model people

使用with()函数在数据框的语义中计算表达式,即可以直接使用数据框的列名,而不必重复指定数据框:

with(product_info, name[released == "no"])
#> [1] "SupPlane" "Dancer"

除了构建子集,表达式还可以用来统计每列各个可能值出现的频数。

with(product_info, table(type[released == "yes"]))
#> 
#> model   toy 
#>     3     1

除了产品信息表,还有一张产品属性的统计表:

product_stats = read_csv("../../../static/datasets/product-stats.csv")
#> Parsed with column specification:
#> cols(
#>   id = col_character(),
#>   material = col_character(),
#>   size = col_double(),
#>   weight = col_double()
#> )
product_stats
#> # A tibble: 6 x 4
#>   id    material  size weight
#>   <chr> <chr>    <dbl>  <dbl>
#> 1 T01   Metal      120   10  
#> 2 T02   Metal      350   45  
#> 3 M01   Plastics    50   NA  
#> 4 M02   Plastics    85    3  
#> 5 M03   Wood        15   NA  
#> 6 M04   Wood        16    0.6

如果现在要获取尺寸最大的前3个产品的名字该怎么办?

一种方法是将product_stats按尺寸降序排列,选择前3个记录的id,然后用id值筛选product_info的行:

top3_id = unlist(product_stats[order(product_stats$size, decreasing = TRUE), "id"])[1:3]
product_info[product_info$id %in% top3_id, ]
#> # A tibble: 3 x 5
#>   id    name      type  class   released
#>   <chr> <chr>     <chr> <chr>   <chr>   
#> 1 T01   SupCar    toy   vehicle yes     
#> 2 T02   SupPlane  toy   vehicle no      
#> 3 M02   AircraftX model vehicle yes

我们用比较冗长的方式完成了任务。但仔细在想想,两个数据框是通过id连接到一起的,我们可以把它们合并到一起,然后执行提取操作:

product_table = merge(product_info, product_stats, by = "id")
product_table
#>    id      name  type   class released material size weight
#> 1 M01     JeepX model vehicle      yes Plastics   50     NA
#> 2 M02 AircraftX model vehicle      yes Plastics   85    3.0
#> 3 M03    Runner model  people      yes     Wood   15     NA
#> 4 M04    Dancer model  people       no     Wood   16    0.6
#> 5 T01    SupCar   toy vehicle      yes    Metal  120   10.0
#> 6 T02  SupPlane   toy vehicle       no    Metal  350   45.0

现在通过合并的数据框,我们可以根据任意一列排序数据框,而不需要处理其他的表格数据:

product_table[order(product_table$size), ]
#>    id      name  type   class released material size weight
#> 3 M03    Runner model  people      yes     Wood   15     NA
#> 4 M04    Dancer model  people       no     Wood   16    0.6
#> 1 M01     JeepX model vehicle      yes Plastics   50     NA
#> 2 M02 AircraftX model vehicle      yes Plastics   85    3.0
#> 5 T01    SupCar   toy vehicle      yes    Metal  120   10.0
#> 6 T02  SupPlane   toy vehicle       no    Metal  350   45.0

前面的问题我们也可以利用合并的数据框加以解决:

product_table[order(product_table$size, decreasing = TRUE), "name"][1:3]
#> [1] "SupPlane"  "SupCar"    "AircraftX"

有时候我们需要生成新数据框来对原始数据基础上进行调整和处理,从而避免破坏原始数据。transform()函数可以帮助我们完成这类任务,例如:

transform(product_table,
          released = ifelse(released == "yes", TRUE, FALSE),
          density = weight / size)
#>    id      name  type   class released material size weight density
#> 1 M01     JeepX model vehicle     TRUE Plastics   50     NA      NA
#> 2 M02 AircraftX model vehicle     TRUE Plastics   85    3.0  0.0353
#> 3 M03    Runner model  people     TRUE     Wood   15     NA      NA
#> 4 M04    Dancer model  people    FALSE     Wood   16    0.6  0.0375
#> 5 T01    SupCar   toy vehicle     TRUE    Metal  120   10.0  0.0833
#> 6 T02  SupPlane   toy vehicle    FALSE    Metal  350   45.0  0.1286

前面数据中我们看到有一些缺失值(用NA表示),很多时候我们不希望数据出现任何缺失值,因此需要某种办法处理它们。为了演示处理的方法,我们再载入一张包含缺失值的表,包含每件产品的质量、耐久性、防水性的测试结果。

product_tests = read_csv("../../../static/datasets/product-tests.csv")
#> Parsed with column specification:
#> cols(
#>   id = col_character(),
#>   quality = col_double(),
#>   durability = col_double(),
#>   waterproof = col_character()
#> )
product_tests
#> # A tibble: 6 x 4
#>   id    quality durability waterproof
#>   <chr>   <dbl>      <dbl> <chr>     
#> 1 T01        NA         10 no        
#> 2 T02        10          9 no        
#> 3 M01         6          4 yes       
#> 4 M02         6          5 yes       
#> 5 M03         5         NA yes       
#> 6 M04         6          6 yes

na.omit()可以删除所有包含缺失值的行:

na.omit(product_tests)
#> # A tibble: 4 x 4
#>   id    quality durability waterproof
#>   <chr>   <dbl>      <dbl> <chr>     
#> 1 T02        10          9 no        
#> 2 M01         6          4 yes       
#> 3 M02         6          5 yes       
#> 4 M04         6          6 yes

另外,函数complete.cases()可以返回逻辑向量,表明某行是否完整。

complete.cases(product_tests)
#> [1] FALSE  TRUE  TRUE  TRUE FALSE  TRUE

利用该函数可以筛选数据框,比如获得不含缺失值的id值:

product_tests[complete.cases(product_tests), "id"]
#> # A tibble: 4 x 1
#>   id   
#>   <chr>
#> 1 T02  
#> 2 M01  
#> 3 M02  
#> 4 M04

前面给出的3个表格有共同的id列,可惜R里面内置函数只能一次合并2个数据框:

product_full = merge(product_table, product_tests, by = "id")
product_full
#>    id      name  type   class released material size weight quality durability
#> 1 M01     JeepX model vehicle      yes Plastics   50     NA       6          4
#> 2 M02 AircraftX model vehicle      yes Plastics   85    3.0       6          5
#> 3 M03    Runner model  people      yes     Wood   15     NA       5         NA
#> 4 M04    Dancer model  people       no     Wood   16    0.6       6          6
#> 5 T01    SupCar   toy vehicle      yes    Metal  120   10.0      NA         10
#> 6 T02  SupPlane   toy vehicle       no    Metal  350   45.0      10          9
#>   waterproof
#> 1        yes
#> 2        yes
#> 3        yes
#> 4        yes
#> 5         no
#> 6         no

对完全合并好的表格,我们利用tapply()函数(apply家族成员)可以进行统计,该函数专门用于处理表格数据,使用某些方法根据某列队另一列的数据进行统计。

例如根据type列计算quality列的均值:

mean_quality1 = tapply(product_full$quality,
                       list(product_full$type),
                       mean, na.rm=TRUE)
mean_quality1
#> model   toy 
#>  5.75 10.00

返回的结果看起来是个数值向量,我们使用str()看看:

str(mean_quality1)
#>  num [1:2(1d)] 5.75 10
#>  - attr(*, "dimnames")=List of 1
#>   ..$ : chr [1:2] "model" "toy"

实际上,这是个一维数组

is.array(mean_quality1)
#> [1] TRUE

tapply()返回的是一个数组,而不是简单的数值向量,因此可以方便地计算多组操作。

例如计算每一对typeclass组合的quality均值:

mean_quality2 = tapply(product_full$quality,
                       list(product_full$type, product_full$class),
                       mean, na.rm = TRUE)
mean_quality2
#>       people vehicle
#> model    5.5       6
#> toy       NA      10

对于二维数组,我们可以使用两个参数来获取其中的值:

typeof(mean_quality2)
#> [1] "double"
class(mean_quality2)
#> [1] "matrix" "array"
mean_quality2["model", "vehicle"]
#> [1] 6

同理我们可以根据多列分组,使用with()可以避免反复输入product_full

mean_quality3 = with(product_full,
                     tapply(quality, list(type, material, released),
                            mean, na.rm = TRUE))
mean_quality3
#> , , no
#> 
#>       Metal Plastics Wood
#> model    NA       NA    6
#> toy      10       NA   NA
#> 
#> , , yes
#> 
#>       Metal Plastics Wood
#> model    NA        6    5
#> toy     NaN       NA   NA

使用3个参数可以获取单元格中的值:

mean_quality3["model", "Wood", "yes"]
#> [1] 5

reshape2重塑数据框

前面我们学习了如何筛选、排序、合并和汇总数据框,有时候我们需要做些更复杂的操作。

例如下面数据包含两种产品不同日期的质量和耐久性的测试结果:

toy_tests = read_csv("../../../static/datasets/product-toy-tests.csv")
#> Parsed with column specification:
#> cols(
#>   id = col_character(),
#>   date = col_double(),
#>   sample = col_double(),
#>   quality = col_double(),
#>   durability = col_double()
#> )
toy_tests
#> # A tibble: 8 x 5
#>   id        date sample quality durability
#>   <chr>    <dbl>  <dbl>   <dbl>      <dbl>
#> 1 T01   20160201    100       9          9
#> 2 T01   20160302    150      10          9
#> 3 T01   20160405    180       9         10
#> 4 T01   20160502    140       9          9
#> 5 T02   20160201     70       7          9
#> 6 T02   20160303     75       8          8
#> 7 T02   20160403     90       9          8
#> 8 T02   20160502     85      10          9

如果需要同时比较两种产品的质量和耐久性,这种格式就比较麻烦,如果是下面的格式就好了:

date    T01 T02
20160201    9   9
2016    10  9

reshape2包就是用来搞定这种任务的,如果没有安装,运行下面代码:

install.packages("reshape2")

安装成功后,我们就可以使用dcast()来转换数据,用于比较:

library(reshape2)
toy_quality = dcast(toy_tests, date ~ id, value.var = "quality")
toy_quality
#>       date T01 T02
#> 1 20160201   9   7
#> 2 20160302  10  NA
#> 3 20160303  NA   8
#> 4 20160403  NA   9
#> 5 20160405   9  NA
#> 6 20160502   9  10

上述代码重塑了toy_testsdate列被共享,id值被单独分割为列,每个dateid对应的值是quality

可以看到数据中存在缺失值,有一种叫末次观测值结转法(LOCF)可以填补缺失值,当非缺失值后面紧跟一个缺失值时,就用该缺失值填补后面的缺失值,直到所有缺失值都被填满。zoo包提供了LOCF的一个实现,使用下面代码安装:

install.packages("zoo")

下面用一组简单的向量演示:

library(zoo)
#> 
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#> 
#>     as.Date, as.Date.numeric
na.locf(c(1, 2, NA, NA, 3, 1, NA, 2, NA))
#> [1] 1 2 2 2 3 1 1 2 2

同样的方法我们可以应用于现在处理的数据:

na.locf(toy_quality$T01)
#> [1]  9 10 10 10  9  9

如果需要填补的数据很多,包含上千个产品,更好的做法是使用lapply进行自动分配:

toy_quality[-1] = lapply(toy_quality[-1], na.locf )
toy_quality
#>       date T01 T02
#> 1 20160201   9   7
#> 2 20160302  10   7
#> 3 20160303  10   8
#> 4 20160403  10   9
#> 5 20160405   9   9
#> 6 20160502   9  10

这里数据虽然已经没有了缺失值,但每一行数据的含义却发生了变化。原始数据中产品T01在20160303这天并没有测试,所以这一天的值应该被解释为在此之前的最后一次quality的测试值。另一个问题是两种产品都是按月测试的,但重塑后的数据框没有以固定的频率对其date。

下面方法进问题进行修正。

toy_tests$ym = substr(toy_tests$date, 1, 6)
toy_tests
#> # A tibble: 8 x 6
#>   id        date sample quality durability ym    
#>   <chr>    <dbl>  <dbl>   <dbl>      <dbl> <chr> 
#> 1 T01   20160201    100       9          9 201602
#> 2 T01   20160302    150      10          9 201603
#> 3 T01   20160405    180       9         10 201604
#> 4 T01   20160502    140       9          9 201605
#> 5 T02   20160201     70       7          9 201602
#> 6 T02   20160303     75       8          8 201603
#> 7 T02   20160403     90       9          8 201604
#> 8 T02   20160502     85      10          9 201605

我们只提取年月信息,然后利用它进行重塑。

toy_quality = dcast(toy_tests, ym ~ id, value.var = "quality")
toy_quality
#>       ym T01 T02
#> 1 201602   9   7
#> 2 201603  10   8
#> 3 201604   9   9
#> 4 201605   9  10

现在,两种产品每月的质量得分自然地展示出来,而且每月缺失值。

有时候,我们需要将许多列合并为1列,用于表示被测量的对象,另外用1列存储对应的结果值。下面用melt()函数将原始数据两种测量组合到一起:

toy_tests2 = melt(toy_tests, id.vars = c("id", "ym"), 
                  measure.vars = c("quality", "durability"),
                  variable.name = "measure")
toy_tests2
#>     id     ym    measure value
#> 1  T01 201602    quality     9
#> 2  T01 201603    quality    10
#> 3  T01 201604    quality     9
#> 4  T01 201605    quality     9
#> 5  T02 201602    quality     7
#> 6  T02 201603    quality     8
#> 7  T02 201604    quality     9
#> 8  T02 201605    quality    10
#> 9  T01 201602 durability     9
#> 10 T01 201603 durability     9
#> 11 T01 201604 durability    10
#> 12 T01 201605 durability     9
#> 13 T02 201602 durability     9
#> 14 T02 201603 durability     8
#> 15 T02 201604 durability     8
#> 16 T02 201605 durability     9

这种格式正是ggplot2所喜爱的长格式数据,我们可以来画图:

library(ggplot2)
ggplot(toy_tests2, aes(x = ym, y = value)) + 
    geom_point() + 
    facet_grid(id ~ measure)

我们得到了按照产品id和measure分组,以ym为x轴,以value为y轴的散点图,可以清晰对比分组后两种产品质量差异(以年月)。

我们还可以用不同的颜色来表示产品,下图可以给出与上图相同的信息:

ggplot(toy_tests2, aes(x = ym, y = value, color = id)) + 
    geom_point() + facet_grid(. ~ measure)

通过sqldf包使用SQL查询数据框

有没有一种方法,能够直接使用SQL进行数据框查询,就像数据框是关系型数据库中的表一样呢?sqldf包给出肯定答案。该包吸收了SQLite轻量结构和易于嵌入R会话的优点,可以用下面代码安装:

install.packages("sqldf")

首先加载包:

library(sqldf)
#> Loading required package: gsubfn
#> Loading required package: proto
#> Loading required package: RSQLite

注意加载sqldf包时,几个依赖包会自动加载进来。sql包的实现依赖这些包,它基本上是在R和SQLite之间传输数据和转换数据类型

读入前面使用的产品表格:

product_info = read_csv("../../../static/datasets/product-info.csv")
#> Parsed with column specification:
#> cols(
#>   id = col_character(),
#>   name = col_character(),
#>   type = col_character(),
#>   class = col_character(),
#>   released = col_character()
#> )
product_stats = read_csv("../../../static/datasets/product-stats.csv")
#> Parsed with column specification:
#> cols(
#>   id = col_character(),
#>   material = col_character(),
#>   size = col_double(),
#>   weight = col_double()
#> )
product_tests = read_csv("../../../static/datasets/product-tests.csv")
#> Parsed with column specification:
#> cols(
#>   id = col_character(),
#>   quality = col_double(),
#>   durability = col_double(),
#>   waterproof = col_character()
#> )
toy_tests = read_csv("../../../static/datasets/product-toy-tests.csv")
#> Parsed with column specification:
#> cols(
#>   id = col_character(),
#>   date = col_double(),
#>   sample = col_double(),
#>   quality = col_double(),
#>   durability = col_double()
#> )

sqldf包的神奇之处在于我们可以使用SQL语句查询工作环境中的数据框,例如:

sqldf("select * from product_info")
#>    id      name  type   class released
#> 1 T01    SupCar   toy vehicle      yes
#> 2 T02  SupPlane   toy vehicle       no
#> 3 M01     JeepX model vehicle      yes
#> 4 M02 AircraftX model vehicle      yes
#> 5 M03    Runner model  people      yes
#> 6 M04    Dancer model  people       no

sqldf与SQLite一样,支持简单的选择性请求。

比如选择特定列:

sqldf("select id, name, class from product_info")
#>    id      name   class
#> 1 T01    SupCar vehicle
#> 2 T02  SupPlane vehicle
#> 3 M01     JeepX vehicle
#> 4 M02 AircraftX vehicle
#> 5 M03    Runner  people
#> 6 M04    Dancer  people

根据条件筛选记录:

sqldf("select id, name from product_info where released = 'yes' ")
#>    id      name
#> 1 T01    SupCar
#> 2 M01     JeepX
#> 3 M02 AircraftX
#> 4 M03    Runner

除了基本的数据库操作和分组统计,该包还支持查询多个数据框,比如:

sqldf("select * from product_info join product_stats using (id)")
#>    id      name  type   class released material size weight
#> 1 T01    SupCar   toy vehicle      yes    Metal  120   10.0
#> 2 T02  SupPlane   toy vehicle       no    Metal  350   45.0
#> 3 M01     JeepX model vehicle      yes Plastics   50     NA
#> 4 M02 AircraftX model vehicle      yes Plastics   85    3.0
#> 5 M03    Runner model  people      yes     Wood   15     NA
#> 6 M04    Dancer model  people       no     Wood   16    0.6

不过sqldf包的缺点也很明显:

  1. sqldf默认基于SQLite,因此SQLite的局限性就是该包的局限性,比如内置的分组汇总函数是有限的,而R本身的统计汇总函数要多得多
  2. 不方便动态编程
  3. SQL的限制性也限制了该包,我们难以像操作dplyr包一样用sqldf进行表格数据的操作、变换等等

如果你喜欢这个包并想用起来,阅读sqldf更多操作例子:https://github.com/ggrothendieck/sqldf#examples ## 使用data.table包操作数据

data.table包提供了一个加强版的data.frame,它运行效率极高,而且能够处理适合内存的大数据集,它使用[]实现了一种自然地数据操作语法。使用下面命令进行安装:

install.packages("data.table")

载入包:

library(data.table)
#> 
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:reshape2':
#> 
#>     dcast, melt

注意,data.table包提供了加强版的dcast()melt(),它们的功能更强大、性能更高,内存使用也更高效。

创建data.table与创建data.frame类似:

dt = data.table(x = 1:3, y = rnorm(3), z = letters[1:3])
dt
#>    x      y z
#> 1: 1  0.265 a
#> 2: 2 -0.955 b
#> 3: 3  1.109 c

检查它的结构:

str(dt)
#> Classes 'data.table' and 'data.frame':   3 obs. of  3 variables:
#>  $ x: int  1 2 3
#>  $ y: num  0.265 -0.955 1.109
#>  $ z: chr  "a" "b" "c"
#>  - attr(*, ".internal.selfref")=<externalptr>

可以看到,dt的类是data.tabledata.frame,也就是说data.table继承了data.frame的一些行为,但增强了其他部分。

**data.table的基本语法是dt[i, j, by],简单说就是使用i选择行,用by分组,然后计算j**。接下来我们看看data.table`继承了什么,增强了什么。

首先,我们仍然载入之前用到的产品数据,不过这里我们使用data.table包提供的fread()函数,它非常高效和智能,默认返回data.table

product_info = fread("../../../static/datasets/product-info.csv")
product_stats = fread("../../../static/datasets/product-stats.csv")
product_tests = fread("../../../static/datasets/product-tests.csv")
toy_tests = fread("../../../static/datasets/product-toy-tests.csv")

如果查看表格信息,你会发现它和data.frame没什么两样:

product_info
#>     id      name  type   class released
#> 1: T01    SupCar   toy vehicle      yes
#> 2: T02  SupPlane   toy vehicle       no
#> 3: M01     JeepX model vehicle      yes
#> 4: M02 AircraftX model vehicle      yes
#> 5: M03    Runner model  people      yes
#> 6: M04    Dancer model  people       no

再看结构:

str(product_info)
#> Classes 'data.table' and 'data.frame':   6 obs. of  5 variables:
#>  $ id      : chr  "T01" "T02" "M01" "M02" ...
#>  $ name    : chr  "SupCar" "SupPlane" "JeepX" "AircraftX" ...
#>  $ type    : chr  "toy" "toy" "model" "model" ...
#>  $ class   : chr  "vehicle" "vehicle" "vehicle" "vehicle" ...
#>  $ released: chr  "yes" "no" "yes" "yes" ...
#>  - attr(*, ".internal.selfref")=<externalptr>

data.frame不同,如果只提供一个参数用来构建子集,data.table是选择行而不是列:

product_info[1]
#>     id   name type   class released
#> 1: T01 SupCar  toy vehicle      yes
product_info[1:3]
#>     id     name  type   class released
#> 1: T01   SupCar   toy vehicle      yes
#> 2: T02 SupPlane   toy vehicle       no
#> 3: M01    JeepX model vehicle      yes

如果提供的是负数,那么将删除指定的行:

product_info[-1]
#>     id      name  type   class released
#> 1: T02  SupPlane   toy vehicle       no
#> 2: M01     JeepX model vehicle      yes
#> 3: M02 AircraftX model vehicle      yes
#> 4: M03    Runner model  people      yes
#> 5: M04    Dancer model  people       no

data.table提供了许多特殊符号,它们是data.table的重要组成.N是最常用的符号之一,它表示当前分组中,对象的数目(就不用调用nrow函数啦)。在[]使用它指提取最后一行。

product_info[.N]
#>     id   name  type  class released
#> 1: M04 Dancer model people       no
product_info[c(1, .N)]
#>     id   name  type   class released
#> 1: T01 SupCar   toy vehicle      yes
#> 2: M04 Dancer model  people       no

在对data.table构建子集时,能够自动根据语义计算表达式,因此可以直接使用列名,像with()subset()那样。

比如:

product_info[released == "yes"]
#>     id      name  type   class released
#> 1: T01    SupCar   toy vehicle      yes
#> 2: M01     JeepX model vehicle      yes
#> 3: M02 AircraftX model vehicle      yes
#> 4: M03    Runner model  people      yes

方括号内的第1个参数是行筛选器,第2个则对筛选后的数据进行适当的计算。

例如提取列:

product_info[released == "yes", id]
#> [1] "T01" "M01" "M02" "M03"

在这里使用"id"结果不同,返回的必然是个data.table。

product_info[released == "yes", "id"]
#>     id
#> 1: T01
#> 2: M01
#> 3: M02
#> 4: M03

第二个参数可以是表达式,例如生成一张表,反应每种typeclass组合中releasedyes的数量:

product_info[released == "yes", table(type, class)]
#>        class
#> type    people vehicle
#>   model      1       2
#>   toy        0       1

要注意,给第2个参数提供list(),结果仍然转换为data.table

product_info[released == "yes", list(id, name)]
#>     id      name
#> 1: T01    SupCar
#> 2: M01     JeepX
#> 3: M02 AircraftX
#> 4: M03    Runner

我们可以替换原有列,生成新的data.table:

product_info[, list(id, name, released = released == "yes")]
#>     id      name released
#> 1: T01    SupCar     TRUE
#> 2: T02  SupPlane    FALSE
#> 3: M01     JeepX     TRUE
#> 4: M02 AircraftX     TRUE
#> 5: M03    Runner     TRUE
#> 6: M04    Dancer    FALSE

还可以创建新列:

product_stats[, list(id, material, size, weight, density = size/weight)]
#>     id material size weight density
#> 1: T01    Metal  120   10.0   12.00
#> 2: T02    Metal  350   45.0    7.78
#> 3: M01 Plastics   50     NA      NA
#> 4: M02 Plastics   85    3.0   28.33
#> 5: M03     Wood   15     NA      NA
#> 6: M04     Wood   16    0.6   26.67

为了简化,data.table使用.()作为list()的缩写,这两者等价

product_info[, .(id, name, type, class)]
#>     id      name  type   class
#> 1: T01    SupCar   toy vehicle
#> 2: T02  SupPlane   toy vehicle
#> 3: M01     JeepX model vehicle
#> 4: M02 AircraftX model vehicle
#> 5: M03    Runner model  people
#> 6: M04    Dancer model  people
product_info[released == "yes", .(id, name)]
#>     id      name
#> 1: T01    SupCar
#> 2: M01     JeepX
#> 3: M02 AircraftX
#> 4: M03    Runner

提供排序索引可以对记录排序:

product_stats[order(size, decreasing = TRUE)]
#>     id material size weight
#> 1: T02    Metal  350   45.0
#> 2: T01    Metal  120   10.0
#> 3: M02 Plastics   85    3.0
#> 4: M01 Plastics   50     NA
#> 5: M04     Wood   16    0.6
#> 6: M03     Wood   15     NA

前面都是在构建子集后,又创建新的data.table。这样挺麻烦的,因此data.table包提供了对列进行原地赋值的符号:=,例如product_stats开始是这样的:

product_stats
#>     id material size weight
#> 1: T01    Metal  120   10.0
#> 2: T02    Metal  350   45.0
#> 3: M01 Plastics   50     NA
#> 4: M02 Plastics   85    3.0
#> 5: M03     Wood   15     NA
#> 6: M04     Wood   16    0.6

使用:=直接在上面数据框创建新列:

product_stats[, density := size / weight]

虽然没有任何返回,但数据已经被修改了:

product_stats
#>     id material size weight density
#> 1: T01    Metal  120   10.0   12.00
#> 2: T02    Metal  350   45.0    7.78
#> 3: M01 Plastics   50     NA      NA
#> 4: M02 Plastics   85    3.0   28.33
#> 5: M03     Wood   15     NA      NA
#> 6: M04     Wood   16    0.6   26.67

使用:=替换已有的列:

product_info[, released := released == "yes"]
product_info
#>     id      name  type   class released
#> 1: T01    SupCar   toy vehicle     TRUE
#> 2: T02  SupPlane   toy vehicle    FALSE
#> 3: M01     JeepX model vehicle     TRUE
#> 4: M02 AircraftX model vehicle     TRUE
#> 5: M03    Runner model  people     TRUE
#> 6: M04    Dancer model  people    FALSE

使用键获取值

索引支持是data.table另一个独特功能,即我们可以创建键(key),使用键获取记录及其高效。

例如,使用setkey()id设置为product_info中的一个键:

setkey(product_info, id)

同样的,函数无任何返回,但我们已经为原始数据设置了键,而且原来的数据看起来也没变化:

product_info
#>     id      name  type   class released
#> 1: M01     JeepX model vehicle     TRUE
#> 2: M02 AircraftX model vehicle     TRUE
#> 3: M03    Runner model  people     TRUE
#> 4: M04    Dancer model  people    FALSE
#> 5: T01    SupCar   toy vehicle     TRUE
#> 6: T02  SupPlane   toy vehicle    FALSE

但键已生成:

key(product_info)
#> [1] "id"

现在我们可以用它来获取数据了,比如提供一个id值:

product_info["M01"]
#>     id  name  type   class released
#> 1: M01 JeepX model vehicle     TRUE

也可以使用setkeyv()来设置键,但它只接受字符向量:

setkeyv(product_stats, "id")

key是一个动态变化的向量时,这个函数会非常好用

product_stats["M02"]
#>     id material size weight density
#> 1: M02 Plastics   85      3    28.3

如果两个表格有相同的键,我们可以轻松把他们连接到一起:

product_info[product_stats]
#>     id      name  type   class released material size weight density
#> 1: M01     JeepX model vehicle     TRUE Plastics   50     NA      NA
#> 2: M02 AircraftX model vehicle     TRUE Plastics   85    3.0   28.33
#> 3: M03    Runner model  people     TRUE     Wood   15     NA      NA
#> 4: M04    Dancer model  people    FALSE     Wood   16    0.6   26.67
#> 5: T01    SupCar   toy vehicle     TRUE    Metal  120   10.0   12.00
#> 6: T02  SupPlane   toy vehicle    FALSE    Metal  350   45.0    7.78

data.table的键可以不止一个。例如使用iddate定位toy_tests中的记录:

setkey(toy_tests, id, date)

现在提供key中的两个元素就可以获取记录了

toy_tests[.("T01", 20160201)]
#>     id     date sample quality durability
#> 1: T01 20160201    100       9          9

如果提供第一个元素,会返回匹配的多个值:

toy_tests["T01"]
#>     id     date sample quality durability
#> 1: T01 20160201    100       9          9
#> 2: T01 20160302    150      10          9
#> 3: T01 20160405    180       9         10
#> 4: T01 20160502    140       9          9

key不能错序,因此不能单独提供第2个元素以及反序排列。

toy_tests[20160201]
#>      id date sample quality durability
#> 1: <NA>   NA     NA      NA         NA
toy_tests[.(20160202,"T01")]
#> Error in bmerge(i, x, leftcols, rightcols, roll, rollends, nomatch, mult, : Incompatible join types: x.id (character) and i.V1 (double)

对数据进行分组汇总

by是data.table中另一个重要参数(即方括号内的第3个参数),它可以将数据按照by值进行分组,并对分组计算第2个参数。

接下来,我们学习如何通过by以简便的方式实现数据的分组汇总。

最简单的用法是计算每组的记录条数:

product_info[, .N, by = released]
#>    released N
#> 1:     TRUE 4
#> 2:    FALSE 2

分组的变量可以不止一个,例如由typeclass确定一个分组:

product_info[, .N, by = .(type, class)]
#>     type   class N
#> 1: model vehicle 2
#> 2: model  people 2
#> 3:   toy vehicle 2

可以对每个分组进行统计计算,这里计算防水和非防水产品的质量得分均值:

product_tests[, mean(quality, na.rm = TRUE), by = .(waterproof)]
#>    waterproof    V1
#> 1:         no 10.00
#> 2:        yes  5.75

可以看到结果存储在V1列中,我们可以手动指定列名:

product_tests[, .(mean_quality = mean(quality, na.rm = TRUE)), by = .(waterproof)]
#>    waterproof mean_quality
#> 1:         no        10.00
#> 2:        yes         5.75

注意操作需要�放在list中进行(.())。

我们可以将多个[]按顺序连接起来,形成工作流(类似管道%>%)。

下面的例子中,首先使用通用键id将product_info和product_tests连接起来,然后筛选已发布的产品,再按type和class进行分组,最后计算每组的quality和durability的均值。

type_class_test0 = product_info[product_tests][released == TRUE,
                                               .(mean_quality = mean(quality, na.rm=TRUE),
                                                 mean_durability = mean(durability, na.rm=TRUE)),
                                               by = .(type, class)]
type_class_test0
#>     type   class mean_quality mean_durability
#> 1:   toy vehicle          NaN            10.0
#> 2: model vehicle            6             4.5
#> 3: model  people            5             NaN

在返回的data.table中,by所对应的组合中的值是唯一的,虽然实现了目标,但结果中没有设置键:

key(type_class_test0)
#> NULL

这种情况下,我们可以使用keyby来确保结果的data.table自动将keyby对应的分组向量设置为键。一般data.table会保持原来的顺序返回,有时候我们想要设定排序,keyby也可以实现,所以是一举两得:

type_class_test = product_info[product_tests][released == TRUE, 
                                              .(mean_quality = mean(quality, na.rm = TRUE),
                                                mean_durability = mean(durability, na.rm = TRUE)),
                                              keyby = .(type, class)]
type_class_test
#>     type   class mean_quality mean_durability
#> 1: model  people            5             NaN
#> 2: model vehicle            6             4.5
#> 3:   toy vehicle          NaN            10.0
key(type_class_test)
#> [1] "type"  "class"

下面可以直接用键来获取值:

type_class_test[.("model", "vehicle"), mean_quality]
#> [1] 6

对大数据集使用键进行搜索,能够比迭代使用逻辑比较快得多,因为键搜索利用了二进制搜索,而迭代在不必要的计算上浪费了时间

下面举例说明,首先创建有1000万行的数据,其中一列是索引列id,其他两列是随机数:

n = 10000000
test1 = data.frame(id = 1:n, x = rnorm(n), y = rnorm(n))

现在查找id为876543的行,看要花多少时间:

system.time(row <- test1[test1$id == 876543, ])
#>    user  system elapsed 
#>   0.146   0.021   0.180

作为对比,我们使用data.table来完成这个任务,使用setDT()将数据框转换为data.table,该函数可以原地转换,不需要复制,并可以设定键。

setDT(test1, key = "id")
class(test1)
#> [1] "data.table" "data.frame"

现在我们搜索相同的元素:

system.time(row <- test1[.(876543)])
#>    user  system elapsed 
#>   0.001   0.000   0.001

结果一致,但data.table用的时间要少得多。

重塑data.table

data.table扩展包为data.table对象提供了更强更快得dcast()melt()函数。

例如将toy_tests的每个产品质量得分按照年和月进行对齐

toy_tests[, ym := substr(date, 1, 6)]
toy_quality = dcast(toy_tests, ym ~ id, value.var = "quality")
toy_quality
#>        ym T01 T02
#> 1: 201602   9   7
#> 2: 201603  10   8
#> 3: 201604   9   9
#> 4: 201605   9  10

data.table::dcast()提供了更强大的多变量支持:

toy_tests2 = data.table::dcast(toy_tests, ym ~ id, value.var = c("quality", "durability"))
toy_tests2
#>        ym quality_T01 quality_T02 durability_T01 durability_T02
#> 1: 201602           9           7              9              9
#> 2: 201603          10           8              9              8
#> 3: 201604           9           9             10              8
#> 4: 201605           9          10              9              9

看到没,data.table可以自动将id值与质量分类连接起来。

此时ym是键:

key(toy_tests2)
#> [1] "ym"

我们可以利用它提取数据:

toy_tests2["201602"]
#>        ym quality_T01 quality_T02 durability_T01 durability_T02
#> 1: 201602           9           7              9              9

使用原地设置函数

我们知道R存在复制修改机制,这在进行大数据计算时开销很大,data.table提供了一系列支持语义的set函数,它们可以原地修改data.table,因此避免不必要的复制。

仍以product_stats为例,我们可以使用setDF()函数不要任何复制就可以将data.table变成data.frame。

product_stats
#>     id material size weight density
#> 1: M01 Plastics   50     NA      NA
#> 2: M02 Plastics   85    3.0   28.33
#> 3: M03     Wood   15     NA      NA
#> 4: M04     Wood   16    0.6   26.67
#> 5: T01    Metal  120   10.0   12.00
#> 6: T02    Metal  350   45.0    7.78
setDF(product_stats)
class(product_stats)
#> [1] "data.frame"

setDT()可以将任意的data.frame转换为data.table,并设置键。

setDT(product_stats, key = "id")
class(product_stats)
#> [1] "data.table" "data.frame"

使用setnames()可以对列重命名:

setnames(product_stats, "size", "volume")
product_stats
#>     id material volume weight density
#> 1: M01 Plastics     50     NA      NA
#> 2: M02 Plastics     85    3.0   28.33
#> 3: M03     Wood     15     NA      NA
#> 4: M04     Wood     16    0.6   26.67
#> 5: T01    Metal    120   10.0   12.00
#> 6: T02    Metal    350   45.0    7.78

如果给行添加索引,使用:

product_stats[, i := .I]
product_stats
#>     id material volume weight density i
#> 1: M01 Plastics     50     NA      NA 1
#> 2: M02 Plastics     85    3.0   28.33 2
#> 3: M03     Wood     15     NA      NA 3
#> 4: M04     Wood     16    0.6   26.67 4
#> 5: T01    Metal    120   10.0   12.00 5
#> 6: T02    Metal    350   45.0    7.78 6

为方便,索引一般在第1列,所以我们要修改列的顺序:

setcolorder(product_stats, c("i", "id", "material", "weight", "volume", "density"))
product_stats
#>    i  id material weight volume density
#> 1: 1 M01 Plastics     NA     50      NA
#> 2: 2 M02 Plastics    3.0     85   28.33
#> 3: 3 M03     Wood     NA     15      NA
#> 4: 4 M04     Wood    0.6     16   26.67
#> 5: 5 T01    Metal   10.0    120   12.00
#> 6: 6 T02    Metal   45.0    350    7.78

data.table的动态作用域

我们不仅可以直接使用列,也可以提前定义注入.N.I.SD来指代数据中的重要部分。

为演示,我们先创建新的data.table,命名为market_data,其中date列是连续的。

market_data = data.table(date = as.Date("2015-05-01") + 0:299)
head(market_data)
#>          date
#> 1: 2015-05-01
#> 2: 2015-05-02
#> 3: 2015-05-03
#> 4: 2015-05-04
#> 5: 2015-05-05
#> 6: 2015-05-06

向调用函数一样,我们给data.table添加数据列:

set.seed(123)
market_data[, `:=`(
    price = round(30 * cumprod(1 + rnorm(300, 0.001, 0.05)), 2),
    volume = rbinom(300, 5000, 0.8)
)]

注意这里的price和volumn都是服从正态分布的随机数:

head(market_data)
#>          date price volume
#> 1: 2015-05-01  29.2   4021
#> 2: 2015-05-02  28.9   4000
#> 3: 2015-05-03  31.2   4033
#> 4: 2015-05-04  31.3   4036
#> 5: 2015-05-05  31.5   3995
#> 6: 2015-05-06  34.3   3955

我们以图形的方式展示数据:

plot(price ~ date, data = market_data,
     type = "l",
     main = "Market data")

数据准备好后,我们看看动态作用域如何让事情变得简单。

看下时间范围:

market_data[, range(date)]
#> [1] "2015-05-01" "2016-02-24"

将数据整合缩减为月度数据:

monthly = market_data[,
                      .(open = price[[1]], high = max(price),
                        low = min(price), close = price[[.N]]),
                      keyby = .(year = year(date), month = month(date))]
head(monthly)
#>    year month open high  low close
#> 1: 2015     5 29.2 37.7 26.1  28.4
#> 2: 2015     6 28.1 37.6 28.1  37.2
#> 3: 2015     7 36.3 41.0 32.1  41.0
#> 4: 2015     8 41.5 50.0 30.9  30.9
#> 5: 2015     9 30.5 34.5 22.9  27.0
#> 6: 2015    10 25.7 33.2 24.6  29.3

计算过程为:先根据by表达式将原始数据分割,分割后的每个部分都是原始数据的一个子集,并且原始数据和子集都是data.table。然后在每个子集data.table的语义中计算j表达式

下面代码没有按组聚合数据,而是画了每年的价格图:

oldpar = par(mfrow = c(1, 2))
market_data[, {
    plot(price ~ date, type = "l",
         main = sprintf("Market data (%d)", year))
}, by = .(year = year(date))]
par(oldpar)

这里我们没有为plot()设定data参数,图像也成功绘制,这是因为该操作是在data.table的语义中进行的。

此外,j表达式还可以用于构建模型的代码,下面是一个批量拟合线性模型的例子。这里使用diamonds数据集。

data("diamonds", package = "ggplot2")
setDT(diamonds)
head(diamonds)
#>    carat       cut color clarity depth table price    x    y    z
#> 1:  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
#> 2:  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
#> 3:  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
#> 4:  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
#> 5:  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
#> 6:  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

该数据集包含超过5万条钻石信息的记录,每条记录了钻石的10个属性,现在我们队cut列中的每种切割类型都你拟合一个线性回归模型,由此观察每种切割类型中carat与depth是如何反映log(price)的信息。

diamonds[, {
    m = lm(log(price) ~ carat + depth)
    as.list(coef(m))
}, keyby = .(cut)]
#>          cut (Intercept) carat    depth
#> 1:      Fair        7.73  1.26 -0.01498
#> 2:      Good        7.08  1.97 -0.01460
#> 3: Very Good        6.29  2.09 -0.00289
#> 4:   Premium        5.93  1.85  0.00594
#> 5:     Ideal        8.50  2.13 -0.03808

动态作用域允许我们组合使用data.table内部或外部预定义的符号。举例,我们定义一个函数,计算market_data中由用户定义的列的年度均值:

average = function(column){
    market_data[, .(average = mean(.SD[[column]])),
                by = .(year = year(date))]
}

这里我们使用.SD[[x]]提取x列的值,这跟通过名字从列表中提取成分或元素相同。

下面计算每年的平均价格:

average("price")
#>    year average
#> 1: 2015    32.3
#> 2: 2016    32.4

每年平均数量:

average("volume")
#>    year average
#> 1: 2015    4000
#> 2: 2016    4003

我们可以利用此包专门的语法创造一个列数动态变化的组合,并且组合中的列是由动态变化的名称决定的。

这里我们假设添加额外的3列数据,每一列都是原始价格加了随机噪声生成的。不用重复调用market_date[, price1 := ...],而是使用market_data[, (columns) := list(...)]来动态设定列,其中columns是一个包含列名的字符向量,list(...)是每个列对应的值:

price_cols = paste0("price", 1:3)
market_data[, (price_cols) := lapply(1:3,
                                     function(i) round(price + rnorm(.N, 0, 5), 2))]
head(market_data)
#>          date price volume price1 price2 price3
#> 1: 2015-05-01  29.2   4021   30.6   27.4   33.2
#> 2: 2015-05-02  28.9   4000   29.7   20.4   36.0
#> 3: 2015-05-03  31.2   4033   34.3   26.9   27.2
#> 4: 2015-05-04  31.3   4036   29.3   29.0   28.0
#> 5: 2015-05-05  31.5   3995   36.0   32.1   34.8
#> 6: 2015-05-06  34.3   3955   30.1   31.0   35.2

另一方面,如果表格有很多列,并且需要对它们的子集进行一些计算,也可以用类似的语法来解决。

举例,我们现在需要对每个价格列调用na.locf()以去掉缺失值,先获取所有的价格列:

cols = colnames(market_data)
price_cols = cols[grep("^price", cols)]
price_cols
#> [1] "price"  "price1" "price2" "price3"

然后我们用类似的语法,并添加一个参数.SDcols = price_cols,这是为了让.SD中的列只是我们想要的那些价格列。

market_data[, (price_cols) := lapply(.SD, zoo::na.locf), .SDcols =  price_cols]
head(market_data)
#>          date price volume price1 price2 price3
#> 1: 2015-05-01  29.2   4021   30.6   27.4   33.2
#> 2: 2015-05-02  28.9   4000   29.7   20.4   36.0
#> 3: 2015-05-03  31.2   4033   34.3   26.9   27.2
#> 4: 2015-05-04  31.3   4036   29.3   29.0   28.0
#> 5: 2015-05-05  31.5   3995   36.0   32.1   34.8
#> 6: 2015-05-06  34.3   3955   30.1   31.0   35.2

最后,更多操作请前往https://github.com/Rdatatable/data.table/wiki查看完整功能列表。

使用dplyr操作数据框

关于dplyr的基本操作我已经写过很多笔记了,不再赘述,想学习的读者请参阅下面几篇文章,这部分我只挑没接触过的学习下。

data.table类似,dplyr也提供了do()函数来对每组数据进行任意操作。

例如将diamondscut分组,每组都按log(price) ~ carat拟合一个线性模型。和data.table不同的是,我们需要为操作指定一个名称,以便将结果存储在列中。而且do()表达式不能直接在分组数据的语义下计算 ,我们需要使用.来表示数据。

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#> 
#>     between, first, last
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
data("diamonds", package = "ggplot2")
models = diamonds %>% 
    group_by(cut) %>% 
    do(lmod = lm(log(price) ~ carat, data = .))
models
#> # A tibble: 5 x 2
#> # Rowwise: 
#>   cut       lmod  
#>   <ord>     <list>
#> 1 Fair      <lm>  
#> 2 Good      <lm>  
#> 3 Very Good <lm>  
#> 4 Premium   <lm>  
#> 5 Ideal     <lm>

注意结果创建了一个新列,该列不是典型的原子向量,每个元素都是模型的结果,包含线性回归对象的列表。我们可以通过索引来提取模型结果:

models$lmod[[1]]
#> 
#> Call:
#> lm(formula = log(price) ~ carat, data = .)
#> 
#> Coefficients:
#> (Intercept)        carat  
#>        6.78         1.25

在需要完成高度定制的操作时,do()的优势非常明显。下面举例。

假如我们需要分析toy_tests数据,要对每种产品的质量和耐久性进行汇总。如果只需要样本数最多的3个测试记录,并且每个产品的质量和耐久性是经样本数加权的平均数,下面是做法。

toy_tests %>% 
    group_by(id) %>% 
    arrange(desc(sample)) %>% 
    do(head(., 3)) %>% 
    summarise(
        quality = sum(quality * sample) / sum(sample),
        durability = sum(durability * sample) / sum(sample)
    )
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 3
#>   id    quality durability
#>   <chr>   <dbl>      <dbl>
#> 1 T01      9.32       9.38
#> 2 T02      9.04       8.34

为了查看中间结果,可以运行do()之前的代码:

toy_tests %>% 
    group_by(id) %>% 
    arrange(desc(sample))
#> # A tibble: 8 x 6
#> # Groups:   id [2]
#>   id        date sample quality durability ym    
#>   <chr>    <int>  <int>   <int>      <int> <chr> 
#> 1 T01   20160405    180       9         10 201604
#> 2 T01   20160302    150      10          9 201603
#> 3 T01   20160502    140       9          9 201605
#> 4 T01   20160201    100       9          9 201602
#> 5 T02   20160403     90       9          8 201604
#> 6 T02   20160502     85      10          9 201605
#> 7 T02   20160303     75       8          8 201603
#> 8 T02   20160201     70       7          9 201602

使用rlist包处理嵌套数据结构

在R中,最常见的嵌套数据结构是列表对象,之前关注的都是操作表格数据,这部分我们一起玩转rlist包,它是针对非表格数据设计的包

rlist的设计与dplyr非常相似,它提供了针对列表对象的映射、筛选、选择、排序和聚合功能。

安装:

install.packages("rlist")

载入:

library(rlist)

为了把数据以列表形式载入R,我们需要用jsonlite::fromJSON()或者直接使用rlist提供的list.load()函数(这里我找不到原作者的数据文件,所以自己谷歌找了类似的学习下):

products = list.load("../../../static/datasets/products.json")

products的每个成分都包含产品所有信息,使用list.map()可以在每个成分的语义中计算表达式:

str(list.map(products, id))
#> List of 66
#>  $ : chr "amazonite"
#>  $ : chr "amber"
#>  $ : chr "amethyst"
#>  $ : chr "ametrine"
#>  $ : chr "apophyllite"
#>  $ : chr "aquamarine"
#>  $ : chr "aragonite"
#>  $ : chr "azurite"
#>  $ : chr "barite"
#>  $ : chr "bismuth"
#>  $ : chr "black_tourmaline"
#>  $ : chr "bloodstone"
#>  $ : chr "blue_calcite"
#>  $ : chr "kyanite"
#>  $ : chr "carnelian"
#>  $ : chr "celestite"
#>  $ : chr "chrysocolla"
#>  $ : chr "chrysoprase"
#>  $ : chr "citrine"
#>  $ : chr "copper"
#>  $ : chr "dalmatian_stone"
#>  $ : chr "danburite"
#>  $ : chr "desert_rose"
#>  $ : chr "emerald"
#>  $ : chr "epidote"
#>  $ : chr "fluorite"
#>  $ : chr "garnet"
#>  $ : chr "green_calcite"
#>  $ : chr "hematite"
#>  $ : chr "honey_calcite"
#>  $ : chr "howlite"
#>  $ : chr "jasper"
#>  $ : chr "jet"
#>  $ : chr "kunzite"
#>  $ : chr "labradorite"
#>  $ : chr "lapis_lazuli"
#>  $ : chr "lemon_quartz"
#>  $ : chr "lemurian_quartz"
#>  $ : chr "lepidolite"
#>  $ : chr "libyan_gold_tektite"
#>  $ : chr "magnesite"
#>  $ : chr "malachite"
#>  $ : chr "meteorite"
#>  $ : chr "moldavite"
#>  $ : chr "moonstone"
#>  $ : chr "orange_calcite"
#>  $ : chr "peridot"
#>  $ : chr "pyrite"
#>  $ : chr "quartz"
#>  $ : chr "rhodochrosite"
#>  $ : chr "rhodonite"
#>  $ : chr "rose_quartz"
#>  $ : chr "ruby"
#>  $ : chr "scolecite"
#>  $ : chr "selenite"
#>  $ : chr "serpentine"
#>  $ : chr "shungite"
#>  $ : chr "smoky_quartz"
#>  $ : chr "sodalite"
#>  $ : chr "tektite"
#>  $ : chr "tibetan_black_quartz"
#>  $ : chr "tigers_eye"
#>  $ : chr "titanium_quartz"
#>  $ : chr "tourmalined_quartz"
#>  $ : chr "turquoise"
#>  $ : chr "vanadinite"

list.mapv()简化这个列表,返回一个向量:

list.mapv(products, name)
#>  [1] "Amazonite"            "Amber"                "Amethyst"            
#>  [4] "Ametrine"             "Apophyllite"          "Aquamarine"          
#>  [7] "Aragonite"            "Azurite"              "Barite"              
#> [10] "Bismuth"              "Black Tourmaline"     "Bloodstone"          
#> [13] "Blue Calcite"         "Blue Kyanite"         "Carnelian"           
#> [16] "Celestite"            "Chrysocolla"          "Chrysoprase"         
#> [19] "Citrine"              "Copper"               "Dalmatian Stone"     
#> [22] "Danburite"            "Desert Rose"          "Emerald"             
#> [25] "Epidote"              "Fluorite"             "Garnet"              
#> [28] "Green Calcite"        "Hematite"             "Honey Calcite"       
#> [31] "Howlite"              "Jasper"               "Jet"                 
#> [34] "Kunzite"              "Labradorite"          "Lapis Lazuli"        
#> [37] "Lemon Quartz"         "Lemurian Quartz"      "Lepidolite"          
#> [40] "Libyan Gold Tektite"  "Magnesite"            "Malachite"           
#> [43] "Meteorite"            "Moldavite"            "Moonstone"           
#> [46] "Orange Calcite"       "Peridot"              "Pyrite"              
#> [49] "Quartz"               "Rhodochrosite"        "Rhodonite"           
#> [52] "Rose Quartz"          "Ruby"                 "Scolecite"           
#> [55] "Selenite"             "Serpentine"           "Shungite"            
#> [58] "Smoky Quartz"         "Sodalite"             "Tektite"             
#> [61] "Tibetan Black Quartz" "Tiger Eye"            "Titanium Quartz"     
#> [64] "Tourmalined Quartz"   "Turquoise"            "Vanadinite"

为了对products进行筛选,我们可以使用list.filter(),只有所有条件都为TRUEproducts元素才会被返回:

liveTRUE = list.filter(products, livemode = TRUE)
list.mapv(liveTRUE, name)
#>  [1] "Amazonite"            "Amber"                "Amethyst"            
#>  [4] "Ametrine"             "Apophyllite"          "Aquamarine"          
#>  [7] "Aragonite"            "Azurite"              "Barite"              
#> [10] "Bismuth"              "Black Tourmaline"     "Bloodstone"          
#> [13] "Blue Calcite"         "Blue Kyanite"         "Carnelian"           
#> [16] "Celestite"            "Chrysocolla"          "Chrysoprase"         
#> [19] "Citrine"              "Copper"               "Dalmatian Stone"     
#> [22] "Danburite"            "Desert Rose"          "Emerald"             
#> [25] "Epidote"              "Fluorite"             "Garnet"              
#> [28] "Green Calcite"        "Hematite"             "Honey Calcite"       
#> [31] "Howlite"              "Jasper"               "Jet"                 
#> [34] "Kunzite"              "Labradorite"          "Lapis Lazuli"        
#> [37] "Lemon Quartz"         "Lemurian Quartz"      "Lepidolite"          
#> [40] "Libyan Gold Tektite"  "Magnesite"            "Malachite"           
#> [43] "Meteorite"            "Moldavite"            "Moonstone"           
#> [46] "Orange Calcite"       "Peridot"              "Pyrite"              
#> [49] "Quartz"               "Rhodochrosite"        "Rhodonite"           
#> [52] "Rose Quartz"          "Ruby"                 "Scolecite"           
#> [55] "Selenite"             "Serpentine"           "Shungite"            
#> [58] "Smoky Quartz"         "Sodalite"             "Tektite"             
#> [61] "Tibetan Black Quartz" "Tiger Eye"            "Titanium Quartz"     
#> [64] "Tourmalined Quartz"   "Turquoise"            "Vanadinite"

注意,rlist的设计与dplyr函数非常类似,我们可以利用管道符号将数据往下传递:

products %>% 
    list.filter(livemode = TRUE) %>% 
    list.mapv(name)
#>  [1] "Amazonite"            "Amber"                "Amethyst"            
#>  [4] "Ametrine"             "Apophyllite"          "Aquamarine"          
#>  [7] "Aragonite"            "Azurite"              "Barite"              
#> [10] "Bismuth"              "Black Tourmaline"     "Bloodstone"          
#> [13] "Blue Calcite"         "Blue Kyanite"         "Carnelian"           
#> [16] "Celestite"            "Chrysocolla"          "Chrysoprase"         
#> [19] "Citrine"              "Copper"               "Dalmatian Stone"     
#> [22] "Danburite"            "Desert Rose"          "Emerald"             
#> [25] "Epidote"              "Fluorite"             "Garnet"              
#> [28] "Green Calcite"        "Hematite"             "Honey Calcite"       
#> [31] "Howlite"              "Jasper"               "Jet"                 
#> [34] "Kunzite"              "Labradorite"          "Lapis Lazuli"        
#> [37] "Lemon Quartz"         "Lemurian Quartz"      "Lepidolite"          
#> [40] "Libyan Gold Tektite"  "Magnesite"            "Malachite"           
#> [43] "Meteorite"            "Moldavite"            "Moonstone"           
#> [46] "Orange Calcite"       "Peridot"              "Pyrite"              
#> [49] "Quartz"               "Rhodochrosite"        "Rhodonite"           
#> [52] "Rose Quartz"          "Ruby"                 "Scolecite"           
#> [55] "Selenite"             "Serpentine"           "Shungite"            
#> [58] "Smoky Quartz"         "Sodalite"             "Tektite"             
#> [61] "Tibetan Black Quartz" "Tiger Eye"            "Titanium Quartz"     
#> [64] "Tourmalined Quartz"   "Turquoise"            "Vanadinite"

使用list.select可以筛选字段以及创建新的字段:

products %>% 
    list.filter(livemode = TRUE) %>% 
    list.select(id, name, caption) %>% 
    str()
#> List of 66
#>  $ :List of 3
#>   ..$ id     : chr "amazonite"
#>   ..$ name   : chr "Amazonite"
#>   ..$ caption: chr "love,calm,communication"
#>  $ :List of 3
#>   ..$ id     : chr "amber"
#>   ..$ name   : chr "Amber"
#>   ..$ caption: chr "cleansing,vitality,protection"
#>  $ :List of 3
#>   ..$ id     : chr "amethyst"
#>   ..$ name   : chr "Amethyst"
#>   ..$ caption: chr "cleansing,protection,clarity"
#>  $ :List of 3
#>   ..$ id     : chr "ametrine"
#>   ..$ name   : chr "Ametrine"
#>   ..$ caption: chr "Clarity,motivation,confidence"
#>  $ :List of 3
#>   ..$ id     : chr "apophyllite"
#>   ..$ name   : chr "Apophyllite"
#>   ..$ caption: chr "insight,intuition,vitality"
#>  $ :List of 3
#>   ..$ id     : chr "aquamarine"
#>   ..$ name   : chr "Aquamarine"
#>   ..$ caption: chr "Clarity,calm,cleansing"
#>  $ :List of 3
#>   ..$ id     : chr "aragonite"
#>   ..$ name   : chr "Aragonite"
#>   ..$ caption: chr "balance,strength,confidence"
#>  $ :List of 3
#>   ..$ id     : chr "azurite"
#>   ..$ name   : chr "Azurite"
#>   ..$ caption: chr "Calm,balance,dreamwork"
#>  $ :List of 3
#>   ..$ id     : chr "barite"
#>   ..$ name   : chr "Barite"
#>   ..$ caption: chr "Dreamwork,insight,intuition"
#>  $ :List of 3
#>   ..$ id     : chr "bismuth"
#>   ..$ name   : chr "Bismuth"
#>   ..$ caption: chr "Transformation,vitality,focus"
#>  $ :List of 3
#>   ..$ id     : chr "black_tourmaline"
#>   ..$ name   : chr "Black Tourmaline"
#>   ..$ caption: chr "grounding,protection,cleansing"
#>  $ :List of 3
#>   ..$ id     : chr "bloodstone"
#>   ..$ name   : chr "Bloodstone"
#>   ..$ caption: chr "courage,cleansing,vitality,strength,manifestation,transformation"
#>  $ :List of 3
#>   ..$ id     : chr "blue_calcite"
#>   ..$ name   : chr "Blue Calcite"
#>   ..$ caption: chr "Creativity,clarity,intuition,calm,protection,dreamwork"
#>  $ :List of 3
#>   ..$ id     : chr "kyanite"
#>   ..$ name   : chr "Blue Kyanite"
#>   ..$ caption: chr "communication,intuition,insight"
#>  $ :List of 3
#>   ..$ id     : chr "carnelian"
#>   ..$ name   : chr "Carnelian"
#>   ..$ caption: chr "confidence,courage,strength,passion,motivation,vitality"
#>  $ :List of 3
#>   ..$ id     : chr "celestite"
#>   ..$ name   : chr "Celestite"
#>   ..$ caption: chr "Intuition,protection,calm"
#>  $ :List of 3
#>   ..$ id     : chr "chrysocolla"
#>   ..$ name   : chr "Chrysocolla"
#>   ..$ caption: chr "communication,strength,insight"
#>  $ :List of 3
#>   ..$ id     : chr "chrysoprase"
#>   ..$ name   : chr "Chrysoprase"
#>   ..$ caption: chr "Balance,joy,transformation"
#>  $ :List of 3
#>   ..$ id     : chr "citrine"
#>   ..$ name   : chr "Citrine"
#>   ..$ caption: chr "confidence,love,passion,manifestation,intuition,creativity"
#>  $ :List of 3
#>   ..$ id     : chr "copper"
#>   ..$ name   : chr "Copper"
#>   ..$ caption: chr "balance,vitality,grounding"
#>  $ :List of 3
#>   ..$ id     : chr "dalmatian_stone"
#>   ..$ name   : chr "Dalmatian Stone"
#>   ..$ caption: chr "protection,calm,joy"
#>  $ :List of 3
#>   ..$ id     : chr "danburite"
#>   ..$ name   : chr "Danburite"
#>   ..$ caption: chr "Calm,joy,communication"
#>  $ :List of 3
#>   ..$ id     : chr "desert_rose"
#>   ..$ name   : chr "Desert Rose"
#>   ..$ caption: chr "insight,courage,clarity"
#>  $ :List of 3
#>   ..$ id     : chr "emerald"
#>   ..$ name   : chr "Emerald"
#>   ..$ caption: chr "Love,Vitality,Balance"
#>  $ :List of 3
#>   ..$ id     : chr "epidote"
#>   ..$ name   : chr "Epidote"
#>   ..$ caption: chr "Transformation,manifestation,motivation"
#>  $ :List of 3
#>   ..$ id     : chr "fluorite"
#>   ..$ name   : chr "Fluorite"
#>   ..$ caption: chr "focus,clarity,insight"
#>  $ :List of 3
#>   ..$ id     : chr "garnet"
#>   ..$ name   : chr "Garnet"
#>   ..$ caption: chr "Creativity,manifestation,passion"
#>  $ :List of 3
#>   ..$ id     : chr "green_calcite"
#>   ..$ name   : chr "Green Calcite"
#>   ..$ caption: chr "Calm,Balance,Love"
#>  $ :List of 3
#>   ..$ id     : chr "hematite"
#>   ..$ name   : chr "Hematite"
#>   ..$ caption: chr "balance,grounding,manifestation"
#>  $ :List of 3
#>   ..$ id     : chr "honey_calcite"
#>   ..$ name   : chr "Honey Calcite"
#>   ..$ caption: chr "confidence,motivation,manifestation"
#>  $ :List of 3
#>   ..$ id     : chr "howlite"
#>   ..$ name   : chr "Howlite"
#>   ..$ caption: chr "focus,calm,creativity"
#>  $ :List of 3
#>   ..$ id     : chr "jasper"
#>   ..$ name   : chr "Jasper"
#>   ..$ caption: chr "Grounding,Balance,Strength"
#>  $ :List of 3
#>   ..$ id     : chr "jet"
#>   ..$ name   : chr "Jet"
#>   ..$ caption: chr "Grounding,cleansing,vitality"
#>  $ :List of 3
#>   ..$ id     : chr "kunzite"
#>   ..$ name   : chr "Kunzite"
#>   ..$ caption: chr "Love,joy,vitality"
#>  $ :List of 3
#>   ..$ id     : chr "labradorite"
#>   ..$ name   : chr "Labradorite"
#>   ..$ caption: chr "Creativity,joy,vitality"
#>  $ :List of 3
#>   ..$ id     : chr "lapis_lazuli"
#>   ..$ name   : chr "Lapis Lazuli"
#>   ..$ caption: chr "insight,calm,transformation"
#>  $ :List of 3
#>   ..$ id     : chr "lemon_quartz"
#>   ..$ name   : chr "Lemon Quartz"
#>   ..$ caption: chr "Joy,Balance,Vitality"
#>  $ :List of 3
#>   ..$ id     : chr "lemurian_quartz"
#>   ..$ name   : chr "Lemurian Quartz"
#>   ..$ caption: chr "Love,strength,balance"
#>  $ :List of 3
#>   ..$ id     : chr "lepidolite"
#>   ..$ name   : chr "Lepidolite"
#>   ..$ caption: chr "Calm,Balance,Cleansing"
#>  $ :List of 3
#>   ..$ id     : chr "libyan_gold_tektite"
#>   ..$ name   : chr "Libyan Gold Tektite"
#>   ..$ caption: chr "Motivation,confidence,manifestation,protection,creativity,strength"
#>  $ :List of 3
#>   ..$ id     : chr "magnesite"
#>   ..$ name   : chr "Magnesite"
#>   ..$ caption: chr "balance,calm,insight"
#>  $ :List of 3
#>   ..$ id     : chr "malachite"
#>   ..$ name   : chr "Malachite"
#>   ..$ caption: chr "creativity,strength,confidence"
#>  $ :List of 3
#>   ..$ id     : chr "meteorite"
#>   ..$ name   : chr "Meteorite"
#>   ..$ caption: chr "Vitality,Intuition,Insight"
#>  $ :List of 3
#>   ..$ id     : chr "moldavite"
#>   ..$ name   : chr "Moldavite"
#>   ..$ caption: chr "Cleansing,Manifestation,Vitality,Insight,Transformation,Dreamwork"
#>  $ :List of 3
#>   ..$ id     : chr "moonstone"
#>   ..$ name   : chr "Moonstone"
#>   ..$ caption: chr "insight,intuition,dreamwork"
#>  $ :List of 3
#>   ..$ id     : chr "orange_calcite"
#>   ..$ name   : chr "Orange Calcite"
#>   ..$ caption: chr "Creativity,vitality,passion"
#>  $ :List of 3
#>   ..$ id     : chr "peridot"
#>   ..$ name   : chr "Peridot"
#>   ..$ caption: chr "manifestation,vitality,transformation"
#>  $ :List of 3
#>   ..$ id     : chr "pyrite"
#>   ..$ name   : chr "Pyrite"
#>   ..$ caption: chr "manifestation,creativity,confidence,passion,strength,focus"
#>  $ :List of 3
#>   ..$ id     : chr "quartz"
#>   ..$ name   : chr "Quartz"
#>   ..$ caption: chr "clarity,transformation,cleansing"
#>  $ :List of 3
#>   ..$ id     : chr "rhodochrosite"
#>   ..$ name   : chr "Rhodochrosite"
#>   ..$ caption: chr "Love,courage,joy"
#>  $ :List of 3
#>   ..$ id     : chr "rhodonite"
#>   ..$ name   : chr "Rhodonite"
#>   ..$ caption: chr "Love,transformation,clarity"
#>  $ :List of 3
#>   ..$ id     : chr "rose_quartz"
#>   ..$ name   : chr "Rose Quartz"
#>   ..$ caption: chr "love,calm,creativity"
#>  $ :List of 3
#>   ..$ id     : chr "ruby"
#>   ..$ name   : chr "Ruby"
#>   ..$ caption: chr "Passion,vitality,courage,strength,protection,motivation"
#>  $ :List of 3
#>   ..$ id     : chr "scolecite"
#>   ..$ name   : chr "Scolecite"
#>   ..$ caption: chr "calm,insight,dreamwork"
#>  $ :List of 3
#>   ..$ id     : chr "selenite"
#>   ..$ name   : chr "Selenite"
#>   ..$ caption: chr "Clarity,cleansing,motivation"
#>  $ :List of 3
#>   ..$ id     : chr "serpentine"
#>   ..$ name   : chr "Serpentine"
#>   ..$ caption: chr "Manifestation,transformation,vitality"
#>  $ :List of 3
#>   ..$ id     : chr "shungite"
#>   ..$ name   : chr "Shungite"
#>   ..$ caption: chr "clarity,transformation,cleansing"
#>  $ :List of 3
#>   ..$ id     : chr "smoky_quartz"
#>   ..$ name   : chr "Smoky Quartz"
#>   ..$ caption: chr "calm,clarity,protection"
#>  $ :List of 3
#>   ..$ id     : chr "sodalite"
#>   ..$ name   : chr "Sodalite"
#>   ..$ caption: chr "Insight,Intuition,Dreamwork"
#>  $ :List of 3
#>   ..$ id     : chr "tektite"
#>   ..$ name   : chr "Tektite"
#>   ..$ caption: chr "Communication,transformation,vitality"
#>  $ :List of 3
#>   ..$ id     : chr "tibetan_black_quartz"
#>   ..$ name   : chr "Tibetan Black Quartz"
#>   ..$ caption: chr "Protection,balance,cleansing"
#>  $ :List of 3
#>   ..$ id     : chr "tigers_eye"
#>   ..$ name   : chr "Tiger Eye"
#>   ..$ caption: chr "Balance,confidence,strength"
#>  $ :List of 3
#>   ..$ id     : chr "titanium_quartz"
#>   ..$ name   : chr "Titanium Quartz"
#>   ..$ caption: chr "strength,focus,confidence"
#>  $ :List of 3
#>   ..$ id     : chr "tourmalined_quartz"
#>   ..$ name   : chr "Tourmalined Quartz"
#>   ..$ caption: chr "Cleansing,protection,clarity"
#>  $ :List of 3
#>   ..$ id     : chr "turquoise"
#>   ..$ name   : chr "Turquoise"
#>   ..$ caption: chr "strength,balance,communication"
#>  $ :List of 3
#>   ..$ id     : chr "vanadinite"
#>   ..$ name   : chr "Vanadinite"
#>   ..$ caption: chr "creativity,motivation,vitality"

创建新字段:

products %>% 
    list.filter(livemode = TRUE) %>% 
    list.select(id, name, caption) %>% 
    list.select(connect = paste(id, name, sep = "-")) %>% 
    str()
#> List of 66
#>  $ :List of 1
#>   ..$ connect: chr "amazonite-Amazonite"
#>  $ :List of 1
#>   ..$ connect: chr "amber-Amber"
#>  $ :List of 1
#>   ..$ connect: chr "amethyst-Amethyst"
#>  $ :List of 1
#>   ..$ connect: chr "ametrine-Ametrine"
#>  $ :List of 1
#>   ..$ connect: chr "apophyllite-Apophyllite"
#>  $ :List of 1
#>   ..$ connect: chr "aquamarine-Aquamarine"
#>  $ :List of 1
#>   ..$ connect: chr "aragonite-Aragonite"
#>  $ :List of 1
#>   ..$ connect: chr "azurite-Azurite"
#>  $ :List of 1
#>   ..$ connect: chr "barite-Barite"
#>  $ :List of 1
#>   ..$ connect: chr "bismuth-Bismuth"
#>  $ :List of 1
#>   ..$ connect: chr "black_tourmaline-Black Tourmaline"
#>  $ :List of 1
#>   ..$ connect: chr "bloodstone-Bloodstone"
#>  $ :List of 1
#>   ..$ connect: chr "blue_calcite-Blue Calcite"
#>  $ :List of 1
#>   ..$ connect: chr "kyanite-Blue Kyanite"
#>  $ :List of 1
#>   ..$ connect: chr "carnelian-Carnelian"
#>  $ :List of 1
#>   ..$ connect: chr "celestite-Celestite"
#>  $ :List of 1
#>   ..$ connect: chr "chrysocolla-Chrysocolla"
#>  $ :List of 1
#>   ..$ connect: chr "chrysoprase-Chrysoprase"
#>  $ :List of 1
#>   ..$ connect: chr "citrine-Citrine"
#>  $ :List of 1
#>   ..$ connect: chr "copper-Copper"
#>  $ :List of 1
#>   ..$ connect: chr "dalmatian_stone-Dalmatian Stone"
#>  $ :List of 1
#>   ..$ connect: chr "danburite-Danburite"
#>  $ :List of 1
#>   ..$ connect: chr "desert_rose-Desert Rose"
#>  $ :List of 1
#>   ..$ connect: chr "emerald-Emerald"
#>  $ :List of 1
#>   ..$ connect: chr "epidote-Epidote"
#>  $ :List of 1
#>   ..$ connect: chr "fluorite-Fluorite"
#>  $ :List of 1
#>   ..$ connect: chr "garnet-Garnet"
#>  $ :List of 1
#>   ..$ connect: chr "green_calcite-Green Calcite"
#>  $ :List of 1
#>   ..$ connect: chr "hematite-Hematite"
#>  $ :List of 1
#>   ..$ connect: chr "honey_calcite-Honey Calcite"
#>  $ :List of 1
#>   ..$ connect: chr "howlite-Howlite"
#>  $ :List of 1
#>   ..$ connect: chr "jasper-Jasper"
#>  $ :List of 1
#>   ..$ connect: chr "jet-Jet"
#>  $ :List of 1
#>   ..$ connect: chr "kunzite-Kunzite"
#>  $ :List of 1
#>   ..$ connect: chr "labradorite-Labradorite"
#>  $ :List of 1
#>   ..$ connect: chr "lapis_lazuli-Lapis Lazuli"
#>  $ :List of 1
#>   ..$ connect: chr "lemon_quartz-Lemon Quartz"
#>  $ :List of 1
#>   ..$ connect: chr "lemurian_quartz-Lemurian Quartz"
#>  $ :List of 1
#>   ..$ connect: chr "lepidolite-Lepidolite"
#>  $ :List of 1
#>   ..$ connect: chr "libyan_gold_tektite-Libyan Gold Tektite"
#>  $ :List of 1
#>   ..$ connect: chr "magnesite-Magnesite"
#>  $ :List of 1
#>   ..$ connect: chr "malachite-Malachite"
#>  $ :List of 1
#>   ..$ connect: chr "meteorite-Meteorite"
#>  $ :List of 1
#>   ..$ connect: chr "moldavite-Moldavite"
#>  $ :List of 1
#>   ..$ connect: chr "moonstone-Moonstone"
#>  $ :List of 1
#>   ..$ connect: chr "orange_calcite-Orange Calcite"
#>  $ :List of 1
#>   ..$ connect: chr "peridot-Peridot"
#>  $ :List of 1
#>   ..$ connect: chr "pyrite-Pyrite"
#>  $ :List of 1
#>   ..$ connect: chr "quartz-Quartz"
#>  $ :List of 1
#>   ..$ connect: chr "rhodochrosite-Rhodochrosite"
#>  $ :List of 1
#>   ..$ connect: chr "rhodonite-Rhodonite"
#>  $ :List of 1
#>   ..$ connect: chr "rose_quartz-Rose Quartz"
#>  $ :List of 1
#>   ..$ connect: chr "ruby-Ruby"
#>  $ :List of 1
#>   ..$ connect: chr "scolecite-Scolecite"
#>  $ :List of 1
#>   ..$ connect: chr "selenite-Selenite"
#>  $ :List of 1
#>   ..$ connect: chr "serpentine-Serpentine"
#>  $ :List of 1
#>   ..$ connect: chr "shungite-Shungite"
#>  $ :List of 1
#>   ..$ connect: chr "smoky_quartz-Smoky Quartz"
#>  $ :List of 1
#>   ..$ connect: chr "sodalite-Sodalite"
#>  $ :List of 1
#>   ..$ connect: chr "tektite-Tektite"
#>  $ :List of 1
#>   ..$ connect: chr "tibetan_black_quartz-Tibetan Black Quartz"
#>  $ :List of 1
#>   ..$ connect: chr "tigers_eye-Tiger Eye"
#>  $ :List of 1
#>   ..$ connect: chr "titanium_quartz-Titanium Quartz"
#>  $ :List of 1
#>   ..$ connect: chr "tourmalined_quartz-Tourmalined Quartz"
#>  $ :List of 1
#>   ..$ connect: chr "turquoise-Turquoise"
#>  $ :List of 1
#>   ..$ connect: chr "vanadinite-Vanadinite"

list.sort()函数,可以按照指定字段或值对列表元素进行排序,再用list.stack()将所有元素变成数据框:

products %>% 
    list.filter(livemode = TRUE) %>% 
    list.select(id, name, caption) %>% 
    list.sort(id) %>% 
    list.stack()
#>                      id                 name
#> 1             amazonite            Amazonite
#> 2                 amber                Amber
#> 3              amethyst             Amethyst
#> 4              ametrine             Ametrine
#> 5           apophyllite          Apophyllite
#> 6            aquamarine           Aquamarine
#> 7             aragonite            Aragonite
#> 8               azurite              Azurite
#> 9                barite               Barite
#> 10              bismuth              Bismuth
#> 11     black_tourmaline     Black Tourmaline
#> 12           bloodstone           Bloodstone
#> 13         blue_calcite         Blue Calcite
#> 14            carnelian            Carnelian
#> 15            celestite            Celestite
#> 16          chrysocolla          Chrysocolla
#> 17          chrysoprase          Chrysoprase
#> 18              citrine              Citrine
#> 19               copper               Copper
#> 20      dalmatian_stone      Dalmatian Stone
#> 21            danburite            Danburite
#> 22          desert_rose          Desert Rose
#> 23              emerald              Emerald
#> 24              epidote              Epidote
#> 25             fluorite             Fluorite
#> 26               garnet               Garnet
#> 27        green_calcite        Green Calcite
#> 28             hematite             Hematite
#> 29        honey_calcite        Honey Calcite
#> 30              howlite              Howlite
#> 31               jasper               Jasper
#> 32                  jet                  Jet
#> 33              kunzite              Kunzite
#> 34              kyanite         Blue Kyanite
#> 35          labradorite          Labradorite
#> 36         lapis_lazuli         Lapis Lazuli
#> 37         lemon_quartz         Lemon Quartz
#> 38      lemurian_quartz      Lemurian Quartz
#> 39           lepidolite           Lepidolite
#> 40  libyan_gold_tektite  Libyan Gold Tektite
#> 41            magnesite            Magnesite
#> 42            malachite            Malachite
#> 43            meteorite            Meteorite
#> 44            moldavite            Moldavite
#> 45            moonstone            Moonstone
#> 46       orange_calcite       Orange Calcite
#> 47              peridot              Peridot
#> 48               pyrite               Pyrite
#> 49               quartz               Quartz
#> 50        rhodochrosite        Rhodochrosite
#> 51            rhodonite            Rhodonite
#> 52          rose_quartz          Rose Quartz
#> 53                 ruby                 Ruby
#> 54            scolecite            Scolecite
#> 55             selenite             Selenite
#> 56           serpentine           Serpentine
#> 57             shungite             Shungite
#> 58         smoky_quartz         Smoky Quartz
#> 59             sodalite             Sodalite
#> 60              tektite              Tektite
#> 61 tibetan_black_quartz Tibetan Black Quartz
#> 62           tigers_eye            Tiger Eye
#> 63      titanium_quartz      Titanium Quartz
#> 64   tourmalined_quartz   Tourmalined Quartz
#> 65            turquoise            Turquoise
#> 66           vanadinite           Vanadinite
#>                                                               caption
#> 1                                             love,calm,communication
#> 2                                       cleansing,vitality,protection
#> 3                                        cleansing,protection,clarity
#> 4                                       Clarity,motivation,confidence
#> 5                                          insight,intuition,vitality
#> 6                                              Clarity,calm,cleansing
#> 7                                         balance,strength,confidence
#> 8                                              Calm,balance,dreamwork
#> 9                                         Dreamwork,insight,intuition
#> 10                                      Transformation,vitality,focus
#> 11                                     grounding,protection,cleansing
#> 12   courage,cleansing,vitality,strength,manifestation,transformation
#> 13             Creativity,clarity,intuition,calm,protection,dreamwork
#> 14            confidence,courage,strength,passion,motivation,vitality
#> 15                                          Intuition,protection,calm
#> 16                                     communication,strength,insight
#> 17                                         Balance,joy,transformation
#> 18         confidence,love,passion,manifestation,intuition,creativity
#> 19                                         balance,vitality,grounding
#> 20                                                protection,calm,joy
#> 21                                             Calm,joy,communication
#> 22                                            insight,courage,clarity
#> 23                                              Love,Vitality,Balance
#> 24                            Transformation,manifestation,motivation
#> 25                                              focus,clarity,insight
#> 26                                   Creativity,manifestation,passion
#> 27                                                  Calm,Balance,Love
#> 28                                    balance,grounding,manifestation
#> 29                                confidence,motivation,manifestation
#> 30                                              focus,calm,creativity
#> 31                                         Grounding,Balance,Strength
#> 32                                       Grounding,cleansing,vitality
#> 33                                                  Love,joy,vitality
#> 34                                    communication,intuition,insight
#> 35                                            Creativity,joy,vitality
#> 36                                        insight,calm,transformation
#> 37                                               Joy,Balance,Vitality
#> 38                                              Love,strength,balance
#> 39                                             Calm,Balance,Cleansing
#> 40 Motivation,confidence,manifestation,protection,creativity,strength
#> 41                                               balance,calm,insight
#> 42                                     creativity,strength,confidence
#> 43                                         Vitality,Intuition,Insight
#> 44  Cleansing,Manifestation,Vitality,Insight,Transformation,Dreamwork
#> 45                                        insight,intuition,dreamwork
#> 46                                        Creativity,vitality,passion
#> 47                              manifestation,vitality,transformation
#> 48         manifestation,creativity,confidence,passion,strength,focus
#> 49                                   clarity,transformation,cleansing
#> 50                                                   Love,courage,joy
#> 51                                        Love,transformation,clarity
#> 52                                               love,calm,creativity
#> 53            Passion,vitality,courage,strength,protection,motivation
#> 54                                             calm,insight,dreamwork
#> 55                                       Clarity,cleansing,motivation
#> 56                              Manifestation,transformation,vitality
#> 57                                   clarity,transformation,cleansing
#> 58                                            calm,clarity,protection
#> 59                                        Insight,Intuition,Dreamwork
#> 60                              Communication,transformation,vitality
#> 61                                       Protection,balance,cleansing
#> 62                                        Balance,confidence,strength
#> 63                                          strength,focus,confidence
#> 64                                       Cleansing,protection,clarity
#> 65                                     strength,balance,communication
#> 66                                     creativity,motivation,vitality

除此之外,list.group()函数可以用指定变量字段值对元素分组,list.table()提供了一个加强版的table()用于处理嵌套列表的表格处理。

更多rlist函数,请阅读rlist教程(https://renkun.me/rlist-tutorial),另外purrr是基于其他理念处理嵌套数据结构的包,有兴趣可以看看。