Shixiang Wang

>上士闻道
勤而行之

stringr——处理字符串

王诗翔 · 2018-09-17

分类: r  
标签: r   stringr   string   text-processing  

导入包:

library(tidyverse)
#> ── Attaching packages ──────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
#> ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
#> ✓ tibble  3.0.3     ✓ dplyr   1.0.0
#> ✓ tidyr   1.1.0     ✓ stringr 1.4.0
#> ✓ readr   1.3.1     ✓ forcats 0.5.0
#> ── Conflicts ─────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()
library(stringr)
x = c("\"", "\\")

显示字符串原始内容:

writeLines(x)
#> "
#> \

字符串长度

str_length(c("a", "R for data science", NA))
#> [1]  1 18 NA

字符串组合

组合两个或多个:

str_c("x", "y")
#> [1] "xy"
str_c("x", "y", "z")
#> [1] "xyz"

控制分隔:

str_c("x", "y", sep = ",")
#> [1] "x,y"

缺失值是可以传染的,我们可以将NA输出为"NA"

x = c("abc", NA)
str_c("|-", x, "-|")
#> [1] "|-abc-|" NA
str_c("|-", str_replace_na(x), "-|")
#> [1] "|-abc-|" "|-NA-|"

组合函数是向量化的:

str_c("prefix-", c("a", "b", "c"), "-suffix")
#> [1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"

将字符向量合并为字符串:

str_c(c("x", "y", "z"), collapse = ", ")
#> [1] "x, y, z"

字符串取子集

x = c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
#> [1] "App" "Ban" "Pea"

负数表示从后到前:

str_sub(x, -3, -1)
#> [1] "ple" "ana" "ear"

注意如果字符串过短函数也会返回尽可能多的字符:

str_sub("a", 1, 5)
#> [1] "a"

以赋值的形式修改字符串:

str_sub(x, 1, 1) = str_to_lower(str_sub(x, 1, 1))
x
#> [1] "apple"  "banana" "pear"

区域设置

字符串的使用因国家地区不同可能有所不同。

str_to_upper(c("i", "l"))
#> [1] "I" "L"
str_to_upper(c("i", "l"), locale = "tr")
#> [1] "İ" "L"

排序:

x = c("apple", "eggplant", "banana")
str_sort(x, locale = "en")
#> [1] "apple"    "banana"   "eggplant"
str_sort(x, locale = "haw")
#> [1] "apple"    "eggplant" "banana"

使用正则表达式

我们可以通过str_view()str_view_all()函数学习正则表达式。函数接受一个字符向量和一个正则表达式。

基础匹配

精确匹配字符串:

x = c("apple", "banana", "pear")
str_view(x, "an")

另一个复杂的模式是使用.,它可以匹配除换行符外的任意字符:

str_view(x, ".a.")

锚点

  • ^从字符串开头进行匹配
  • $从字符串末尾进行匹配
str_view(x, "^a")
str_view(x, "a$")

字符串类与字符选项

除了.,还有4种常见的字符类:

  • \d匹配任意数字
  • \s匹配任意空白符
  • [abc]匹配a、b或c
  • [^abc]匹配除a、b、c之外的任意字符

因为要对\转义,在R中使用正则需要\\s来匹配空白符,其他也一样。

|可以获取可选模式,比如abc|xyz匹配abcxyz,该操作符的优先级很低。

str_view(c("grey", "gray"), "gr(e|a)y")

重复

该操作用来控制某个模式能够匹配多少次:

  • ?- 0次或一次
  • +- 1次或多次
  • *- 0次或多次
x = "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, "CC?")
str_view(x ,"CC+")

精确匹配次数:

  • {n}- 匹配n次
  • {n,}- 匹配n次或更多次
  • {,m}- 最多匹配m次
  • {n, m}- 匹配n到m次
str_view(x, "C{2}")
str_view(x, "C{2,}")
str_view(x, "C{2,3}")

默认的匹配方式是贪婪的,正则表达式会匹配尽量长的字符串,在后面添加?可以将匹配方式更改为懒惰的,即匹配尽量短的字符串。

str_view(x, "C{2,3}?")
str_view(x, "C[LX]+?")

分组与回溯引用

括号除了可以消除复杂表达式的歧义,还可以定义分组,我们可以通过回溯引用(如\1,\2等)来引用这些分组。

str_view(fruit, "(..)\\1", match = TRUE)

工具

学习stringr多种函数,可以:

匹配检测

要想知道一个字符向量能否匹配一种模式,可以使用str_detect()

x = c("apple", "banana", "pear")
str_detect(x, "e")
#> [1]  TRUE FALSE  TRUE

因为在数学意义上F为0,T为1,所以我们可以使用求和和求均值函数等,它们有时候可以发挥巨大用处。

sum(str_detect(words, "^t"))
#> [1] 65
mean(str_detect(words, "[aeiou]$"))
#> [1] 0.277

当逻辑条件非常复杂时,相对于创建单个正则表达式,使用逻辑运算符进行调用组合会更容易

例如下面可以找不包含元音字母的所有单词:

no_vowel_1 = !str_detect(words, "[aeiou]")
no_vowel_2 = str_detect(words, "^[^aeiou]+$")
identical(no_vowel_1, no_vowel_2)
#> [1] TRUE

两种方法结果一致,但第一种更容易理解。

str_detect一种常见用法是选取匹配某种模式的元素,然后取子集,也可以使用str_subset()包装函数完全两步操作:

words[str_detect(words, "x$")]
#> [1] "box" "sex" "six" "tax"
str_subset(words, "x$")
#> [1] "box" "sex" "six" "tax"

字符串通常是数据框的一列,我们可以联合filter()操作:

df = tibble(
    word = words,
    i = seq_along(words)
)
df %>% 
    filter(str_detect(words, "x$"))
#> # A tibble: 4 x 2
#>   word      i
#>   <chr> <int>
#> 1 box     108
#> 2 sex     747
#> 3 six     772
#> 4 tax     841

str_detect()函数的变体str_count()返回字符串中匹配的数量:

str_count(x, "a")
#> [1] 1 3 1

str_count()完全可以和mutate()联合使用:

df %>% 
    mutate(
        vowels = str_count(word, "[aeiou]"),
        consonants = str_count(word, "[^aeiou]")
    )
#> # A tibble: 980 x 4
#>    word         i vowels consonants
#>    <chr>    <int>  <int>      <int>
#>  1 a            1      1          0
#>  2 able         2      2          2
#>  3 about        3      3          2
#>  4 absolute     4      4          4
#>  5 accept       5      2          4
#>  6 account      6      3          4
#>  7 achieve      7      4          3
#>  8 across       8      2          4
#>  9 act          9      1          2
#> 10 active      10      3          3
#> # … with 970 more rows

注意,匹配的模式不会重叠,比如abababaaba只会匹配2次而不是3次:

str_count("abababa", "aba")
#> [1] 2
str_view_all("abababa", "aba")

str_view_all()用于全部匹配。

提取匹配内容

我们可以使用str_extract()函数来提取匹配的实际文本。这里使用维基百科的Harvard sentences作为复杂的示例。

length(sentences)
#> [1] 720
head(sentences)
#> [1] "The birch canoe slid on the smooth planks." 
#> [2] "Glue the sheet to the dark blue background."
#> [3] "It's easy to tell the depth of a well."     
#> [4] "These days a chicken leg is a rare dish."   
#> [5] "Rice is often served in round bowls."       
#> [6] "The juice of lemons makes fine punch."

假如现在我们想找出包含一种颜色的所有句子。我们先创建颜色名称向量,然后转换为正则表达式:

colors = c(
    "red", "orange", "yellow", "green", "blue", "purple"
)
color_match = str_c(colors, collapse = "|")
color_match
#> [1] "red|orange|yellow|green|blue|purple"

现在我们选取出包含一种颜色的句子,然后再提取出颜色:

has_color = str_subset(sentences, color_match)
matches = str_extract(has_color, color_match)
head(matches)
#> [1] "blue" "blue" "red"  "red"  "red"  "blue"

注意,str_extract()只提取第一个匹配。我们可以选取多余一种匹配的所有句子,这样我们更容易看到所有的匹配。

more = sentences[str_count(sentences, color_match) > 1]
str_view_all(more, color_match)
str_extract(more, color_match)
#> [1] "blue"   "green"  "orange"

这是stringr函数的一种通用模式,单个匹配可以使用更简单的数据结构,想要得到所有的匹配,使用str_extract_all()函数,它会返回一个列表

str_extract_all(more, color_match)
#> [[1]]
#> [1] "blue" "red" 
#> 
#> [[2]]
#> [1] "green" "red"  
#> 
#> [[3]]
#> [1] "orange" "red"

如果设置了simplify = TRUE,那么结果会是一个矩阵,其中短的匹配会和最长的匹配有一样的长度。

str_extract_all(more, color_match, simplify = TRUE)
#>      [,1]     [,2] 
#> [1,] "blue"   "red"
#> [2,] "green"  "red"
#> [3,] "orange" "red"
x = c("a", "a b", "a b c")
str_extract_all(x, "[a-z]", simplify = TRUE)
#>      [,1] [,2] [,3]
#> [1,] "a"  ""   ""  
#> [2,] "a"  "b"  ""  
#> [3,] "a"  "b"  "c"

分组匹配

括号在正则表达式中科院阐明优先级,还能对正则表达式进行分组,分组可以在匹配时回溯引用。我们因而可以用括号来提取复杂匹配的各个部分。

举例说明:加入我们想从句子中提取名词,我们可以先进行一种启发式实验,找出a或the后面的所有单词。使用正则表达式定义“单词”概念有点难度,我们使用一种简单的近似——至少有1个非空格字符的字符序列。

noun = "(a|the) ([^ ]+)"
has_noun = sentences %>% 
    str_subset(noun) %>% 
    head(10)
has_noun %>% 
    str_extract(noun)
#>  [1] "the smooth" "the sheet"  "the depth"  "a chicken"  "the parked"
#>  [6] "the sun"    "the huge"   "the ball"   "the woman"  "a helps"

str_extract()函数给出完整匹配,str_match()函数给出每个独立分组。后面函数返回的不是字符向量而是矩阵:其中一列是完整匹配,后面的列是每个分组的匹配:

has_noun %>% 
    str_match(noun)
#>       [,1]         [,2]  [,3]     
#>  [1,] "the smooth" "the" "smooth" 
#>  [2,] "the sheet"  "the" "sheet"  
#>  [3,] "the depth"  "the" "depth"  
#>  [4,] "a chicken"  "a"   "chicken"
#>  [5,] "the parked" "the" "parked" 
#>  [6,] "the sun"    "the" "sun"    
#>  [7,] "the huge"   "the" "huge"   
#>  [8,] "the ball"   "the" "ball"   
#>  [9,] "the woman"  "the" "woman"  
#> [10,] "a helps"    "a"   "helps"

这种启发式名词检测的效果并不好,它找出了一些形容词,比如smoothparked如果数据保存在tibble中,使用extract()会更容易,该函数工作方式与str_match()函数类似,只需要为每个分组提供名词以作为结果的新列

tibble(sentences = sentences) %>% 
    tidyr::extract(
        sentences, c("article", "noun"), "(a|the) ([^ ]+)",
        remove = FALSE
    )
#> # A tibble: 720 x 3
#>    sentences                                   article noun   
#>    <chr>                                       <chr>   <chr>  
#>  1 The birch canoe slid on the smooth planks.  the     smooth 
#>  2 Glue the sheet to the dark blue background. the     sheet  
#>  3 It's easy to tell the depth of a well.      the     depth  
#>  4 These days a chicken leg is a rare dish.    a       chicken
#>  5 Rice is often served in round bowls.        <NA>    <NA>   
#>  6 The juice of lemons makes fine punch.       <NA>    <NA>   
#>  7 The box was thrown beside the parked truck. the     parked 
#>  8 The hogs were fed chopped corn and garbage. <NA>    <NA>   
#>  9 Four hours of steady work faced us.         <NA>    <NA>   
#> 10 Large size in stockings is hard to sell.    <NA>    <NA>   
#> # … with 710 more rows

str_extract()函数一样,如果要找出所有的匹配,需要使用str_match_all()函数。

替换匹配内容

str_replace()str_replace_all()函数可以使用新的字符串替换匹配的内容。最简单的就是使用固定的字符串进行替换:

x = c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
#> [1] "-pple"  "p-ar"   "b-nana"
str_replace_all(x, "[aeiou]", "-")
#> [1] "-ppl-"  "p--r"   "b-n-n-"

通过一个命令向量我们可以同时进行多个替换

x = c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
#> [1] "one house"    "two cars"     "three people"

除了使用固定字符串,我们还可以使用引用来插入匹配的分组。下面的代码我们交换第二个单词和第三个单词的顺序:

sentences %>% head(5)
#> [1] "The birch canoe slid on the smooth planks." 
#> [2] "Glue the sheet to the dark blue background."
#> [3] "It's easy to tell the depth of a well."     
#> [4] "These days a chicken leg is a rare dish."   
#> [5] "Rice is often served in round bowls."
sentences %>% 
    str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>% 
    head(5)
#> [1] "The canoe birch slid on the smooth planks." 
#> [2] "Glue sheet the to the dark blue background."
#> [3] "It's to easy tell the depth of a well."     
#> [4] "These a days chicken leg is a rare dish."   
#> [5] "Rice often is served in round bowls."

拆分

str_split()函数可以将字符串拆分为多个片段。比如把句子拆分为单词:

sentences %>% 
    head(5) %>% 
    str_split(" ")
#> [[1]]
#> [1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth" 
#> [8] "planks."
#> 
#> [[2]]
#> [1] "Glue"        "the"         "sheet"       "to"          "the"        
#> [6] "dark"        "blue"        "background."
#> 
#> [[3]]
#> [1] "It's"  "easy"  "to"    "tell"  "the"   "depth" "of"    "a"     "well."
#> 
#> [[4]]
#> [1] "These"   "days"    "a"       "chicken" "leg"     "is"      "a"      
#> [8] "rare"    "dish."  
#> 
#> [[5]]
#> [1] "Rice"   "is"     "often"  "served" "in"     "round"  "bowls."

因为拆分句子产生的单词数目不一样,所以函数结果返回一个列表。如果我们想要返回一个矩阵,可以通过simplify = TRUE进行指定。

sentences %>% 
    head(5) %>% 
    str_split(" ", simplify = TRUE)
#>      [,1]    [,2]    [,3]    [,4]      [,5]  [,6]    [,7]     [,8]         
#> [1,] "The"   "birch" "canoe" "slid"    "on"  "the"   "smooth" "planks."    
#> [2,] "Glue"  "the"   "sheet" "to"      "the" "dark"  "blue"   "background."
#> [3,] "It's"  "easy"  "to"    "tell"    "the" "depth" "of"     "a"          
#> [4,] "These" "days"  "a"     "chicken" "leg" "is"    "a"      "rare"       
#> [5,] "Rice"  "is"    "often" "served"  "in"  "round" "bowls." ""           
#>      [,9]   
#> [1,] ""     
#> [2,] ""     
#> [3,] "well."
#> [4,] "dish."
#> [5,] ""

我们可以设定拆分片段的最大数量:

fields = c("Name:诗翔", "Country:CN", "Age:24")
fields %>% str_split(":", n = 2, simplify = TRUE)
#>      [,1]      [,2]  
#> [1,] "Name"    "诗翔"
#> [2,] "Country" "CN"  
#> [3,] "Age"     "24"

我们还可以通过字母、行、句子和单词边界(boundary()函数)来拆分字符串

x = "This is a sentence. This is another sentence."
str_view_all(x, boundary("word"))
str_split(x, " ")[[1]]
#> [1] "This"      "is"        "a"         "sentence." "This"      "is"       
#> [7] "another"   "sentence."
str_split(x, boundary("word"))[[1]]
#> [1] "This"     "is"       "a"        "sentence" "This"     "is"       "another" 
#> [8] "sentence"

定位匹配内容

str_locate()str_locate_all()函数可以给出每个匹配的开始位置和结束位置。当没有其他函数能够精确满足需求时该函数非常有用,我们可以使用str_locate()函数找出匹配的模式,然后使用str_sub()来提取和修改匹配的内容

其他类型模式

当一个字符串作为模式时,R内部使用regex()函数进行了包装:

# 正常调用
str_view(fruit, "nana")
# 上面实质上是下面的简写
str_view(fruit, regex("nana"))

因而我们可以通过设定regex()的其他参数来控制匹配方式。

除了regex(),我们还可以使用另外3种函数:

microbenchmark::microbenchmark(
    fixed = str_detect(sentences, fixed("the")),
    regex = str_detect(sentences, "the"),
    times = 20
)
#> Unit: microseconds
#>   expr   min    lq mean median  uq  max neval
#>  fixed  90.2  94.7  177   98.3 102 1605    20
#>  regex 318.0 334.5  353  347.4 359  434    20
stringi::stri_locale_info()
#> $Language
#> [1] "en"
#> 
#> $Country
#> [1] "US"
#> 
#> $Variant
#> [1] ""
#> 
#> $Name
#> [1] "en_US"
x = "This is  a sentence"
str_view_all(x, boundary("word"))
str_extract_all(x, boundary("word"))
#> [[1]]
#> [1] "This"     "is"       "a"        "sentence"

正则表达式其他应用

R基础包中存在2个常用函数,它们可以使用正则表达式:

apropos("replace")
#> [1] "%+replace%"       "replace"          "replace_na"       "setReplaceMethod"
#> [5] "str_replace"      "str_replace_all"  "str_replace_na"   "theme_replace"
head(dir(pattern = "\\.Rmd$"))
#> [1] "2017-10-09-microArray-data-analysis.Rmd"
#> [2] "2017-10-27-RNAseq-data-analysis.Rmd"    
#> [3] "2018-08-10-r-installation.Rmd"          
#> [4] "2019-06-21-baseplot-addplot.Rmd"        
#> [5] "2019-06-21-baseplot-multiplots.Rmd"     
#> [6] "2019-07-07-use-rstatix.Rmd"

stringi

stringr是建立于stringi基础之上的。stringr比较容易学习(书上写非常容易,我个人并不这样认为)——它只提供少惊醒挑选的函数,可以完成常见大部分的字符串操作。而stringi的设计思想是尽量全面,几乎包含了我们可以用到的所有函数,共234个。

当我们从stringr过渡到stringi时会比较容易,相应的函数会经历str_stri_的转变。

线程信息

sessionInfo()
#> R version 4.0.2 (2020-06-22)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Catalina 10.15.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] forcats_0.5.0   stringr_1.4.0   dplyr_1.0.0     purrr_0.3.4    
#> [5] readr_1.3.1     tidyr_1.1.0     tibble_3.0.3    ggplot2_3.3.2  
#> [9] tidyverse_1.3.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.1.0     xfun_0.15            haven_2.3.1         
#>  [4] colorspace_1.4-1     vctrs_0.3.2          generics_0.0.2      
#>  [7] htmltools_0.5.0      yaml_2.2.1           utf8_1.1.4          
#> [10] blob_1.2.1           rlang_0.4.7          pillar_1.4.6        
#> [13] glue_1.4.1           withr_2.2.0          DBI_1.1.0           
#> [16] dbplyr_1.4.4         modelr_0.1.8         readxl_1.3.1        
#> [19] lifecycle_0.2.0      munsell_0.5.0        blogdown_0.20       
#> [22] gtable_0.3.0         cellranger_1.1.0     rvest_0.3.6         
#> [25] htmlwidgets_1.5.1    evaluate_0.14        knitr_1.29          
#> [28] fansi_0.4.1          broom_0.7.0          Rcpp_1.0.5          
#> [31] scales_1.1.1         backports_1.1.8      jsonlite_1.7.0      
#> [34] fs_1.4.2             microbenchmark_1.4-7 hms_0.5.3           
#> [37] digest_0.6.25        stringi_1.4.6        bookdown_0.20       
#> [40] grid_4.0.2           cli_2.0.2            tools_4.0.2         
#> [43] magrittr_1.5         crayon_1.3.4         pkgconfig_2.0.3     
#> [46] ellipsis_0.3.1       xml2_1.3.2           reprex_0.3.0        
#> [49] lubridate_1.7.9      assertthat_0.2.1     rmarkdown_2.3       
#> [52] httr_1.4.2           rstudioapi_0.11      R6_2.4.1            
#> [55] compiler_4.0.2