stringr——处理字符串
王诗翔 · 2018-09-17
分类:
r  
标签:
r  
stringr  
string  
text-processing  
导入包:
library(tidyverse)
#> ── Attaching packages ──────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
#> ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
#> ✓ tibble 3.0.3 ✓ dplyr 1.0.0
#> ✓ tidyr 1.1.0 ✓ stringr 1.4.0
#> ✓ readr 1.3.1 ✓ forcats 0.5.0
#> ── Conflicts ─────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
library(stringr)
x = c("\"", "\\")
显示字符串原始内容:
writeLines(x)
#> "
#> \
字符串长度
str_length(c("a", "R for data science", NA))
#> [1] 1 18 NA
字符串组合
组合两个或多个:
str_c("x", "y")
#> [1] "xy"
str_c("x", "y", "z")
#> [1] "xyz"
控制分隔:
str_c("x", "y", sep = ",")
#> [1] "x,y"
缺失值是可以传染的,我们可以将NA
输出为"NA"
:
x = c("abc", NA)
str_c("|-", x, "-|")
#> [1] "|-abc-|" NA
str_c("|-", str_replace_na(x), "-|")
#> [1] "|-abc-|" "|-NA-|"
组合函数是向量化的:
str_c("prefix-", c("a", "b", "c"), "-suffix")
#> [1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"
将字符向量合并为字符串:
str_c(c("x", "y", "z"), collapse = ", ")
#> [1] "x, y, z"
字符串取子集
x = c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
#> [1] "App" "Ban" "Pea"
负数表示从后到前:
str_sub(x, -3, -1)
#> [1] "ple" "ana" "ear"
注意如果字符串过短函数也会返回尽可能多的字符:
str_sub("a", 1, 5)
#> [1] "a"
以赋值的形式修改字符串:
str_sub(x, 1, 1) = str_to_lower(str_sub(x, 1, 1))
x
#> [1] "apple" "banana" "pear"
区域设置
字符串的使用因国家地区不同可能有所不同。
str_to_upper(c("i", "l"))
#> [1] "I" "L"
str_to_upper(c("i", "l"), locale = "tr")
#> [1] "İ" "L"
排序:
x = c("apple", "eggplant", "banana")
str_sort(x, locale = "en")
#> [1] "apple" "banana" "eggplant"
str_sort(x, locale = "haw")
#> [1] "apple" "eggplant" "banana"
使用正则表达式
我们可以通过str_view()
和str_view_all()
函数学习正则表达式。函数接受一个字符向量和一个正则表达式。
基础匹配
精确匹配字符串:
x = c("apple", "banana", "pear")
str_view(x, "an")
另一个复杂的模式是使用.
,它可以匹配除换行符外的任意字符:
str_view(x, ".a.")
锚点
^
从字符串开头进行匹配$
从字符串末尾进行匹配
str_view(x, "^a")
str_view(x, "a$")
字符串类与字符选项
除了.
,还有4种常见的字符类:
\d
匹配任意数字\s
匹配任意空白符[abc]
匹配a、b或c[^abc]
匹配除a、b、c之外的任意字符
因为要对\
转义,在R中使用正则需要\\s
来匹配空白符,其他也一样。
|
可以获取可选模式,比如abc|xyz
匹配abc
或xyz
,该操作符的优先级很低。
str_view(c("grey", "gray"), "gr(e|a)y")
重复
该操作用来控制某个模式能够匹配多少次:
?
- 0次或一次+
- 1次或多次*
- 0次或多次
x = "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, "CC?")
str_view(x ,"CC+")
精确匹配次数:
{n}
- 匹配n次{n,}
- 匹配n次或更多次{,m}
- 最多匹配m次{n, m}
- 匹配n到m次
str_view(x, "C{2}")
str_view(x, "C{2,}")
str_view(x, "C{2,3}")
默认的匹配方式是贪婪的,正则表达式会匹配尽量长的字符串,在后面添加
?
可以将匹配方式更改为懒惰的,即匹配尽量短的字符串。
str_view(x, "C{2,3}?")
str_view(x, "C[LX]+?")
分组与回溯引用
括号除了可以消除复杂表达式的歧义,还可以定义分组,我们可以通过回溯引用(如\1
,\2
等)来引用这些分组。
str_view(fruit, "(..)\\1", match = TRUE)
工具
学习stringr多种函数,可以:
- 确定与某种模式相匹配的字符串
- 找出匹配的位置
- 提取出匹配的内容
- 使用新值替换匹配内容
- 基于匹配拆分字符串
匹配检测
要想知道一个字符向量能否匹配一种模式,可以使用str_detect()
:
x = c("apple", "banana", "pear")
str_detect(x, "e")
#> [1] TRUE FALSE TRUE
因为在数学意义上F
为0,T
为1,所以我们可以使用求和和求均值函数等,它们有时候可以发挥巨大用处。
sum(str_detect(words, "^t"))
#> [1] 65
mean(str_detect(words, "[aeiou]$"))
#> [1] 0.277
当逻辑条件非常复杂时,相对于创建单个正则表达式,使用逻辑运算符进行调用组合会更容易。
例如下面可以找不包含元音字母的所有单词:
no_vowel_1 = !str_detect(words, "[aeiou]")
no_vowel_2 = str_detect(words, "^[^aeiou]+$")
identical(no_vowel_1, no_vowel_2)
#> [1] TRUE
两种方法结果一致,但第一种更容易理解。
str_detect
一种常见用法是选取匹配某种模式的元素,然后取子集,也可以使用str_subset()
包装函数完全两步操作:
words[str_detect(words, "x$")]
#> [1] "box" "sex" "six" "tax"
str_subset(words, "x$")
#> [1] "box" "sex" "six" "tax"
字符串通常是数据框的一列,我们可以联合filter()
操作:
df = tibble(
word = words,
i = seq_along(words)
)
df %>%
filter(str_detect(words, "x$"))
#> # A tibble: 4 x 2
#> word i
#> <chr> <int>
#> 1 box 108
#> 2 sex 747
#> 3 six 772
#> 4 tax 841
str_detect()
函数的变体str_count()
返回字符串中匹配的数量:
str_count(x, "a")
#> [1] 1 3 1
str_count()
完全可以和mutate()
联合使用:
df %>%
mutate(
vowels = str_count(word, "[aeiou]"),
consonants = str_count(word, "[^aeiou]")
)
#> # A tibble: 980 x 4
#> word i vowels consonants
#> <chr> <int> <int> <int>
#> 1 a 1 1 0
#> 2 able 2 2 2
#> 3 about 3 3 2
#> 4 absolute 4 4 4
#> 5 accept 5 2 4
#> 6 account 6 3 4
#> 7 achieve 7 4 3
#> 8 across 8 2 4
#> 9 act 9 1 2
#> 10 active 10 3 3
#> # … with 970 more rows
注意,匹配的模式不会重叠,比如abababa
中aba
只会匹配2次而不是3次:
str_count("abababa", "aba")
#> [1] 2
str_view_all("abababa", "aba")
str_view_all()
用于全部匹配。
提取匹配内容
我们可以使用str_extract()
函数来提取匹配的实际文本。这里使用维基百科的Harvard sentences作为复杂的示例。
length(sentences)
#> [1] 720
head(sentences)
#> [1] "The birch canoe slid on the smooth planks."
#> [2] "Glue the sheet to the dark blue background."
#> [3] "It's easy to tell the depth of a well."
#> [4] "These days a chicken leg is a rare dish."
#> [5] "Rice is often served in round bowls."
#> [6] "The juice of lemons makes fine punch."
假如现在我们想找出包含一种颜色的所有句子。我们先创建颜色名称向量,然后转换为正则表达式:
colors = c(
"red", "orange", "yellow", "green", "blue", "purple"
)
color_match = str_c(colors, collapse = "|")
color_match
#> [1] "red|orange|yellow|green|blue|purple"
现在我们选取出包含一种颜色的句子,然后再提取出颜色:
has_color = str_subset(sentences, color_match)
matches = str_extract(has_color, color_match)
head(matches)
#> [1] "blue" "blue" "red" "red" "red" "blue"
注意,str_extract()
只提取第一个匹配。我们可以选取多余一种匹配的所有句子,这样我们更容易看到所有的匹配。
more = sentences[str_count(sentences, color_match) > 1]
str_view_all(more, color_match)
str_extract(more, color_match)
#> [1] "blue" "green" "orange"
这是stringr
函数的一种通用模式,单个匹配可以使用更简单的数据结构,想要得到所有的匹配,使用str_extract_all()
函数,它会返回一个列表。
str_extract_all(more, color_match)
#> [[1]]
#> [1] "blue" "red"
#>
#> [[2]]
#> [1] "green" "red"
#>
#> [[3]]
#> [1] "orange" "red"
如果设置了simplify = TRUE
,那么结果会是一个矩阵,其中短的匹配会和最长的匹配有一样的长度。
str_extract_all(more, color_match, simplify = TRUE)
#> [,1] [,2]
#> [1,] "blue" "red"
#> [2,] "green" "red"
#> [3,] "orange" "red"
x = c("a", "a b", "a b c")
str_extract_all(x, "[a-z]", simplify = TRUE)
#> [,1] [,2] [,3]
#> [1,] "a" "" ""
#> [2,] "a" "b" ""
#> [3,] "a" "b" "c"
分组匹配
括号在正则表达式中科院阐明优先级,还能对正则表达式进行分组,分组可以在匹配时回溯引用。我们因而可以用括号来提取复杂匹配的各个部分。
举例说明:加入我们想从句子中提取名词,我们可以先进行一种启发式实验,找出a或the后面的所有单词。使用正则表达式定义“单词”概念有点难度,我们使用一种简单的近似——至少有1个非空格字符的字符序列。
noun = "(a|the) ([^ ]+)"
has_noun = sentences %>%
str_subset(noun) %>%
head(10)
has_noun %>%
str_extract(noun)
#> [1] "the smooth" "the sheet" "the depth" "a chicken" "the parked"
#> [6] "the sun" "the huge" "the ball" "the woman" "a helps"
str_extract()
函数给出完整匹配,str_match()
函数给出每个独立分组。后面函数返回的不是字符向量而是矩阵:其中一列是完整匹配,后面的列是每个分组的匹配:
has_noun %>%
str_match(noun)
#> [,1] [,2] [,3]
#> [1,] "the smooth" "the" "smooth"
#> [2,] "the sheet" "the" "sheet"
#> [3,] "the depth" "the" "depth"
#> [4,] "a chicken" "a" "chicken"
#> [5,] "the parked" "the" "parked"
#> [6,] "the sun" "the" "sun"
#> [7,] "the huge" "the" "huge"
#> [8,] "the ball" "the" "ball"
#> [9,] "the woman" "the" "woman"
#> [10,] "a helps" "a" "helps"
这种启发式名词检测的效果并不好,它找出了一些形容词,比如
smooth
和parked
。 如果数据保存在tibble中,使用extract()
会更容易,该函数工作方式与str_match()
函数类似,只需要为每个分组提供名词以作为结果的新列。
tibble(sentences = sentences) %>%
tidyr::extract(
sentences, c("article", "noun"), "(a|the) ([^ ]+)",
remove = FALSE
)
#> # A tibble: 720 x 3
#> sentences article noun
#> <chr> <chr> <chr>
#> 1 The birch canoe slid on the smooth planks. the smooth
#> 2 Glue the sheet to the dark blue background. the sheet
#> 3 It's easy to tell the depth of a well. the depth
#> 4 These days a chicken leg is a rare dish. a chicken
#> 5 Rice is often served in round bowls. <NA> <NA>
#> 6 The juice of lemons makes fine punch. <NA> <NA>
#> 7 The box was thrown beside the parked truck. the parked
#> 8 The hogs were fed chopped corn and garbage. <NA> <NA>
#> 9 Four hours of steady work faced us. <NA> <NA>
#> 10 Large size in stockings is hard to sell. <NA> <NA>
#> # … with 710 more rows
同str_extract()
函数一样,如果要找出所有的匹配,需要使用str_match_all()
函数。
替换匹配内容
str_replace()
和str_replace_all()
函数可以使用新的字符串替换匹配的内容。最简单的就是使用固定的字符串进行替换:
x = c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
#> [1] "-pple" "p-ar" "b-nana"
str_replace_all(x, "[aeiou]", "-")
#> [1] "-ppl-" "p--r" "b-n-n-"
通过一个命令向量我们可以同时进行多个替换:
x = c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
#> [1] "one house" "two cars" "three people"
除了使用固定字符串,我们还可以使用引用来插入匹配的分组。下面的代码我们交换第二个单词和第三个单词的顺序:
sentences %>% head(5)
#> [1] "The birch canoe slid on the smooth planks."
#> [2] "Glue the sheet to the dark blue background."
#> [3] "It's easy to tell the depth of a well."
#> [4] "These days a chicken leg is a rare dish."
#> [5] "Rice is often served in round bowls."
sentences %>%
str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
head(5)
#> [1] "The canoe birch slid on the smooth planks."
#> [2] "Glue sheet the to the dark blue background."
#> [3] "It's to easy tell the depth of a well."
#> [4] "These a days chicken leg is a rare dish."
#> [5] "Rice often is served in round bowls."
拆分
str_split()
函数可以将字符串拆分为多个片段。比如把句子拆分为单词:
sentences %>%
head(5) %>%
str_split(" ")
#> [[1]]
#> [1] "The" "birch" "canoe" "slid" "on" "the" "smooth"
#> [8] "planks."
#>
#> [[2]]
#> [1] "Glue" "the" "sheet" "to" "the"
#> [6] "dark" "blue" "background."
#>
#> [[3]]
#> [1] "It's" "easy" "to" "tell" "the" "depth" "of" "a" "well."
#>
#> [[4]]
#> [1] "These" "days" "a" "chicken" "leg" "is" "a"
#> [8] "rare" "dish."
#>
#> [[5]]
#> [1] "Rice" "is" "often" "served" "in" "round" "bowls."
因为拆分句子产生的单词数目不一样,所以函数结果返回一个列表。如果我们想要返回一个矩阵,可以通过simplify = TRUE
进行指定。
sentences %>%
head(5) %>%
str_split(" ", simplify = TRUE)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks."
#> [2,] "Glue" "the" "sheet" "to" "the" "dark" "blue" "background."
#> [3,] "It's" "easy" "to" "tell" "the" "depth" "of" "a"
#> [4,] "These" "days" "a" "chicken" "leg" "is" "a" "rare"
#> [5,] "Rice" "is" "often" "served" "in" "round" "bowls." ""
#> [,9]
#> [1,] ""
#> [2,] ""
#> [3,] "well."
#> [4,] "dish."
#> [5,] ""
我们可以设定拆分片段的最大数量:
fields = c("Name:诗翔", "Country:CN", "Age:24")
fields %>% str_split(":", n = 2, simplify = TRUE)
#> [,1] [,2]
#> [1,] "Name" "诗翔"
#> [2,] "Country" "CN"
#> [3,] "Age" "24"
我们还可以通过字母、行、句子和单词边界(boundary()
函数)来拆分字符串:
x = "This is a sentence. This is another sentence."
str_view_all(x, boundary("word"))
str_split(x, " ")[[1]]
#> [1] "This" "is" "a" "sentence." "This" "is"
#> [7] "another" "sentence."
str_split(x, boundary("word"))[[1]]
#> [1] "This" "is" "a" "sentence" "This" "is" "another"
#> [8] "sentence"
定位匹配内容
str_locate()
与str_locate_all()
函数可以给出每个匹配的开始位置和结束位置。当没有其他函数能够精确满足需求时该函数非常有用,我们可以使用str_locate()
函数找出匹配的模式,然后使用str_sub()
来提取和修改匹配的内容。
其他类型模式
当一个字符串作为模式时,R内部使用regex()
函数进行了包装:
# 正常调用
str_view(fruit, "nana")
# 上面实质上是下面的简写
str_view(fruit, regex("nana"))
因而我们可以通过设定regex()
的其他参数来控制匹配方式。
ignore_case = TRUE
:允许匹配大小写multline = TRUE
可以使^
和$
锚定每行的开头和末尾,而不是整个字符串的开头和末尾comment = TRUE
,这可以让我们在复杂的正则表达式中加入注释和空白字符,以便理解。如要匹配空格,使用\\
。phone = regex(" \\(? # 可选的小括号开头 (\\d{3}) # 地区编号 [)- ]? # 可选的小括号结尾、短划线或空格 (\\d{3}) # 另外3个数字 [ -]? # 可选的空格或短划线 (\\d{3}) # 另外3个数字 ", comment = TRUE) str_match("514-791-8141", phone) #> [,1] [,2] [,3] [,4] #> [1,] "514-791-814" "514" "791" "814"
dotall = TRUE
可以匹配包括\n
在内的所有字符
除了regex()
,我们还可以使用另外3种函数:
fixed()
函数 - 可以按照字符串的字节形式进行精确匹配,它会忽略正则表达式中的所有特殊字符,在非常低的层次上进行操作。这样我们可以不用进行转义,并且速度也要快得多。下面是一个简单的测试示例,它的速度差不多是普通正则表达式的3倍。
microbenchmark::microbenchmark(
fixed = str_detect(sentences, fixed("the")),
regex = str_detect(sentences, "the"),
times = 20
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> fixed 90.2 94.7 177 98.3 102 1605 20
#> regex 318.0 334.5 353 347.4 359 434 20
coll()
函数使用标准排序规则来比较字符串,这在进行不区分大小写的匹配时时非常有效的,但速度很慢。注意,我们可以在coll()
中设定locale参数,以确定使用哪种规则来比较字符。世界各地使用的规则是不同的。另外,我们可以使用下面代码查看默认区域设置:
stringi::stri_locale_info()
#> $Language
#> [1] "en"
#>
#> $Country
#> [1] "US"
#>
#> $Variant
#> [1] ""
#>
#> $Name
#> [1] "en_US"
boundary()
函数可以用来匹配边界,我们可以在其他字符串操作函数中使用它
x = "This is a sentence"
str_view_all(x, boundary("word"))
str_extract_all(x, boundary("word"))
#> [[1]]
#> [1] "This" "is" "a" "sentence"
正则表达式其他应用
R基础包中存在2个常用函数,它们可以使用正则表达式:
apropos()
函数在全局环境中搜索所有可用对象,当不记得函数名时非常有用:
apropos("replace")
#> [1] "%+replace%" "replace" "replace_na" "setReplaceMethod"
#> [5] "str_replace" "str_replace_all" "str_replace_na" "theme_replace"
dir()
函数列出一个目录下的所有文件,其参数pattern可以是一个正则表达式:
head(dir(pattern = "\\.Rmd$"))
#> [1] "2017-10-09-microArray-data-analysis.Rmd"
#> [2] "2017-10-27-RNAseq-data-analysis.Rmd"
#> [3] "2018-08-10-r-installation.Rmd"
#> [4] "2019-06-21-baseplot-addplot.Rmd"
#> [5] "2019-06-21-baseplot-multiplots.Rmd"
#> [6] "2019-07-07-use-rstatix.Rmd"
stringi
stringr是建立于stringi基础之上的。stringr比较容易学习(书上写非常容易,我个人并不这样认为)——它只提供少惊醒挑选的函数,可以完成常见大部分的字符串操作。而stringi的设计思想是尽量全面,几乎包含了我们可以用到的所有函数,共234个。
当我们从stringr过渡到stringi时会比较容易,相应的函数会经历str_
到stri_
的转变。
线程信息
sessionInfo()
#> R version 4.0.2 (2020-06-22)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Catalina 10.15.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] forcats_0.5.0 stringr_1.4.0 dplyr_1.0.0 purrr_0.3.4
#> [5] readr_1.3.1 tidyr_1.1.0 tibble_3.0.3 ggplot2_3.3.2
#> [9] tidyverse_1.3.0
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.1.0 xfun_0.15 haven_2.3.1
#> [4] colorspace_1.4-1 vctrs_0.3.2 generics_0.0.2
#> [7] htmltools_0.5.0 yaml_2.2.1 utf8_1.1.4
#> [10] blob_1.2.1 rlang_0.4.7 pillar_1.4.6
#> [13] glue_1.4.1 withr_2.2.0 DBI_1.1.0
#> [16] dbplyr_1.4.4 modelr_0.1.8 readxl_1.3.1
#> [19] lifecycle_0.2.0 munsell_0.5.0 blogdown_0.20
#> [22] gtable_0.3.0 cellranger_1.1.0 rvest_0.3.6
#> [25] htmlwidgets_1.5.1 evaluate_0.14 knitr_1.29
#> [28] fansi_0.4.1 broom_0.7.0 Rcpp_1.0.5
#> [31] scales_1.1.1 backports_1.1.8 jsonlite_1.7.0
#> [34] fs_1.4.2 microbenchmark_1.4-7 hms_0.5.3
#> [37] digest_0.6.25 stringi_1.4.6 bookdown_0.20
#> [40] grid_4.0.2 cli_2.0.2 tools_4.0.2
#> [43] magrittr_1.5 crayon_1.3.4 pkgconfig_2.0.3
#> [46] ellipsis_0.3.1 xml2_1.3.2 reprex_0.3.0
#> [49] lubridate_1.7.9 assertthat_0.2.1 rmarkdown_2.3
#> [52] httr_1.4.2 rstudioapi_0.11 R6_2.4.1
#> [55] compiler_4.0.2