Often times one wants to work with a data set, but the variables in a column were not consistently entered. For example, say there is a column for material
and there are values wood
, Wood
, and WooD
. Effectively we have three separate factors here, but really they are just typos. We want them all to say wood
because it will allow for more accurate grouping for purposes like summarizing or graphing. So how does one enforce consistency in strings (and by extension factors) in R
? Turns out there are a few ways. Here we look at the base R
functions sub()
(for the first instance in vector) and gsub()
(for all instances in vector) as well as the stringr
functions string_replace()
(for the first instance in vector) along with string_replace_all()
(for all instances in a vector).
Create some data
df <- tribble(
~col1, ~col2,
"a", 1,
"b", 2,
"c", 3,
"ab", 4,
"ac", 5,
"ag", 6)
gsub()
from base R
There are two key functions: sub()
which will replace the first instance and gsub()
which will replace all instances. The basic syntax is gsub(pattern, replacement, x)
. There are other handy arguments like inverse
and ignore.case
. Some examples…
Change every instance of a
in col1
to X
.
gsub("a", "X", df$col1)
[1] "X" "b" "c" "Xb" "Xc" "Xg"
To apply this change to the column use the following df$col1 <- gsub("a", "X", df$col1)
Note that it is possible to use regular expressions in the first argument
# ab OR ac
gsub("ab|ac", "D", df$col1)
[1] "a" "b" "c" "D" "D" "ag"
starts with a
gsub("^a", "X", df$col1)
[1] "X" "b" "c" "Xb" "Xc" "Xg"
Ends with c
or with g
gsub("c$|g$", "X", df$col1)
[1] "a" "b" "X" "ab" "aX" "aX"
stringr::str_replace_all()
Use stringr::str_replace_all()
to perform find and replace. Use |
as the or
operator.
str_replace_all(df$col1, pattern = "ab|ac", replacement = "D")
[1] "a" "b" "c" "D" "D" "ag"
This can be assigned back to the column df$col1 <- str_replace_all(df$col1, pattern = "ab|ac", replacement = "D")
Use stringr::str_replace_all()
inside a dplyr::mutate()
function to perform find and replace.
df %>%
mutate(col1 = str_replace_all(col1, pattern = "ab|ac", replacement = "D"),
col1 = str_replace_all(col1, pattern = "ag", replacement = "E"))
# A tibble: 6 × 2
col1 col2
<chr> <dbl>
1 a 1
2 b 2
3 c 3
4 D 4
5 D 5
6 E 6
Regular expression ^
for starts with.
df %>%
mutate(col1 = str_replace_all(col1, pattern = "^a", replacement = "E"))
# A tibble: 6 × 2
col1 col2
<chr> <dbl>
1 E 1
2 b 2
3 c 3
4 Eb 4
5 Ec 5
6 Eg 6
Regular expression $
for ends with.
df %>%
mutate(col1 = str_replace_all(col1, pattern = "c$", replacement = "E"))
# A tibble: 6 × 2
col1 col2
<chr> <dbl>
1 a 1
2 b 2
3 E 3
4 ab 4
5 aE 5
6 ag 6
Use ^
starts with and $
ends with to narrow what is returned.
df %>%
mutate(col1 = str_replace_all(col1, pattern = "^ac$", replacement = "E"))
# A tibble: 6 × 2
col1 col2
<chr> <dbl>
1 a 1
2 b 2
3 c 3
4 ab 4
5 E 5
6 ag 6
If there is a long list of strings that all need to conform to a single value, replace()
is an option to consider.
Links
Citation
@online{craig2023,
author = {Craig, Nathan},
title = {Find and Replace in {R} Data Frames},
date = {2023-10-21},
url = {https://ncraig.netlify.app/posts/2023-10-21-find-and-replace/index.html},
langid = {en}
}