Generally, it is not a good idea to put multiple values into a single table cell. This is not tidy and it is cumbersome to deal with fields so structured. However, given the constraints of some data entry systems, one may need to encode compound values to describe the observation. Suppose an individual is recording an object made of three materials but the data entry form only offers a single field (Figure 1). We don’t want to log just one value; we would rather capture all three materials. When confronted with such situations, people often enter some kind of delimited value like apple/pear/orange
where /
is the separator for three values. What follows is a way of dealing with this situation in R.
# Construct data frame
df <- data.frame(
id = 1:5,
material = c(
"stone",
"stone/bone/sinew",
"ceramic/paint",
"metal/stone/bone/sinew/paint",
NA
),
location = "NM"
)
df
id | material | location |
---|---|---|
1 | stone | NM |
2 | stone/bone/sinew | NM |
3 | ceramic/paint | NM |
4 | metal/stone/bone/sinew/paint | NM |
5 | NA | NM |
Now we have a data frame where material contains entries representing more than one value. With stringi
R can parse that complex string into a list.
# Split the string into a list
df$material <- str_split(df$material, "/")
It is possible to access a given list item with unlist()
. This can be done for individual records, ranges of records, or the entire column.
# Get the list item from the second record
df[2,][,2]
[[1]]
[1] "stone" "bone" "sinew"
# Unpack the list associated with that record
unlist(df[2,][,2])
[1] "stone" "bone" "sinew"
# Unpack the entire list column as a single list
x <- unlist(df[,2])
x
[1] "stone" "stone" "bone" "sinew" "ceramic" "paint" "metal"
[8] "stone" "bone" "sinew" "paint" NA
Let’s turn the list of all values into its own data frame.
df_x <- tibble(x)
To see how many of each type of object we have, we can use tidyverse
verbs to group by and summarize (Table 2).
x | n |
---|---|
bone | 2 |
ceramic | 1 |
metal | 1 |
paint | 2 |
sinew | 2 |
stone | 3 |
NA | 1 |
Let’s search for a specific material and filter the table returning only records that meet the search criteria (Table 3).
# Test for the presence of a string
# returns a TRUE/FALSE vector
a <- stri_detect_regex(df$material, "stone")
# Get the numerical value of the record when TRUE
b <- which(a == TRUE)
# Call the rows based on the index
df[b,]
id | material | location | |
---|---|---|---|
1 | 1 | stone | NM |
2 | 2 | stone, bone , sinew | NM |
4 | 4 | metal, stone, bone , sinew, paint | NM |
Citation
@online{craig2022,
author = {Craig, Nathan},
title = {Dealing with Compound Fields},
date = {2022-10-13},
url = {https://ncraig.netlify.app/posts/2022-10-13-compound-fields/index.html},
langid = {en}
}