GO TO TABLE: Section 3.1
Problem
I need to find all duplicate object records in a database so that information from the two records (including linked tables) can be merged into a single comprehensive record.
Information about museum objects is tracked by object numbers (Simmons and Kiser 2020; NPS Information Management Team (IMT) 2019; Chenhall 1975). Museum accession numbers are generally based on a binomial following the formula year.accesion
while object numbers are based on a trinomal system using the following formula year.acquisition.object
(Simmons and Kiser 2020, pt. 4). This system encodes key pieces of information about accessions and objects. Each object number should be completely unique, and the museum database should contain no duplicate records.
Unfortunately, over time different individuals registering objects in PastPerfect inconsistently entered year values in both accession and object id fields. In earlier cases, year was entered as a two digit value (i.e. ##
). Now year is entered as a four digit value (i.e. ####
). Additionally, several individuals recently added new records into the database without first checking to see if there were existing records logged with two digit year encoding. As a result, now there are several duplicate records in the database, early records have two digit year encoding and the subsequent duplicates have four digits. Each record contains independent linked information like location and conservation history. It is valuable to merge this information into a single comprehensive canonical record. To get there, one must first identify the duplicate records (Figure 1).
So that this matter does not arise again, identify all records with two digit year encoding and modify the accessions and accession numbers in PastPerfect.
Methods
The following reads the data, collapses four digit year encoding writing it to a new column, establishes counts by this new collapsed number, identifies ids with multiple entries, isolates these in a new table, and joins them to the complete record. That product gets reported as a reactive DT
table.
Read and Clean Data
This probably loads an unnecessary number of libraries, but deploys verbs I’m familiar with. For example, I believe base tapply
could substitute the group by and summarize dplyr
functions.
Data were first exported from PastPerfect. This export file is read into R
. From there, field types are assigned and names are cleaned consistent with R
conventions. A separate script is called to do some standard wrangling.
# Read data
df <- read_excel(here::here("posts/2023-09-28-duplicate-detection/data/inventory_2023_02_23.xlsx")) %>% janitor::clean_names()
# Clean columns
source(here::here("posts/2023-09-28-duplicate-detection/scripts/past_perfect_clean_fields.R"))
Create New Object Number w. Truncated Year
To compare two digit and four digit year encoding, four digit year encoding were stripped to two digits. Doing this relies on a regular expression or regex. To build the proper expression, I had to fiddle a bit. I worked my way to it by making a minimum reproducible example that isolated the problem (see Section 4).
# Add column composed of two digit objectid
df$objectid2 <- str_remove_all(df$objectid , "^19|^20")
Summarize New Object Number w. Truncated Year
With two digit year encoding in place, group by the new object id field and summarize based on that grouping.
Isolate Duplicates
Filter only those object numbers whose counts are greater than 1. This results in a list of duplicate records.
# Isolate duplicates
df_dupes <- df_objectid2 |>
filter(n>1)
Join Duplicate Table with Complete Record
Join the duplicate record list to the full set of object attributes.
# Join table of duplicates with full record info
df_joined <- left_join(df_dupes, df, by = "objectid2", multiple = "all")
Results
There are 44 duplicate records in the database.
Plot Reactive Table Using JavaScript
2023-03-01 flagdate
is not reporting properly. Need to track that down.
Reproducible Example
The following is a reproducible example illustrating the key pieces of the procedure above. Making a small example that isolates the problem is a really good way to finding a solution.
Code
x <- c("1991.45.03",
"91.45.03",
"2002.34.55",
"2002.34.75")
Return only numbers that begin with 19, using a stringr
verbs.
Code
str_detect(x, "^19")
[1] TRUE FALSE FALSE FALSE
Code
str_starts(x, "19")
[1] TRUE FALSE FALSE FALSE
Use stringr::str_remove_all
with the regex to strip four digit year encoding to two digits.
Code
# Removes all when objectid starts with 19 or 20
str_remove_all(x, "^19|^20")
[1] "91.45.03" "91.45.03" "02.34.55" "02.34.75"
References
Citation
@online{craig2023,
author = {Craig, Nathan},
title = {Duplicate {Detection}},
date = {2023-09-28},
url = {https://ncraig.netlify.app/posts/2023-09-28-duplicate-detection/index.html},
langid = {en}
}