N.Craig - Duplicate Detection

¡TL/DR!

Problem

I need to find all duplicate object records in a database so that information from the two records (including linked tables) can be merged into a single comprehensive record.

Information about museum objects is tracked by object numbers (Simmons and Kiser 2020; NPS Information Management Team (IMT) 2019; Chenhall 1975). Museum accession numbers are generally based on a binomial following the formula year.accesion while object numbers are based on a trinomal system using the following formula year.acquisition.object (Simmons and Kiser 2020, pt. 4). This system encodes key pieces of information about accessions and objects. Each object number should be completely unique, and the museum database should contain no duplicate records.

Unfortunately, over time different individuals registering objects in PastPerfect inconsistently entered year values in both accession and object id fields. In earlier cases, year was entered as a two digit value (i.e. ##). Now year is entered as a four digit value (i.e. ####). Additionally, several individuals recently added new records into the database without first checking to see if there were existing records logged with two digit year encoding. As a result, now there are several duplicate records in the database, early records have two digit year encoding and the subsequent duplicates have four digits. Each record contains independent linked information like location and conservation history. It is valuable to merge this information into a single comprehensive canonical record. To get there, one must first identify the duplicate records (Figure 1).

Figure 1: Science cat does the math to find duplicates in the data

For the future

So that this matter does not arise again, identify all records with two digit year encoding and modify the accessions and accession numbers in PastPerfect.

Methods

The following reads the data, collapses four digit year encoding writing it to a new column, establishes counts by this new collapsed number, identifies ids with multiple entries, isolates these in a new table, and joins them to the complete record. That product gets reported as a reactive DT table.

Read and Clean Data

This probably loads an unnecessary number of libraries, but deploys verbs I’m familiar with. For example, I believe base tapply could substitute the group by and summarize dplyr functions.

# load libraries
library(readxl)
library(dplyr)
library(janitor)
library(stringr)
library(forcats)

Data were first exported from PastPerfect. This export file is read into R. From there, field types are assigned and names are cleaned consistent with R conventions. A separate script is called to do some standard wrangling.

# Read data
df <- read_excel(here::here("posts/2023-09-28-duplicate-detection/data/inventory_2023_02_23.xlsx")) %>% janitor::clean_names()

# Clean columns
source(here::here("posts/2023-09-28-duplicate-detection/scripts/past_perfect_clean_fields.R"))

Create New Object Number w. Truncated Year

To compare two digit and four digit year encoding, four digit year encoding were stripped to two digits. Doing this relies on a regular expression or regex. To build the proper expression, I had to fiddle a bit. I worked my way to it by making a minimum reproducible example that isolated the problem (see Section 4).

# Add column composed of two digit objectid
df$objectid2 <- str_remove_all(df$objectid , "^19|^20")

Summarize New Object Number w. Truncated Year

With two digit year encoding in place, group by the new object id field and summarize based on that grouping.

# Get duplicates
df_objectid2 <- df |> 
  group_by(objectid2) |> 
  summarise(n = n())

Isolate Duplicates

Filter only those object numbers whose counts are greater than 1. This results in a list of duplicate records.

# Isolate duplicates
df_dupes <- df_objectid2 |> 
  filter(n>1)

Join Duplicate Table with Complete Record

Join the duplicate record list to the full set of object attributes.

# Join table of duplicates with full record info
df_joined <- left_join(df_dupes, df, by = "objectid2", multiple = "all")

Results

There are 44 duplicate records in the database.

Plot Reactive Table Using JavaScript

Note

2023-03-01 flagdate is not reporting properly. Need to track that down.

# Other fields: catby, invnby, flagdate

df_joined |> 
  select(objectid2, catdate, invndate, flagnotes, flagreason, objname) |> 
  DT::datatable()

Table 1: Duplicate Records

Show entries

Search:

	objectid2	catdate	invndate	flagnotes	flagreason	objname
1	69.03.03	2016-05-24		Duplicate of 1969.03.03	Quality Check	Bag
2	69.03.03	2022-03-30	2022-03-30			Bag
3	73.02.10	2019-06-11				Hat
4	73.02.10	2022-09-15				Hat
5	75.30.02	2019-07-31				Dress
6	75.30.02	2000-06-06				Dress
7	77.25.100	2018-07-17				Candleholder
8	77.25.100	2021-12-06	2021-12-06			Candleholder
9	77.25.11	2018-07-16				Mug
10	77.25.11	2021-07-01	2021-07-01			Pitcher

Showing 1 to 10 of 44 entries

Previous1 2 3 4 5Next

Reproducible Example

The following is a reproducible example illustrating the key pieces of the procedure above. Making a small example that isolates the problem is a really good way to finding a solution.

Code

x <- c("1991.45.03",
       "91.45.03",
       "2002.34.55",
       "2002.34.75")

Return only numbers that begin with 19, using a stringr verbs.

Code

str_detect(x, "^19")

[1]  TRUE FALSE FALSE FALSE

Code

str_starts(x, "19")

[1]  TRUE FALSE FALSE FALSE

Use stringr::str_remove_all with the regex to strip four digit year encoding to two digits.

Code

# Removes all when objectid starts with 19 or 20
str_remove_all(x, "^19|^20")

[1] "91.45.03" "91.45.03" "02.34.55" "02.34.75"

References

Chenhall, Robert G. 1975. Museum Cataloging in the Computer Age. Nashville: American Association for State; Local History.

NPS Information Management Team (IMT), NPS Museum Management Program (MMP), ed. 2019. The Museum Handbook Part i: Museum Collections. Washington, D. C.: National Park Service. https://www.nps.gov/museum/publications/MHI/MHI.pdf.

Simmons, John E., and Toni M. Kiser, eds. 2020. MRM6: Museum Registration Methods. 6th edition. Lanham Boulder New York London: Rowan & Littlefield.

Citation

BibTeX citation:

@online{craig2023,
  author = {Craig, Nathan},
  title = {Duplicate {Detection}},
  date = {2023-09-28},
  url = {https://ncraig.netlify.app/posts/2023-09-28-duplicate-detection/index.html},
  langid = {en}
}

For attribution, please cite this work as:

Craig, Nathan. 2023. “Duplicate Detection.” September 28, 2023. https://ncraig.netlify.app/posts/2023-09-28-duplicate-detection/index.html.