How to Identify Repeating Data Entries when the Repeated Entries are Spelled or Constructed Differently

Question

I have a dataset of entries and a variable for the owner of the entry. Some of these people occur more than once. However, the names are sometimes written differently. I want to eventually be able to aggregate the other data to the single owner. These are the names of business owners so sometimes it's a singular name, sometimes it's more than one name, and sometimes it's just the company name. Here's an example of some of the styles of names in the data:

DOE JOHN
DOE JOHN J
DOE, JOHN
DOE, JOHN + JANE
DOE JOHN + JANE
JOHN DOE J ETAL % JOHN J DOE
COMPANY CO

I've never done anything like this before. How could I go about identifying some of the same people? Is there a way to create an index to identify the similarity between these groups? Most of the ones I've seen are for longer text. Is there an index well suited for this?
I apologize if this is too basic a question. I'm new to doing things like this and I'm not sure if I know exactly what to search for. I'm most comfortable with Stata and R but I've used Python before and I could eventually figure out how to do something with that.

arne · Accepted Answer

For R: Have a look and the stringr package. I would use for example the str_detect() function as follows: str_detect(column_of_different_names,"DOE|company_name"). This will return TRUE for each string that includes "DOE" or the company name in "company_name".

Correct answer by arne on August 14, 2021

How to Identify Repeating Data Entries when the Repeated Entries are Spelled or Constructed Differently

One Answer

Add your own answers!

Ask a Question