TransWikia.com

How to Identify Repeating Data Entries when the Repeated Entries are Spelled or Constructed Differently

Data Science Asked on August 14, 2021

I have a dataset of entries and a variable for the owner of the entry. Some of these people occur more than once. However, the names are sometimes written differently. I want to eventually be able to aggregate the other data to the single owner. These are the names of business owners so sometimes it’s a singular name, sometimes it’s more than one name, and sometimes it’s just the company name. Here’s an example of some of the styles of names in the data:

  • DOE JOHN
  • DOE JOHN J
  • DOE, JOHN
  • DOE, JOHN + JANE
  • DOE JOHN + JANE
  • JOHN DOE J ETAL % JOHN J DOE
  • COMPANY CO

I’ve never done anything like this before. How could I go about identifying some of the same people? Is there a way to create an index to identify the similarity between these groups? Most of the ones I’ve seen are for longer text. Is there an index well suited for this?

I apologize if this is too basic a question. I’m new to doing things like this and I’m not sure if I know exactly what to search for. I’m most comfortable with Stata and R but I’ve used Python before and I could eventually figure out how to do something with that.

One Answer

For R: Have a look and the stringr package. I would use for example the str_detect() function as follows: str_detect(column_of_different_names,"DOE|company_name"). This will return TRUE for each string that includes "DOE" or the company name in "company_name".

Correct answer by arne on August 14, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP