TransWikia.com

String similarity algorithms for string containment (rather than string equality)

Data Science Asked by Gilad Deutsch on July 17, 2021

I saw that there are a lot of string similarity algorithms to whether two strings are the same. I have a slightly different problem – I get two strings "a" and "b" and I need a similarity algorithm to whether "b" contains "a" ("b" is likely to contain "a" with some "error"). I haven’t found any algorithms to solve this problem and I’d be happy to hear about some (if they exist).

One Answer

My first thought was that it's difficult to formally define this concept:

"b" is likely to contain "a" with some "error"

  • On the one hand, there is the idea that a is a substring of b. This question should have a boolean answer: either it does contain it, or it doesn't.
  • On the other hand there is the idea of approximate matching: b should contain a substring c which is "similar enough" to a. In general the question of similarity between two strings is answered with a numerical value, usually a real number between 0 and 1.

As far as I know the only way to solve this issue is to consider that there is a threshold on the similarity score between a and c, where c is any substring of b. This way the answer becomes boolean.

However there might be a way around this by considering the substring operation as part of the similarity calculation. In particular the Levenshtein edit distance can account for insertions/deletions of characters, which is what a substring is w.r.t the string which contains it.

More interestingly, it is possible to assign a different cost to any particular edit operation in the Levenshtein distance. So it's probably possible to define a variant of Levenshtein where insertions at the beginning or at the end would have cost 0, thus making the final distance $x$ between a and b equivalent to "b contains a substring c which has distance $x$ against a".

The way I would try to implement this is:

  1. Calculate the regular edit distance, keeping the matrix used for the calculation
  2. From the matrix, count how many insertions were done but only at the beginning and at the end, then substract this value from the distance.

Note that there might be flaws in my idea, I didn't try it :D

Answered by Erwan on July 17, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP