TransWikia.com

Similarity between two words

Data Science Asked by gogasca on January 23, 2021

I’m looking for a Python library that helps me identify the similarity between two words or sentences.

I will be doing Audio to Text conversion which will result in an English dictionary or non dictionary word(s) ( This could be a Person or Company name) After that, I need to compare it to a known word or words.

Example:

1) Text to audio result: Thanks for calling America Expansion
will be compared to American Express.

Both sentences are somehow similar but not the same.

Looks like I may need to look into how many chars they share. Any ideas will be great. Looks a functionality like Google search “did you mean” feature.

5 Answers

The closest would be like Jan has mentioned inhis answer, the Levenstein's distance (also popularly called the edit distance).

In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.

It is a very commonly used metric for identifying similar words. Nltk already has an implementation for the edit distance metric, which can be invoked in the following way:

import nltk
nltk.edit_distance("humpty", "dumpty")

The above code would return 1, as only one letter is different between the two words.

Correct answer by Dawny33 on January 23, 2021

I think the function get_close_matches in module difflib could be more suitable for such a requirement.

get_close_matches(word, possibilities, n=3, cutoff=0.7)

possibilities -> is the list of words
n = maximum number of close matches
cutoff = accuracy of matches.


data=["drain","rain","brain","stackexchange"]

word="rainnn"

if len(get_close_matches(w, data,  n=3, cutoff=0.7)) > 0:
   return data[get_close_matches(w, data,  n=3, cutoff=0.7)[0]]

This piece of code will return the best first match and that will be word rain.

Answered by JP Chauhan on January 23, 2021

Apart from very good responses here, you may try SequenceMatcher in difflib python library.

https://docs.python.org/2/library/difflib.html

import difflib

a = 'Thanks for calling America Expansion'
b = 'Thanks for calling American Express'

seq = difflib.SequenceMatcher(None,a,b)
d = seq.ratio()*100
print(d) 
### OUTPUT: 87.323943

Now Consider the below code:

a = 'Thanks for calling American Expansion'
b = 'Thanks for calling American Express'

seq = difflib.SequenceMatcher(None,a,b)
d = seq.ratio()*100
print(d)
### OUTPUT: 88.88888

Now you may compare the d value to evaluate the similarity.

Answered by SVK on January 23, 2021

An old and well-known technique for comparison is the Soundex algorithm. The idea is to compare not the words themselves but approximations of how they are pronounced. To what extent this actually improves the quality of the results I don't know.

However it feels a bit strange to apply something like Soundex to results from a speech-to-text recognition engine. First you throw away information about how the words are pronounced, then you try to add it back again. It would be better to combine these two phases.

Hence, I expect the state of the art technology in this area to do that, and be some form of adaptive classification, e.g. based on neural networks. Google does return recent research on Speech Recognition with Neural Networks.

Answered by reinierpost on January 23, 2021

If your dictionary is not too big a common approach is to take the Levenshtein distance, which basically counts how many changes you have to make to get from one word to another. Changes include changing a character, removing a character or adding a character. An example from Wikipedia:

lev(kitten, sitting) = 3

  • k itten -> s itten
  • sitt e n -> sitt i n
  • sittin -> sittin g

Here are some Python implements on Wikibooks.

The algorithm to compute these distances is not cheap however. If you need to do this on a big scale there are ways to use cosine similarity on bi-gram vectors that are a lot faster and easy to distribute if you need to find matches for a lot of words at once. They are however only an approximation to this distance.

Answered by Jan van der Vegt on January 23, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP