TransWikia.com

Is there any package in python that can identify similarity between alphanumeric alias names of a parameter?

Data Science Asked by rb173 on February 2, 2021

For example: for a parameter like input voltage,

     Alias names : V_INPUT, VIN etc.

Now, I want the software to recognize each of the alias names as same. Is there any package/method by which I can achieve this?

Nltk is only allowing for dictionary words.

2 Answers

If you know there are only specific variants, you can obviously make a look-up table yourself (i.e. a Python dictionary).

Otherwise you could try using a fuzzy matching library, like fuzzywuzzy.

This will give you a "closeness" score for your search term, based on your list of parameters (measurements). Here is an example of how you could use it:

In [1]: from fuzzywuzzy import process

In [2]: measurements = ["Voltage", "Current", "Resistance", "Power"]

In [3]: variants = ["VOLT", "voltage_in", "resistnce", "pwr", "amps"] # notice typos etc.

In [4]: for variant in variants:
   ...:     results = process.extract(variant, measurements, limit=2)
   ...:     print(f"{variant:<11} -> {results}")  # See which two were found to be closest 
   ...:     best = results[0]                     # Take the best match by score (first in the list)
   ...:     if best[1] < 70:                      # Set a threshold at 70%
   ...:         print(f"Rejected best match for '{variant}': {best}")

VOLT        -> [('Voltage', 90), ('Current', 22)]
voltage_in  -> [('Voltage', 82), ('Resistance', 30)]
resistnce   -> [('Resistance', 95), ('Current', 38)]
pwr         -> [('Power', 75), ('Current', 30)]
amps        -> [('Voltage', 26), ('Resistance', 22)]
Rejected best match for 'amps': ('Voltage', 26)

So most worked out pretty well, including the typo example.

Obviously this does not kind of semantic search, as so amps do not get related to Current in any way.


To go the way of semantic encodings, you might want to look into "word embeddings", which do indeed try to match the real meaning of words, based on their semantic meaning. To start here, you could look into Word2Vec or GloVe` embeddings. Perhaps there is even a tool or library that already offers this capability.

These approaches will not inherently deal with things like typos, so for best results, you could even combine the two approaches.

Answered by n1k31t4 on February 2, 2021

Yes, there are a couple. My favorite is PyDictionary PyDictionary

Or if you’re using pip make sure you’re up to date and in terminal execute this command: pip install PyDictionary Hope this helped

Answered by Dummy Scripts on February 2, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP