how to create word2vec for phrases and then calculate cosine similarity

Question

I have just started using word2vec and I have no idea how to create vectors (using word2vec) of two lists, each containing set of words and phrases and then how to calculate cosine similarity between these 2 lists.

For example :

list1 =['blogs', 'vmware', 'server', 'virtual', 'oracle update', 'virtualization', 'application','infrastructure', 'management']
list2 = ['microsoft visual studio','desktop virtualization',
'microsoft exchange server','cloud computing','windows server 2008']

Any help would be appreciated.

Esmailian · Answer

You cannot apply word2vec on multiple words. You should use something like doc2vec, which gives a vector for each phrase:

phrase = model.infer_vector(['microsoft', 'visual', 'studio'])

You can also average or sum the vectors of words (from word2vec) in each phrase, e.g.

phrase = w2v('microsoft') + w2v('visual') + w2v('studio')

This way, a phrase vector would be the same length as a word vector for comparison. But still, methods like doc2vec are better than a simple average or sum. Finally, you could proceed to compare each word in the first list to every phrase in the second list, and find the closest phrase.

Note that a phrase like "cloud computing" has a completely different meaning than the word "cloud". Therefore, these phrases, specially if frequent, better to be treated as a single word, e.g.

phrase = w2v('cloud_computing')

Extra directions:

Here is an answer by Astariul on stackoverflow that uses a function from word2vec package to calculate similarity between two sets of words.
Take a look at fastText that works better when there is a lot of misspelled, or out-of-vocabulary words.

Shamit Verma · Answer

Vector representation of phrases (called term-vectors) are used in projects like search results optimization and question answering.

A textbook example is "Chinese river" ~ {"Yangtze_River","Qiantang_River"}  (https://code.google.com/archive/p/word2vec/)

Above example identifies phrases based on Nouns mentioned in Freebase DB. There are alternatives such as :

Identify all nouns and other phrases based on POS tagging   
Identify all bi-grams, tri-grams

Filter the list above based on usage (E.g.: only retain terms that have been used at least 500 times in large corpus such as Wikipedia).

Once terms have been identified, Word Vector algo will work as it is :

Train word vector model 
Concat phrases into single tokens and retrain the model 
Merge these 2 models

Following patent from Google has more details

https://patents.google.com/patent/CN106776713A/en

Other papers that have examples of domains where term vectors have been evaluated / used :

https://arxiv.org/ftp/arxiv/papers/1801/1801.01884.pdf

https://www.cs.cmu.edu/~lbing/pub/KBS2018_bing.pdf

https://www.sciencedirect.com/science/article/pii/S1532046417302769

http://resources.mpi-inf.mpg.de/departments/d5/teaching/ws15_16/irdm/slides/irdm2015-ch13-part2-handout.pdf

how to create word2vec for phrases and then calculate cosine similarity

2 Answers

Add your own answers!

Ask a Question