TransWikia.com

Using pre-trained word2vec vs re-training a word2vec model

Data Science Asked by Niro on July 3, 2021

I am relatively new to using word2vec. I am interested in solving the topic-word intrusion introduced here by using the vector spaces of words generated by word2vec and SVC.

I have a corpus with a vocabulary of 8000 words, the vocabulary is perfectly contained in Google’s word2vec trained model. I was wondering which model would provide a better representation of the words, the pre-trained model on 3M words or a model trained only on the 8000 words appearing in my corpus?

One Answer

As always, "is A better than B" always depends on what you consider better? Is it accuracy, is it speed etc.

The dumb but correct answer to your question is: "try both and see which one is better".

Performance of techniques is always to some extent dependent on the data. What you need to remember with word vectors, is that they are learned based on a certain context. If the context that was used by Google's model is similar to yours, you might be better off using their model. But if it's different, you might run into some problems.

Just imagine the following case. You have 4 words: King, Man, Queen, Woman. Which pairs of two words would you create? Depending on the context, you could make a case for several

  • King/Man and Queen/Woman because of gender
  • Man/Woman and King/Queen because of the use of the word etc.

Answered by Valentin Calomme on July 3, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP