TransWikia.com

word2vec: usefulness of context vectors in classification

Data Science Asked by Ingolifs on September 4, 2021

I’ve been working on a NN-based classification system that accepts document vectors as input. I can’t really talk about what I’m specifically training the neural net on, so i’m hoping for a more general answer.

Up to now, the word vectors I’ve been using (specifically, the gloVe function from the text2vec package for R) have been target vectors. Up to now I wasn’t aware that the word2vec training produced context vectors, and quite frankly I’m not sure what exactly they represent. (It’s not part of the main question, but if anybody could point me to resources on what context vectors are for and what they do, that would be greatly appreciated)

My question is, how useful are these context word vectors in any kind of classification scheme? Am I missing out on useful information to feed into the neuralnet?

How would, qualitatively speaking, these four schemes fare?

  1. Target word vectors only.
  2. Context word vectors only.
  3. Averaged target and context vectors.
  4. Concatenated vectors (i.e. a 100-vector word2vec model ends up with a length of 200)

3 Answers

I am a bit confused about your question because word2vec basically outputs a word vector/embedding for each word which is completely independent of the context of that word in the document.

Word2vec is a very simple and nice embedding technique Google came up with that uses a neural network to make word embeddings by using either a Skip-gram or Continuous Bag of Words approach (CBOW). If you would like more details you should read their paper.

To overcome this issue with word2vec - that its embeddings do not learn context - there are two approaches. One, a model named ELMO (it uses LSTMs) and the other BERT (uses Transformers). These are the state of the art now as they can learn context and long-range dependencies. I don't have experience with R so I do not know if these can be implemented in R or if there is a package for them, but they work with most deep learning packages like PyTorch or TensorFlow.

A brief summary:

Answered by Dev Tech on September 4, 2021

Essentially the target vectors $mathbf{w}$ and the context vectors $mathbf{c}$ learn the same thing (they jointly factorise a matrix of word co-occurrence statistics and are two sides of the same coin). You can use either equivalently.

However, a better result is given by taking their average... This is because the embeddings are trained so that $mathbf{w}^topmathbf{c}$ approximates a useful statistic. If the $mathbf{c}$'s are discarded and only $mathbf{w}$'s used then $mathbf{w}^topmathbf{w}$ gets back the encoded statistic but with an error. Using $mathbf{w}^topmathbf{c}$ is obviously preferable as it gets back the statistic with minimal error, but requires keeping both sets of embeddings (twice the memory). Under some simplifying assumptions, using their average $mathbf{a}=tfrac{mathbf{w}+mathbf{c}}{2}$ can be seen to half the error, so not as good as keeping both (e.g. concatenating), but better than just using either $mathbf{w}$ or $mathbf{c}$.

More detail is given in this paper

Answered by Carl on September 4, 2021

So, from what I understand from the question, you want to get an idea of how word2vec works so you can assess how well the resulting context vectors from this model will help discriminate between words by their meaning.

Word2vec works on the premise of the distributional hypothesis which essentially states that words which appear in soimikar contexts will have similar meanings (e.g. the dog ate the food/ the cat ate the food : both dog and cat appear in the same context so they are semantically close to each other)

So word2vec formulates this by a CBOW model which effectively is a feedforward neural network, which takes in the surrounding context of the target word as a series of one hot encoded vectors and aims to predict the target word (there is an assumption made of course where the contexts are treated as a bag of words and therefore assumes that a word’s meaning is not related to the word ordering of its context).

After training this, the models weights are then used to dorm the word embeddings, which represents a words meaning in semantic space. (For further reference)

Answered by shepan6 on September 4, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP