Negative Sampling in Word2Vec - Embedding Vector / Amount of Samples

Question

I understand that negative sampling in the skip-gram model of word2vec changing the classification from given a center word c, what is the probability of a context word o to accur to given a tuple (c,o), how likely is it that they will appear in the same context.
However, I've got two questions. First, the loss function of the negative sampling approach is often stated along the lines of $log sigma(w^Tc) + k cdot sum_{E_{c_N~P(w)}} [log -sigma (w^Tc_N)]$. What we can see here and what is already stated in [1], that for each training sample, we consider $k$ negative training samples. Why do we consider k negative samples for each positive one? Why doesn't this cause a bias in our data, as we have a lot more negative samples than the positive ones?
And what does "consider" exactly mean in this context? We can only input one training sample at a time in our network. Does it mean we input a positive sample, do backprop, and then $k$ times input negative samples and do backprop?
Secondly, [2] provides a great tutorial on how to implement negative sampling. The source code there is as follows:
posTrainSet = []

# add positive examples
for document in corpus:
    for i in range(1, len(document)-1):
        word = word_to_ix[document[i]]
        context_words = [word_to_ix[document[i-1]], word_to_ix[document[i+1]]]
        for context in context_words:
            posTrainSet.append((word, context))

n_pos_examples = len(posTrainSet)

# add the same number of negative examples
n_neg_examples = 0
negTrainSet = []

while n_neg_examples < n_pos_examples:
    (word, context) = generate_negative_sample(wordProb)
    # convert to indicies
    word, context = word_to_ix[word], word_to_ix[context]
    if (word, context) not in posTrainSet:
        negTrainSet.append((word, context))
        n_neg_examples += 1

X = np.concatenate([np.array(posTrainSet), np.array(negTrainSet)], axis=0)
y = np.concatenate([[1]*n_pos_examples, [0]*n_neg_examples])
N_WORDS = len(word_to_ix.keys())
embedding_layer = layers.Embedding(N_WORDS, EMBEDDING_DIM, 
                                   embeddings_initializer="RandomNormal",
                                   input_shape=(2,))
from tensorflow import keras
from tensorflow.keras import layers
from keras.models import Model

model = keras.Sequential([
  embedding_layer,
  layers.GlobalAveragePooling1D(),
  layers.Dense(1, activation='sigmoid'),
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

history = model.fit(X,y, batch_size=X.shape[0])

First of all, don't they use one negative sample per positive sample? Second, I do not understand how here and in general with negative sampling the embedding layer works. Without negative sampling, we considered the input as the one-hot encoded vector of a word and therefore, one row in our weight matrix of the embedding layer corresponded to the word embedding. In negative sampling, we input two words into the neural network. How does that even work? In this case, we set the input dimensions of the embedding layer to 2, but given the tuple (center word, context word), why would we want to input the context word into the word embedding layer (without negative samplings, we had context embeddings) and I don't understand how the output of the embedding layer works if we input it two words instead of one. It has to return a single embedding in the end. So this question has two sides, first, I do not understand the general idea of inputting a tuple of words into the NN and second, I don't see how this implementation approaches this problem.
Thank you very much!
[1] How does negative sampling work in word2vec?
[2] https://www.jasonosajima.com/ns

Negative Sampling in Word2Vec - Embedding Vector / Amount of Samples

Add your own answers!

Ask a Question