How can I label (predict) an unseen set of data based on an existing model?

Question

I'm working on a learning multi-label classification project, for which I've taken 16K lines of text and kind of manually classified them achieving around 94% of accuracy/recall (out of three models).
Good results I'd say.
I then though I would have been ready to use my model to predict the label for a set of new similar text but not previously seen/predicted.
However, it appears that - at least with the sklearns models - I can't simply run the predict against the new dataset as the prediction label array is of a different size.
I am missing something for sure, but at this stage I wonder what considering that I always thought that the classification would have helped in such a task.
If I need to know the "answer", I struggle to understand the benefit of the approach.
Below the approach taken in short:
from gensim import corpora

corpus = df_train.Terms.to_list()

# build a dictionary
texts = [
    word_tokenizer(document, False)
    for document in corpus
]

dictionary = corpora.Dictionary(texts)

from gensim.models.tfidfmodel import TfidfModel

# create the tfidf vector
new_corpus = [dictionary.doc2bow(text) for text in texts]
tfidf_model = TfidfModel(new_corpus, smartirs='Lpc')
corpus_tfidf = tfidf_model[new_corpus]

# convert into a format usable by the sklearn
from gensim.matutils import corpus2csc

X = corpus2csc(corpus_tfidf).transpose()

# Let fit and predict

from sklearn.naive_bayes import ComplementNB
clf = ComplementNB()
clf.fit(X.toarray(), y)

y_pred = clf.predict(X.toarray())

# At this stage I have my model with the 16K text label.

# Running again almost the above code till X = corpus2csc(corpus_tfidf).transpose().
# Supplying a new dataframe should give me a new vector that I can predict via the clf.predict(X.toarray())

corpus = df.Query.to_list()

# build a dictionary
.....
.....

X = corpus2csc(corpus_tfidf).transpose()
y_pred = clf.predict(X.toarray()) # here I get the error

So everything works fine in using the df_train (shape (16496, 2)), by the time I repeat the above with my new dataset df (shape (831, 1), I got the error as above mentioned.
Of course, the second dimension in the first dataset, is the one containing the label, which are used with the fit method, so the problem is not there.
The error is due to the fact that a much smaller corpus has generated just 778 columns, whereas the first set of data with 16k row has generated 3226 columns. This is because I vectorised my corpus as I was after using the TF-IDF to give terms some importance. Perhaps this is the error?
I understand that there are models like PCS that can reduce the dimensionality, but I'm not sure about the opposite.
Anybody can kindly explain?
UPDATE
Nicholas helped to figure out where the error is, though a new one is now appearing always in connection of some missing columns.
See below the code and errors as it stands.
from gensim import corpora

corpus = df_train.Terms.to_list()

# build a dictionary
texts = [
    word_tokenizer(document, False)
    for document in corpus
]

dictionary = corpora.Dictionary(texts)

from gensim.models.tfidfmodel import TfidfModel

# create the tfidf vector
new_corpus = [dictionary.doc2bow(text) for text in texts]
tfidf_model = TfidfModel(new_corpus, smartirs='Lpc')
corpus_tfidf = tfidf_model[new_corpus]

# convert into a format usable by the sklearn
from gensim.matutils import corpus2csc

X = corpus2csc(corpus_tfidf).transpose()

# Let fit and predict

from sklearn.naive_bayes import ComplementNB
clf = ComplementNB()
clf.fit(X.toarray(), y)

y_pred = clf.predict(X.toarray())

# At this stage I have my model with the 16K text label.

corpus = df.Query.to_list()

unseen_tokens = [word_tokenizer(document, False) for document in corpus]
unseen_bow = [dictionary.doc2bow(t) for t in unseen_tokens]
unseen_vectors = tfidf_model[unseen_bow]
X = corpus2csc(unseen_vectors).transpose() # here I get the errors in the first screenshot

y_pred = clf.predict(X.toarray()) # here I get the errors in the second screenshot

UPDATE 2
I've tried also a second approach, using the TfidfVectorizer from sklearn. I did it just in case I was missing something obvious on the previous implementation (you know ... the KISS method).
In that circumstance the output is as expected, I got a prediction. So not sure, but I suspect there is a problem somewhere with the corpus2csc library.
UPDATE 3
Have uploaded the datasets here and here if you want to try. Also a gist is available here.
Cheers

Andrea Moro · Accepted Answer

Kudos to @Nicholas to have put myself on the right way.
The specific answer on why this was not working with the Corpora model is due on what I guessed over time.
The corpus2csc was kind of compressing/forgetting some details.
The solution is to specify the length of the dictionary when transposing the values.
Therefore, from X = corpus2csc(unseen_vectors).transpose() the code has to become X = corpus2csc(unseen_vectors, num_terms=len(dictionary)).transpose().
Hope this may help somebody one day.
Therefore

Nicholas James Bailey · Answer

You need to use the same preprocessing elements (dictionary etc) that you used to create your tfidf matrix during training when you come to apply your model to unseen data.
Do not create a new dictionary, tfidf_model, etc. for the unseen data, or else

the dimensionality of the data you are passing to your model may not be the same.
you will lose the information you learned by doing the tfidf on your training data

Straight after the line
corpus = df.Query.to_list()

You want something like
unseen_tokens = [word_tokenizer(document, False) for document in corpus]
unseen_bow = [dictionary.doc2bow(t) for t in unseen_tokens]
unseen_vectors = tfidf_model[unseen_bow]

i.e. not creating a new tfidf model or a new dictionary - using the ones you created and used in training.

How can I label (predict) an unseen set of data based on an existing model?

2 Answers

Add your own answers!

Ask a Question