TransWikia.com

How do I use TF*IDF scores for my machine learning model?

Data Science Asked by Apollo on December 4, 2020

I have applied TF*IDF on the ‘Ad-topic line’ column of my dataset.
For every ad-topic line, I get the same output:

Output

Firstly, I am unable to make sense of the output. The TF*IDF values are mentioned to the right, but what exactly are the numbers in brackets?

I plan to use these for my logistic regression model for classification. How exactly do I feed these values to the algorithm?

3 Answers

The numbers on the left are also important, they are basically the indexes in the following format

(document_number, token_number)

TF-IDF is computed for all the unique tokens in all the documents.

Let me know if you have any further doubt.

Vote me if i was able to help ;)

Answered by William Scott on December 4, 2020

From the output you have shared it can be understood that you have around 1000 rows of data and 300+ features/columns/words which the tfidf function created based on your selection of ngrams parameter. Now, in the brackets (x,y) signifies: X as the number of row of your data And Y as the nth feature in the features list.

Assuming you have written something similar to get the above output-

tfidf_matrix =  tf.fit_transform(data)

this will give you a list of feature names 
feature_names = tf.get_feature_names()

And now you can check any Y value in the brackets, for example lets take the first value from your output- (0,53)

feature_names[53]

This will give the name of the feature(which is basically a word or a combination of words) and right side value is the the tfidf score of that feature in the 0th row of your data.

Answered by cap on December 4, 2020

First of all, I'm not sure how you have applied tfidf vectorizers on the data as no code snippet has been attached. Tfidf vectorizers are applied on text to convert the text into numerical vectors. Speciality of tfidf vectorization is that it gives more importance to rarely occuring words than the words which occur a lot of time ex: stop words or filler words which occur a lot of times yet they add no special meaning to a sentence. Nevertheless, I've applied tfidf vectorization on the dataset and I have posted them here and here. Try to recreate this. Hope it helps

Answered by karthikeyan mg on December 4, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP