Bag-of-words model : Boolean vs. TF-IDF

Question

When I design a document classifier using traditional feature engineering, I would prefer (to Boolean model) tf-idf model to represent a document into a vector because intuitively Boolean model loses information of how important each word is for classifying a document into certain class.

I mean using Boolean model for representing a document as a vector is to give it a less meaningful position in n-dimensional vector space than tf-idf-based feature extraction when each dimension represents a term, by using discrete value rather than a continuous value, since discrete(0 or 1) value is made to ignore the difference of weight of each term although parameter tuning process may optimize coefficient of each term when using linear combination for document classification.

Am I justified in my thinking that using Boolean feature for bag-of-words model to extract feature vector from a document is not a good choice for the above-mentioned reason?

I already know the recent approach like representation learning and dimensional reduction like word embedding or BERT language model. My question is limited to some traditional feature extraction from document data.

Erwan · Answer

Your reasoning is correct: for most tasks related to information retrieval and/or document classification based on the semantics of the documents, it's recommended to take into account the importance of the terms (both inside the document and across all documents, hence TF and IDF).

However TF-IDF is not necessarily always the best choice:

There are some classification tasks which are not based on the semantics of the document. For example if the goal is to classify documents by writing style (e.g. find documents by the same author) then the topic doesn't matter and therefore IDF is not relevant. 
In the case of a very small dataset and/or very short documents, using TF-IDF scores can lead to overfitting. In such cases using boolean values might perform better because it makes the job of the model easier.

Bag-of-words model : Boolean vs. TF-IDF

One Answer

Add your own answers!

Ask a Question