Treating recommender systems as multiclass classification or binary classification problem

Question

I'm thinking about the two following approaches for building a recommender system to recommend products using implicit data as a classifier:

Treat it as a multi-class classification problem. The features of the model are the user features and the target is the item. This is the approach used in this Google documentation.
Treat it as a binary classification problem. The features of the model are the user and item features, and the target variable is a binary variable indicating whether the user purchased this item. This is the approach used in Tensorflow recommenders.

What are the advantages and disadvantages of using one or the other? Is the first approach implemented in any recommendation systems library?

Brian Spiering · Accepted Answer

One of the primary advantages of framing it as a multi-class classification problem is the ability to build a single model for all items and make direct comparisons between predictions for different items. The typical loss function for multi-class classification is softmax. Softmax provides the relative probability of predicting each item.
This is also one of the biggest disadvantages. It requires the data to be homogeneous. If not all items have the same features, modeling will be more difficult or not possible.
One of the primary advantages of binary classification framing is robustness to heterogeneous data. For example, sparse or missing data is more easily handled. Since each item is independent, items with very few examples or missing data can be modeled differently than items with dense or complete data.
The disadvantage is the effort of training and predicting a separate model for each item.

user702846 · Answer

The biggest challenge is probably how to measure the performance of your model. binary classification you can use Accuracy or AUC for example - but in multi-class it would be harder.

Measuring error in Recommendation systems is tricky in general. Different from typical classification problems. Predicting an absolutely amazing item to be shitty has a different cost/value than presenting a shitty item as diamond to users.

Specially if your items are hierarchically connected. For example you have the following objects, A:toilet paper, B:kitchen paper, C:beer. If the model predict A to be B as bad as predicting A to be C ? So one has to make a cost-function of X where i and j are the items and the value of x_ij is what is the score of predicting i to be j.

1.5) Interpretable models: It is more complicated to investigate which feature helps to identify which class in a multi-class setup. However this is not the case for binary classification.

imbalanced data: you have to make sure the model is 'fair' and has seen enough. In both type of modeling. This is probably more severe in multi-class. You would have to make tons of figures and table to explain where your model sucks and for example in what scenarios it recognizes A to be B or C. In other words you would deal with a nxn matrix of error. This is not a headache in binary classification as you can imagine.

There is subtle differences in two approaches you have put their link. DNN works with user queries which is not necessarily structured. This can take into account even the time for example which they also mention it. For example, if you feel users are spending time on your page you can recommend them sth. Perhaps they are searching for a specific information which is different from the classic Netflix recommender system (user-item matrix).

As boring as this may sound - part of the decision should come from business. Whether to contact users and 'recommend them something' to avoid churn perhaps has a different value than 'up selling'. the features to model each can also be very different. In the first scenario for example one should use the how long the user has been with the platform however this feature most likely is irrelevant when it comes to up selling.

recommendation system is an active field. There is an annual conference called RecSys https://recsys.acm.org/ where companies/universities present their latest work. There has been tons of different methods out there and there is no one solution ! After 16 years of research Netflix has improved the recommendation system recall only 4%. So be prepared for a narrow range of performance improvement.

UPDATE
One way to decide whether to go with binary classification or multi-class is leveraging PCA (or any other dimension reduction method) and color the points with respect to the class. So if you go with binary, you would end up with a two  color plot and if you go multiple class regression you would have multiple colors. Visually you can inspect in which set up you get a better separation.

David Masip · Answer

I got an answer to this same question in here.
Mainly, what is says is:

In general, softmax of catalog implies a fixed set of output items. Thus, whenever new items are added to the catalog, you'll have to change the output layer and retrain the model. In addition, training with a large softmax layer is time-consuming. Typically, the softmax layer is constrained (to a limited set of items) to speed up training time, or a negative sampling approach is adopted. See TripAdvisor's and YouTube's final layer here.

On the other hand, sigmoid of user-item pair can work with any number of items, as well as new items (provided the item embeddings are available). That said, it requires one prediction for each user-item pair and could be costly if many there are many pairs. (Contrast this with the softmax approach where you only need to predict once to get probabilities for all items). Most implementations I've seen adopt this approach (see other approaches in this post).

Treating recommender systems as multiclass classification or binary classification problem

3 Answers

Add your own answers!

Ask a Question