TransWikia.com

Data splitting for a binary classification model

Data Science Asked by hina10531 on April 5, 2021

I’m trying to build a binary classification model that will tell who’s going to buy the product and who’s not. I’ve heard that splitting a dataset into two different subsets is a common way when you prepare an input data.

[ ================ Training Data 80% ================= ] [ ==== Test Set 20% ==== ]

Is it just mindlessly splitting a chunk of dataset by some amount of proportion like above? Is it that simple?

Imagine I have this simple dataset below.

UserId,UserName,AppId,Purchased
1,Lianne,1,1
1,Lianne,2,1
1,Lianne,3,1
1,Lianne,4,1
1,Lianne,5,1
1,Lianne,6,0
1,Lianne,7,0
1,Lianne,8,0
1,Lianne,9,0
1,Lianne,10,0   

As the common recommended way, I splitted it into two groups.

// Training Data Set
UserId,UserName,AppId,Purchased
1,Lianne,1,1
1,Lianne,2,1
1,Lianne,3,1
1,Lianne,4,1
1,Lianne,5,1
1,Lianne,6,0
1,Lianne,7,0
1,Lianne,8,0

// Test Set
UserId,UserName,AppId,Purchased
1,Lianne,9,0
1,Lianne,10,0

Would this work? well it seemed not and it turned out it actually didn’t. The model was wrong about predicting on the appId of 6,7,8,9. It thought the user number one would buy them with a slightly high chance. The metrics look like…

  • TP : 5
  • FP : 4
  • FN : 1
  • Accuracy : 0.5
  • Auc : NaN
  • F1Score : NaN
  • Precision : 0
  • Negative Precision : 1
  • Negative Recall : 0.5

To make a proper model, what my test dataset should look like on this sample training data?

3 Answers

My 2 cents: the number of records in the data set used here is very small. If we have a look into the data set we can see that the target variable split is exactly 50:50 which means the probability is half. Its like flipping a coin to get heads or tail.

The training set contains a known output and the model learns on this data in order to be generalized to other data later on. The dependent variables and the independent variable should be in splatted and then do a train test fit.

You can use the library from scikit learn as well from sklearn.model_selection import train_test_split

Answered by Sunil on April 5, 2021

if this is how your data looks like it would be a good idea to split users into training examples and test examples, train examples would contain users with info about all related apps, and in test data you would give your model about 80% info about user and model would have to fill covered 20%, in some cases you have to split data in problem speciffic way

Answered by quester on April 5, 2021

Assuming that the data set posted is just an illustrative example (and therefore so small):

The problem is that your test data has a very different distribution regarding the dependent variable compared to your training data (in your example split: it does not contain any examples of class 1).

When splitting the data into train and test sets you need to include some randomness to fix this. However, in such a small data set that might still lead to very different empirical distributions regarding the training and test data.

What you can do is to apply a split which keeps the distribution of the target variable the same for the training and test data (i.e. both sets will have the same share of examples with y==1 and y==0).

Scikit Learn offers the parameter stratify for this (I copied your data to a CSV file):

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

raw_data = pd.read_csv('binary classification examples')
data = pd.get_dummies(raw_data)
X = data[["UserId", "UserName_Lianne", "AppId"]]
y = data[["Purchased"]]
X_train, X_test, y_train, y_test = train_test_split(
        X, y, stratify=y, test_size=0.2)

With stratify=y I have forced the split to keep the same share of each class of y in the train and test split:

>>> y_train
Out[40]: 
   Purchased
7          0
3          1
0          1
5          0
1          1
9          0
8          0
4          1

>>> y_test
Out[41]: 
   Purchased
2          1
6          0 

As you can see both, the training and the test data, now contain 50% items with y==0 and y==1. And with this data a DecisionTreeClassifier can easily classify the training and test data correctly:

model = DecisionTreeClassifier()
model.fit(X_train, y_train)
print("Train score: {}tTest score: {}".format(
        model.score(X_train, y_train),
        model.score(X_test, y_test)))

Gives the following scores:

Train score: 1.0        Test score: 1.0

Answered by Sammy on April 5, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP