# Naive Bayes always predicting the same label

Data Science Asked on January 2, 2022

I have been trying to write a naive bayes classifier from scratch that is supposed to predict the class label of the nominal car.arff dataset. However the classifier always predicts the most common one. I have tried log probabilities and laplace correction, both to no avail. Also I have noticed that the conditional probabilities for any attribute is always the greatest for the most common label. Is this because of my dataset? What can be done about it?

Here is my code:

import numpy as np
import pandas as pd
from scipy.io import arff

def parser(path):
"""
function which parses the data from an arff file
@param path: string containig the path to file
@return array containing the data
@raise FileNotFoundError exception in case if the path does not point to a valid file
"""

start = 0  # check if data really occured
# Declaratives as constant to avoid misspelling in code
RELATION = 'relation'
ATTRIBUTE = 'attribute'
DATA = 'data'

# Create dictionary holding the arff information
data = {RELATION: [],
ATTRIBUTE: [],
DATA: []}

# Read the file and analyse the data
with open(path) as file:
# Check if line is empty
if line.strip() == '':
continue

# Check if line contains the relation
elif '@' + RELATION in line:
data[RELATION].append(line.replace('@' + RELATION, '').strip())

# Check if line contains an attribute
elif line.startswith('@attribute'):
tmp = line.replace("{", "").replace("}", "").replace("n", "").replace("'", "")
# checks if whitespaces between commas in attributes occur
if (len(tmp.split(" ")) > 3):
values = tmp.replace(",", "").split(" ")[2:]
else:
values = tmp.split(" ").split(",")

data[ATTRIBUTE].append({'name': tmp.split(" "), 'values': values})

# check if @data exists
elif '@' + DATA in line:
start = 1

# If the line is not one of the others, it has to be data
elif '@' + DATA not in line and start:
line = line.split(',')
# strip each element of the line
for i in range(len(line)):
line[i] = line[i].strip()
data[DATA].append(line)

attributes = np.array(data['attribute'])
out = []
for i in range(len(data['data'])):
data_dict = {}
for j in range(len(attributes)):
data_dict.update({attributes[j]['name']: data['data'][i][j]})
out.append(data_dict)
out = np.array(out)
return out, data[ATTRIBUTE]

class NaiveBayes():

def __init__(self, data, atts, class_label):
self.data = data
self.atts = atts
self.class_label = class_label

def prior(self):

prior_probabilities = [0,0,0,0]
for i in range(len(self.data)):
if self.data[i]['class'] == 'unacc': prior_probabilities += 1
if self.data[i]['class'] == 'acc': prior_probabilities += 1
if self.data[i]['class'] == 'good': prior_probabilities += 1
if self.data[i]['class'] == 'vgood': prior_probabilities += 1
prior_probabilities = [x/len(self.data) for x in prior_probabilities]

return prior_probabilities

def conditionalProbability(self,key,value,length):
#returns (in our case) 4 vector for one attribute with probabilities for each outcome
conditional_probabilities = *length
#definetly not the most efficient way
for i in range(len(self.data)):
if self.data[i][key] == value:
if self.data[i]['class'] == 'unacc': conditional_probabilities += 1
if self.data[i]['class'] == 'acc': conditional_probabilities += 1
if self.data[i]['class'] == 'good': conditional_probabilities += 1
if self.data[i]['class'] == 'vgood': conditional_probabilities += 1

s = np.sum(conditional_probabilities)
conditional_probabilities = [x/s for x in conditional_probabilities]

return conditional_probabilities

def classification(self, instance):

cprobs = []
probs = self.prior()
for key in instance.keys():
cprobs.append(self.conditionalProbability(key,instance[key],4))
print(cprobs)
#get probabilities
predicted_class = "unacc"

for i in range(len(cprobs)-1):
for j in range(4):
probs[j]*=cprobs[i][j]

#print(instance)
print(probs)

return probs.index(max(probs))

raw,atts = parser('car.arff')
class_attribute = 'class'

classifier = NaiveBayes(raw,atts,class_attribute)
print(classifier.data)
print(classifier.prior())

print(classifier.classification(classifier.data))
'''
results = [0,0,0,0]
for i in range(len(classifier.data)):
results[classifier.classification(classifier.data[i])]+=1
print(results)
'''


% 5. Number of Instances: 1728
%    (instances completely cover the attribute space)
%
% 6. Number of Attributes: 6
%
% 7. Attribute Values:
%
%    buying       v-high, high, med, low
%    maint        v-high, high, med, low
%    doors        2, 3, 4, 5-more
%    persons      2, 4, more
%    lug_boot     small, med, big
%    safety       low, med, high
%
% 8. Missing Attribute Values: none
%
% 9. Class Distribution (number of instances per class)
%
%    class      N          N[%]
%    -----------------------------
%    unacc     1210     (70.023 %)
%    acc        384     (22.222 %)
%    good        69     ( 3.993 %)
%    v-good      65     ( 3.762 %)


and here is some sample data:

low,low,5more,more,small,low,unacc
low,low,5more,more,small,med,acc
low,low,5more,more,small,high,good
low,low,5more,more,med,low,unacc
low,low,5more,more,med,med,good
low,low,5more,more,med,high,vgood
low,low,5more,more,big,low,unacc
low,low,5more,more,big,med,good
low,low,5more,more,big,high,vgood


The complete dataset can be found here

Looking at your distribution over classes, it is heavily unbalanced and this can skew the model to predicting the majority class, which in this case is 'unacc'. So, one recommendation would be to balance out the classes, typically by adding more instances of the minority classes to be equal to the majority class.

Also, looking at your sample data, there seems to be little, if not no variation between the buying, maint, doors and persons and here it looks like these features would not impact the classification decision.

In this case, I would go back to exploring the data and seeing which features could affect the classification decision. This can be done with bar plots and histograms. When doing this divide the data into the classes and plot the distribution of the features by class so you can see if there is any noticeable variation in distribution of these features by their classes.

Answered by shepan6 on January 2, 2022

## Related Questions

### Different values of mean absolute error when using GridSearchCV for max_leaf_nodes vs manually optimising max_leaf_nodes

1  Asked on August 6, 2021 by spectre

### Predicting categories from data having no targets

0  Asked on August 6, 2021

### what make lightGBM run faster than XGBoost?

1  Asked on August 6, 2021

### Explainability and Autoencoders

1  Asked on August 6, 2021 by mariah

### Does MLPClassifier (sklearn) support different activations for different layers?

1  Asked on August 6, 2021 by delorean88

### How does fusing operations lower accuracy for machine learning models?

0  Asked on August 6, 2021 by zebular13

### First perform data augmentation or normalization?

1  Asked on August 6, 2021

### Consequences of using XGBoost regressor for small dataset(< 500 rows)

1  Asked on August 5, 2021 by anubhav-nehru

### Data Warehouse design schema for sales (items, shopping carts, ..)

0  Asked on August 5, 2021 by naheliegend

### How can my loss be stable while the gradient keeps growing?

0  Asked on August 5, 2021 by metahg

### How to classify all words in a sentence with a context?

0  Asked on August 5, 2021 by gleb-shigin

### ValueError: Error when checking input: expected dense_36_input to have shape (None, 12) but got array with shape (140, 2)

1  Asked on August 5, 2021

### Performing anomalie detection on a battery volatge using LSTM-RNN

1  Asked on August 5, 2021

### Neural network: does bias equal to zero, is the same as, a layer without bias?

1  Asked on August 4, 2021 by kyle_397

### Reproducing randomForest Proximity Matrix from R package in Python

1  Asked on August 4, 2021

### Optimizing decision threshold on model with oversampled/imbalanced data

1  Asked on August 4, 2021

### How to set class-weight for imbalanced classes in KerasClassifier while it is used inside the GridSearchCV?

1  Asked on August 4, 2021 by ebrahimi

### Python: Handling imbalance Classes in python Machine Learning

6  Asked on August 4, 2021