TransWikia.com

ValueError trying to use a pickled scikit-learn model

Data Science Asked on December 5, 2020

I am new to data science and trying to learn something. I was able to complete the prediction with 98% accuracy and i saved it as pickle model. Now while trying to predict using this model I am getting the below error.

trainFile=os.path.join('D:PYPrograms','Data','POS','collected.csv')
#load the data
train  = pd.read_csv(trainFile)
dataTemp=train
nullInTrain=train.shape[0] - train.dropna().shape[0]
print("Null values in Train data "+str(nullInTrain))
dataTemp.columns = dataTemp.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')
dataTemp.loc[:,"title"] = dataTemp.title.apply(lambda x : " ".join(re.findall('[w]+',x)))
df1 = dataTemp.dropna()
cv1 = CountVectorizer()
df_x = df1["tickettype"]+" "+df1["title"]
df_y = df1["type"]
X_train, X_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.2, random_state=0)
x_traincv = cv1.fit_transform(X_train)
x_testcv = cv1.transform(X_test)
clf = RandomForestClassifier(n_estimators = 1000, max_depth = 6)
clf.fit(x_traincv,y_train)
pred=clf.predict(x_testcv)
pred
#make prediction and check model's accuracy
predictions_test = clf.predict(x_testcv)
acc =  accuracy_score(np.array(y_test),predictions_test)
print ('The accuracy of Random Forest is {}'.format(acc))
import pickle
modelFile=os.path.join('D:PYPrograms','Data','model2')
with open(modelFile, 'wb') as picklefile:
    pickle.dump(clf,picklefile)

with open(modelFile, 'rb') as training_model:
    model = pickle.load(training_model)

cv2 = CountVectorizer()
File=os.path.join('D:PYPrograms','Data','POS','Report_one_wk08.csv')
data = pd.read_csv(File)
data.columns = dataTemp.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')
test = cv2.fit_transform(data['title'])
model.predict(test)

Error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-205-c0ac8462bce6> in <module>
----> 1 model.predict(test)

~AppDataLocalProgramsPythonPython37libsite-packagessklearnensembleforest.py in predict(self, X)
    543             The predicted classes.
    544         """
--> 545         proba = self.predict_proba(X)
    546 
    547         if self.n_outputs_ == 1:

~AppDataLocalProgramsPythonPython37libsite-packagessklearnensembleforest.py in predict_proba(self, X)
    586         check_is_fitted(self, 'estimators_')
    587         # Check data
--> 588         X = self._validate_X_predict(X)
    589 
    590         # Assign chunk of trees to jobs

~AppDataLocalProgramsPythonPython37libsite-packagessklearnensembleforest.py in _validate_X_predict(self, X)
    357                                  "call `fit` before exploiting the model.")
    358 
--> 359         return self.estimators_[0]._validate_X_predict(X, check_input=True)
    360 
    361     @property

~AppDataLocalProgramsPythonPython37libsite-packagessklearntreetree.py in _validate_X_predict(self, X, check_input)
    400                              "match the input. Model n_features is %s and "
    401                              "input n_features is %s "
--> 402                              % (self.n_features_, n_features))
    403 
    404         return X

ValueError: Number of features of the model must match the input. Model n_features is 6639 and input n_features is 3 

Data available at https://drive.google.com/open?id=1xaKKSXzpr7THezqU_8jycfvAueg0nnCQ

One Answer

You have to use the same CountVectorizer instance on all data and have a method to handle out of training sample tokens.

Answered by Brian Spiering on December 5, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP