TransWikia.com

How can I improve the accuracy of my model? (Cab Cancellation Prediction)

Data Science Asked by frigidelirium on July 30, 2021

I am trying to predict based on several parameters like trip type, car type, source of booking, start time, lead time (start- book) and a few other params whether or not a customer will cancel. From the code below the accuracy of default.ct the 1st classification I do is giving me an accuracy of 75%. deeper.ct the deeper tree that I am generating is giving me an accuracy of 70%. Progressively the accuracy of the pruned tree also is remaining the same. Boosting with adabag package is taking way too long because I’ve nearly 5,00,000 observations across 19 variables. xgboost is giving me the best mlogloss value at about 0.43.

What can I do to improve the accuracy of the model?

    # Generate classification tree
    default.ct <- rpart(tag ~ ., data = train.df, method = "class", 
    control=rpart.control(minsplit=2, minbucket=1, cp=0.001))
    summary(default.ct)$used
    printcp(default.ct)

   # generate confusion matrix for training data
   prp(default.ct, type = 1, extra = 1, under = TRUE, split.font = 1, varlen = 
  -10)
  default.ct.point.pred.train <- predict(default.ct,train.df,type = "class")
  confusionMatrix(default.ct.point.pred.train, train.df$tag)

    deeper.ct <- rpart(tag ~ ., data = train.df, method = "class", cp = 0, 
   minsplit = 1)
   # count number of leaves
   length(deeper.ct$frame$var[deeper.ct$frame$var == "<leaf>"])

   ## Use cross-validation to prune the tree
   cv.ct <- rpart(tag ~ ., data = train.df, method = "class", cp = 0, minsplit = 
   5, xval = 5)
   # use printcp() to print the table. 

   printcp(cv.ct)
   # Use variable c to store accuracy data for different cp and print it out
   c <- list()

   for (i in 1:nrow(cv.ct$cptable)){
   pruned.ct <- prune(cv.ct, 
                  cp = cv.ct$cptable[i])
   pruned.ct.point.pred.train <- predict(pruned.ct,valid.df,type = "class")
   c[i] <- confusionMatrix(pruned.ct.point.pred.train, valid.df$tag)$overall[1]
   }


    # prune the tree with second large cp and use it to predict validation data 
    pruned.ct <- prune(cv.ct, cp = cv.ct$cptable[2])
length(pruned.ct$frame$var[pruned.ct$frame$var == "<leaf>"])

One Answer

  1. Create addition variable: Eg: lead_time-start_time can be time to book.
  2. Reduce variables with many classes if present (part of EDA)
  3. standardize numeric variables - (val-mean)/sigma
  4. Tree is a very weak classifier, you will have to do bagging or boosting (like ada boost or gbm or random forest
  5. try parameter tuning - I am not pasting any links since I don;t know the policy for advertising over here, but just search GBM parameter tuning in google - you'll get multiple links

You have mentioned time issues when you tried ensemble methods. To solve that:

  1. take a subset of data and try ensembling on that (once you finalize the model, run it on whole dataset)
  2. if you are using XGBoost, you have an option of taking in another model as an input (you can run this in batches)

Answered by Rohan on July 30, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP