TransWikia.com

Resampling train and test data in R

Data Science Asked by znoris007 on April 27, 2021

I need to try out few different machine learning methods (SVM, Logistic regression etc.), predict a value either true or false, and write down their AUC and Accuracy of these predictions.
I have allready successfully done that, now i have a two matrixes one for AUC and one for Accuracy, and they are filled with data from SVM and logistic regression (one row).

Now i need to create models for SVM and Logistic regression 10 more times (i should use bootstrapping sampling) and with that i should have 10 rows of my AUC and accuracy data. I have read multiple articles and guides/tutorials, however i can’t figure out how to achieve this. I also found and tried few libraries ( one is ROSE and the other one is boot) and none worked for me. Because if i understand the assigment correctly i need to get 10 different samples from my dataset, and then seperate the data in train and test sets so i can compare the models AUC and accuracy and see how good those models actually are.

Like i said i found multiple sources and the best thing i came up with is this:

 for (i in 1:10){
      set.seed(123)
      ##########################
      ##########################
      boot.sample = sample(n, 1000, replace = TRUE)
      bootSample = dataset[boot.sample, ]
      bootSample
    
      split = sample.split(bootSample$blueWins, SplitRatio= 0.80)
      training = subset(bootSample, split == TRUE,  replace=TRUE)
      test = subset(bootSample, split == FALSE,  replace=TRUE)
      print(training)
}

But with this approach i think set.seed messes up everything, because it works with the same data every time. However i think the assingment wants me to use the same seed for every machine learning model.

I maybe overcomplicated the whole thing, i am new to R.

Hope someone can clear these things up.
Thanks

2 Answers

Try using a different seed for each loop. You can do it like this.

my_seeds <- c(1:10) # These are 10 seeds, 1, 2, 3...10. Change to whatever.     
for (i in 1:10){
          set.seed(my_seeds[i])
          ##########################
          ##########################
          boot.sample = sample(n, 1000, replace = TRUE)
          bootSample = dataset[boot.sample, ]
          bootSample
        
          split = sample.split(bootSample$blueWins, SplitRatio= 0.80)
          training = subset(bootSample, split == TRUE,  replace=TRUE)
          test = subset(bootSample, split == FALSE,  replace=TRUE)
          print(training)
    }

Answered by bstrain on April 27, 2021

You can set seed once outside the loop:

set.seed(123)
 for (i in 1:10){
      
      ##########################
      ##########################
      boot.sample = sample(n, 1000, replace = TRUE)
      bootSample = dataset[boot.sample, ]
      bootSample
    
      split = sample.split(bootSample$blueWins, SplitRatio= 0.80)
      training = subset(bootSample, split == TRUE,  replace=TRUE)
      test = subset(bootSample, split == FALSE,  replace=TRUE)
      print(training)
}

Answered by Ruin Donas on April 27, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP