TransWikia.com

Avoid hardware limitation while competing in Kaggle?

Data Science Asked on July 12, 2021

I’ve learned machine learning via textbooks and examples, which don’t delve into the engineering challenges of working with “big-ish” data like Kaggle’s.

As a specific example, I’m working on the New York taxi trip challenge. It’s a regression task for ~ 2 million rows and 20 columns.

My 4GB-RAM laptop can barely do EDA with pandas and matplotlib in Jupyter Notebook. However, when I try to build a random forest with 1000 trees, it hangs (e.g. Kernel restart error in Jupyter Notebook).

To combat this, I set up a 16GB-RAM desktop. I then ssh in, start a browser-less Jupyter Notebook kernel, and connect my local Notebook to that kernel. However, I still max out that machine.

At this point, I’m guessing that I need to run my model training code as a script.

  • Will this help with my machine hanging?
  • What’s the workflow to store the model results and use it later for prediction? Do you use a Makefile to keep this reproducible?
  • Doing this also sacrifices the interactivity of Jupyter Notebook — is there a workflow that maintains the interactivity?

My current toolkit is RStudio, Jupyter Notebook, Emacs, but am willing to pick up new things.

One Answer

  • Yes - a Python script will have less overhead than a Jupyter Notebook
  • Pickle is the standard way to store a scikit-learn model, see model persistence documentation.
  • The two primary ways to scale Jupyter Notebooks is vertical (rent a bigger machine on a cloud service provider) or horizontal (spin-up a cluster).

Answered by Brian Spiering on July 12, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP