Why is Spark's LinearRegressionWithSGD very slow locally?

Question

I have been trying to run linear regression with SGD that is found in Spark mllib for some time and I am experiencing huge performance problems. All examples that I was looking have number of iterations set to 100 and say that Spark mllib can run very fast for big data.

This the code I am having problems with:

def train(data: RDD[TrainFeature], scalerModel: StandardScalerModel): LinearRegressionModel = {
    val labeledData = data map { trainFeature =>
      LabeledPoint(trainFeature.relevance.value, toVector(trainFeature.feature, scalerModel))
    }
    labeledData.cache()

val algorithm = new LinearRegressionWithSGD()
    algorithm.optimizer
      .setNumIterations(10)
      .setRegParam(0.01)
      .setStepSize(0.1)

algorithm run labeledData
  }

private def toVector(feature: Feature, scalerModel: StandardScalerModel): Vector = scalerModel transform toVector(feature)

private def toVector(feature: Feature): Vector = Vectors dense feature.coordinates.toArray

I have scaled the data first and then run the algorithm to train the model. Even when I use 10 iterations it takes around 10 minutes to train model for 70,000 entries with feature vector of size 2. And the results I get are not good at all. I start getting decent results after numberOfIterations = 1000, but that would take ages.

Is it normal for linear regression with SGD to be this slow for 70,000 vectors of size 2?

My JVM min and initial memory is set to 4g.
I have tried setting the following as well (devastated try): System.setProperty("spark.executor.memory", "3g")

I am running this locally and since normal LinearRegression written in MatLab would finish the job very fast, I am wondering what am I doing wrong?

Edit: When I look at the spark UI in jobs section, I see that it is creating way too many jobs for gradient descent. Is there a way that I could tell Spark to create very little jobs - i.e. don't split data, run everything in a single thread? Maybe that can help me debug the problem further.

Brian Spiering · Answer

Spark is designed to be distributed across a cluster and use stochastic gradient descent (SGD) to optimize linear regression.
There is overhead for cluster infrastructure (even when the "cluster" is a single local node). Also, SGD is an iterative method that uses many batches to find a solution.
Given that your problem is 70k rows, it would be better to use a single node framework (e.g., scikit-learn) and ordinary least squares (OLS), a closed form solution, to optimize linear regression. Those two changes will greatly speed up training.

Why is Spark's LinearRegressionWithSGD very slow locally?

One Answer

Add your own answers!

Ask a Question