TransWikia.com

What are the alternatives to Python + Spark (pyspark)?

Data Science Asked by stackoverflower on December 14, 2020

I like Python, and I like Spark, but they don’t fit very well together. In particular,

  1. it is very hard to use python functions in spark (have to create JVM binding for function in python)
  2. it is hard to debug pyspark, with py4j in the middle

So I wonder if there are any alternatives to pyspark that supports python natively instead of via an adapter layer?

Reference

2 Answers

Try checking dask. It's a distributed library which is native with Python and it uses pandas and numpy. So its like using pandas with some wrapper for distribution computation.

Correct answer by avinash raghuthu on December 14, 2020

Try Parallel Python. https://www.parallelpython.com/

I use it for my bespoke data integrations which can scale to multiple machines.

With the bespoke option, you have the flexibility to process data with what ever tools you like.

Eg. algorithmic processing with dataframes take very long, but if you use opencl or other GPU abstraction libraries, you can cut your processing time in half if you are willing to refactor and vectorise your algorithms.

It takes a while to build an "Integration Template" with Parallel Python. But it is worth it once you have it.

You will be able to build many integrations, whether you are distributing your data pulling task, your data pushing task, or your data processing task, a bespoke strategy gives you options and flexibility, where as using an off the shelf integration framework tightly couples you with their product.

Answered by user40285 on December 14, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP