TransWikia.com

Repeated filtering on Spark Dataframe?

Stack Overflow Asked by Jeff Gong on November 16, 2021

I have a large Spark dataframe that contains a variety of financial information, say the schema looks something like:

Amount

account_id | amount
0            10.00
1            15.15
...

I have another large Spark dataframe that contains payment information, which is supposed to represent a payment towards that amount. For example

Payment

account_id | paid_amount
0            5.00
0            5.00
1            15.15

What I want to do is to iterate through each unique ID in my amount dataframe, and one at a time, filter out the payments that were associated with that ID in order to perform some other calculations.

Is this a slow operation or ill-advised? It seems like looping through all these Account IDs in a linear fashion is throwing away a lot of the optimization that Spark is providing.

What would be a better alternative, should one exist?

Thanks!

2 Answers

Here's how you can approach this problem.

Make both DataFrames as small as possible (e.g. perhaps run payments_df.groupBy("account_id").sum(), write them out to disk and see which one is smaller.

If one of the DataFrames is small enough to be broadcasted, just do a broadcast join with big_df.join(broadcast(small_df), "id", "inner"). See here for more details on broadcast joins.

If you can use Spark 3, try the join and see if Adaptive Query Execution gives you the performance you need.

If that's not good enough, look at optimizing a sort merge join.

Definitely don't iterate over rows in a DataFrame one-by-one!

Answered by Powers on November 16, 2021

Join both the dataframes and perform your complex operations on them

Df=Amountdf.join(paymentdf, 'id','inner')

Answered by Shubham Jain on November 16, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP