Repeated filtering on Spark Dataframe?

Question

I have a large Spark dataframe that contains a variety of financial information, say the schema looks something like:
Amount

account_id | amount
0            10.00
1            15.15
...

I have another large Spark dataframe that contains payment information, which is supposed to represent a payment towards that amount. For example
Payment

account_id | paid_amount
0            5.00
0            5.00
1            15.15

What I want to do is to iterate through each unique ID in my amount dataframe, and one at a time, filter out the payments that were associated with that ID in order to perform some other calculations.
Is this a slow operation or ill-advised? It seems like looping through all these Account IDs in a linear fashion is throwing away a lot of the optimization that Spark is providing.
What would be a better alternative, should one exist?
Thanks!

Powers · Answer

Here's how you can approach this problem.
Make both DataFrames as small as possible (e.g. perhaps run payments_df.groupBy("account_id").sum(), write them out to disk and see which one is smaller.
If one of the DataFrames is small enough to be broadcasted, just do a broadcast join with big_df.join(broadcast(small_df), "id", "inner").  See here for more details on broadcast joins.
If you can use Spark 3, try the join and see if Adaptive Query Execution gives you the performance you need.
If that's not good enough, look at optimizing a sort merge join.
Definitely don't iterate over rows in a DataFrame one-by-one!

Shubham Jain · Answer

Join both the dataframes and perform your complex operations on them
Df=Amountdf.join(paymentdf, 'id','inner')

Repeated filtering on Spark Dataframe?

2 Answers

Add your own answers!

Ask a Question