TransWikia.com

PySpark: How do I specify dropna axis in PySpark transformation?

Data Science Asked by Horbaje on January 14, 2021

I would like to drop columns that contain all null values using dropna(). With Pandas you can do this with setting the keyword argument axis = 'columns' in dropna(). Here an example in a GitHub post.

How do I do this in PySpark ? dropna() is available as a transformation in PySpark, however axis is not an available keyword.

Note: I do not want to transpose my dataframe for this to work.

How would I drop the furniture column from this dataframe ?

data_2 = { 'furniture': [np.NaN ,np.NaN ,np.NaN], 'myid': ['1-12', '0-11', '2-12'], 'clothing': ["pants", "shoes", "socks"]} 

df_1 = pd.DataFrame(data_2)
ddf_1 = spark.createDataFrame(df_1)
ddf_1.show() 

2 Answers

You should be able to use the column name like:

df_1 = df_1.drop('furniture') 

Answered by Nitish Sahay on January 14, 2021

I know this is a bit late, but I struggled with this also. This is my attempt at removing null columns from a Spark Dataframe.

from pyspark.sql.functions import when, isnull

colsthatarenull = df.select([(when(isnull(c), c)).alias(c) for c in df.columns]).first().asDict()
namesofnullcols = {key:val for key, val in colsthatarenull.items() if val != None}.values()
df = df.drop(*namesofnullcols)

Answered by pm2020 on January 14, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP