TransWikia.com

Is there any way to read Xlsx file in pyspark?Also want to read strings of column from each columnName

Data Science Asked by shalu on December 4, 2020

pd is a panda module is one way of reading excel but its not available in my cluster. I want to read excel without pd module. Code1 and Code2 are two implementations i want in pyspark.

Code 1: Reading Excel

pdf = pd.read_excel(Name.xlsx)
sparkDF = sqlContext.createDataFrame(pdf)
df = sparkDF.rdd.map(list)
type(df)

Want to implement without pandas module

Code 2: gets list of strings from column colname in dataframe df

stringsList = []
columnList = list(df[colname])
for i in range(len(columnList)):
    if type(columnList[i]) != float:
        text = columnList[i]
        stringsList.append(text.lower())    
    else:
        stringsList.append(u'')
return stringsList

I want to implement this in pyspark.

2 Answers

Is pandas itself available on the cluster? If so, you may try to go with the in-built read_excel().

You may also try the HadoopOffice library, it contains a Spark DataSource, also available as Spark Package, you can easily test it out without any installation:

$SPARK_HOME/bin/pyspark --packages com.github.zuinnote:spark-hadoopoffice-ds_2.11:1.0.4

Some people also recommend the Spark Excel dependency.

Answered by Dominik on December 4, 2020

You need the jar crealytics. Use the link - jar to download the jar

Try this, it would help!

def get_df_from_excel(sqlContext, file_name):
    """    
    This method is intended to create a dataframe form excel file
    :param sqlContext: sqlContext
    :param file_name:  - Address of file 
    :return: dataframe
    """
    return sqlContext.read.format("com.crealytics.spark.excel") 
        .option("useHeader", "true") 
        .option("treatEmptyValuesAsNulls", "true") 
        .option("inferSchema", "true") 
        .option("addColorColumns", "False") 
        .option("maxRowsInMey", 2000) 
        .option("sheetName", "Import") 
        .load(file_name)

Answered by Rahul on December 4, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP