TransWikia.com

Issues with pandas chunk merge

Data Science Asked by Sany on June 22, 2021

I’m trying to solve a kaggle competition

Since the data is too much to fit in memory at once, I’m trying to clean, process and save data back to disk. This is the code I’m using:

# Process individual chunks to normalize JSON columns and remove unwanted columns
def process_chunk(df):
    df['hits'] = df['hits'].apply(literal_eval)
    df['hits'] = df['hits'].str[0]
    df = df[pd.notnull(df['hits'])]

    df['customDimensions'] = df['customDimensions'].apply(literal_eval)
    df['customDimensions'] = df['customDimensions'].str[0]
    df = df[pd.notnull(df['customDimensions'])]

    JSON_COLUMNS = ['device', 'geoNetwork', 'totals', 'trafficSource', 'hits', 'customDimensions']

    for column in JSON_COLUMNS:
        column_as_df = json_normalize(df[column])
        column_as_df.columns = [f"{column}.{subcolumn}" for subcolumn in column_as_df.columns]
        df.drop(column, axis=1, inplace=True)
        df = df.merge(column_as_df, right_index=True, left_index=True)

    # remove some columns to reduce size
    # we only need a few hits columns, I suppose
    want_hits_col = ['hits.type', 'hits.hour', 'hits.hitNumber', 'hits.type']
    all_hits = df[df.columns[pd.Series(df.columns).str.startswith('hits')]].columns
    hits_columns_remove = list(set(all_hits) - set(want_hits_col))

    df = df.drop(hits_columns_remove, axis=1)
    return df



# divide the data into chunks and save processed chunks incrementally
def save_parse(csv_path, chunksize=500):
    JSON_COLUMNS = ['device', 'geoNetwork', 'totals', 'trafficSource']

    chunks = pd.read_csv(csv_path,
                         converters={column: json.loads for column in JSON_COLUMNS},
                         dtype={'fullVisitorId': 'str'},  # Important!!
                         chunksize=chunksize)
    chunk_counter = 1
    for chunk in chunks:
        processed = process_chunk(chunk)
        # show me the progress
        print("Processed chunk number - {} with shape {}".format(chunk_counter, processed.shape))
        chunk_counter += 1
        # save to csv
        processed.to_csv('train_chunks_appended.csv', mode='a')
    print("Done parsing")

For the first chunk, the process_chunk method outputs current value (df.shape = (211, 59)). Stragly, for second chunk onwards, chunk is not processing properly.

I think there is a problem at merging the normalized json columns back to dataframe at this line:

df = df.merge(column_as_df, right_index=True, left_index=True)

which is why the shape becomes (0, 27) for rest of the chunks. I am not able to figure out why this is happening. Any pointers would be great.

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP