TransWikia.com

Python __hash__ performance for bulky data

Stack Overflow Asked by user14848693 on February 7, 2021

Tech stack:

  • Python 3.8

When this function (to re-structure data in acceptable format) is executed for

  • few 100s of timestamps works efficiently (0.022 s)
  • bulk in order of 100,000+ it takes lot of time (~40 seconds)

where length of grouped values is 250+.

def re_struct_data(all_timestamps: List, grouped_values: Dict[String, Dict[Integer, Integer]]):
    tm_count = len(all_timestamps)

    start_tm = 1607494871
    get_tms = lambda: [None] * tm_count
    data_matrix = {'runTime': get_tms()}

    for i_idx, tm in enumerate(all_timestamps):

        data_matrix['runTime'][i_idx] = float(tm) - start_tm
        for cnl_nm in grouped_values:
            if cnl_nm not in data_matrix:
                data_matrix[cnl_nm] = get_tms()

            value_dict = grouped_values[cnl_nm]
            if tm in value_dict:
                data_matrix[cnl_nm][i_idx] = value_dict[tm]
    return data_matrix

When I did code profiling for the same, got to know quite a handsome amount of time goes into hashing for the presence/ absence of cnl_nm in data_matrix.

I tried switching to

  • setdefault() -> (as it does the same under the hood)
  • Using .items() -> (tuple conversion + unpacking)

But it took more time.

ANy suggestions to improve the same?

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP