Python __hash__ performance for bulky data

Question

Tech stack:

Python 3.8

When this function (to re-structure data in acceptable format) is executed for

few 100s of timestamps works efficiently (0.022 s)
bulk in order of 100,000+ it takes lot of time (~40 seconds)

where length of grouped values is 250+.
def re_struct_data(all_timestamps: List, grouped_values: Dict[String, Dict[Integer, Integer]]):
    tm_count = len(all_timestamps)

start_tm = 1607494871
    get_tms = lambda: [None] * tm_count
    data_matrix = {'runTime': get_tms()}

for i_idx, tm in enumerate(all_timestamps):

data_matrix['runTime'][i_idx] = float(tm) - start_tm
        for cnl_nm in grouped_values:
            if cnl_nm not in data_matrix:
                data_matrix[cnl_nm] = get_tms()

value_dict = grouped_values[cnl_nm]
            if tm in value_dict:
                data_matrix[cnl_nm][i_idx] = value_dict[tm]
    return data_matrix

When I did code profiling for the same, got to know quite a handsome amount of time goes into hashing for the presence/ absence of cnl_nm in data_matrix.
I tried switching to

setdefault() -> (as it does the same under the hood)
Using .items() -> (tuple conversion + unpacking)

But it took more time.
ANy suggestions to improve the same?

Python hash performance for bulky data

Add your own answers!

Ask a Question