TransWikia.com

Pandas - avoid iterrows() assembling a multi-index data frame from another time-series multi-index data frame

Data Science Asked by Alexander Willer on July 4, 2021

For analyzing pairs of currencies, I have two dataframes. The first one contains information about the symbols constituting every unique pair (as in "pair metadata"):

    X           Y           handle
0   ETHEUR E3   XMRUSD E3   ETHEUR E3_XMRUSD E3
1   ETHEUR E3   TRXETH E3   ETHEUR E3_TRXETH E3
2   ETHEUR E3   ETHUSD E3   ETHEUR E3_ETHUSD E3
3   ETHEUR E3   BTCEUR E3   ETHEUR E3_BTCEUR E3
4   ETHEUR E3   ETHBTC E3   ETHEUR E3_ETHBTC E3
...     ...     ...     ...

The second one is a multi-index dataframe providing historic data, having the symbol as level 0 index and time as a level 1 index. This specific structure is desired (and given by some other library) as not every symbol has data for every point in time.

                                    close       high        low         open        volume
symbol      time                    
ETHEUR E3   2021-01-18 04:41:00     988.730000  988.730000  988.726611  988.726611  8.404723
            2021-01-18 04:42:00     988.730000  988.730000  988.726611  988.726611  0.000000
            2021-01-18 04:43:00     988.730000  988.730000  988.726611  988.726611  0.000000
            2021-01-18 04:44:00     990.620360  990.620360  989.640000  989.640000  0.158394
            2021-01-18 04:45:00     995.010000  995.010000  995.010000  995.010000  0.006665
...     ...     ...     ...     ...     ...     ...
ANTETH E3   2021-01-18 04:56:00     0.003745    0.003745    0.003745    0.003745    0.000000
            2021-01-18 04:57:00     0.003745    0.003745    0.003745    0.003745    0.000000
            2021-01-18 04:58:00     0.003745    0.003745    0.003745    0.003745    0.000000
            2021-01-18 04:59:00     0.003745    0.003745    0.003745    0.003745    0.000000
            2021-01-18 05:00:00     0.003745    0.003745    0.003745    0.003745    0.000000

My goal is to assemble a pair-wise history as follows, with ‘X’ and ‘Y’ containing the data from the previous dataframe for the first and second symbol of the pair:

                                                X                                                           Y
                                                close       high        low         open        volume      close       high        low         open        volume
handle                  time                                        
ETHEUR E3_XMRUSD E3     2021-01-18 04:41:00     988.730000  988.730000  988.726611  988.726611  8.404723    153.450000  153.450000  153.250000  153.250000  0.000000
                        2021-01-18 04:42:00     988.730000  988.730000  988.726611  988.726611  0.000000    153.450000  153.450000  153.250000  153.250000  0.000000
                        2021-01-18 04:43:00     988.730000  988.730000  988.726611  988.726611  0.000000    153.450000  153.450000  153.250000  153.250000  0.000000
                        2021-01-18 04:44:00     990.620360  990.620360  989.640000  989.640000  0.158394    153.450000  153.450000  153.250000  153.250000  0.000000
                        2021-01-18 04:45:00     995.010000  995.010000  995.010000  995.010000  0.006665    153.860000  153.860000  153.360000  153.360000  4.845876
...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...
EOSETH E3_ANTETH E3     2021-01-18 04:56:00     0.002259    0.002259    0.002259    0.002259    0.000000    0.003745    0.003745    0.003745    0.003745    0.000000
                        2021-01-18 04:57:00     0.002259    0.002259    0.002259    0.002259    0.000000    0.003745    0.003745    0.003745    0.003745    0.000000
                        2021-01-18 04:58:00     0.002259    0.002259    0.002259    0.002259    0.000000    0.003745    0.003745    0.003745    0.003745    0.000000
                        2021-01-18 04:59:00     0.002259    0.002259    0.002259    0.002259    0.000000    0.003745    0.003745    0.003745    0.003745    0.000000
                        2021-01-18 05:00:00     0.002259    0.002259    0.002259    0.002259    0.000000    0.003745    0.003745    0.003745    0.003745    0.000000

This pair-wise history is used for further time-series style calculations between the two symbols.

Currently, I use the following code to generate the shown dataframe:

def assemble_pairs_from_history(history):
    # Build unique pair combinations from symbols
    pair_tuples = list(combinations(history.index.unique('symbol'), 2))
    pair_info = pd.DataFrame(pair_tuples, columns=['X', 'Y'])
    # Designate a handle for each combination
    pair_info['handle'] = pair_info['X'].str.cat(pair_info['Y'], sep="_")
    
    display(pair_info)
    display(history)
    
    # Iterate over all pair combinations, assembling a dict of historical data DFs for X and Y symbol of the pair
    pair_history = {}
    for i, pi in pair_info.iterrows():
        pair_history[pi.handle] = pd.concat([history.loc[pi.X], history.loc[pi.Y]], axis=1, keys=['X', 'Y'])
        # Clean the time series so that only rows having values for both symbols of the pair remain
        pair_history[pi.handle].dropna(inplace=True)
    
    
    # Assemble into one big dataframe, index by handle -> time with X and Y top level columns
    pair_history = pd.concat(pair_history.values(), keys=pair_history.keys(), names=['handle'])

    display(pair_history)

The issue with this code is that the performance is pretty bad, with the iterrows() and subsequent concat() part taking 10 seconds for a result dataframe size of 5000 rows.

I already tried various combinations of apply(), join() and merge() as well as building a multiindex manually, but was not able to achieve the desired output format.

How can the performance of this task be improved?

One Answer

Not sure if you need the multi index for the columns, but would something like this be what you're looking for?

Given the following two input dataframes

X Y handle
ETHEUR E3 XMRUSD E3 ETHEUR E3_XMRUSD E3
ETHEUR E3 TRXETH E3 ETHEUR E3_TRXETH E3
ETHEUR E3 ETHUSD E3 ETHEUR E3_ETHUSD E3
ETHEUR E3 BTCEUR E3 ETHEUR E3_BTCEUR E3
ETHEUR E3 ETHBTC E3 ETHEUR E3_ETHBTC E3
symbol time close high low open volume
ETHEUR E3 2021-01-18 04:41:00 988.73 988.73 988.727 988.727 8.40472
ETHEUR E3 2021-01-18 04:42:00 988.73 988.73 988.727 988.727 0
ETHEUR E3 2021-01-18 04:43:00 988.73 988.73 988.727 988.727 0
ETHEUR E3 2021-01-18 04:44:00 990.62 990.62 989.64 989.64 0.158394
ETHEUR E3 2021-01-18 04:45:00 995.01 995.01 995.01 995.01 0.006665
ANTETH E3 2021-01-18 04:56:00 0.003745 0.003745 0.003745 0.003745 0
ANTETH E3 2021-01-18 04:57:00 0.003745 0.003745 0.003745 0.003745 0
ANTETH E3 2021-01-18 04:58:00 0.003745 0.003745 0.003745 0.003745 0
ANTETH E3 2021-01-18 04:59:00 0.003745 0.003745 0.003745 0.003745 0
ANTETH E3 2021-01-18 05:00:00 0.003745 0.003745 0.003745 0.003745 0

you can use pandas.merge to join the tables twice for each of the currency pairs.

combined = (
    df1
    # join data for first currency pair
    .merge(df2, how="left", left_on="X", right_on="symbol")
    # join data for second currency pair
    .merge(df2, how="left", left_on=["Y", "time"], right_on=["symbol", "time"], suffixes=("_X", "_Y"))
)

# get columns names for multi index
cols = [x.split("_")[::-1] if "_" in x else ["", x] for x in combined.columns]
# set new column names
combined.columns = pd.MultiIndex.from_arrays(list(zip(*cols)))
# select required columns
combined = combined[[x for x in combined.columns if x[1] in ["handle", "time", "close", "high", "low", "open", "volume"]]]

Which results in the following dataframe:

('', 'handle') ('', 'time') ('X', 'close') ('X', 'high') ('X', 'low') ('X', 'open') ('X', 'volume') ('Y', 'close') ('Y', 'high') ('Y', 'low') ('Y', 'open') ('Y', 'volume')
ETHEUR E3_XMRUSD E3 2021-01-18 04:41:00 988.73 988.73 988.727 988.727 8.40472 nan nan nan nan nan
ETHEUR E3_XMRUSD E3 2021-01-18 04:42:00 988.73 988.73 988.727 988.727 0 nan nan nan nan nan
ETHEUR E3_XMRUSD E3 2021-01-18 04:43:00 988.73 988.73 988.727 988.727 0 nan nan nan nan nan
ETHEUR E3_XMRUSD E3 2021-01-18 04:44:00 990.62 990.62 989.64 989.64 0.158394 nan nan nan nan nan
ETHEUR E3_XMRUSD E3 2021-01-18 04:45:00 995.01 995.01 995.01 995.01 0.006665 nan nan nan nan nan

Answered by Oxbowerce on July 4, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP