How to speed up pandas apply for string matching

Question

I have a large number of files for which I have to carry out calculations based on string columns. The relevant columns look like this.
df = pd.DataFrame({'A': ['A', 'B', 'A', 'B'], 'B': ['B', 'C', 'D', 'A'], 'C': ['A', 'B', 'D', 'D'], 'D': ['A', 'C', 'C', 'B'],})

A   B   C   D
0   A   B   A   A
1   B   C   B   C
2   A   D   D   C
3   B   A   D   B

I have to create new columns containing the number of occurences of certain strings in each row. I do this like this:
for elem in ['A', 'B', 'C', 'D']:
    df['n_{}'.format(elem)] = df[['A', 'B', 'C', 'D']].apply(lambda x: (x == elem).sum(), axis=1)

A  B  C  D  n_A  n_B  n_C  n_D
0  A  B  A  A    3    1    0    0
1  B  C  B  C    0    2    2    0
2  A  D  D  C    1    0    1    2
3  B  A  D  B    1    2    0    1

However, this is taking minutes per file, and I have to do this for around 900 files. Is there any way I can speed this up?

Tom · Accepted Answer

Rather than use apply to loop over each row, I looped over each column to compute the sum for each letter:
for l in ['A','B','C','D']:
    df['n_' + l] = (df == l).sum(axis=1)

This seems to be an improvement in this example, but (from quick testing not shown) seems like it can be ~equal or worse depending on the shape and size of the data (and probably how many strings you are looking for)
Some time comparisons:
%%timeit
for elem in ['A', 'B', 'C', 'D']:
    df['n_{}'.format(elem)] = df[['A', 'B', 'C', 'D']].apply(lambda x: (x == elem).sum(), axis=1)    
#6.77 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
for l in ['A','B','C','D']:
    df['n_' + l] = (df == l).sum(axis=1)
#1.95 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And for other answers here:
%%timeit
df1 = df.join(df.stack().str.get_dummies().sum(level=0).add_prefix('n_'))
#3.59 ms ± 62.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df1=df.join(pd.get_dummies(df,prefix='n',prefix_sep='_').sum(1,level=0))
#5.82 ms ± 52.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
counts = df.apply(lambda s: s.value_counts(), axis=1).fillna(0)
counts.columns = [f'n_{col}' for col in counts.columns]
df.join(counts)
#5.58 ms ± 71.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Ayoub ZAROU · Answer

you could do:
counts = df.apply(lambda s: s.value_counts(), axis=1).fillna(0)
counts.columns = [f'n_{col}' for col in counts.columns]
df.join(counts)

BEN_YO · Answer

Try get_dummies and  sum with level , here we do not need stack :-)
df=df.join(pd.get_dummies(df,prefix='n',prefix_sep='_').sum(1,level=0))
Out[57]: 
   A  B  C  D  n_A  n_B  n_C  n_D
0  A  B  A  A    3    1    0    0
1  B  C  B  C    0    2    2    0
2  A  D  D  C    1    0    1    2
3  B  A  D  B    1    2    0    1

Shubham Sharma · Answer

Use stack + str.get_dummies and then sum on level=0 and join it with df:
df1 = df.join(df.stack().str.get_dummies().sum(level=0).add_prefix('n_'))

Result:
print(df1)
   A  B  C  D  n_A  n_B  n_C  n_D
0  A  B  A  A    3    1    0    0
1  B  C  B  C    0    2    2    0
2  A  D  D  C    1    0    1    2
3  B  A  D  B    1    2    0    1

How to speed up pandas apply for string matching

4 Answers

Add your own answers!

Ask a Question