It has been shown in most cases that Python performs better with vectorized process than generic pandas DataFrame.apply()
method. Here is an example from my work.
In our study, the participants wear accelerometers to record their physical activities. I need to summarize the activities and classify them as sedentary, light, moderate and vigorous. Here meterplus.data
is a list of movement counts within each 30 seconds epoch. len(meterplus.data)=20639
. Another list valid
is a list of the same length of meterplus.data
, generated from a function validate()
. The cutoff
is a list of thresholds.
The first way to apply a function to both meterplus.data
and valid
is the apply
method from pandas DataFrame.
import pandas as pd
meterplus.df = pd.DataFrame({
'y': meterplus.data,
'timestamp': timestamp,
'valid': validate(meterplus.data)
})
def categorize(row):
if row['valid']==0:
return 'nonvalid'
elif row['valid']==1:
if cutoff[0] <= row['y'] < cutoff[1]:
return 'sed'
elif cutoff[1] <= row['y'] < cutoff[2]:
return 'light'
elif cutoff[2] <= row['y'] < cutoff[3]:
return 'mod'
elif cutoff[3] < row['y']:
return 'vig'
else:
return np.NaN
else:
return np.NaN
meterplus.df['mvpa'] = meterplus.df.apply(categorize, axis = 1)
Using %timeit
and %prun -l
from ipython, we can see the performance of this process
In [7]: %timeit acc2.label(acctest)
1 loop, best of 3: 1.72 s per loop
In [9]: %prun -l 3 acc2.label(acctest)
5169897 function calls (5079010 primitive calls) in 2.827 seconds
Ordered by: internal time
List reduced from 320 to 3 due to restriction <3>
ncalls tottime percall cumtime percall filename:lineno(function)
1048969 0.143 0.000 0.263 0.000 {built-in method builtins.isinstance}
185798 0.131 0.000 0.294 0.000 dtypes.py:74(is_dtype)
20639 0.125 0.000 0.125 0.000 internals.py:2202()
Another way to apply a function to both list is vecterization.
def categorize(valid, y):
if valid==0:
return 'nonvalid'
elif valid==1:
if cutoff[0] <= y < cutoff[1]:
return 'sed'
elif cutoff[1] <= y < cutoff[2]:
return 'light'
elif cutoff[2] <= y < cutoff[3]:
return 'mod'
elif cutoff[3] < y:
return 'vig'
else:
return 'NA'
else:
return 'NA'
mvpa = list(map(categorize, valid, meterplus.data))
Again, using %timeit
and %prun -l
to time the performance.
In [10]: %timeit acc1.label(acctest)
10 loops, best of 3: 174 ms per loop
In [11]: %prun -l 3 acc1.label(acctest)
230767 function calls (230755 primitive calls) in 0.205 seconds
Ordered by: internal time
List reduced from 248 to 3 due to restriction <3>
ncalls tottime percall cumtime percall filename:lineno(function)
41290 0.048 0.000 0.048 0.000 {method 'replace' of 'datetime.datetime' objects}
1 0.039 0.039 0.039 0.039 {pandas.tslib.datetime_to_datetime64}
20641 0.033 0.000 0.093 0.000 tzinfo.py:179(fromutc)
Vectorization outperformed the apply
method easily with almost 10 fold increase in speed.