vectorized_pandas's Issues
Vectorize apply calls at any location
Currently only apply
calls within assignments are replaced. We should replace apply calls wherever they are located, it can be:
- Inside a function
- Inside a method
- As a return of a method/function
- Within a condition of a if statement
- Part of chaining assignments (eg: s.apply(...).isna()
Supporting these cases probably involves a refactoring on how apply
calls are identified. Typically a ast.Call
visitor could be used instead of the current implementation that only detects top level apply calls within assignments.
Support sep.join(str_list)
ENH: Vectorize str operations
For instance
def func(val):
return val.upper()
s = s.apply(func)
should be converted to
s = s.str.upper()
While pandas operations on strings are not necessarily efficient, this kind of replacement still enhances performances on dataframes as it avoid the overhead of casting each DataFrame
row into a Series
within the applied method.
ENH: Vectorize for loops
For instance
df = pd.DataFrame({"A": [0 ,1], "B": [0, 2]})
for i in range(len(df)):
df["B"][i] = df["A"][i] + df["B"][i]
should be converted to
df = pd.DataFrame({"A": [0 ,1], "B": [0, 2]})
df["B"] = df["A"] + df["B"]
BUG: Remove extra str methods
Some string methods in apply.STR_METHODS
actually correspond to python builtin functions having strings as parameters, such as len
. Thus the replacement mechanism is different than for string methods such as startswith
.
ENH: Support f-strings
For instance
def func(val):
return f"{val}_0"
s = s.apply(func)
should be converted to
s = s.astype(str) + "_0"
BUG: Multiple conditional statements not vectorized
replace_apply
on the following code does not vectorize func into np.select
, it simply drops the function definition (cf #17):
def func(row):
if row["A"] != 0:
if row["A"] == 1:
if row["B"] == 1:
return row["C"]
else:
return row["D"]
else:
if row["B"] == 2:
return row["E"]
else:
return row["F"]
else:
return 0.0
df["SL"] = df.apply(func, axis=1)
Support in operator
This operator is tricky as it can be used both with lists and strings, so we can't always be sure whether we should use isin
or str.contains
.
Support len method
As for in
operator #32, it might be ambiguous whether len
is called on a string or on a list.
Support inline conditional expression
For instance
def func(val):
return val if val == 0 else 0
s = s.apply(func)
should be converted to
s = np.select(conditions=[(s == 0)], choices=[s], default=0)
Vectorize code located within methods
For instance
def func(val):
return val + 1
class A():
def method(self):
s = pd.Series(range(100))
s = s.apply(func)
should return
class A():
def method(self):
s = pd.Series(range(100))
s = s + 1
ENH: Support nested UDF calls.
For instance
def add_one(value):
return value + 1
def func(value):
return add_one(value)
s1 = pd.Series([0, 1])
s2 = s1.apply(func)
should be converted to
s1 = pd.Series([0, 1])
s2 = s1 + 1
ENH: Vectorize Dataframe.apply(axis=1)
For instance
def func(row):
return row["A"] + row["B"]
df = pd.DataFrame({"A": [0 ,1], "B": [0, 2]})
df["sum"] = df.apply(func, axis=1)
should be converted to
df = pd.DataFrame({"A": [0 ,1], "B": [0, 2]})
df["sum"] = df["A"] + df["B"]
ENH: Vectorize numpy mathematical functions.
Currently, only pd.isna
/ pd.isnull
/ np.isnan
are supported. Check for additional ones, at least pd.notna
.
See numpy mathematical functions.
ENH: Support row.field notation
For instance
def func(row):
return row.A + row.B
df.sum = df.apply(func, axis=1)
should be converted to
df.sum = df.A + df.B
BUG: unvectorizable function definition is dropped
replace_apply
on the following code drops the func
definition while it should not.
from json import loads
def func(row):
return loads(row)
x = df.apply(func)
BUG: Import numpy when using np.select
If we generate a np.select
statement from a UDF using conditions, we should add import numpy as np
if numpy
is not already imported or reuse the current numpy
alias.
ENH: Support all augmented assignments in applied functions.
Currently, only the +=
is supported.
For instance
def func(val):
val -= 1
return val
s = s.apply(func)
should be converted to
s = s - 1
BUG: Unexpected KeyError while referencing a module
replace_apply
called on the following code raises a KeyError
:
import json
def func(row):
return json.loads(row)
x = df.apply(func, axis=1)
-->
...
self = <apply.FunctionBodyParser object at 0x7eff77764c10>
expr_node = <ast.Name object at 0x7eff77764910>
dependencies = {'row': ApplyFuncArg()}
def _resolve_expr(self, expr_node: Optional[ast.expr], dependencies: Dict[str, Expr]) -> Expr:
"""
Build an Expr object from a ast.expr node, using the current dependencies.
"""
assert expr_node is not None, "Trying to resolve an empty expression node"
if isinstance(expr_node, ast.Constant):
return Constant(expr_node.value)
if isinstance(expr_node, ast.Name):
if expr_node.id in self.numpy_alias or expr_node.id in self.pandas_alias:
return Alias(expr_node.id)
> return dependencies[expr_node.id]
E KeyError: 'json'
apply.py:288: KeyError
ENH: Support reference to global variables
For instance
A = 1
def func(value):
return value == A
s = s.apply(func)
should be converted to
A = 1
s = s == A
ENH: Vectorize iterrows
For instance
df = pd.DataFrame({"A": [0 ,1], "B": [0, 2]})
res = []
for index, row in df.iterrows():
res.append(row["A"])
should be converted to
df = pd.DataFrame({"A": [0 ,1], "B": [0, 2]})
res = df["A"].tolist()
ENH: Vectorize type cast
For instance
s = s.apply(int)
should be converted to
s = s.astype(int)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.