Git Product home page Git Product logo

vectorized_pandas's People

Contributors

lucasg0 avatar

Watchers

 avatar

vectorized_pandas's Issues

ENH: Vectorize Dataframe.apply(axis=1)

For instance

def func(row):
    return row["A"] + row["B"]

df = pd.DataFrame({"A": [0 ,1], "B": [0, 2]})
df["sum"] = df.apply(func, axis=1)

should be converted to

df = pd.DataFrame({"A": [0 ,1], "B": [0, 2]})
df["sum"] = df["A"] + df["B"]

Support inline conditional expression

For instance

def func(val):
    return val if val == 0 else 0
s = s.apply(func)

should be converted to

s = np.select(conditions=[(s == 0)], choices=[s], default=0)

Vectorize apply calls at any location

Currently only apply calls within assignments are replaced. We should replace apply calls wherever they are located, it can be:

  • Inside a function
  • Inside a method
  • As a return of a method/function
  • Within a condition of a if statement
  • Part of chaining assignments (eg: s.apply(...).isna()

Supporting these cases probably involves a refactoring on how apply calls are identified. Typically a ast.Call visitor could be used instead of the current implementation that only detects top level apply calls within assignments.

BUG: Import numpy when using np.select

If we generate a np.select statement from a UDF using conditions, we should add import numpy as np if numpy is not already imported or reuse the current numpy alias.

ENH: Vectorize str operations

For instance

def func(val):
    return val.upper()

s = s.apply(func)

should be converted to

s = s.str.upper()

While pandas operations on strings are not necessarily efficient, this kind of replacement still enhances performances on dataframes as it avoid the overhead of casting each DataFrame row into a Series within the applied method.

BUG: Multiple conditional statements not vectorized

replace_apply on the following code does not vectorize func into np.select, it simply drops the function definition (cf #17):

def func(row):
    if row["A"] != 0:
        if row["A"] == 1:
            if row["B"] == 1:
                return row["C"]
            else:
                return row["D"]
        else:
            if row["B"] == 2:
                return row["E"]
            else:
                return row["F"]
    else:
        return 0.0

df["SL"] = df.apply(func, axis=1)

Support len method

As for in operator #32, it might be ambiguous whether len is called on a string or on a list.

ENH: Vectorize iterrows

For instance

df = pd.DataFrame({"A": [0 ,1], "B": [0, 2]})

res = []
for index, row in df.iterrows():
   res.append(row["A"])

should be converted to

df = pd.DataFrame({"A": [0 ,1], "B": [0, 2]})

res = df["A"].tolist()

BUG: Unexpected KeyError while referencing a module

replace_apply called on the following code raises a KeyError:

import json
def func(row):
    return json.loads(row)

x = df.apply(func, axis=1)

-->

...
self = <apply.FunctionBodyParser object at 0x7eff77764c10>
expr_node = <ast.Name object at 0x7eff77764910>
dependencies = {'row': ApplyFuncArg()}

    def _resolve_expr(self, expr_node: Optional[ast.expr], dependencies: Dict[str, Expr]) -> Expr:
        """
        Build an Expr object from a ast.expr node, using the current dependencies.
        """
    
        assert expr_node is not None, "Trying to resolve an empty expression node"
    
        if isinstance(expr_node, ast.Constant):
            return Constant(expr_node.value)
    
        if isinstance(expr_node, ast.Name):
            if expr_node.id in self.numpy_alias or expr_node.id in self.pandas_alias:
                return Alias(expr_node.id)
>           return dependencies[expr_node.id]
E           KeyError: 'json'

apply.py:288: KeyError

ENH: Support nested UDF calls.

For instance

def add_one(value):
    return value + 1

def func(value):
    return add_one(value)

s1 = pd.Series([0, 1])
s2 = s1.apply(func)

should be converted to

s1 = pd.Series([0, 1])
s2 = s1 + 1

Support in operator

This operator is tricky as it can be used both with lists and strings, so we can't always be sure whether we should use isin or str.contains.

ENH: Vectorize for loops

For instance

df = pd.DataFrame({"A": [0 ,1], "B": [0, 2]})

for i in range(len(df)):
    df["B"][i] = df["A"][i] + df["B"][i]

should be converted to

df = pd.DataFrame({"A": [0 ,1], "B": [0, 2]})

df["B"] = df["A"] + df["B"]

Vectorize code located within methods

For instance

def func(val):
    return val + 1

class A():
    def method(self):
        s = pd.Series(range(100))
        s = s.apply(func)

should return

class A():
    def method(self):
        s = pd.Series(range(100))
        s = s + 1

ENH: Support f-strings

For instance

def func(val):
    return f"{val}_0"
s = s.apply(func)

should be converted to

s = s.astype(str) + "_0"

BUG: Remove extra str methods

Some string methods in apply.STR_METHODS actually correspond to python builtin functions having strings as parameters, such as len. Thus the replacement mechanism is different than for string methods such as startswith.

ENH: Support row.field notation

For instance

def func(row):
    return row.A + row.B

df.sum = df.apply(func, axis=1)

should be converted to

df.sum = df.A + df.B

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.