eyaltrabelsi / pandas-log Goto Github PK

The goal of pandas-log is to provide feedback about basic pandas operations. It provides simple wrapper functions for the most common functions that add additional logs

License: MIT License

Makefile 3.95% Python 96.05%

pandas-log's Introduction

Hi there, I'm Eyal 👋

Enthusiastic Software Engineer👷
Who appreciates good software engineering 🙏
I have with big passion for Python 🐍, Machine Learning 🤖, Databases 🛢️, Scale and Performance Optimisations🦸 and making all of these easy to use.

☕ Wanna chat? 👉

Latest Blog Posts:

Latest Talks:

pandas-log's People

Contributors

Stargazers

Watchers

Forkers

deanla yatrik11 mikitachab californiapolicylab devenlu stjordanis al1p-r rsorma04 dmil korsbakken sourcery-ai-bot iq-scm

pandas-log's Issues

Ease up finding bottlenecks

Brief Description

I would like to propose adding the precent of total execution time for each operation to ease up finding bottlenecks

A way to enable globally?

Brief Description

I'm looking for a way to enable pandas-log globally without the use of context manager. Is it possible right now? If not, how do you think about this feature? Thanks.

Add tips/watch out section

Brief Description

I would like to propose a section with various of tips like:

warn if using iterrows
use resample for group by on timestamp

can use dovpanda once it get stabelized

pd.merge nonetype object has no attribute 'memory_usage'

Brief Description

System Information

Operating system: Windows
OS details (optional):
Python version (required): Python 3.6

installed via pip

Minimally Reproducible Code

import pandas as pd
import pandas_log
df_a = pd.DataFrame({'a':[1,2,3],'b':['a','b','c']})
df_b = pd.DataFrame({'c':[11,12,13],'b':['a','b','c']})

with pandas_log.enable():
df = (
pd.merge(df_a,df_b,on='b')
)

Error Messages

--------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-c5cf824082a8> in <module>
      1 with pandas_log.enable():
      2     df = (
----> 3         pd.merge(df_a,df_b,on='b')
      4     )

~\.conda\envs\empl\lib\site-packages\pandas\core\reshape\merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     79         copy=copy,
     80         indicator=indicator,
---> 81         validate=validate,
     82     )
     83     return op.get_result()

~\.conda\envs\empl\lib\site-packages\pandas\core\reshape\merge.py in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
    624             self.right_join_keys,
    625             self.join_names,
--> 626         ) = self._get_merge_keys()
    627 
    628         # validate the merge keys dtypes. We may need to coerce

~\.conda\envs\empl\lib\site-packages\pandas\core\reshape\merge.py in _get_merge_keys(self)
   1031 
   1032         if right_drop:
-> 1033             self.right = self.right._drop_labels_or_levels(right_drop)
   1034 
   1035         return left_keys, right_keys, join_names

~\.conda\envs\empl\lib\site-packages\pandas\core\generic.py in _drop_labels_or_levels(self, keys, axis)
   1860             # Handle dropping columns labels
   1861             if labels_to_drop:
-> 1862                 dropped.drop(labels_to_drop, axis=1, inplace=True)
   1863         else:
   1864             # Handle dropping column levels

~\.conda\envs\empl\lib\site-packages\pandas_flavor\register.py in __call__(self, *args, **kwargs)
     27             @wraps(method)
     28             def __call__(self, *args, **kwargs):
---> 29                 return method(self._obj, *args, **kwargs)
     30 
     31         register_dataframe_accessor(method.__name__)(AccessorMethod)

~\.conda\envs\empl\lib\site-packages\pandas_log\pandas_log.py in wrapped(*args, **fn_kwargs)
    134                 full_signature,
    135                 silent,
--> 136                 verbose,
    137             )
    138             return output_df

~\.conda\envs\empl\lib\site-packages\pandas_log\pandas_log.py in _run_method_and_calc_stats(fn, fn_args, fn_kwargs, input_df, full_signature, silent, verbose)
    104 
    105         output_df, execution_stats = get_execution_stats(
--> 106             fn, input_df, fn_args, fn_kwargs
    107         )
    108 

~\.conda\envs\empl\lib\site-packages\pandas_log\pandas_execution_stats.py in get_execution_stats(fn, input_df, fn_args, fn_kwargs)
     35 
     36     input_memory_size = StepStats.calc_df_series_memory(input_df)
---> 37     output_memory_size = StepStats.calc_df_series_memory(output_df)
     38 
     39     ExecutionStats = namedtuple(

~\.conda\envs\empl\lib\site-packages\pandas_log\pandas_execution_stats.py in calc_df_series_memory(df_or_series)
     78     @staticmethod
     79     def calc_df_series_memory(df_or_series):
---> 80         memory_size = df_or_series.memory_usage(index=True, deep=True)
     81         return (
     82             humanize.naturalsize(memory_size.sum())

AttributeError: 'NoneType' object has no attribute 'memory_usage'

Pandas regression test

Brief Description

Run pandas test to make sure no method functionality was effected

No module humanize or pandas-flavor

Brief Description

Are the required imports for this package up-to-date? I installed pip install pandas-log, then got module import errors as I tried importing pandas log to my notebook:

--> 114             import pandas_log
    115 
    116             with pandas_logs.enable():

~/GitHub/simple-tech-challenges/venv/lib/python3.8/site-packages/pandas_log/__init__.py in <module>
      2 
      3 """Top-level package for pandas-log."""
----> 4 from .pandas_log import *
      5 
      6 __author__ = """Eyal Trabelsi"""

~/GitHub/simple-tech-challenges/venv/lib/python3.8/site-packages/pandas_log/pandas_log.py in <module>
     15     restore_pandas_func_copy,
     16 )
---> 17 from pandas_log.pandas_execution_stats import StepStats, get_execution_stats
     18 
     19 __all__ = ["auto_enable", "auto_disable", "enable"]

~/GitHub/simple-tech-challenges/venv/lib/python3.8/site-packages/pandas_log/pandas_execution_stats.py in <module>
     22 with warnings.catch_warnings():
     23     warnings.simplefilter("ignore")
---> 24     import humanize
     25 
     26 

ModuleNotFoundError: No module named 'humanize'

Adding memory footprint of each operation

Brief Description

It can be cool if for each operation we will have the dataframe memory footprint

Integrate with Python logging module

Integrate with Python logging

I would love a way to integrate this with the standard Python logging module, or ar dropin replacement thereof, such as loguro.

Such an integration would make it more useful when running production data-science code, and ease adoption of this library, which is thinks is a really interesting idea!

All logs should be suppressed after disable being called

Brief Description

Currently, after disabling method some methods still reproduce logs altough they shouldn't, because the reference of the pandas method is diffrent from the instance method.

Minimally Reproducible Code

with enable():
    df = pd.read_csv("../examples/pokemon.csv")
    (
        df.query("legendary==0")
        .query("type_1=='fire' or type_2=='fire'")
    )
df.query("legendary==1")

Add CI/CD

Brief Description

Add Travis for CI/CD

TypeError: data type not understood

Brief Description

I'm trying to run pandas-log on my chain and it fails with the error:

TypeError: data type not understood

System Information

Python version (required): Python 3.8.5
Pandas version: 1.3.2

Minimally Reproducible Code

import pandas as pd
autos = pd.read_csv('https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip')
def to_tz(df_, time_col, tz_offset, tz_name):
    return (df_
             .groupby(tz_offset)
             [time_col]
             .transform(lambda s: pd.to_datetime(s)
                 .dt.tz_localize(s.name, ambiguous=True)
                 .dt.tz_convert(tz_name))
            )


def tweak_autos(autos):
    cols = ['city08', 'comb08', 'highway08', 'cylinders', 'displ', 'drive', 'eng_dscr', 
        'fuelCost08', 'make', 'model', 'trany', 'range', 'createdOn', 'year']
    return (autos
     [cols]
     .assign(cylinders=autos.cylinders.fillna(0).astype('int8'),
             displ=autos.displ.fillna(0).astype('float16'),
             drive=autos.drive.fillna('Other').astype('category'),
             automatic=autos.trany.str.contains('Auto'),
             speeds=autos.trany.str.extract(r'(\d)+').fillna('20').astype('int8'),
             tz=autos.createdOn.str.extract(r'\d\d:\d\d ([A-Z]{3}?)').replace('EDT', 'EST5EDT'),
             str_date=(autos.createdOn.str.slice(4,19) + ' ' + autos.createdOn.str.slice(-4)),
             createdOn=lambda df_: to_tz(df_, 'str_date', 'tz', 'US/Eastern'),
             ffs=autos.eng_dscr.str.contains('FFS')
            )
     .pipe(show, rows=2, title='New Cols')            
     .astype({'highway08': 'int8', 'city08': 'int16', 'comb08': 'int16', 'fuelCost08': 'int16',
              'range': 'int16',  'year': 'int16', 'make': 'category'})
     .drop(columns=['trany', 'eng_dscr'])
    )
import pandas_log
with pandas_log.enable():
    tweak_autos(autos)

Error Messages

1) fillna(value: 'object | ArrayLike | None' ="20", method: 'FillnaOptions | None' = None, axis: 'Axis | None' = None, inplace: 'bool' = False, limit=None, downcast=None):
	Metadata:
	* Filled 837 with 20.
	Execution Stats:
	* Execution time: Step Took 0.001512 seconds.

1) replace(to_replace="EDT", value="EST5EDT", inplace: 'bool' = False, limit=None, regex: 'bool' = False, method: 'str' = 'pad'):
	Execution Stats:
	* Execution time: Step Took 0.001215 seconds.

1) groupby(by="tz", axis: 'Axis' = 0, level: 'Level | None' = None, as_index: 'bool' = True, sort: 'bool' = True, group_keys: 'bool' = True, squeeze: 'bool | lib.NoDefault' = <no_default>, observed: 'bool' = False, dropna: 'bool' = True):
	Metadata:
	* Grouping by tz resulted in 2 groups like 
		EST,
		EST5EDT,
	  and more.
	Execution Stats:
	* Execution time: Step Took 0.006409 seconds.
/home/matt/envs/menv/lib/python3.8/site-packages/pandas_log/patched_logs_functions.py:249: UserWarning: Some pandas logging may involve copying dataframes, which can be time-/memory-intensive. Consider passing copy_ok=False to the enable/auto_enable functions in pandas_log if issues arise.
  warnings.warn(COPY_WARNING_MSG)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-f6bfc55c635b> in <module>
     33 import pandas_log
     34 with pandas_log.enable():
---> 35     tweak_autos(autos)

<ipython-input-1-f6bfc55c635b> in tweak_autos(autos)
     14     cols = ['city08', 'comb08', 'highway08', 'cylinders', 'displ', 'drive', 'eng_dscr', 
     15         'fuelCost08', 'make', 'model', 'trany', 'range', 'createdOn', 'year']
---> 16     return (autos
     17      [cols]
     18      .assign(cylinders=autos.cylinders.fillna(0).astype('int8'),

~/envs/menv/lib/python3.8/site-packages/pandas_flavor/register.py in __call__(self, *args, **kwargs)
     27             @wraps(method)
     28             def __call__(self, *args, **kwargs):
---> 29                 return method(self._obj, *args, **kwargs)
     30 
     31         register_dataframe_accessor(method.__name__)(AccessorMethod)

~/envs/menv/lib/python3.8/site-packages/pandas_log/pandas_log.py in wrapped(*args, **fn_kwargs)
    184 
    185             input_df, fn_args = args[0], args[1:]
--> 186             output_df = _run_method_and_calc_stats(
    187                 fn,
    188                 fn_args,

~/envs/menv/lib/python3.8/site-packages/pandas_log/pandas_log.py in _run_method_and_calc_stats(fn, fn_args, fn_kwargs, input_df, full_signature, silent, verbose, copy_ok, calculate_memory)
    168             output_df,
    169         )
--> 170         step_stats.log_stats_if_needed(silent, verbose, copy_ok)
    171         if isinstance(output_df, pd.DataFrame) or isinstance(output_df, pd.Series):
    172             step_stats.persist_execution_stats()

~/envs/menv/lib/python3.8/site-packages/pandas_log/pandas_execution_stats.py in log_stats_if_needed(self, silent, verbose, copy_ok)
    106 
    107         if verbose or self.fn.__name__ not in DATAFRAME_ADDITIONAL_METHODS_TO_OVERIDE:
--> 108             s = self.__repr__(verbose, copy_ok)
    109             if s:
    110                 # If this method isn't patched and verbose is False, __repr__ will give an empty string, which

~/envs/menv/lib/python3.8/site-packages/pandas_log/pandas_execution_stats.py in __repr__(self, verbose, copy_ok)
    147 
    148         # Step Metadata stats
--> 149         logs, tips = self.get_logs_for_specifc_method(verbose, copy_ok)
    150         metadata_stats = f"\033[4mMetadata\033[0m:\n{logs}" if logs else ""
    151         metadata_tips = f"\033[4mTips\033[0m:\n{tips}" if tips else ""

~/envs/menv/lib/python3.8/site-packages/pandas_log/pandas_execution_stats.py in get_logs_for_specifc_method(self, verbose, copy_ok)
    128 
    129         log_method = partial(log_method, self.output_df, self.input_df)
--> 130         logs, tips = log_method(*self.fn_args, **self.fn_kwargs)
    131         return logs, tips
    132 

~/envs/menv/lib/python3.8/site-packages/pandas_log/patched_logs_functions.py in log_assign(output_df, input_df, **kwargs)
    250             # If copying is ok, we can check how many values actually changed
    251             for col in changed_cols:
--> 252                 values_changed, values_unchanged = num_values_changed(
    253                     input_df[col], output_df[col]
    254                 )

~/envs/menv/lib/python3.8/site-packages/pandas_log/patched_logs_functions.py in num_values_changed(input_obj, output_obj)
    127         isinstance(input_obj, pd.Series)
    128         and isinstance(output_obj, pd.Series)
--> 129         and input_obj.dtype != output_obj.dtype
    130     ):
    131         # Comparing values for equality across dtypes wouldn't be well-defined so we just say they all changed

TypeError: Cannot interpret 'datetime64[ns, US/Eastern]' as a data type

Accessable medium post

Description

Writing a medium post in addition to the docs may allow more comprehensive understanding of the both the need and usage.

Relevant Context

Toturial for writing medium post

Allow pretty html output

Brief Description

I would like to propose the ability to generate html from history execution

fix link to open new issue

Brief Description of Fix

Currently, the link to submit an issue refers to ericmjl. probably from a template cookie cutter.

I would like to propose a change, such that now the docs...

Relevant Context

readthedocs contributing page

Link to documentation page

eyaltrabelsi / pandas-log Goto Github PK

pandas-log's Introduction

Hi there, I'm Eyal 👋

Latest Blog Posts:

Latest Talks:

pandas-log's People

Contributors

Stargazers

Watchers

Forkers

pandas-log's Issues

Brief Description

Brief Description

Brief Description

Brief Description

System Information

Minimally Reproducible Code

Error Messages

Brief Description

Brief Description

Brief Description

Integrate with Python logging

Brief Description

Minimally Reproducible Code

Brief Description

Brief Description

System Information

Minimally Reproducible Code

Error Messages

Description

Relevant Context

Brief Description

Brief Description of Fix

Relevant Context

Recommend Projects

Recommend Topics

Recommend Org