Light

bmmalone / pyllars Goto Github PK

View Code? Open in Web Editor NEW

6.0 1.0 1.0 574 KB

This repository contains supporting utilities for Python 3, with an emphasis on data science tasks.

License: MIT License

Python 89.77% Jupyter Notebook 10.23%

pyllars's Introduction

pyllars

This project contains supporting utilities for Python 3.

Installation

This package is available on PyPI.

pip3 install pyllars

Alternatively, the package can be installed from source.

git clone https://github.com/bmmalone/pyllars
cd pyllars
pip3 install .

(The "period" at the end is required.)

If possible, I recommend installing inside a virtual environment or with conda. See here, for example.

Please see the documentation for more details.

pytorch and ray installation

The pytorch.torch submodule requires pytorch and ray-tune. Due to the various configuration options (CPU vs. GPU, etc.), the pyllars installation does not attempt to install these dependencies; they need to be installed manually, though this can be done after installing pyllars with no problems. I suggest using these within an anaconda or similar environment. Please see the official documentation for installing pytorch and ray-tune for details.

pyllars's People

Contributors

Stargazers

Watchers

Forkers

stianlagstad

pyllars's Issues

pandas_utils requires shared build of python

misc.pandas_utils imports fastparquet, which in turn imports numba. It seems that numba requires a shared build of python:

  File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/misc/pandas_utils.py", line 19, in <module>
    import fastparquet
  File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/fastparquet/__init__.py", line 8, in <module>
    from .core import read_thrift
  File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/fastparquet/core.py", line 13, in <module>
    from . import encoding
  File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/fastparquet/encoding.py", line 8, in <module>
    import numba
  File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/numba/__init__.py", line 12, in <module>
    from .special import typeof, prange
  File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/numba/special.py", line 3, in <module>
    from .typing.typeof import typeof
  File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/numba/typing/__init__.py", line 2, in <module>
    from .context import BaseContext, Context
  File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/numba/typing/context.py", line 10, in <module>
    from numba.typeconv import Conversion, rules
  File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/numba/typeconv/rules.py", line 3, in <module>
    from .typeconv import TypeManager, TypeCastingRules
  File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/numba/typeconv/typeconv.py", line 3, in <module>
    from . import _typeconv, castgraph, Conversion
ImportError: libpython3.5m.so.1.0: cannot open shared object file: No such file or directory

Removing the import stops the problem.

The easiest option is to move the parquet import inside the respective functions. Then, using fastparquet simply requires python to be build with --enable-shared.

pandas_utils needs a logger

The pandas_utils module refers to a logger which is not defined, causing a NameError.

Add a `sample_dirichlet_multinomial` function to `stats_utils`

Here is a basic example:

def sample_dirichlet_multinomial(dirichlet_alphas:np.ndarray, num_samples:int) -> np.ndarray:
    pvals = np.random.dirichlet(dirichlet_alphas)
    sampled_counts = np.random.multinomial(n=num_samples, pvals=pvals)
    return sampled_counts

AutoSklearnWrapper does not save "le_"

The label encoder "le_" is not persisted when dumped to disk with joblib, so the model cannot be used for classification prediction when reloaded.

Add `metric` to AutoSklearnWrapper constructor

In some cases, like when constructing a pipeline and the call to fit is not direct, it is convenient to specify the optimization metric when the wrapper is constructed.

Document dataset manager

All of the package could use improved documentation; however, dataset manager is used in several external projects. The fields, etc., it exposes, and how it determines those, should be explained in much better detail.

Allow `None` for the client in dask_utils

It is convenient if the various apply functions in dask_utils can accept None as the dask_client. In these cases, the implementation can fall back to the respective function in pd_utils.

Add `union_` and `intersect_masks` functions to `pandas_utils`

Here are example implementations:

def intersect_masks(masks:Sequence[np.ndarray]) -> np.ndarray:
    m_intersect = np.all(list(m for m in masks), axis=0)
    return m_intersect

def union_masks(masks:Sequence[np.ndarray]) -> np.ndarray:
    m_union = np.any(list(m for m in masks), axis=0)
    return m_union

Add a "get_class" function

This function should take in a class name and return the class. It is just a light wrapper around importlib.import_module and should look similar to the following.

def get_class(fully_qualified_class_name):
    """ Convert the string version of a class to the class object
    
    For example, for the input "keras.optimizers.Adam", this function
    will return the Adam class. It could then be called to create
    an instance of an Adam optimizer.
    
    Parameters
    ----------
    fully_qualified_class_name : str
        The name of the class
        
    Returns
    -------
    class : type
        The type of the class
    """
    sp = fully_qualified_class_name.split(".")

    module_name = ".".join(sp[:-1])
    class_name = sp[-1]
    
    m = importlib.import_module(module_name)
    clazz = getattr(m, class_name)
    
    return clazz

pd_utils.get_group_extreme fails if the data frame index is not unique

The following is the problematic code:

if is_max:
        ex_vals = groups[ex_field].idxmax()
    elif is_min:
        ex_vals = groups[ex_field].idxmin()

    ex_rows = df.loc[ex_vals]

If the index of df is not unique, then the loc line can match multiple rows per group.

Update to sheetname reading excel files in pd_utils

In particular, reading an excel file currently raises the following warning or similar:

DEBUG    : The guessed filetype was: excel
/home/bmalone/.virtualenvs/nes-ehr/lib/python3.6/site-packages/pandas/util/_decorators.py:188: FutureWarning: The `sheetname` keyword is deprecated, use `sheet_name` instead
  return func(*args, **kwargs)

Update sheet_name to sheetname.

Add logging option defaults to function signature

The add_logging_option function sets all defaults within the function body.

https://pyllars.readthedocs.io/en/stable/_modules/pyllars/logging_utils.html#add_logging_options

Build documentation with sphinx

The internal documentation is largely compatible with sphinx (sklearn-style). Fix any improperly formatted documentation and build it with sphinx.

Verify univariate Gaussian KL divergence calculation

The stats_utils.calculate_univariate_gaussian_kl function aims to be numerically stable by performing most calculations in logspace. However, it is not clear that the equations are correct.

The name of mpl_utils.hide_tick_labels_by_index is misleading

By default, this removes all tick labels. The name should be changed.

Add an `apply_rolling_window` helper to pd_utils.

The idea is that this function applies a function to overlapping rows in a data frame (that is, a "rolling window"). A basic implementation is as follows:

def apply_rolling_window(df, func, window_size, progress_bar=False):
    
    it = range(len(df))
    
    if progress_bar:
        it = tqdm.trange(len(df))
        
    ret = [
        func(df.iloc[i: i+window_size])
            for i in it
    ]

    return ret

sklearn is on a brownout schedule. Replace with scikit-learn?

Hi,

Ref https://github.com/scikit-learn/sklearn-pypi-package#brownout-schedule, the sklearn package is on a brownout schedule, causing our builds to fail at specific times of the day due to pyllars dependency on it. I believe that changing this line to say scikit-learn is all it takes to solve the problem: https://github.com/bmmalone/pyllars/blob/master/setup.py#L35. PR coming with the change shortly.

Document modules

We need to document the following modules:

Make "fontsize" and "font_size" consistent in mpl_utils

Right now, some functions include an underscore and others do not. Matplotlib does not use the underscore, so use fontsize everywhere.

Handle old versions of pyyaml gracefully

Right now, old versions of pyyaml will cause the following error message: AttributeError: module 'yaml' has no attribute 'full_load'. This is due to having an old version of pyyaml (and the call to "load" there is unsafe).

Document mpl_utils.plot_stacked_bar_graph

All of mpl_utils is documented except the plot_stacked_bar_graph function and its get_diff_counts helper.

Add a `get_index_and_reverse_map` function

This function should return two maps which map from items to indices and back from indices to items. It should also check that the items are unique (or the reverse map will not work). That is, this should be a bijective mapping. Here is a basic example:

    index_map = {
        c:i for i, c in enumerate(items)
    }
    reverse_index_map = {
        i:c for c,i in index_map.items()
    }

def _remove_nans(vals):
    m_nan = pd.isnull(vals)
    vals = vals[~m_nan]
    return vals

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.