Git Product home page Git Product logo

pyllars's Introduction

pyllars

This project contains supporting utilities for Python 3.

Installation

This package is available on PyPI.

pip3 install pyllars

Alternatively, the package can be installed from source.

git clone https://github.com/bmmalone/pyllars
cd pyllars
pip3 install .

(The "period" at the end is required.)

If possible, I recommend installing inside a virtual environment or with conda. See here, for example.

Please see the documentation for more details.

pytorch and ray installation

The pytorch.torch submodule requires pytorch and ray-tune. Due to the various configuration options (CPU vs. GPU, etc.), the pyllars installation does not attempt to install these dependencies; they need to be installed manually, though this can be done after installing pyllars with no problems. I suggest using these within an anaconda or similar environment. Please see the official documentation for installing pytorch and ray-tune for details.

pyllars's People

Contributors

bmmalone avatar stianlagstad avatar

Stargazers

Alan Szmyt avatar Haneol Lee avatar  avatar Filippo Grazioli avatar  avatar Thiago Britto Borges avatar

Watchers

 avatar

Forkers

stianlagstad

pyllars's Issues

pandas_utils requires shared build of python

misc.pandas_utils imports fastparquet, which in turn imports numba. It seems that numba requires a shared build of python:

  File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/misc/pandas_utils.py", line 19, in <module>
    import fastparquet
  File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/fastparquet/__init__.py", line 8, in <module>
    from .core import read_thrift
  File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/fastparquet/core.py", line 13, in <module>
    from . import encoding
  File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/fastparquet/encoding.py", line 8, in <module>
    import numba
  File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/numba/__init__.py", line 12, in <module>
    from .special import typeof, prange
  File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/numba/special.py", line 3, in <module>
    from .typing.typeof import typeof
  File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/numba/typing/__init__.py", line 2, in <module>
    from .context import BaseContext, Context
  File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/numba/typing/context.py", line 10, in <module>
    from numba.typeconv import Conversion, rules
  File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/numba/typeconv/rules.py", line 3, in <module>
    from .typeconv import TypeManager, TypeCastingRules
  File "/beegfs/homes/bmalone/.virtualenvs/rpbp-test/lib/python3.5/site-packages/numba/typeconv/typeconv.py", line 3, in <module>
    from . import _typeconv, castgraph, Conversion
ImportError: libpython3.5m.so.1.0: cannot open shared object file: No such file or directory

Removing the import stops the problem.

The easiest option is to move the parquet import inside the respective functions. Then, using fastparquet simply requires python to be build with --enable-shared.

Add a `sample_dirichlet_multinomial` function to `stats_utils`

Here is a basic example:

def sample_dirichlet_multinomial(dirichlet_alphas:np.ndarray, num_samples:int) -> np.ndarray:
    pvals = np.random.dirichlet(dirichlet_alphas)
    sampled_counts = np.random.multinomial(n=num_samples, pvals=pvals)
    return sampled_counts

Document dataset manager

All of the package could use improved documentation; however, dataset manager is used in several external projects. The fields, etc., it exposes, and how it determines those, should be explained in much better detail.

Allow `None` for the client in dask_utils

It is convenient if the various apply functions in dask_utils can accept None as the dask_client. In these cases, the implementation can fall back to the respective function in pd_utils.

Add `union_` and `intersect_masks` functions to `pandas_utils`

Here are example implementations:

def intersect_masks(masks:Sequence[np.ndarray]) -> np.ndarray:
    m_intersect = np.all(list(m for m in masks), axis=0)
    return m_intersect

def union_masks(masks:Sequence[np.ndarray]) -> np.ndarray:
    m_union = np.any(list(m for m in masks), axis=0)
    return m_union

Add a "get_class" function

This function should take in a class name and return the class. It is just a light wrapper around importlib.import_module and should look similar to the following.

def get_class(fully_qualified_class_name):
    """ Convert the string version of a class to the class object
    
    For example, for the input "keras.optimizers.Adam", this function
    will return the Adam class. It could then be called to create
    an instance of an Adam optimizer.
    
    Parameters
    ----------
    fully_qualified_class_name : str
        The name of the class
        
    Returns
    -------
    class : type
        The type of the class
    """
    sp = fully_qualified_class_name.split(".")

    module_name = ".".join(sp[:-1])
    class_name = sp[-1]
    
    m = importlib.import_module(module_name)
    clazz = getattr(m, class_name)
    
    return clazz

Update to sheetname reading excel files in pd_utils

In particular, reading an excel file currently raises the following warning or similar:

DEBUG    : The guessed filetype was: excel
/home/bmalone/.virtualenvs/nes-ehr/lib/python3.6/site-packages/pandas/util/_decorators.py:188: FutureWarning: The `sheetname` keyword is deprecated, use `sheet_name` instead
  return func(*args, **kwargs)

Update sheet_name to sheetname.

Build documentation with sphinx

The internal documentation is largely compatible with sphinx (sklearn-style). Fix any improperly formatted documentation and build it with sphinx.

Add an `apply_rolling_window` helper to pd_utils.

The idea is that this function applies a function to overlapping rows in a data frame (that is, a "rolling window"). A basic implementation is as follows:

def apply_rolling_window(df, func, window_size, progress_bar=False):
    
    it = range(len(df))
    
    if progress_bar:
        it = tqdm.trange(len(df))
        
    ret = [
        func(df.iloc[i: i+window_size])
            for i in it
    ]

    return ret

Document modules

We need to document the following modules:

  • collection_utils
  • dask_utils
  • deprecated_decorator
  • external_sparse_matrix_list
  • gene_ontology_utils
  • hyperparameter_utils
  • incremental_gaussian_estimator
  • latex_utils
  • logging_utils
  • math_utils
  • matrix_utils
  • missing_data_utils
  • ml_utils
  • mpl_utils
  • mygene_utils
  • nlp_utils
  • pandas_utils
  • physionet_utils
  • scip_utils
  • shell_utils
  • sparse_vector
  • ssh_utils
  • stats_utils
  • string_utils
  • suppress_stdout_stderr
  • utils
  • validation_utils

Handle old versions of pyyaml gracefully

Right now, old versions of pyyaml will cause the following error message: AttributeError: module 'yaml' has no attribute 'full_load'. This is due to having an old version of pyyaml (and the call to "load" there is unsafe).

Add a `get_index_and_reverse_map` function

This function should return two maps which map from items to indices and back from indices to items. It should also check that the items are unique (or the reverse map will not work). That is, this should be a bijective mapping. Here is a basic example:

    index_map = {
        c:i for i, c in enumerate(items)
    }
    reverse_index_map = {
        i:c for c,i in index_map.items()
    }

Handle timeouts in dask_utils.collect_results

For large jobs, dask sometimes times out when retrieving individual future results. It is not clear why this happens. future.result has a timeout parameter, so that can be used to avoid indefinite hangs waiting for specific jobs.

The function should probably also have an option to return timed-out results.

Simplify mpl_utils.plot_roc_curve

Currently, this function is very complicated, and its parameters are highly non-obvious. Simplify it to plot a single ROC curve. Users can call it multiple times if desired.

Add a `remove_nans` function

This function returns the non-nan values from an array. This should probably be a part of math_utils, or maybe matrix_utils. A basic implementation is as follows.

def _remove_nans(vals):
    m_nan = pd.isnull(vals)
    vals = vals[~m_nan]
    return vals

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.