Git Product home page Git Product logo

toolz's Introduction

Toolz

Build Status Coverage Status Version Status

A set of utility functions for iterators, functions, and dictionaries.

See the PyToolz documentation at https://toolz.readthedocs.io

LICENSE

New BSD. See License File.

Install

toolz is on the Python Package Index (PyPI):

pip install toolz

Structure and Heritage

toolz is implemented in three parts:

itertoolz, for operations on iterables. Examples: groupby, unique, interpose,

functoolz, for higher-order functions. Examples: memoize, curry, compose,

dicttoolz, for operations on dictionaries. Examples: assoc, update-in, merge.

These functions come from the legacy of functional languages for list processing. They interoperate well to accomplish common complex tasks.

Read our API Documentation for more details.

Example

This builds a standard wordcount function from pieces within toolz:

>>> def stem(word):
...     """ Stem word to primitive form """
...     return word.lower().rstrip(",.!:;'-\"").lstrip("'\"")

>>> from toolz import compose, frequencies
>>> from toolz.curried import map
>>> wordcount = compose(frequencies, map(stem), str.split)

>>> sentence = "This cat jumped over this other cat!"
>>> wordcount(sentence)
{'this': 2, 'cat': 2, 'jumped': 1, 'over': 1, 'other': 1}

Dependencies

toolz supports Python 3.7+ with a common codebase. It is pure Python and requires no dependencies beyond the standard library.

It is, in short, a lightweight dependency.

CyToolz

The toolz project has been reimplemented in Cython. The cytoolz project is a drop-in replacement for the Pure Python implementation. See CyToolz GitHub Page for more details.

See Also

  • Underscore.js: A similar library for JavaScript
  • Enumerable: A similar library for Ruby
  • Clojure: A functional language whose standard library has several counterparts in toolz
  • itertools: The Python standard library for iterator tools
  • functools: The Python standard library for function tools

Contributions Welcome

toolz aims to be a repository for utility functions, particularly those that come from the functional programming and list processing traditions. We welcome contributions that fall within this scope.

We also try to keep the API small to keep toolz manageable. The ideal contribution is significantly different from existing functions and has precedent in a few other functional systems.

Please take a look at our issue page for contribution ideas.

Community

See our mailing list. We're friendly.

toolz's People

Contributors

183amir avatar andreacrotti avatar andrewwalker avatar bartvm avatar coady avatar cpcloud avatar davidshepherd7 avatar digenis avatar ericthemagician avatar eriknw avatar florisla avatar groutr avatar hugovk avatar jayvdb avatar jcrichton avatar jcrist avatar julian-o avatar karansag avatar lincolnpuzey avatar lumbric avatar machinelearningdeveloper avatar medecau avatar minrk avatar mrocklin avatar obmarg avatar roryk avatar samfrances avatar tacaswell avatar themiurgo avatar timgates42 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

toolz's Issues

Add API documentation

Good documentation will increase uptake of this library.

We have good docstrings. Rather than making people read the source code, it might be helpful to auto-generate API docs from the code and host them on GitHub.

Should `dictoolz.merge` take many arguments or just one?

Currently merge takes several arguments

>>> merge({1: 'one'}, {2: 'two'})
{1: 'one', 2: 'two'}

This is convenient when dealing with only a few arguments but inconvenient for large lists. We might want something like the following instead

>>> merge([{1: 'one'}, {2: 'two'}])
{1: 'one', 2: 'two'}

This is better when collections of dicts are stored as variables

>>> dicts = [{1: 'one'}, {2: 'two'}]
>>> merge(dicts)
{1: 'one', 2: 'two'}

This question comes up with sum, sorted, min, and max.

min/max accept either. Maybe that's the way to go.

Behavior of `intersection`

First, I think the behavior of itertoolz.intersection could be made more clear:

  1. it is only lazy with respect to the first sequence
  2. it may not be clear that list(intersection([1, 2, 3, 2], [2, 3, 4])) gives [2, 3, 2]
  3. the first sequence may be an iterator or container
  4. currently, the other sequences must be containers

My first proposal is to change the function signature to intersection(seq, *seqs), because the first sequence is not optional, and this makes it more clear that the first one is special.

Two options for my second proposal regarding iterables in seqs are:

(2a) raise a TypeError:

for item in seqs:
    if iter(item) is item:
        raise TypeError("no iterators, yo!")

(2b) consume the iterators into memory, such as:

def intersection(seq, *seqs):
    iterset = set()
    targets = [iterset]
    for coll in seqs:
        if iter(coll) is coll:
            iterset.update(coll)
        else:
            targets.append(coll)

    return (item for item in seq
            if all(item in coll for coll in targets))

I don't have a specific usage-case problem with the current implementation or need to have it changed. I began thinking about input types thanks to the recently fixed issue with accumulate (#87). Currently, I think the only functions in itertoolz that don't handle iterators nicely are intersection (this PR) and isdistinct (#89).

Faster implementation of `interpose`

Any thoughts on this implementation of interpose? Compared to the current version, it's about 3x faster for large sequences, and (arguably) easier to comprehend.

def interpose(el, seq):
    """ Introduce element between each pair of elements in seq

    >>> list(interpose("a", [1, 2, 3]))
    [1, 'a', 2, 'a', 3]
    """
    combined = zip(itertools.repeat(el), seq)
    return drop(1, concat(combined))


def test_interpose():
    assert list(interpose(0, itertools.repeat(1, 4))) == [1, 0, 1, 0, 1, 0, 1]
    assert list(interpose('.', ['a', 'b', 'c'])) == ['a', '.', 'b', '.', 'c']

from toolz import partition fails

>>> from toolz import partition
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name partition
>>> 

Tracing

Woah!

I was curious what it would be like to trace the input and output of toolz functions and user-defined functions. As a proof-of-concept, I created this branch:

https://github.com/eriknw/toolz/tree/trace_with_q

Simply do from toolz.traced import * and viola! In another termal, watch the output real-time via tail -f /tmp/toolz.

To trace a user function use trace as a decorator or function.

The results are astounding. I would paste example traces here, but I think you guys have got to try this out yourself.

q was copied from https://github.com/zestyping/q and was slightly modified to output to "/tmp/toolz" instead of "/tmp/q".

As I said above, this was meant as a proof-of-concept. It begs the question, though, whether such functionality should be added to toolz, how it should behave, etc. Tracing can be very handy for debugging and as an educational tool for new users.

If you encounter any bugs in the above branch, please post here.

Thoughts and reactions?

Indexing in `nth` not always correct

For efficiency's sake nth tries to index into the sequence in hopes that it supports indexing

try:
    return seq[n]

If this doesn't work then it resorts to iterating through the sequence using islice to get to the correct element.

This behavior is incorrect on some datatypes, such as dictionaries, where indexing and iterating have significantly different meanings.

Some discussion on this took place on #57

Possible bug in `update_in`

I think there's a bug, or at least inconsistent behaviour, in the update_in function when called with keys which are not present in the dictionary to be updated.

For example:

>>> # let's feed it an empty dictionary
>>> update_in({}, [4], lambda x:1 if x else -1)
{4: -1}
>>> update_in({}, [4, 5], lambda x:1 if x else -1)
Traceback (most recent call last):
  File "<pyshell#38>", line 1, in <module>
    update_in({}, [4, 5], lambda x:1 if x else -1)
  File "<pyshell#4>", line 29, in update_in
    keys[1:], f))
  File "<pyshell#4>", line 26, in update_in
    return assoc(d, keys[0], f(d.get(keys[0], None)))
AttributeError: 'NoneType' object has no attribute 'get'

It can handle one missing key if the function passed to it happens to accept None as its only argument, but it always chokes if there is more than one missing key.

Assuming I'm right about this (I may have misunderstood the function), there are two possible fixes; it could be modified to throw an explicit KeyError, or modified to follow the behaviour of Clojure's update-in, which creates nested dictionaries to the depth specified by the keys.

The latter would look something like:

def clojure_update_in(d, keys, f):
    k, ks = keys[0], keys[1:]
    if ks:
        return assoc(d, k, clojure_update_in(d.get(k, {}), ks, f))
    else:
        return assoc(d, k, f(d.get(k, None)))

>>> clojure_update_in({}, [4], lambda x:1 if x else -1)
{4: -1}  # same behaviour with one missing key
>>> clojure_update_in({}, [4, 5], lambda x:1 if x else -1)
{4: {5: -1}}
>>> clojure_update_in({}, [4, 5, 9, 10, 11], lambda x:1 if x else -1)
{4: {5: {9: {10: {11: -1}}}}}
>>> clojure_update_in({}, [4, 5, 9, 10, 11], identity)
{4: {5: {9: {10: {11: None}}}}}

Any thoughts?

Scope and namespaces

What is the scope of toolz? How do we organize toolz functions into cohesive groups?

Grouping: One extreme is to have lots of functions swimming around in a common namespace (from toolz import *). Another is to rigidly structure the namespaces so that few functions are within each (from toolz.itertoolz.random import *). A hybrid is to structure functions into namespaces but then have a public top-level namespace of very commonly used functions.

Scope: At what point do we say, "this function should be in a different package"? One extreme is to accept any pure-python function that anyone deems useful. The other is to restrict ourselves to an exclusive orthogonal set.

Toolz is now growing at a rate where I would like to develop some kind of guideline. We don't want to introduce an API that we don't want to maintain into the future. I believe that these two issues are related because we can ease the problems of a large scope with organization.

Need Partition

>>> list(partition(2, range(1, 11)))
[[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]]

Called chunked in more_itertools, partition in Clojure.

Build set of common examples

These should be in an examples directory and can be used for various publications

Examples should possess some of the following virtues:

  1. Simpler than we think they should be
  2. Not require understanding of other fields (e.g. linear algebra, genomics)
  3. Cover some common use cases (e.g. word counting)
  4. Be exciting
  5. ...

Possible Examples include

  1. Word counting
  2. ...

Ideas to publicize pytoolz

  • Blogpost on Planet Clojure
  • Blogpost on Planet Python
  • Link on underscorejs.org (they have a list of related projects)
  • Paper on arXiv?
  • Presentation at ChiPy
  • Presentation at SF Python meetup
  • Tutorials at conferences
  • Troll stackoverflow.com, supplying toolz function as solutions to relevant problems

complement

I make frequent use of complement.

Something like,

def complement(f):
    def inner(*args, **kwargs):
        return not f(*args, **kwargs)
    return inner

Needed it today and put it in my local toolbelt.

PyPy

Should we be PyPy compliant? How how much faster is this on common tasks?

`get` on iterators

It was brought up in #93 that get didn't work on iterators, but it could be made to do so. This would act as a generalized nth that could accept a list of indices to fetch. I thought this was interesting, so I developed the following:

def iterget(ind, seq):
    indices = sorted(zip(ind, itertools.count()))
    results = []
    val = next(seq)
    prev_index = 0
    for index, count in indices:
        if index != prev_index:
            n = index - prev_index - 1
            val = next(itertools.islice(seq, n, n + 1))
            prev_index = index
        results.append((count, val))
    return tuple(item for _, item in sorted(results))

and some output:

>>> iterget([1, 3, 5, 2, 3], iter(range(100, 200)))
(101, 103, 105, 102, 103)

>>> iterget(range(10) + range(9, -1, -1), iter(range(10)))
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0)

>>> iterget([100000000, 1, 100000000, 5], iter(xrange(1000000000)))
(100000000, 1, 100000000, 5)

>>> iterget([100000000000000, 1, 5], iter(xrange(1000000000000000)))
# will need to wait a long, long time, but at least the memory footprint is low!

I don't know if this is the best or cleanest way to do it, and it currently doesn't accept default arguments, but I thought this was neat enough to share.

So, should we allow get to work on iterators? Should there instead be a separate function for getting multiple items from an iterator?

Benchmarks

Performance matters, particularly because we're in pure python. I was able to speed up frequencies by a factor of three or so through profiling on a particular dataset. We should keep this in mind. The examples issue can provide a stopgap for this.

Multiple dispatch project

Multiple dispatch might be something we could work on. It's a fairly important issue in scientific python and is something that Clojure does will via protocols. There are a few implementations out there for single dispatch (including one in the Py3 standard lib) but nothing dominant for multiple dispatch. This is probably because it's a hard problem with some challenging decisions to make.

Module-level higher-order functions (HOFs)

This thread is to continue the discussions from #69 (comment) and related to #67.

We may want to have several HOFs, such as:

  • curried
  • traced (possibly many options)
  • tupled (to change iterator outputs to tuples)
  • several options for performing parallel operations
  • options to change the variadic use of functions

and so on.

We already have curried, and a proof-of-concept traced is being explored in #69. We ought to explore how we wish to support HOFs. Usefulness, ease of use, and ease of code-maintenance are all extremely important.

To combine HOFs at the module level, I propose the following kind of API:

In [1]: import hof

In [2]: hof.magic.double.triple.three()
Out[2]: 18

In [3]: hof.magic.inc.inc.inc.one()
Out[3]: 4

The above is real output from a toy package I threw together. A possible issue is one can't generally do from hof.magic.inc.inc.inc import one, two, three, because the module hof.magic.inc.inc.inc is auto-generated and won't exist in sys.modules. Right now I see two options to work around this:

  1. inc3 = hof.magic.inc.inc.inc
  2. pre-initialize the most useful (or all?) combinations in sys.modules so importing will work as expected.

I'll explore further, but feedback for what kind of API and behavior you wish to see is also very important. Oh, and what HOFs you think may be useful.

README.md looks bad on PyPI

PyPI (and its cousins) expects reStructuredText for the long description of a package. Markdown renders partially correctly, but much of it is rendered incorrectly as can be seen here:
https://pypi.python.org/pypi/toolz
https://crate.io/packages/toolz/

"pandoc" can convert from markdown to rst, although a couple changes to README.md may be required to convert everything smoothly. For example, [itertoolz](https://github.com/pytoolz/toolz/blob/master/toolz/itertoolz/core.py), hasn't converted properly for me (maybe a newer version of pandoc or different command line options would help), but changing itertools to itertools works.

Currying standard functions

I'd like to curry a large number of functions in toolz/__init__.py.

This would allow us to use idioms like

map(take(2), list_of_lists)

rather than

map(lambda L: take(2, L), list_of_lists)

In particular it would make thread_first and thread_last much more Pythonic

bare import of `toolz` fails

(env)02:39:00 lbnerc (master) >  cd /tmp
(env)02:39:03 tmp  >  python
Python 2.6.8 (unknown, Nov 17 2012, 21:23:35) 
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import toolz
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jacobsen/env/lib/python2.6/site-packages/toolz-0.2-py2.6.egg/toolz/__init__.py", line 1, in <module>
    from .itertoolz import (groupby, countby, frequencies, reduceby,
ImportError: No module named itertoolz

Variadic argument conventions

I have been thinking about variadic arguments with a goal of making the itertoolz API "just work" in a way that is user-friendly and intuitive. I am proposing a uniform API that allows functions to be used variadically and non-variadically by accepting a sequence container of iterables.

In dicttoolz, merge and merge_with accept dicts variadically as *dicts. This may be used variadically and non-variadically. For example, these are equivalent: merge(d1, d2) and merge([d1, d2]). This is very convenient. Furthermore, there can be no ambiguity over what is intended, because it checks input for dict type.

In itertoolz, we cannot check for types as in dictoolz, so the same strategy of allowing variadic and non-variadic input can result in ambiguities. Ambiguities can lead to incorrect edge cases that stymie novices and experts alike. This is bad... right?

An API that follows several competing conventions is also bad. It makes the library harder to learn and use. When does a function accept multiple inputs variadically? Which ones require a sequence of iterables? Should there be duplicates of functions that handle inputs differently? Why does X do this and not Y? A unified flexible API cannot completely avoid edge cases under all uses, but it can control them, and I believe the benefits are worth the costs.

First, a summary. Here is a list of itertoolz (and itertools) functions that accept variadic input of iterables: merge_sorted, intersection, concatv (chain, map, zip, product). Here are the functions that accept a sequence of iterables: interleave, concat, mapcat (chain.from_iterable, starmap).

TL;DNR: Read Here

The proposal: if the variadic argument is a single sequence container of iterables, then unpack the container and treat the iterables as arguments. (If we document this somewhere, somebody else should come up with better wording!). For example, f([[1, 2], [3, 4]]) is the same as f([1, 2], [3, 4]). The simplest way to show this condition programmatically is:

def f(*seqs):
    try:
        if len(seqs) == 1 and iter(seqs[0][0]):
            seqs = seqs[0]
    except TypeError:
        pass

Note that we can add more conditionals to the above in order to avoid the performance penalty of raising and catching exceptions.

Behaviors:

  1. always works if iterators over data are used
  2. always works if non-variadic form is used
  3. always works with more than one input (for variadic and non-variadic)
  4. always works if variadic form is used with one input that is not a nested sequence

Hence, the only failure is when the variadic form is attempted on a single nested container. For example, f([1, 2]) will work, but f([(1, 2)]) will not, because a single nested sequence will be interpreted as a sequence of iterables.

Here are more examples of sequences with a nested data type:

  • seqs1 = [[[1, 2]]]
    • f(seqs1) works (2)
    • f(*seqs1) fails
  • seqs2 = [iter([[1, 2]])]
    • f(seqs2) works (1, 2)
    • f(*seqs2) works (1)
  • seqs3 = [[[1, 2]], [[3, 4]]]
    • f(seqs3) works (2, 3)
    • f(*seqs3) works (3)

Hence, failures can easily be avoided by not expanding the arguments (i.e., the *seqs operation).

There is one final ambiguity: f([]). Is this an empty list, or no input? We can make the two cases equivalent.

We lose one more thing by using this convention: seqs cannot generally be an iterator of iterables, because we require it to be a sequence container of iterables. This is the same convention used in dicttoolz.merge, which accepts a sequence--but not iterator--of dicts. Fortunately, an iterator of iterables can be easily handled as f(list(seqs)) or list(*seqs), although the latter version with unpacking has the same conditions as stated earlier. The reason for not supporting an iterator of iterables is simple: nested data is common, and we should support it as painlessly as possible without introducing weird edge cases. As long as one works with iterators over data (or do not use the variadic form, or explicitly use the variadic form with more than one input), this convention is guaranteed to work regardless of data type.

I don't know about you guys, but this convention would "just work" for virtually everything I do. However, I don't doubt that edge cases can arise, and my work flow is not necessarily the same as yours or other users. I don't claim to have thought about everything. I welcome discussion and counter-examples.

Misleading comment in "thread_first" and "thread_last" docstrings

The docstring for thread_first includes the following:

>>> thread_first(1, (add, 4), (pow, 2))  # pow(add(4, 1), 2)
25

Note that # pow(add(4, 1), 2) is misleading and should be # pow(add(1, 4), 2)

Similarly, the docstring for thread_last includes the following:

>>> thread_last(1, (add, 4), (pow, 2))  # pow(2, add(1, 4))
32

and # pow(2, add(1, 4)) should be # pow(2, add(4, 1))

Let me illustrate using another example:

thread_first(1, (pow, 2), (pow, 3)) is pow(pow(1, 2), 3) is (1**2)**3 equals 1
and
thread_last(1, (pow, 2), (pow, 3)) is pow(3, pow(2, 1)) is 3**(2**1) equals 9

Lots of new PEP-8 violations

Can we have a consensus on whether or not to follow PEP-8? As @mrocklin well knows, my vote is to slavishly follow the dictates of the pep8 tool, as an aid to consistency and readability (without long discussions about style). @eriknw, do you have an opinion? Anyways, here are some outstanding errors, which I'm happy to fix if we can come to a final decision on this.

(env)09:57:08 toolz (lazy-remove) >  pep8 .
./bench/test_curry.py:5:1: E302 expected 2 blank lines, found 1
./doc/source/conf.py:6:80: E501 line too long (80 characters)
./doc/source/conf.py:14:11: E401 multiple imports on one line
./examples/graph.py:11:11: E221 multiple spaces before operator
./examples/wordcount.py:12:1: W391 blank line at end of file
./toolz/dicttoolz/tests/test_core.py:35:1: E303 too many blank lines (3)
(env)09:57:34 toolz (lazy-remove) >  

jackknife function

What do you think of the following function:

def jackknife(seq, replace=no_replace):
    """ Repeatedly iterate over seq, each time omitting a successive element

    Elements may be replaced by a value insted of omitted.

    >>> list(list(x) for x in jackknife(range(4)))
    [[1, 2, 3], [0, 2, 3], [0, 1, 3], [0, 1, 2]]

    >>> list(list(x) for x in jackknife(range(4), replace=None))
    [[None, 1, 2, 3], [0, None, 2, 3], [0, 1, None, 3], [0, 1, 2, None]]

    See Also:
        itertools.combinations
    """
    if replace is no_replace:
        replace = ()
    else:
        replace = (replace,)
    itcounter = iter(seq)
    for i, _ in enumerate(itcounter):
        it = iter(seq)
        yield itertools.chain(itertools.islice(it, i), replace,
                              itertools.islice(it, 1, None))

This is often found in statistics packages, and they typically accept a jackfun function to calculate some statistic over each group of data.

See: http://en.wikipedia.org/wiki/Resampling_(statistics)#Jackknife

I bet there's a good PyToolz example dying to use this function, and it's a perfect candidate to make parallel (as in #24). In fact, the following even accepts a 'UseParallel' keyword:

http://www.mathworks.com/help/stats/jackknife.html

LRU Cache

We should implement a few different caching/memoization options. Least Recently Used is a standard and often used system.

functools 3.2 has an lru_cache HOF/decorator.

I actually think that the way to go is to build a number of dict-like objects (dict, limited-space-dict, ...) and pass these into memoize which will act the same regardless of the data structure used to store the results. This follows principles of composition and encapsulation.

An LRU dict could be implemented from OrderedDict. I've seen recipes online for limited space dictionaries. I even have one lying around work somewhere.

Need consistent naming of function arguments

Part of having a clean API is having a consistent API, which includes consistent names of arguments. This needs reviewed. For example, an argument that is a function is named a variety of ways:

  • f
  • fn
  • func
  • *funcs
  • *functions
  • *forms
  • predicate
  • binop
  • key
  • keyfn

*forms, predicate, and binop are acceptable (and should always be used if appropriate), because they are descriptive. key or keyfn are also appropriate, but only one should be used. f, fn, func, *funcs, and *functions is excessive and needs to be made consistent. How about choosing to use func and *funcs?.

Heh, standard Python libraries aren't much help in establishing a precedence, because func, function, key, and keyfunc are all used.

It is very easy for this to happen with an evolving library (and it is easy to ignore if you are already familiar with a library). I have only looked closely at names of arguments that are functions, but other argument types should be looked at and simplified as well.

multiprocessing/threading

One benefit of naming and abstracting away common control structures (e.g. map) is that we can swap out their implementations for new technologies. In particular we may want to implement parallel versions of many operations including map, filter and groupby using multiprocessing. These could exist in a separate namespace so that code could be parallelized simply by changing imports.

Sampling or Random module

Toolz supports streaming data well. Sampling and random algorithms are natural pairs with this idea.

Do we want to exploit this synergy within toolz?

If so what are utilities that we would add?

What's up with `hashable`?

It's not used anywhere, and it's not part of the API of toolz. Should we:

  1. Delete it
  2. Add it to toolz API

This was discovered while performing coverage testing. It's the last uncovered code.

Add `get_in` function to dictoolz?

I've ported Clojure's get-in function, which I think might be a useful addition to dictoolz and, if accepted, would allow for a more efficient implementation of dictoolz.update_in.

One thing to bear in mind is that, at first blush, it appears to occupy similar territory to itertoolz.get, although functionally they are distinct, with get using an index to extract multiple values from a collection, and get_in using an index to extract a single value from a nested structure.

So, whilst get provides an alternative interface to operator.itemgetter, get_in generalises operator.getitem for nested data structures.

Good idea? Bad idea?

from functools import reduce
import operator

no_default = '__no__default__'

def get_in(seq, index, default=no_default):
    """ Returns seq[i0][i1]...[iX] where [i0, i1, ..., iX]==index.

    If seq[i0][i1]...[iX] cannot be found, returns ``default`` if it is
    specified, otherwise raises KeyError or IndexError.

    ``get_in`` is a generalization of ``operator.getitem`` for nested data 
    structures such as dictionaries and lists.

    >>> transaction = {'name': 'Alice',
    ...                'purchase': {'items': ['Apple', 'Orange'],
    ...                             'costs': [0.50, 1.25]},
    ...                'credit card': '5555-1234-1234-1234'}
    >>> get_in(transaction, ['purchase', 'items'])
    ['Apple', 'Orange']
    >>> get_in(transaction, ['purchase', 'items', 0])
    'Apple'
    >>> get_in(transaction, ['purchase', 'date'], default='2013-11-12')
    '2013-11-12'

    >>> nested_list = [0, [1, [2, [3]]], 4]
    >>> get_in(nested_list, [1, 1, 1, 0])
    3
    >>> get_in(nested_list, [1, 1, 1, 1], default='foo')
    'foo'

    See Also:
        itertoolz.get
        operator.getitem
    """
    try:
        return reduce(operator.getitem, index, seq)
    except (KeyError, IndexError) as e:
        if default is no_default:
            raise e
        else:
            return default


def test_get_in():
    assert get_in([0, 1], [0]) == operator.getitem([0, 1], 0)== 0
    try:
        get_in([0, 1], [2])
    except IndexError:
        pass
    assert get_in([0, 1], [2], default='foo') == 'foo'
    d = {1: {2: {3: ['a', 'b']}}}
    assert get_in(d, [1, 2, 3, 0]) == d[1][2][3][0] == 'a'
    try:
        get_in(d, [1, 2, 4])
    except KeyError:
        pass
    assert get_in(d, [1, 2, 4], default='bar') == 'bar'

Meta: Quick PRs before tutorial

I'm giving a tutorial on pytoolz at PyData. In creating notebooks for the tutorial I'm running into a few small issues. I plan to merge these quickly (after my next flight) so immediate review would be nice.

These are
#71
#72
#73

I'll try to post tutorial materials in a github repo soon.

Behavior of `isdistinct`

itertoolz.isdistinct does not accept iterators, and it does not short-circuit. Both of these issues can be resolved as follows:

def isdistinct(seq):
    seen = set()
    for item in seq:
        if item in seen:
            return False
        seen.add(item)
    return True

Here are benchmarks when all elements are distinct:

l1 = range(1)
l10 = range(10)
l100 = range(100)
l1000 = range(1000)

%timeit isdistinct(l1)
Old: 1000000 loops, best of 3: 968 ns per loop
New: 1000000 loops, best of 3: 1.02 µs per loop

%timeit isdistinct(l10)
Old: 100000 loops, best of 3: 1.86 µs per loop
New: 100000 loops, best of 3: 3.77 µs per loop

%timeit isdistinct(l100)
Old: 100000 loops, best of 3: 8.49 µs per loop
New: 10000 loops, best of 3: 30.1 µs per loop

%timeit isdistinct(l1000)
Old: 10000 loops, best of 3: 65.9 µs per loop
New: 1000 loops, best of 3: 283 µs per loop

Thoughts? Can you think of a better way? Should we allow the current method to be chosen (such as if seq is expected to be distinct nearly all the time)?

code coverage

Several python packages use https://coveralls.io/ to show line coverage information. It integrates seemlessly with TravisCI. This can show coverage statistics on the main page of github and PyPI (and http://crate.io). For example, see:

https://github.com/coagulant/coveralls-python

I think coverage statistics should be shown for toolz, and we should shoot for 100% coverage from tests. It's a warm fuzzy that make people feel better about using a package, and complete coverage is something that can be touted whenever toolz is written about.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.