pytoolz / toolz Goto Github PK

View Code? Open in Web Editor NEW

4.6K 81.0 258.0 1.1 MB

A functional standard library for Python.

Home Page: http://toolz.readthedocs.org/

License: Other

Python 99.97% Shell 0.02% Batchfile 0.02%

toolz's Introduction

Toolz

A set of utility functions for iterators, functions, and dictionaries.

See the PyToolz documentation at https://toolz.readthedocs.io

LICENSE

New BSD. See License File.

Install

toolz is on the Python Package Index (PyPI):

pip install toolz

Structure and Heritage

toolz is implemented in three parts:

itertoolz, for operations on iterables. Examples: groupby, unique, interpose,

functoolz, for higher-order functions. Examples: memoize, curry, compose,

dicttoolz, for operations on dictionaries. Examples: assoc, update-in, merge.

These functions come from the legacy of functional languages for list processing. They interoperate well to accomplish common complex tasks.

Read our API Documentation for more details.

Example

This builds a standard wordcount function from pieces within toolz:

>>> def stem(word):
...     """ Stem word to primitive form """
...     return word.lower().rstrip(",.!:;'-\"").lstrip("'\"")

>>> from toolz import compose, frequencies
>>> from toolz.curried import map
>>> wordcount = compose(frequencies, map(stem), str.split)

>>> sentence = "This cat jumped over this other cat!"
>>> wordcount(sentence)
{'this': 2, 'cat': 2, 'jumped': 1, 'over': 1, 'other': 1}

Dependencies

toolz supports Python 3.7+ with a common codebase. It is pure Python and requires no dependencies beyond the standard library.

It is, in short, a lightweight dependency.

CyToolz

The toolz project has been reimplemented in Cython. The cytoolz project is a drop-in replacement for the Pure Python implementation. See CyToolz GitHub Page for more details.

Contributions Welcome

toolz aims to be a repository for utility functions, particularly those that come from the functional programming and list processing traditions. We welcome contributions that fall within this scope.

We also try to keep the API small to keep toolz manageable. The ideal contribution is significantly different from existing functions and has precedent in a few other functional systems.

Please take a look at our issue page for contribution ideas.

Community

See our mailing list. We're friendly.

toolz's People

Contributors

Stargazers

Watchers

Forkers

mrocklin eigenhombre jnrowe-retired-forks julian-o eriknw jdmcbr jcrichton obmarg whilo joyrexus shvechikov larsmans wavelets ellisonbg tomprince josericardo sandbender njwilson karansag berrytj asifiqbal b-rich danoneata manishtomar efeng1st sucitw bartvm ajfriend minrk davidshepherd7 pombreda thampiman dboyliao okdistribute afthill bag-of-projects z1989628 higumachan vonrosenchild themiurgo janusnic 183amir snth xuefeng-zhu llllllllll simudream cowlicks thomasgreg machinelearningdeveloper jcrist benlewis-tes quantopian domenkozar e42s andreacrotti timedcy gpfreitas ssanderson roryk paulmueller nkhuyu david-oconnor andrewwalker steven-cutting tony mulkieran straychen stefanv boneyao witte-de-with satishgoda caioaao caot adamchainz tomas-fp beenorgone jakirkham donaldfoss sjl421 blakev detrout grafke jacob-faber sergesvr postelrich dirkakrid yws johncrickett pereirapysensing sd2017 hristog groutr pyzh fominok mzsk abremod mikebohdan mitalashok leshy yardsale8

toolz's Issues

Check PEP-8 as part of continuous integration

Per discussion on #79.

Outstanding open question (@mrocklin ?) is which PEP-8 exceptions to allow.

@erikw says, "Be sure to add --show-source to the pep8 command to provide the best feedback to the user."

Add assoc and update_in to dicttoolz

I have code for these and can supply a PR, will depend on #4 accepted, if not merged.

Add API documentation

Good documentation will increase uptake of this library.

We have good docstrings. Rather than making people read the source code, it might be helpful to auto-generate API docs from the code and host them on GitHub.

Merge itertoolz and functoolz into common codebase?

Should `dictoolz.merge` take many arguments or just one?

Currently merge takes several arguments

>>> merge({1: 'one'}, {2: 'two'})
{1: 'one', 2: 'two'}

This is convenient when dealing with only a few arguments but inconvenient for large lists. We might want something like the following instead

>>> merge([{1: 'one'}, {2: 'two'}])
{1: 'one', 2: 'two'}

This is better when collections of dicts are stored as variables

>>> dicts = [{1: 'one'}, {2: 'two'}]
>>> merge(dicts)
{1: 'one', 2: 'two'}

This question comes up with sum, sorted, min, and max.

min/max accept either. Maybe that's the way to go.

Behavior of `intersection`

First, I think the behavior of itertoolz.intersection could be made more clear:

it is only lazy with respect to the first sequence
it may not be clear that list(intersection([1, 2, 3, 2], [2, 3, 4])) gives [2, 3, 2]
the first sequence may be an iterator or container
currently, the other sequences must be containers

My first proposal is to change the function signature to intersection(seq, *seqs), because the first sequence is not optional, and this makes it more clear that the first one is special.

Two options for my second proposal regarding iterables in seqs are:

(2a) raise a TypeError:

for item in seqs:
    if iter(item) is item:
        raise TypeError("no iterators, yo!")

(2b) consume the iterators into memory, such as:

def intersection(seq, *seqs):
    iterset = set()
    targets = [iterset]
    for coll in seqs:
        if iter(coll) is coll:
            iterset.update(coll)
        else:
            targets.append(coll)

    return (item for item in seq
            if all(item in coll for coll in targets))

I don't have a specific usage-case problem with the current implementation or need to have it changed. I began thinking about input types thanks to the recently fixed issue with accumulate (#87). Currently, I think the only functions in itertoolz that don't handle iterators nicely are intersection (this PR) and isdistinct (#89).

Faster implementation of `interpose`

Any thoughts on this implementation of interpose? Compared to the current version, it's about 3x faster for large sequences, and (arguably) easier to comprehend.

def interpose(el, seq):
    """ Introduce element between each pair of elements in seq

    >>> list(interpose("a", [1, 2, 3]))
    [1, 'a', 2, 'a', 3]
    """
    combined = zip(itertools.repeat(el), seq)
    return drop(1, concat(combined))


def test_interpose():
    assert list(interpose(0, itertools.repeat(1, 4))) == [1, 0, 1, 0, 1, 0, 1]
    assert list(interpose('.', ['a', 'b', 'c'])) == ['a', '.', 'b', '.', 'c']

Remove is not lazy

from toolz import partition fails

>>> from toolz import partition
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name partition
>>>

Tracing

Woah!

I was curious what it would be like to trace the input and output of toolz functions and user-defined functions. As a proof-of-concept, I created this branch:

https://github.com/eriknw/toolz/tree/trace_with_q

Simply do from toolz.traced import * and viola! In another termal, watch the output real-time via tail -f /tmp/toolz.

To trace a user function use trace as a decorator or function.

The results are astounding. I would paste example traces here, but I think you guys have got to try this out yourself.

q was copied from https://github.com/zestyping/q and was slightly modified to output to "/tmp/toolz" instead of "/tmp/q".

As I said above, this was meant as a proof-of-concept. It begs the question, though, whether such functionality should be added to toolz, how it should behave, etc. Tracing can be very handy for debugging and as an educational tool for new users.

If you encounter any bugs in the above branch, please post here.

Thoughts and reactions?

Indexing in `nth` not always correct

For efficiency's sake nth tries to index into the sequence in hopes that it supports indexing

try:
    return seq[n]

If this doesn't work then it resorts to iterating through the sequence using islice to get to the correct element.

This behavior is incorrect on some datatypes, such as dictionaries, where indexing and iterating have significantly different meanings.

Some discussion on this took place on #57

Possible bug in `update_in`

I think there's a bug, or at least inconsistent behaviour, in the update_in function when called with keys which are not present in the dictionary to be updated.

For example:

>>> # let's feed it an empty dictionary
>>> update_in({}, [4], lambda x:1 if x else -1)
{4: -1}
>>> update_in({}, [4, 5], lambda x:1 if x else -1)
Traceback (most recent call last):
  File "<pyshell#38>", line 1, in <module>
    update_in({}, [4, 5], lambda x:1 if x else -1)
  File "<pyshell#4>", line 29, in update_in
    keys[1:], f))
  File "<pyshell#4>", line 26, in update_in
    return assoc(d, keys[0], f(d.get(keys[0], None)))
AttributeError: 'NoneType' object has no attribute 'get'

It can handle one missing key if the function passed to it happens to accept None as its only argument, but it always chokes if there is more than one missing key.

Assuming I'm right about this (I may have misunderstood the function), there are two possible fixes; it could be modified to throw an explicit KeyError, or modified to follow the behaviour of Clojure's update-in, which creates nested dictionaries to the depth specified by the keys.

The latter would look something like:

def clojure_update_in(d, keys, f):
    k, ks = keys[0], keys[1:]
    if ks:
        return assoc(d, k, clojure_update_in(d.get(k, {}), ks, f))
    else:
        return assoc(d, k, f(d.get(k, None)))

>>> clojure_update_in({}, [4], lambda x:1 if x else -1)
{4: -1}  # same behaviour with one missing key
>>> clojure_update_in({}, [4, 5], lambda x:1 if x else -1)
{4: {5: -1}}
>>> clojure_update_in({}, [4, 5, 9, 10, 11], lambda x:1 if x else -1)
{4: {5: {9: {10: {11: -1}}}}}
>>> clojure_update_in({}, [4, 5, 9, 10, 11], identity)
{4: {5: {9: {10: {11: None}}}}}

Any thoughts?

Add references to other Python projects to docs

We should reference other competing python projects in our docs.

The following projects are notable.

Scope and namespaces

What is the scope of toolz? How do we organize toolz functions into cohesive groups?

Grouping: One extreme is to have lots of functions swimming around in a common namespace (from toolz import *). Another is to rigidly structure the namespaces so that few functions are within each (from toolz.itertoolz.random import *). A hybrid is to structure functions into namespaces but then have a public top-level namespace of very commonly used functions.

Scope: At what point do we say, "this function should be in a different package"? One extreme is to accept any pure-python function that anyone deems useful. The other is to restrict ourselves to an exclusive orthogonal set.

Toolz is now growing at a rate where I would like to develop some kind of guideline. We don't want to introduce an API that we don't want to maintain into the future. I believe that these two issues are related because we can ease the problems of a large scope with organization.

Need Partition

>>> list(partition(2, range(1, 11)))
[[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]]

Called chunked in more_itertools, partition in Clojure.

Build set of common examples

These should be in an examples directory and can be used for various publications

Examples should possess some of the following virtues:

Simpler than we think they should be
Not require understanding of other fields (e.g. linear algebra, genomics)
Cover some common use cases (e.g. word counting)
Be exciting
...

Possible Examples include

Word counting
...

Ideas to publicize pytoolz

Blogpost on Planet Clojure
Blogpost on Planet Python
Link on underscorejs.org (they have a list of related projects)
Paper on arXiv?
Presentation at ChiPy
Presentation at SF Python meetup
Tutorials at conferences
Troll stackoverflow.com, supplying toolz function as solutions to relevant problems

complement

I make frequent use of complement.

Something like,

def complement(f):
    def inner(*args, **kwargs):
        return not f(*args, **kwargs)
    return inner

Needed it today and put it in my local toolbelt.

Need an explanatory website

The docs need to live somewhere. Presumably we should have some examples and motivational text as well.

PyPy

Should we be PyPy compliant? How how much faster is this on common tasks?

Curry should raise error when given too many arguments

The following returns a curry object instead of raising an error.

map = curry(map)
result = map(f, L, 1, 2, 3, 4)

`get` on iterators

It was brought up in #93 that get didn't work on iterators, but it could be made to do so. This would act as a generalized nth that could accept a list of indices to fetch. I thought this was interesting, so I developed the following:

def iterget(ind, seq):
    indices = sorted(zip(ind, itertools.count()))
    results = []
    val = next(seq)
    prev_index = 0
    for index, count in indices:
        if index != prev_index:
            n = index - prev_index - 1
            val = next(itertools.islice(seq, n, n + 1))
            prev_index = index
        results.append((count, val))
    return tuple(item for _, item in sorted(results))

and some output:

>>> iterget([1, 3, 5, 2, 3], iter(range(100, 200)))
(101, 103, 105, 102, 103)

>>> iterget(range(10) + range(9, -1, -1), iter(range(10)))
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0)

>>> iterget([100000000, 1, 100000000, 5], iter(xrange(1000000000)))
(100000000, 1, 100000000, 5)

>>> iterget([100000000000000, 1, 5], iter(xrange(1000000000000000)))
# will need to wait a long, long time, but at least the memory footprint is low!

I don't know if this is the best or cleanest way to do it, and it currently doesn't accept default arguments, but I thought this was neat enough to share.

So, should we allow get to work on iterators? Should there instead be a separate function for getting multiple items from an iterator?

Benchmarks

Performance matters, particularly because we're in pure python. I was able to speed up frequencies by a factor of three or so through profiling on a particular dataset. We should keep this in mind. The examples issue can provide a stopgap for this.

partition-by

I plan to add partition-by shortly. This provides, for example, a trivial answer to http://www.4clojure.com/problem/30

Multiple dispatch project

Multiple dispatch might be something we could work on. It's a fairly important issue in scientific python and is something that Clojure does will via protocols. There are a few implementations out there for single dispatch (including one in the Py3 standard lib) but nothing dominant for multiple dispatch. This is probably because it's a hard problem with some challenging decisions to make.

Module-level higher-order functions (HOFs)

This thread is to continue the discussions from #69 (comment) and related to #67.

We may want to have several HOFs, such as:

curried
traced (possibly many options)
tupled (to change iterator outputs to tuples)
several options for performing parallel operations
options to change the variadic use of functions

and so on.

We already have curried, and a proof-of-concept traced is being explored in #69. We ought to explore how we wish to support HOFs. Usefulness, ease of use, and ease of code-maintenance are all extremely important.

To combine HOFs at the module level, I propose the following kind of API:

In [1]: import hof

In [2]: hof.magic.double.triple.three()
Out[2]: 18

In [3]: hof.magic.inc.inc.inc.one()
Out[3]: 4

The above is real output from a toy package I threw together. A possible issue is one can't generally do from hof.magic.inc.inc.inc import one, two, three, because the module hof.magic.inc.inc.inc is auto-generated and won't exist in sys.modules. Right now I see two options to work around this:

inc3 = hof.magic.inc.inc.inc
pre-initialize the most useful (or all?) combinations in sys.modules so importing will work as expected.

I'll explore further, but feedback for what kind of API and behavior you wish to see is also very important. Oh, and what HOFs you think may be useful.

assoc_in

We have update_in. assoc_in is its simpler equivalent, generalizing assoc to a nested data structure. C.f. http://clojuredocs.org/clojure_core/clojure.core/assoc-in

README.md looks bad on PyPI

PyPI (and its cousins) expects reStructuredText for the long description of a package. Markdown renders partially correctly, but much of it is rendered incorrectly as can be seen here:
https://pypi.python.org/pypi/toolz
https://crate.io/packages/toolz/

"pandoc" can convert from markdown to rst, although a couple changes to README.md may be required to convert everything smoothly. For example, [itertoolz](https://github.com/pytoolz/toolz/blob/master/toolz/itertoolz/core.py), hasn't converted properly for me (maybe a newer version of pandoc or different command line options would help), but changing itertools to itertools works.

Currying standard functions

I'd like to curry a large number of functions in toolz/__init__.py.

This would allow us to use idioms like

map(take(2), list_of_lists)

rather than

map(lambda L: take(2, L), list_of_lists)

In particular it would make thread_first and thread_last much more Pythonic

bare import of `toolz` fails

(env)02:39:00 lbnerc (master) >  cd /tmp
(env)02:39:03 tmp  >  python
Python 2.6.8 (unknown, Nov 17 2012, 21:23:35) 
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import toolz
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jacobsen/env/lib/python2.6/site-packages/toolz-0.2-py2.6.egg/toolz/__init__.py", line 1, in <module>
    from .itertoolz import (groupby, countby, frequencies, reduceby,
ImportError: No module named itertoolz

Variadic argument conventions

I have been thinking about variadic arguments with a goal of making the itertoolz API "just work" in a way that is user-friendly and intuitive. I am proposing a uniform API that allows functions to be used variadically and non-variadically by accepting a sequence container of iterables.

In dicttoolz, merge and merge_with accept dicts variadically as *dicts. This may be used variadically and non-variadically. For example, these are equivalent: merge(d1, d2) and merge([d1, d2]). This is very convenient. Furthermore, there can be no ambiguity over what is intended, because it checks input for dict type.

In itertoolz, we cannot check for types as in dictoolz, so the same strategy of allowing variadic and non-variadic input can result in ambiguities. Ambiguities can lead to incorrect edge cases that stymie novices and experts alike. This is bad... right?

An API that follows several competing conventions is also bad. It makes the library harder to learn and use. When does a function accept multiple inputs variadically? Which ones require a sequence of iterables? Should there be duplicates of functions that handle inputs differently? Why does X do this and not Y? A unified flexible API cannot completely avoid edge cases under all uses, but it can control them, and I believe the benefits are worth the costs.

First, a summary. Here is a list of itertoolz (and itertools) functions that accept variadic input of iterables: merge_sorted, intersection, concatv (chain, map, zip, product). Here are the functions that accept a sequence of iterables: interleave, concat, mapcat (chain.from_iterable, starmap).

TL;DNR: Read Here

The proposal: if the variadic argument is a single sequence container of iterables, then unpack the container and treat the iterables as arguments. (If we document this somewhere, somebody else should come up with better wording!). For example, f([[1, 2], [3, 4]]) is the same as f([1, 2], [3, 4]). The simplest way to show this condition programmatically is:

def f(*seqs):
    try:
        if len(seqs) == 1 and iter(seqs[0][0]):
            seqs = seqs[0]
    except TypeError:
        pass

Note that we can add more conditionals to the above in order to avoid the performance penalty of raising and catching exceptions.

Behaviors:

always works if iterators over data are used
always works if non-variadic form is used
always works with more than one input (for variadic and non-variadic)
always works if variadic form is used with one input that is not a nested sequence

Hence, the only failure is when the variadic form is attempted on a single nested container. For example, f([1, 2]) will work, but f([(1, 2)]) will not, because a single nested sequence will be interpreted as a sequence of iterables.

Here are more examples of sequences with a nested data type:

seqs1 = [[[1, 2]]]
- f(seqs1) works (2)
- f(*seqs1) fails
seqs2 = [iter([[1, 2]])]
- f(seqs2) works (1, 2)
- f(*seqs2) works (1)
seqs3 = [[[1, 2]], [[3, 4]]]
- f(seqs3) works (2, 3)
- f(*seqs3) works (3)

Hence, failures can easily be avoided by not expanding the arguments (i.e., the *seqs operation).

There is one final ambiguity: f([]). Is this an empty list, or no input? We can make the two cases equivalent.

We lose one more thing by using this convention: seqs cannot generally be an iterator of iterables, because we require it to be a sequence container of iterables. This is the same convention used in dicttoolz.merge, which accepts a sequence--but not iterator--of dicts. Fortunately, an iterator of iterables can be easily handled as f(list(seqs)) or list(*seqs), although the latter version with unpacking has the same conditions as stated earlier. The reason for not supporting an iterator of iterables is simple: nested data is common, and we should support it as painlessly as possible without introducing weird edge cases. As long as one works with iterators over data (or do not use the variadic form, or explicitly use the variadic form with more than one input), this convention is guaranteed to work regardless of data type.

I don't know about you guys, but this convention would "just work" for virtually everything I do. However, I don't doubt that edge cases can arise, and my work flow is not necessarily the same as yours or other users. I don't claim to have thought about everything. I welcome discussion and counter-examples.

Misleading comment in "thread_first" and "thread_last" docstrings

The docstring for thread_first includes the following:

>>> thread_first(1, (add, 4), (pow, 2))  # pow(add(4, 1), 2)
25

Note that # pow(add(4, 1), 2) is misleading and should be # pow(add(1, 4), 2)

Similarly, the docstring for thread_last includes the following:

>>> thread_last(1, (add, 4), (pow, 2))  # pow(2, add(1, 4))
32

and # pow(2, add(1, 4)) should be # pow(2, add(4, 1))

Let me illustrate using another example:

thread_first(1, (pow, 2), (pow, 3)) is pow(pow(1, 2), 3) is (1**2)**3 equals 1
and
thread_last(1, (pow, 2), (pow, 3)) is pow(3, pow(2, 1)) is 3**(2**1) equals 9

Lots of new PEP-8 violations

Can we have a consensus on whether or not to follow PEP-8? As @mrocklin well knows, my vote is to slavishly follow the dictates of the pep8 tool, as an aid to consistency and readability (without long discussions about style). @eriknw, do you have an opinion? Anyways, here are some outstanding errors, which I'm happy to fix if we can come to a final decision on this.

(env)09:57:08 toolz (lazy-remove) >  pep8 .
./bench/test_curry.py:5:1: E302 expected 2 blank lines, found 1
./doc/source/conf.py:6:80: E501 line too long (80 characters)
./doc/source/conf.py:14:11: E401 multiple imports on one line
./examples/graph.py:11:11: E221 multiple spaces before operator
./examples/wordcount.py:12:1: W391 blank line at end of file
./toolz/dicttoolz/tests/test_core.py:35:1: E303 too many blank lines (3)
(env)09:57:34 toolz (lazy-remove) >

Move `remove` and `accumulate` to itertoolz

Homepage should link prominently to the GitHub repo

https://toolz.readthedocs.org/ currently doesn't provide a prominent link for potential contributors to get to this GitHub repo.

Version is duplicated in setup.py and toolz/init.py.

Should be in just one place.

Memoize - fast and kwargs

Here are two stackoverflow questions to which toolz.memoize is not currently the best answer

http://stackoverflow.com/questions/9108238/efficient-memoization-in-python
http://stackoverflow.com/questions/6407993/how-to-memoize-kwargs

jackknife function

What do you think of the following function:

def jackknife(seq, replace=no_replace):
    """ Repeatedly iterate over seq, each time omitting a successive element

    Elements may be replaced by a value insted of omitted.

    >>> list(list(x) for x in jackknife(range(4)))
    [[1, 2, 3], [0, 2, 3], [0, 1, 3], [0, 1, 2]]

    >>> list(list(x) for x in jackknife(range(4), replace=None))
    [[None, 1, 2, 3], [0, None, 2, 3], [0, 1, None, 3], [0, 1, 2, None]]

    See Also:
        itertools.combinations
    """
    if replace is no_replace:
        replace = ()
    else:
        replace = (replace,)
    itcounter = iter(seq)
    for i, _ in enumerate(itcounter):
        it = iter(seq)
        yield itertools.chain(itertools.islice(it, i), replace,
                              itertools.islice(it, 1, None))

This is often found in statistics packages, and they typically accept a jackfun function to calculate some statistic over each group of data.

See: http://en.wikipedia.org/wiki/Resampling_(statistics)#Jackknife

I bet there's a good PyToolz example dying to use this function, and it's a perfect candidate to make parallel (as in #24). In fact, the following even accepts a 'UseParallel' keyword:

http://www.mathworks.com/help/stats/jackknife.html

LRU Cache

We should implement a few different caching/memoization options. Least Recently Used is a standard and often used system.

functools 3.2 has an lru_cache HOF/decorator.

I actually think that the way to go is to build a number of dict-like objects (dict, limited-space-dict, ...) and pass these into memoize which will act the same regardless of the data structure used to store the results. This follows principles of composition and encapsulation.

An LRU dict could be implemented from OrderedDict. I've seen recipes online for limited space dictionaries. I even have one lying around work somewhere.

Need consistent naming of function arguments

Part of having a clean API is having a consistent API, which includes consistent names of arguments. This needs reviewed. For example, an argument that is a function is named a variety of ways:

f
fn
func
*funcs
*functions
*forms
predicate
binop
key
keyfn

*forms, predicate, and binop are acceptable (and should always be used if appropriate), because they are descriptive. key or keyfn are also appropriate, but only one should be used. f, fn, func, *funcs, and *functions is excessive and needs to be made consistent. How about choosing to use func and *funcs?.

Heh, standard Python libraries aren't much help in establishing a precedence, because func, function, key, and keyfunc are all used.

It is very easy for this to happen with an evolving library (and it is easy to ignore if you are already familiar with a library). I have only looked closely at names of arguments that are functions, but other argument types should be looked at and simplified as well.

Which functions should be curried (if any)?

For example it would be convenient if functions like nth and take were curried allowing for expressions like

map(take(5), data_lists)

setup.py lists dependencies on itertoolz and functoolz

This seems wrong, since itertoolz and functoolz are now subdirs, not separate projects, but I'll tag this as a question for the moment.

multiprocessing/threading

One benefit of naming and abstracting away common control structures (e.g. map) is that we can swap out their implementations for new technologies. In particular we may want to implement parallel versions of many operations including map, filter and groupby using multiprocessing. These could exist in a separate namespace so that code could be parallelized simply by changing imports.

Sampling or Random module

Toolz supports streaming data well. Sampling and random algorithms are natural pairs with this idea.

Do we want to exploit this synergy within toolz?

If so what are utilities that we would add?

What's up with `hashable`?

It's not used anywhere, and it's not part of the API of toolz. Should we:

Delete it
Add it to toolz API

This was discovered while performing coverage testing. It's the last uncovered code.

Add `get_in` function to dictoolz?

I've ported Clojure's get-in function, which I think might be a useful addition to dictoolz and, if accepted, would allow for a more efficient implementation of dictoolz.update_in.

One thing to bear in mind is that, at first blush, it appears to occupy similar territory to itertoolz.get, although functionally they are distinct, with get using an index to extract multiple values from a collection, and get_in using an index to extract a single value from a nested structure.

So, whilst get provides an alternative interface to operator.itemgetter, get_in generalises operator.getitem for nested data structures.

Good idea? Bad idea?

from functools import reduce
import operator

no_default = '__no__default__'

def get_in(seq, index, default=no_default):
    """ Returns seq[i0][i1]...[iX] where [i0, i1, ..., iX]==index.

    If seq[i0][i1]...[iX] cannot be found, returns ``default`` if it is
    specified, otherwise raises KeyError or IndexError.

    ``get_in`` is a generalization of ``operator.getitem`` for nested data 
    structures such as dictionaries and lists.

    >>> transaction = {'name': 'Alice',
    ...                'purchase': {'items': ['Apple', 'Orange'],
    ...                             'costs': [0.50, 1.25]},
    ...                'credit card': '5555-1234-1234-1234'}
    >>> get_in(transaction, ['purchase', 'items'])
    ['Apple', 'Orange']
    >>> get_in(transaction, ['purchase', 'items', 0])
    'Apple'
    >>> get_in(transaction, ['purchase', 'date'], default='2013-11-12')
    '2013-11-12'

    >>> nested_list = [0, [1, [2, [3]]], 4]
    >>> get_in(nested_list, [1, 1, 1, 0])
    3
    >>> get_in(nested_list, [1, 1, 1, 1], default='foo')
    'foo'

    See Also:
        itertoolz.get
        operator.getitem
    """
    try:
        return reduce(operator.getitem, index, seq)
    except (KeyError, IndexError) as e:
        if default is no_default:
            raise e
        else:
            return default


def test_get_in():
    assert get_in([0, 1], [0]) == operator.getitem([0, 1], 0)== 0
    try:
        get_in([0, 1], [2])
    except IndexError:
        pass
    assert get_in([0, 1], [2], default='foo') == 'foo'
    d = {1: {2: {3: ['a', 'b']}}}
    assert get_in(d, [1, 2, 3, 0]) == d[1][2][3][0] == 'a'
    try:
        get_in(d, [1, 2, 4])
    except KeyError:
        pass
    assert get_in(d, [1, 2, 4], default='bar') == 'bar'

Add seque and pmap to sandbox

c.f. https://gist.github.com/eigenhombre/6854438

Meta: Quick PRs before tutorial

I'm giving a tutorial on pytoolz at PyData. In creating notebooks for the tutorial I'm running into a few small issues. I plan to merge these quickly (after my next flight) so immediate review would be nice.

These are
#71
#72
#73

I'll try to post tutorial materials in a github repo soon.

Behavior of `isdistinct`

itertoolz.isdistinct does not accept iterators, and it does not short-circuit. Both of these issues can be resolved as follows:

def isdistinct(seq):
    seen = set()
    for item in seq:
        if item in seen:
            return False
        seen.add(item)
    return True

Here are benchmarks when all elements are distinct:

l1 = range(1)
l10 = range(10)
l100 = range(100)
l1000 = range(1000)

%timeit isdistinct(l1)
Old: 1000000 loops, best of 3: 968 ns per loop
New: 1000000 loops, best of 3: 1.02 µs per loop

%timeit isdistinct(l10)
Old: 100000 loops, best of 3: 1.86 µs per loop
New: 100000 loops, best of 3: 3.77 µs per loop

%timeit isdistinct(l100)
Old: 100000 loops, best of 3: 8.49 µs per loop
New: 10000 loops, best of 3: 30.1 µs per loop

%timeit isdistinct(l1000)
Old: 10000 loops, best of 3: 65.9 µs per loop
New: 1000 loops, best of 3: 283 µs per loop

Thoughts? Can you think of a better way? Should we allow the current method to be chosen (such as if seq is expected to be distinct nearly all the time)?

code coverage

Several python packages use https://coveralls.io/ to show line coverage information. It integrates seemlessly with TravisCI. This can show coverage statistics on the main page of github and PyPI (and http://crate.io). For example, see:

https://github.com/coagulant/coveralls-python

I think coverage statistics should be shown for toolz, and we should shoot for 100% coverage from tests. It's a warm fuzzy that make people feel better about using a package, and complete coverage is something that can be touted whenever toolz is written about.