blaze / blaze Goto Github PK

NumPy and Pandas interface to Big Data

License: BSD 3-Clause "New" or "Revised" License

Python 99.99% Shell 0.01% Batchfile 0.01%

blaze's Introduction

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar interface to query data living in other data storage systems.

Example

We point blaze to a simple dataset in a foreign database (PostgreSQL). Instantly we see results as we would see them in a Pandas DataFrame.

>>> import blaze as bz
>>> iris = bz.Data('postgresql://localhost::iris')
>>> iris
    sepal_length  sepal_width  petal_length  petal_width      species
0            5.1          3.5           1.4          0.2  Iris-setosa
1            4.9          3.0           1.4          0.2  Iris-setosa
2            4.7          3.2           1.3          0.2  Iris-setosa
3            4.6          3.1           1.5          0.2  Iris-setosa

These results occur immediately. Blaze does not pull data out of Postgres, instead it translates your Python commands into SQL (or others.)

>>> iris.species.distinct()
           species
0      Iris-setosa
1  Iris-versicolor
2   Iris-virginica

>>> bz.by(iris.species, smallest=iris.petal_length.min(),
...                      largest=iris.petal_length.max())
           species  largest  smallest
0      Iris-setosa      1.9       1.0
1  Iris-versicolor      5.1       3.0
2   Iris-virginica      6.9       4.5

This same example would have worked with a wide range of databases, on-disk text or binary files, or remote data.

What Blaze is not

Blaze does not perform computation. It relies on other systems like SQL, Spark, or Pandas to do the actual number crunching. It is not a replacement for any of these systems.

Blaze does not implement the entire NumPy/Pandas API, nor does it interact with libraries intended to work with NumPy/Pandas. This is the cost of using more and larger data systems.

Blaze is a good way to inspect data living in a large database, perform a small but powerful set of operations to query that data, and then transform your results into a format suitable for your favorite Python tools.

In the Abstract

Blaze separates the computations that we want to perform:

>>> accounts = Symbol('accounts', 'var * {id: int, name: string, amount: int}')

>>> deadbeats = accounts[accounts.amount < 0].name

From the representation of data

>>> L = [[1, 'Alice',   100],
...      [2, 'Bob',    -200],
...      [3, 'Charlie', 300],
...      [4, 'Denis',   400],
...      [5, 'Edith',  -500]]

Blaze enables users to solve data-oriented problems

>>> list(compute(deadbeats, L))
['Bob', 'Edith']

But the separation of expression from data allows us to switch between different backends.

Here we solve the same problem using Pandas instead of Pure Python.

>>> df = DataFrame(L, columns=['id', 'name', 'amount'])

>>> compute(deadbeats, df)
1      Bob
4    Edith
Name: name, dtype: object

Blaze doesn't compute these results, Blaze intelligently drives other projects to compute them instead. These projects range from simple Pure Python iterators to powerful distributed Spark clusters. Blaze is built to be extended to new systems as they evolve.

Getting Started

Blaze is available on conda or on PyPI

conda install blaze
pip install blaze

Development builds are accessible

conda install blaze -c blaze
pip install http://github.com/blaze/blaze --upgrade

You may want to view the docs, the tutorial, some blogposts, or the mailing list archives.

Development setup

The quickest way to install all Blaze dependencies with conda is as follows

conda install blaze spark -c blaze -c anaconda-cluster -y
conda remove odo blaze blaze-core datashape -y

After running these commands, clone odo, blaze, and datashape from GitHub directly. These three projects release together. Run python setup.py develop to make development installations of each.

License

Released under BSD license. See LICENSE.txt for details.

Blaze development is sponsored by Continuum Analytics.

blaze's People

Contributors

Stargazers

Watchers

Forkers

bussiere cornsea davidcoallier atbrox garfee lijinhui kanghaiyang dlbrittain seibert help-lazybones rfaulkner jedbrown markflorisson devinshields zed9 davidfischer renjiec aashish24 ephrein dwillmer garaud sergiopasra grandtiger pombredanne nvdnkpr jcrabtree pepmadon subbuhariharan mrgloom stuart-knock xuanhan863 dreamfrog aterrel mwiebe marascio imclab nuaays zeeshanali shaded-enmity zzmjohn abhiagarwal cezary12 schevalier mrocklin dg2 aburan28 aaronmartin0303 xsixing quasiben francescalted talumbau sethkontny b-rich whalen53 ckrug chdoig fandres70 chrisbg holdenk gdementen laserson hhuuggoo dalejung pgnepal chrisbeaumont makenoteshere calvinwly vitan luccasmenezes ellisonbg jdmcbr likaiguo alanhdu leolujuyi gpulost pfjob09 jreback samucc danishabdullah ramanathanr alemagnani vodkabuaa yarko stone5495 sheltowt kingctan shoyer tomaugspurger haibocheng an100 jaegeun-park-7 phanther mindis cowlicks bossadvisors stevenmanton copester belevtsoff yonglehou mhlr

blaze's Issues

Mailing list link on http://blaze.pydata.org/ points to GitHub not Google Groups

Sorry, this isn't a "code" related issue, but I thought you'd like to know that your mailing list icon on the http://blaze.pydata.org/ home page doesn't point where I'd expect it to. It takes me to GitHub instead of to the Google Groups.

Array's from iterators don't determing type correctly

In [47]: alst = [1, 2, 3]

In [48]: array(alst.__iter__())
Out[48]: 
array([ 1.,  2.,  3.],
      dshape='3, float64')

In [50]: array(alst)
Out[50]: 
array([1, 2, 3],
      dshape='3, int32')

Declaring dependencies

I understand that it is recommended that users use Anaconda instead of attempting to build all of the dependencies. However, it would be great if the dependencies of blaze were listed somewhere. setup.py references numpy and cython and there's a commented out reference to llvmpy. However, there's no mention of ply for example. Here are some possibilities:

Use setuptools/distribute and use install_requires parameter in setup.py
Add a pip requirements.txt file
Put the required packages into the documentation

I'm open to writing the code or docs.

Creating a blaze.array from another blaze.array does not work.

The problem comes from not handling when the source object is another array.

a=blaze.zeros('10,10,float64')
b=blaze.array(a) # fails

Add Datashape string proposal to grammar.

Implement the variable length string as described in proposal, with optional flag for encoding.

Reference: http://blaze.pydata.org/docs/datashape.html#string-types

After following install instructions "blaze" module won't import

I followed the instructions here to install:

http://blaze.pydata.org/docs/install.html

using the conda install ... approach, and then looked at the quick start example here:

http://blaze.pydata.org/docs/quickstart.html

but import blaze doesn't work:

In [1]: import blaze
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-5c5ee3cb747a> in <module>()
----> 1 import blaze

ImportError: No module named blaze

I tried conda install blaze and pip install blaze but both failed.

Incidentally, it would be nice if conda install blaze installed ply and blosc.

Please advise what I am doing wrong. I should note that I've tried this both in Wakari and on my own laptop.

Vlen implementation issues on Windows

Mark Wiebe reported this problem on Windows (master branch):


I'm getting the following failures in blaze master:

Cheers,
Mark

======================================================================
ERROR: blaze.tests.test_vlen.test_object_persistent_blob
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Anaconda\lib\site-packages\nose\case.py", line 197, in runTest
    self.test(*self.arg)
  File "D:\Develop\blaze\blaze\tests\test_vlen.py", line 54, in test_object_persistent_blob
    for i, v in enumerate(c):
  File "D:\Develop\blaze\blaze\table.py", line 240, in __getitem__
    return retrieve(cc, indexer)
  File "D:\Develop\blaze\blaze\layouts\query.py", line 36, in retrieve
    return getitem(cc, indexer)
  File "D:\Develop\blaze\blaze\layouts\query.py", line 81, in getitem
    datum = elt.read(elt, lc)
  File "D:\Develop\blaze\blaze\sources\chunked.py", line 148, in read
    return self.ca.__getitem__(key)
  File "carrayExtension.pyx", line 1654, in blaze.carray.carrayExtension.carray.__getitem__ (blaz
e/carray\carrayExtension.c:18053)
  File "carrayExtension.pyx", line 1609, in blaze.carray.carrayExtension.carray.getitem_object (b
laze/carray\carrayExtension.c:17783)
  File "carrayExtension.pyx", line 629, in blaze.carray.carrayExtension.chunks.__getitem__ (blaze
/carray\carrayExtension.c:7465)
  File "carrayExtension.pyx", line 609, in blaze.carray.carrayExtension.chunks.read_chunk (blaze/
carray\carrayExtension.c:7094)
ValueError: chunkfile c:\users\mwiebe\appdata\local\temp\tmppkq0gg\c\data\__10.blp not found

======================================================================
ERROR: blaze.tests.test_vlen.test_object_persistent_blob_reopen
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Anaconda\lib\site-packages\nose\case.py", line 197, in runTest
    self.test(*self.arg)
  File "D:\Develop\blaze\blaze\tests\test_vlen.py", line 69, in test_object_persistent_blob_reope
n
    c2 = blaze.open(tmppath)
  File "D:\Develop\blaze\blaze\toplevel.py", line 69, in open
    source = CArraySource(params=parms)
  File "D:\Develop\blaze\blaze\sources\chunked.py", line 63, in __init__
    self.ca = carray.carray(data, rootdir=rootdir, cparams=cparams)
  File "carrayExtension.pyx", line 874, in blaze.carray.carrayExtension.carray.__cinit__ (blaze/c
array\carrayExtension.c:10006)
  File "carrayExtension.pyx", line 1120, in blaze.carray.carrayExtension.carray.read_meta (blaze/
carray\carrayExtension.c:13320)
IOError: [Errno 2] No such file or directory: '\\users\\mwiebe\\appdata\\local\\temp\\tmpeskzuq\\
c\\meta\\sizes'

Apparently this works on Unix.

Meaning of blaze.toplevel.open() with no arguments?

While writing some more tests, I noticed that blaze.toplevel.open() can be called with no arguments. The open() function tries to instantiate a CArraySource with no arguments, which doesn't work:

Traceback (most recent call last):
  File "/Users/stan/anaconda/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/work/projects/blaze-core/blaze/tests/test_toplevel.py", line 4, in test_open_uri_none
    toplevel.open()
  File "/work/projects/blaze-core/blaze/toplevel.py", line 44, in open
    source = CArraySource()
  File "/work/projects/blaze-core/blaze/sources/chunked.py", line 49, in __init__
    (params.get('storage'))
AttributeError: 'NoneType' object has no attribute 'get'

What is the intended meaning of calling blaze.toplevel.open() with a uri set to None?

Opening CTable fails

Datashape related failure on CTable open.

ERROR: blaze.tests.test_toplevel.test_open_ctable

   Traceback (most recent call last):
    /home/stephen/continuum/anaconda/lib/python2.7/site-packages/nose/case.py line 197 in runTest
      self.test(*self.arg)
    blaze/tests/test_toplevel.py line 40 in test_open_ctable
      c = toplevel.open(uri)
    blaze/toplevel.py line 60 in open
      source = CTableSource(params=parms)
    blaze/sources/chunked.py line 206 in __init__
      self.ca = ctable(data, rootdir=rootdir, cparams=cparams)
    blaze/carray/ctable.py line 199 in __init__
      self.open_ctable()
    blaze/carray/ctable.py line 291 in open_ctable
      self.cols.read_meta_and_open()
    blaze/carray/ctable.py line 46 in read_meta_and_open
      self._cols[str(name)] = carray(rootdir=dir_, mode=self.mode)
    carrayExtension.pyx line 892 in blaze.carray.carrayExtension.carray.__cinit__ (blaze/carray/carrayExtension.c:10069)

    carrayExtension.pyx line 1170 in blaze.carray.carrayExtension.carray.read_meta (blaze/carray/carrayExtension.c:13872)

    /home/stephen/continuum/anaconda/lib/python2.7/site-packages/numpy/core/_internal.py line 166 in _commastring
      (len(result)+1, astr))
   ValueError: format number 1 of "[('x', '<i4'), ('y', '<i4')]" is not recognized

Clean up blaze.array methods/attributes

Many of these were put in there to express desired capabilities. I think we should remove them, and rather have a design document describing how we want it to work.

In [1]: import blaze

In [2]: a = blaze.array([1,2,3])

In [3]: [x for x in dir(a) if not x.startswith('_')]
Out[3]: ['axes', 'capabilities', 'dshape', 'expr', 'labels', 'user', 'view']

The printing code should support general datashapes

The next code shows the problem:

In []: a = blaze.array([(1, 2.1, "23")], dshape='1, { x: int32; y: float32; z: string }')

In []: a.dshape
Out[]: dshape("1, { x : int32; y : float32; z : string }")

In []: print a
exception raised in fillFormat: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
exception raised in fillFormat: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
exception raised in fillFormat: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
<snip>
/Users/faltet/software/blaze/blaze/_printing/_arrayprint.pyc in _to_numpy(ds)
     36 
     37 def _to_numpy(ds):
---> 38     res = _internal_to_numpy(ds)
     39     res = res if type(res) is tuple else ((), res)
     40     return res

/Users/faltet/software/blaze/blaze/datashape/coretypes.py in to_numpy(ds)
   1013 
   1014     # The datashape dimensions
-> 1015     for dim in ds[:-1]:
   1016         if isinstance(dim, IntegerConstant):
   1017             shape += (dim,)

/Users/faltet/software/blaze/blaze/datashape/coretypes.py in __getitem__(self, key)
    785 
    786     def __getitem__(self, key):
--> 787         return self.__fdict[key]
    788 
    789     def __eq__(self, other):

TypeError: unhashable type

However, dynd support this:

In []: dynd.nd.array([('1.2', '1', 'sds')], dtype='{x: float32; y: int8; z: string}')
Out[]: nd.array([[1.2, 1, "sds"]], strided_dim<{x : float32; y : int8; z : string}>)

I would say that we should start using dynd instead of numpy for printing blaze arrays.

Finally, note that the code above fails even if we declare the string as a fixed length (which should be supported by numpy):

In []: print blaze.array([(1, 2.1, "23")], dshape='1, { x: int32; y: float32; z: string(10) }')

Consider renaming blaze.drop function

I don't think drop universally signals deletion, I think we should rename it.

Problems building the docs

zsh» make docs
cd docs; make html
make[1]: Entering directory `/home/esc/git-working/blaze/docs'
sphinx-build -b html -d build/doctrees   source build/html
Making output directory...
Running Sphinx v1.1.3
pdfTeX 3.1415926-1.40.10-2.2 (TeX Live 2009/Debian)
kpathsea version 5.0.0
Copyright 2009 Peter Breitenlohner (eTeX)/Han The Thanh (pdfTeX).
There is NO warranty.  Redistribution of this software is
covered by the terms of both the pdfTeX copyright and
the Lesser GNU General Public License.
For more information about these matters, see the file
named COPYING and the pdfTeX source.
Primary author of pdfTeX: Peter Breitenlohner (eTeX)/Han The Thanh (pdfTeX).
Compiled with libpng 1.2.44; using libpng 1.2.44
Compiled with zlib 1.2.3.4; using zlib 1.2.3.4
Compiled with poppler version 0.12.4


Exception occurred:
  File "/home/esc/anaconda/lib/python2.7/site-packages/sphinx/pycode/pgen2/pgen.py", line 15, in __init__
    stream = open(filename)
IOError: [Errno 2] No such file or directory: '/home/esc/anaconda/lib/python2.7/site-packages/sphinx/pycode/Grammar.txt'
The full traceback has been saved in /tmp/sphinx-err-ocrnkt.log, if you want to report the issue to the developers.
Please also report this if it was a user error, so that a better error message can be provided next time.
Either send bugs to the mailing list at <http://groups.google.com/group/sphinx-dev/>,
or report them in the tracker at <http://bitbucket.org/birkenfeld/sphinx/issues/>. Thanks!
make[1]: *** [html] Error 1
make[1]: Leaving directory `/home/esc/git-working/blaze/docs'
make: *** [docs] Error 2

The full traceback is

# Sphinx version: 1.1.3
# Python version: 2.7.3
# Docutils version: 0.9.1 release
# Jinja2 version: 2.6
Traceback (most recent call last):
  File "/home/esc/anaconda/lib/python2.7/site-packages/sphinx/cmdline.py", line 188, in main
    warningiserror, tags)
  File "/home/esc/anaconda/lib/python2.7/site-packages/sphinx/application.py", line 114, in __init__
    self.setup_extension(extension)
  File "/home/esc/anaconda/lib/python2.7/site-packages/sphinx/application.py", line 247, in setup_extension
    mod = __import__(extension, None, None, ['setup'])
  File "/home/esc/anaconda/lib/python2.7/site-packages/sphinx/ext/autodoc.py", line 26, in <module>
    from sphinx.pycode import ModuleAnalyzer, PycodeError
  File "/home/esc/anaconda/lib/python2.7/site-packages/sphinx/pycode/__init__.py", line 25, in <module>
    pygrammar = driver.load_grammar(_grammarfile)
  File "/home/esc/anaconda/lib/python2.7/site-packages/sphinx/pycode/pgen2/driver.py", line 126, in load_grammar
    g = pgen.generate_grammar(gt)
  File "/home/esc/anaconda/lib/python2.7/site-packages/sphinx/pycode/pgen2/pgen.py", line 383, in generate_grammar
    p = ParserGenerator(filename)
  File "/home/esc/anaconda/lib/python2.7/site-packages/sphinx/pycode/pgen2/pgen.py", line 15, in __init__    
    stream = open(filename)
IOError: [Errno 2] No such file or directory: '/home/esc/anaconda/lib/python2.7/site-packages/sphinx/pycode/Grammar.txt'

Missing dynd-python dependency requirement

Hi,

I'm trying to use & test blaze from the master branch. Should 'DyND' be in the requirements.txt or README file (and thus the related C++ libdynd library) ?

'doc/source/install.rst' said that dynd can be optional, is that true ? Maybe update install doc.

Thanks.
Damien G.

blaze.zeros() slowness

It seems that blaze.zeros() has undergone some significant slowdown lately, as the next script shows:

import blaze as blz
import numpy as np
from time import time


len_ = np.prod((100,100,100))
print "len for array:", len_

t0 = time()
a = np.arange(len_)
print "numpy creation time: %.3f" % (time() - t0,)

t0 = time()
b = blz.Array(a, dshape='%d, int32' % (len_,))
t1 = time() - t0
print "Final datashape:", b.datashape
print "blaze.Array creation time: %.3f" % (t1,)

t0 = time()
c = blz.zeros(dshape='%d, int32'% (len_,))
t2 = time() - t0
print "Final datashape:", c.datashape
print "blaze.zeros creation time: %.3f" % (t2,)
print "time ratio blaze.Array vs blaze.zeros: %.1fx" % (t2 / t1,)

and the out in my laptop is:

len for array: 1000000
numpy creation time: 0.008
Final datashape: 1000000, int32
blaze.Array creation time: 0.013
Final datashape: 1000000, int32
blaze.zeros creation time: 2.479
time ratio blaze.Array vs blaze.zeros: 184.3x

Perhaps it is a bit soon for this, but we should start considering some performance regression tool like FunkLoad or codespeed (or whatever).

BLZ `format` '' is not supported.

In [43]: dname = 'persisted.blz'

In [44]: store = blaze.Storage(dname)

ipython shows:

ValueError                                Traceback (most recent call last)
/home/drill/blaze/samples/basics/<ipython-input-44-465fd73ab60e> in <module>()
----> 1 store = blaze.Storage(dname)

/usr/local/lib/python2.7/dist-packages/blaze/storage.pyc in __init__(self, uri, mode, permanent)
     91         self._mode = mode
     92         if self._format != 'blz':
---> 93             raise ValueError("BLZ `format` '%s' is not supported." % self._format)
     94         if not permanent:
     95             raise ValueError(

ValueError: BLZ `format` '' is not supported.

warnings when building extensions (at least on mac os x)

There are several warnings "warning: implicit conversion shortens 64-bit value into a 32-bit value" when building extensions. Files involved are carrayExtension (cython), blosclz and blosc.

Quickstart first example doesn't work as described

If I run the code given there:

from blaze import Array, dshape
ds = dshape('2, 2, int')

a = Array([1,2,3,4], ds)

I get an object that has the datashape described, but behaves like a one-dimensional array. This code should throw an exception, in my opinion, because the data doesn't match the datashape.

The following is what the code should look like:

from blaze import Array, dshape
ds = dshape('2, 2, int')

a = Array([[1,2],[3,4]], ds)

many import errors

I ran many examples in the docs document, they didnt work. Much change has been happened in the 0.2 dev version? Figure out the matched document quickly please! thank in advance!

outdated website examples?

http://blaze.pydata.org/docs/quickstart.html

from blaze import Table gives me an importError ( cannot import Table)

blz storage r/w mode is wrong

if you run the array_creation.py script in samples, you will write data to the specified blz location

however the repr of that storage says mode='r', even though it clearly isn't, because we just wrote some data to it

Add a caching mechanism to the blaze catalog

Two things: in-memory caching, and on-disk caching.

The dynd-based server demo used naive in-memory caching of all its arrays. The blaze server doesn't, in lieu of putting a proper caching mechanism in place. This means the blaze server is really slow if the data backing it gets bigger, e.g. a few gigabytes.

For on-disk caching, we need a file format which fully supports the generality of the blaze data model. I think a format memory-mappable by dynd is the way to go, analogous to the numpy .npy format.

Can't create large multidimensional array in BLZ

[Adapted from https://github.com/Blosc/bcolz/issues/25]

With Numpy I can do something like this:

foo = np.zeros([ 2 ] * 20)

And get an ndarray with the corresponding shape. I can then:

ac = blaze.blz.barray(foo)

To get a barray object. Great. But I'm playing around with blaze.blz because I want to use array sizes that are larger than could otherwise fit in memory, and [2] * 20 is an easy shape for Numpy to handle, so for me it's a baseline of sorts.

Looking to explore the capabilities of blze.blz, I try to create the object directly, without the intermediate Numpy step:

ac = blaze.blz.zeros([2] * 20)

But I get an error:

In [11]: blz.zeros([2]*20)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-32ca191bdff9> in <module>()
----> 1 blz.zeros([2]*20)

/home/faltet/software/blaze/blaze/blz/bfuncs.pyc in zeros(shape, dtype, **kwargs)
    239     """
    240     dtype = np.dtype(dtype)
--> 241     return fill(shape=shape, dflt=np.zeros((), dtype), dtype=dtype, **kwargs)
    242 
    243 

/home/faltet/software/blaze/blaze/blz/bfuncs.pyc in fill(shape, dflt, dtype, **kwargs)
    203     # Then fill it
    204     # We need an array for the defaults so as to keep the atom info
--> 205     dflt = np.array(obj.dflt, dtype=dtype)
    206     # Making strides=(0,) below is a trick to create the array fast and
    207     # without memory consumption

ValueError: number of dimensions must be within [0, 32]

Which leads me to wonder: is something like this possible using blaze.blz?

Cannot create a blaze array with a Record datashape

The record datashape itself can be created as:

"""
In []: blaze.datashape.dshape("Var, {x: int32; y:bool}")
Out[]: dshape("Var, { x : int32; y : bool }")
"""

but it does not work when using it for creating arrays:

"""
In []: blaze.array([(1, True), (2, False)], dshape="Var, { x : int32; y : bool }")
Out[]: exception raised in fillFormat: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
exception raised in fillFormat: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

exception raised in fillFormat: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

TypeError Traceback (most recent call last)
in ()
----> 1 blaze.array([(1, True), (2, False)], dshape="Var, { x : int32; y : bool }")
[clip]
/Users/faltet/software/blaze/blaze/datashape/coretypes.pyc in getitem(self, key)
785
786 def getitem(self, key):
--> 787 return self.fdict[key]
788
789 def __eq(self, other):

TypeError: unhashable type
"""

Notice that there are strange exceptions coming from fillFormat too.

Skipping cffi test on travis

Looks like the CFFI tests are failing and being skipped on travis and jenkins.

RROR: test_1d_array (blaze.datadescriptor.tests.test_cffi_membuf_data_descriptor.TestCFFIMemBufDataDescriptor)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/mark/blaze/blaze/datadescriptor/tests/test_cffi_membuf_data_descriptor.py", line 34, in test_1d_array
    self.assertEqual(dd.dshape, blaze.dshape('32, int16'))
AttributeError: 'module' object has no attribute 'dshape'

======================================================================
ERROR: test_2d_array (blaze.datadescriptor.tests.test_cffi_membuf_data_descriptor.TestCFFIMemBufDataDescriptor)

carray "object" supports needs more testing

fixing issue #14 surfaced some issues in carray when dealing with objects. Object support is not complete and needs its own set of tests to make sure all behavior is tested and expected behavior is properly documented.

Shape is not preserved when deserializing blaze objects

The shape of a ND array is converted into an 1D object, i.e. an array that is stored as:

barray: Array
  datashape := 3, 4, float64 
  values    := [CArray(ptr=4376427488)] 
  metadata  := [manifest, arraylike] 
  layout    := Chunked(dim=0)

it is retrieved as:

barray2: Array
  datashape := 12, float64 
  values    := [CArray(ptr=4376462352)] 
  metadata  := [manifest, arraylike] 
  layout    := Chunked(dim=0)

The next code snipped shows the issue:

import os.path
import shutil
import numpy as np
import blaze as blz

shape = (3,4)
arr = np.ones(shape)

dshape = "%s,%s, float64" % (shape[0], shape[1])
path = "p.blz"
if os.path.exists(path):
    shutil.rmtree(path)
bparams = blz.params(storage=path)
barray = blz.Array(arr, dshape, params=bparams)
print "barray:", repr(barray)

barray2 = blz.open(path)
print "barray2:", repr(barray2)

assert(str(barray.datashape) == str(barray2.datashape))

Disagreement in size of "float" between blaze and numpy

The test_dtype_compat() test is currently failing because of a confusion between the "float" dtype in blaze and the "float" dtype in numpy:

======================================================================
FAIL: test_dtype_compat (blaze.tests.test_numpy_compat.TestToNumPy)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "blaze/tests/test_numpy_compat.py", line 23, in test_dtype_compat
    self.assertEqual(to_numpy(blaze.float_), np.float_)
AssertionError: dtype('float32') != <type 'numpy.float64'>

----------------------------------------------------------------------

The issue is that blaze.float_ is defined in blaze.datashape.coretypes to be:

 float_     = CType('float')

Then CType.to_dtype() has a special case:

if self.name == "float":
    return np.dtype("f")

And np.dtype("f") is a 32-bit float.

Fixing this is easy, but I think the root confusion is whether the blaze.float_ dshape is supposed to be the C "float" (32-bit) or the Python "float" (64-bit). Because of this mismatch, I personally would vote to either:

Eliminate "float" from Blaze entirely and force people to be explicit and use the existing blaze.float32 or blaze.float64 dshapes.
Use the Python/numpy convention and make float_ = float64.

Happy to make a PR, but this is a design question for you guys. :)

Open command should be from_*

I think the blaze.open should match blaze.load.

open is taken by builtin and we don't support many things it does
we don't have a load and if we want to be api compatable with numpy, it would be necessary.

Server compute context

The compute context which worked with the server on top of dynd hasn't been ported to the server in blaze. We need to discuss and figure out how we want it to work, based on the compute mechanisms built in blaze.

Do not require uri for local file

We seem to require the network protocol for all uri's on file openings. For example

store = blaze.Storage('csv:///tmp/test.csv')

instead of

store = blaze.Storage('/tmp/test.csv')

not writing the network protocol is far better since this abuse of uri's will be misunderstood. Let's dispatch on file extension where possible.

Blaze master still using the old parser

One result is that datashapes still need commas on input, but print semicolons on output.

The problem seems to be from blaze/datashape/init.py, where it says

from parse import parse

I've tried changing it to

import parser

and have it call parser.parse instead, and this causes the dshape with the semicolon below to work, but causes many errors in the test suite.

In [1]: import blaze

In [2]: blaze.dshape('{x:int32,y:int32}')
Out[2]: dshape("{ x : int32; y : int32 }")

In [3]: blaze.dshape('{x:int32;y:int32}')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
C:\Anaconda\lib\site-packages\IPython\core\interactiveshell.pyc in run_code(self, code_obj)
   2730             self.CustomTB(etype,value,tb)
   2731         except:
-> 2732             self.showtraceback()
   2733         else:
   2734             outflag = 0

C:\Anaconda\lib\site-packages\IPython\core\interactiveshell.pyc in showtraceback(self, exc_tuple, filename, tb_offset, exception_only)
   1718                                             value, tb, tb_offset=tb_offset)
   1719 
-> 1720                     self._showtraceback(etype, value, stb)
   1721                     if self.call_pdb:
   1722                         # drop into debugger

C:\Anaconda\lib\site-packages\IPython\zmq\zmqshell.pyc in _showtraceback(self, etype, evalue, stb)
    537             u'traceback' : stb,
    538             u'ename' : unicode(etype.__name__),
--> 539             u'evalue' : safe_unicode(evalue)
    540         }
    541 

C:\Anaconda\lib\site-packages\IPython\zmq\zmqshell.pyc in safe_unicode(e)
    443     """
    444     try:
--> 445         return unicode(e)
    446     except UnicodeError:
    447         pass

C:\Anaconda\lib\site-packages\blaze\error.pyc in __str__(self)
     54             filename = self.filename,
     55             lineno   = self.lineno,
---> 56             line     = self.text.split()[self.lineno],
     57             pointer  = ' '*self.col_offset + '^',
     58             msg      = self.msg,

TypeError: list indices must be integers, not str

blaze/ucr-dtw/ucr.pyx: No such file or directory

During setup.py build. ucr-dtw dir is empty.

Add complex32 support

Iterators in BLZ should be in their own class

Right now, the barray and btable objects in BLZ implement the iter in the same class, and this can create problems in different situations:

the len() cannot be shared between the iterator and the underlying object (e.g. nd.array(b.where(a<5)) uses the len(b) to fill the object).
two iterators cannot be run simultaneously (e.g. zip(b.where(a<5), b.where(a>1 && a<6)))

Making the iterator to be an independent object will solve these issues.

Circular imports?

When I try to run the tests in blaze/tests/test_quickstart.py with nose, I get an error that looks like a circular import:

======================================================================
ERROR: tests.test_quickstart.test_sqlite
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/work/projects/blaze/env/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/work/projects/blaze/tests/test_quickstart.py", line 52, in test_sqlite
    from blaze import open
  File "/work/projects/blaze/env/lib/python2.7/site-packages/blaze/__init__.py", line 5, in <module>
    from lib import *
  File "/work/projects/blaze/env/lib/python2.7/site-packages/blaze/lib.py", line 41, in <module>
    from blaze.rts.funcs import PythonFn, install, lift
  File "/work/projects/blaze/env/lib/python2.7/site-packages/blaze/rts/funcs.py", line 30, in <module>
    from blaze.metadata import all_prop
  File "/work/projects/blaze/env/lib/python2.7/site-packages/blaze/metadata.py", line 2, in <module>
    from blaze.expr.utils import Symbol as S
  File "/work/projects/blaze/env/lib/python2.7/site-packages/blaze/expr/__init__.py", line 1, in <module>
    import ops
  File "/work/projects/blaze/env/lib/python2.7/site-packages/blaze/expr/ops.py", line 1, in <module>
    from graph import Op
  File "/work/projects/blaze/env/lib/python2.7/site-packages/blaze/expr/graph.py", line 24, in <module>
    from blaze.expr import nodes, catalog
ImportError: cannot import name nodes

I think the problem is that blaze.expr.init ultimately performs an absolute import of blaze.expr again when trying to import blaze.expr.nodes.

installation failed

In July, i installed successfully, recently the updated version couldn't

drill@goldMINER:~/blaze$ sudo python setup.py install
* Found Cython 0.19.1 package installed.
* Found numpy 1.6.1 package installed.
running install
running build
running build_py
running build_ext
skipping 'blaze/blz/blz_ext.c' Cython extension (up-to-date)
Rebuilding the datashape parser...
Traceback (most recent call last):
  File "setup.py", line 301, in <module>
    'build'     : make_build(build),
  File "/usr/lib/python2.7/distutils/core.py", line 152, in setup
    dist.run_commands()
  File "/usr/lib/python2.7/distutils/dist.py", line 953, in run_commands
    self.run_command(cmd)
  File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
    cmd_obj.run()
  File "/usr/lib/python2.7/distutils/command/install.py", line 601, in run
    self.run_command('build')
  File "/usr/lib/python2.7/distutils/cmd.py", line 326, in run_command
    self.distribution.run_command(command)
  File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
    cmd_obj.run()
  File "setup.py", line 256, in run
    build_command.run(self)
  File "/usr/lib/python2.7/distutils/command/build.py", line 128, in run
    self.run_command(cmd_name)
  File "/usr/lib/python2.7/distutils/cmd.py", line 326, in run_command
    self.distribution.run_command(command)
  File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
    cmd_obj.run()
  File "setup.py", line 260, in run
    rebuild_parse_tables()
  File "setup.py", line 265, in rebuild_parse_tables
    from blaze.datashape.parser import rebuild
  File "build/lib.linux-x86_64-2.7/blaze/__init__.py", line 13, in <module>
    from .array import Array
  File "build/lib.linux-x86_64-2.7/blaze/array.py", line 14, in <module>
    from blaze.ops import ufuncs
  File "build/lib.linux-x86_64-2.7/blaze/ops/ufuncs.py", line 43, in <module>
    @elementwise('A -> A -> bool')
  File "build/lib.linux-x86_64-2.7/blaze/function.py", line 26, in decorator
    return overload(signature, elementwise=True)(f)
  File "build/lib.linux-x86_64-2.7/blaze/overloading.py", line 69, in decorator
    signature = dshape(signature)
  File "build/lib.linux-x86_64-2.7/blaze/datashape/util.py", line 67, in dshape
    ds = _dshape(o, multi)
  File "build/lib.linux-x86_64-2.7/blaze/datashape/util.py", line 75, in _dshape
    return parser.parse(o)
  File "build/lib.linux-x86_64-2.7/blaze/datashape/parser.py", line 481, in parse
    ds = _parse(pattern)
  File "build/lib.linux-x86_64-2.7/blaze/datashape/parser.py", line 463, in _parse
    raise RuntimeError("Parse tables not built, run install script.")
RuntimeError: Parse tables not built, run install script.

Catalog module requires yaml

But yaml is an external dependency that should be listed at least in requisites for Blaze.

setup.py fails with DistutilsClassError

installed anaconda CE
got the blaze-core code from git
ran make build

python --version: Python 2.7.3 :: AnacondaCE 1.3.0 (x86_64)
OS: OSX Version 10.6.8

executed

"python setup.py test"

result:

Found Cython 0.17.4 package installed.
Found numpy 1.7.0rc1 package installed.
Traceback (most recent call last):
File "setup.py", line 317, in
'clean' : CleanCommand,
File "/Users/ekimbrel/anaconda/lib/python2.7/distutils/core.py", line 138, in setup
ok = dist.parse_command_line()
File "/Users/ekimbrel/anaconda/lib/python2.7/distutils/dist.py", line 467, in parse_command_line
args = self._parse_command_opts(parser, args)
File "/Users/ekimbrel/anaconda/lib/python2.7/distutils/dist.py", line 531, in _parse_command_opts
"command class %s must subclass Command" % cmd_class
distutils.errors.DistutilsClassError: command class <class 'unittest.runner.TextTestRunner'> must subclass Command

Blaze docs can't be read on iPhone

Matt Knox emailed the mailing list:

Just wanted to point out that the blaze docs can't be read in an 
iPhone (at least on my 4S with latest iOS). The nav menu
hovers and blocks most of the text as you scroll.

Can't create Tables using RecordDecl per the examples in the docs

Following the "Custom DShapes" example at the bottom of the Quickstart:

from blaze import Table, derived
from blaze import RecordDecl as Record
from blaze import int32
class Custom(Record):
    max = int32
    min = int32
    @derived
    def mid(self):
        return (self.min + self.max)/2

I get the output seen in this gist: https://gist.github.com/ce09394e928890825263

Can't import blaze due to canonical.py

Based on a fresh checkout (I'm currently on c287f4c):

$ python -c 'import blaze'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/ehiggs/.virtualenvs/pandas/local/lib/python2.7/site-packages/blaze/__init__.py", line 5, in <module>
    from lib import *
  File "/home/ehiggs/.virtualenvs/pandas/local/lib/python2.7/site-packages/blaze/lib.py", line 41, in <module>
    from blaze.rts.funcs import PythonFn, install, lift
  File "/home/ehiggs/.virtualenvs/pandas/local/lib/python2.7/site-packages/blaze/rts/funcs.py", line 30, in <module>
    from blaze.metadata import all_prop
  File "/home/ehiggs/.virtualenvs/pandas/local/lib/python2.7/site-packages/blaze/metadata.py", line 2, in <module>
    from blaze.expr.utils import Symbol as S
  File "/home/ehiggs/.virtualenvs/pandas/local/lib/python2.7/site-packages/blaze/expr/__init__.py", line 1, in <module>
    import ops
  File "/home/ehiggs/.virtualenvs/pandas/local/lib/python2.7/site-packages/blaze/expr/ops.py", line 1, in <module>
    from graph import Op
  File "/home/ehiggs/.virtualenvs/pandas/local/lib/python2.7/site-packages/blaze/expr/graph.py", line 26, in <module>
    from blaze.sources.canonical import PythonSource
  File "/home/ehiggs/.virtualenvs/pandas/local/lib/python2.7/site-packages/blaze/sources/canonical.py", line 6, in <module>
    from blaze.sources.descriptors.byteprovider import ByteProvider
ImportError: No module named descriptors.byteprovider

Parsing datashapes with "type Name = ..." in them returns None

Test code:

>>> from blaze import dshape
>>> dshape('{x:int32; y:int32}')
dshape("{ x : int32; y : int32 }")
>>> dshape('type P = {x:int32; y:int32}')
>>> dshape("""type P = {x:int32; y:int32}
...    3, P""")
>>>

The behavior I expected is for it to return the last datashape declared in the string.

blaze does not build on windows using msvc

Build breaks in some C files.

persistence of tables seems to not be working

open fails to open persisted tables with an exception. It seems that tables do not create a proper meta folder and that's causing open to fail

Provide a `.to_numpy()` function or method for dtypes

In some occasions, specially in code that is in-flux for doing the conversion from numpy to dynd, you need to convert arrays and types bidirectionally between the two packages. In dynd we already have nd.to_numpy(), but we lack a way to convert types to numpy types.

I have tried this:

In [7]: dt = nd.type_of(nd.empty('2,2,int32'))

In [8]: np.dtype(str(dt.dtype))
Out[8]: dtype('int32')

but the user can get unpleasant surprises:

In [9]: np.dtype(dt.dtype)
Violació de segment

I think it would be nice to offer something like a ndt.to_numpy() function or a .to_numpy() method for converting dynd types to numpy equivalents (when possible).

fromiter silently catches exceptions thrown by generators, generates bad matrices

Right now:

In [43]: def gen(rows):
    for i in rows:
        yield 0.1*i
   ....:         

In [44]: blaze.fromiter(gen(100), 'x, { f1: int; f2: int }')
Out[44]: 
Array
  datashape := 0, { f1 : int32; f2 : int32 } 
  values    := [CArray(ptr=140531395163520)] 
  metadata  := [manifest, arraylike] 
  layout    := Chunked(dim=0) 
[]

in numpy the exception passes through:

 In [45]: np.fromiter(gen(100), np.float32)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-45-4665ba131c41> in <module>()
----> 1 np.fromiter(gen(100), np.float32)

<ipython-input-43-ee234889df46> in gen(rows)
      1 def gen(rows):
----> 2     for i in rows:
      3         yield 0.1*i
      4 

TypeError: 'int' object is not iterable

It seems that numpy behavior is more sensible.

Note: if passing a dshape with a non-variable dimension:

In [50]: blaze.fromiter(gen(100), '100, float')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-50-c413bd6d80b0> in <module>()
----> 1 blaze.fromiter(gen(100), '100, float')

/Users/ovillellas/continuum/blaze-core/blaze/toplevel.pyc in fromiter(iterable, dshape, params)
    189         return open(rootdir)
    190     else:
--> 191         ica = carray.fromiter(iterable, dtype, count=count, cparams=cparams)
    192         source = CArraySource(ica, params=params)
    193         return Array(source)

/Users/ovillellas/continuum/blaze-core/blaze/carray/toplevel.pyc in fromiter(iterable, dtype, count, **kwargs)
    183             blen = chunklen
    184         if count != sys.maxint:
--> 185             chunk = np.fromiter(iterable, dtype=dtype, count=blen)
    186         else:
    187             try:

ValueError: iterator too short

It may be that it is interpreting any exception as an end of iteration exception

dshape parser behaving inconsistently with large datashape

Running the command dshape(s) in the following example produces different error messages when run repeatedly. It should produce the same error message each time. After a while, it repeatedly gives the 'list index out of range' error, the last one listed below.

Error Messages:

C:\Anaconda\lib\site-packages\blaze\datashape\parser.pyc in p_error(p)
    273             p.lexpos,
    274             '<stdin>',
--> 275             p.lexer.lexdata,
    276         )
    277     else:

DatashapeSyntaxError: 

  File <stdin>, line 32
    {
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       ^

DatashapeSyntaxError: invalid syntax

DatashapeSyntaxError: 

  File <stdin>, line 63
    in
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       ^

DatashapeSyntaxError: invalid syntax

DatashapeSyntaxError: 

  File <stdin>, line 94
    type:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       ^

DatashapeSyntaxError: invalid syntax

DatashapeSyntaxError: 

  File <stdin>, line 125
    #
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       ^

DatashapeSyntaxError: invalid syntax

DatashapeSyntaxError: 

  File <stdin>, line 156
    {
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       ^

DatashapeSyntaxError: invalid syntax

DatashapeSyntaxError: 

  File <stdin>, line 187
    processed_date:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       ^

DatashapeSyntaxError: invalid syntax

DatashapeSyntaxError: 

  File <stdin>, line 218
    bulkEntries:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       ^

DatashapeSyntaxError: invalid syntax

C:\Anaconda\lib\site-packages\blaze\error.pyc in __str__(self)
     54             filename = self.filename,
     55             lineno   = self.lineno,
---> 56             line     = self.text.split()[self.lineno],
     57             pointer  = ' '*self.col_offset + '^',
     58             msg      = self.msg,

IndexError: list index out of range

Code:

from blaze import dshape

s = """5, {
    id: int64;
    name: string;
    description: {
        languages: VarDim, string(2);
        texts: json # map<string(2), string>;
    };
    status: string; # LoanStatusType;
    funded_amount: float64;
    basket_amount: json; # Option(float64);
    paid_amount: json; # Option(float64);
    image: {
        id: int64;
        template_id: int64;
    };
    video: json; # Option({
    #    id: int64;
    #    youtube_id: string;
    #});
    activity: string;
    sector: string;
    use: string;
    # For 'delinquent', saw values \"null\" and \"true\" in brief search, map null -> false on import?
    delinquent: bool;
    location: {
        country_code: string(2);
        country: string;
        town: json; # Option(string);
        geo: {
            level: string; # GeoLevelType
            pairs: string; # latlong
            type: string; # GeoTypeType
        }
    };
    partner_id: int64;
    posted_date: json; # datetime<seconds>;
    planned_expiration_date: json; # Option(datetime<seconds>);
    loan_amount: float64;
    currency_exchange_loss_amount: json; # Option(float64);
    borrowers: VarDim, {
        first_name: string;
        last_name: string;
        gender: string(1); # GenderType
        pictured: bool;
    };
    terms: {
        disbursal_date: json; # datetime<seconds>;
        disbursal_currency: json; # Option(string);
        disbursal_amount: float64;
        loan_amount: float64;
        local_payments: VarDim, {
            due_date: json; # datetime<seconds>;
            amount: float64;
        };
        scheduled_payments: VarDim, {
            due_date: json; # datetime<seconds>;
            amount: float64;
        };
        loss_liability: {
            nonpayment: string; # Categorical(string, [\"lender\", \"partner\"]);
            currency_exchange: string;
            currency_exchange_coverage_rate: json; # Option(float64);
        }
    };
    payments: VarDim, {
        amount: float64;
        local_amount: float64;
        processed_date: json; # datetime<seconds>;
        settlement_date: json; # datetime<seconds>;
        rounded_local_amount: float64;
        currency_exchange_loss_amount: float64;
        payment_id: int64;
        comment: json; # Option(string);
    };
    funded_date: json; # datetime<seconds>;
    paid_date: json; # datetime<seconds>;
    journal_totals: {
        entries: int64;
        bulkEntries: int64;
    }
}

type KivaLoansFile = {
    header: {
        total: int64;
        page: int64;
        date: string;
        page_size: int64;
    };
    loans: VarDim, KivaLoan;
}"""

dshape(s)

Example in quick start docs does not work

The example in: http://blaze.pydata.org/docs/quickstart.html#custom-dshapes fails with this error:

======================================================================
ERROR: blaze.tests.test_table.test_custom_dshape
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/faltet/anaconda/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/faltet/software/blaze-core/blaze/tests/test_table.py", line 50, in test_custom_dshape
    from blaze import int32, string
ImportError: cannot import name string

This fails even not using strings with this error:

======================================================================
ERROR: blaze.tests.test_table.test_custom_dshape
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/faltet/anaconda/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/faltet/software/blaze-core/blaze/tests/test_table.py", line 60, in test_custom_dshape
    a = Table([(120, 153)], CustomStock)
  File "/Users/faltet/software/blaze-core/blaze/table.py", line 453, in __init__
    self._axes = self._datashape[-1].names
TypeError: 'DeclMeta' object does not support indexing

Implicit Coercion to 64bit

In [23]: blaze.ones('32, float32')
Out[23]:
Array
  datashape := 32, float64
  values    := [CArray(ptr=140628285688896)]
  metadata  := [manifest, arraylike]
  layout    := Chunked(dim=0)
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]