ibis-project / ibis Goto Github PK

the portable Python dataframe library

License: Apache License 2.0

Python 98.42% Shell 0.07% CMake 0.04% C++ 0.92% Nix 0.24% Dockerfile 0.02% JavaScript 0.11% R 0.04% Just 0.10% Visual Basic 6.0 0.04%

python impala pandas database clickhouse postgresql sqlite mysql datafusion sql

ibis's People

Contributors

Stargazers

Watchers

Forkers

mindis ihodes nataliaking laserson megvuyyuru imaxxs quasiben o0neup wesm dalejung koverholt schevalier winterflower charlessantiago dot-sean jbdien rbparrish cassidamius quantcruncher deveshbatra malkocb raderaj evilhonduco mariusvniekerk zuxfoucault calculus-ask dboyliao mahantheshhv cpcloud korotkyn techgoldy teslaa22 supermem ashhher3 adamobeng dambor hyunsik bikash rtvt123 jackson1992 cs-wang hamedhsn keflavich hdfeos egbutter condla eotp thekingofhero wangxiong2015 theseusyang thezedd golnazardeshiri solusi247 halfaleague solertis caohy1988 marshall245 mbrukman pombredanne hougs qwshy bquinart slangwald ash-vs teamclairvoyant maxmzkr nubank deepfield napjon pmart123 minhpascal teazj tvial scottcode souljourner awesome-python safouanio sahanduiuc kination alanzhong fuwenchao 144lucky gdtm86 zaytiamo gitter-badger chris-b1 tarungog radovankavicky gapdata jreback toryhaavik kszucs resurgo-genetics john-boik tsdlovell mistralbkru animenon cuulee www3838438 diegoalbertotorres

ibis's Issues

[CLOSED] Provide DISTINCT support such as Impala is capable of it

Issue by wesm
Wednesday Jan 07, 2015 at 23:17 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/49

Support metrics such as:

table[column].distinct().count()

Simple case with a single COUNT(DISTINCT ...) will be straightforward. Impala does not have support for multiple DISTINCT clauses at the moment; we will have to work around via joins or raise exceptions in the event that we can't get something Impala will execute.

For now let's limit issue scope to having a single DISTINCT per SELECT set. Error should be raised on SQL translation if there are multiple distincts (bit of a hack, we'll have to examine separately and see if DISTINCT is found multiple times...)

Handling of 0 row UDAs

Issue by bittorf
Thursday Dec 18, 2014 at 18:42 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/22

Currently, if a UDA is called with 0 rows, it will return NULL.

For example,

SELECT ibis_uda(stuff, foo) FROM table WHERE false;

will return NULL. It will not call the Init or Finalize function for the user.

Add like / regex / rlike support a la Impala

Issue by wesm
Wednesday Jan 07, 2015 at 22:43 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/45

http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_operators.html

Ibis UDA persistence in Impala

Issue by wesm
Tuesday Dec 09, 2014 at 21:51 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/14

Some way to create a UDA that can be reused between sessions (and by other users)

Method for user to indicate format of data to be passed to UDA subclasses

Issue by wesm
Wednesday Dec 10, 2014 at 00:35 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/15

For example, the user might want all data to be wrapped in pandas Series objects. One approach is class level variables and creating a metaclass for all UDA subclasses to inherit from (perhaps optionally).

[CLOSED] Front-end error checking for ibis_uda calls

Issue by wesm
Wednesday Dec 10, 2014 at 21:10 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/19

Currently select ibis_uda("any string") from table will segfault impalad

Timestamp support in Impala UDA interface

Issue by wesm
Tuesday Dec 09, 2014 at 20:04 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/1

String support in Impala UDA interface

Issue by wesm
Tuesday Dec 09, 2014 at 20:04 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/3

[CLOSED] Support array/list literals for set membership and other operations

Issue by wesm
Wednesday Jan 07, 2015 at 18:34 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/37

Given SQL like:

select some_key, sum(value) as metric
from table
where other_key in (value1, value2, value2)
group by 1

It would be useful to be able to use pandas syntax to write:

cond = table['other_key'].isin([value1, value2, value3])
filtered_table = table[cond]

Impala Code Style

Issue by bittorf
Tuesday Dec 09, 2014 at 20:10 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/8

Need to do a pass to prepare for review by impala team; such as:

handling of namespaces, formatting rules for functions, etc.

[CLOSED] Prototype for deferred expression IR

Issue by wesm
Monday Jan 05, 2015 at 19:56 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/pull/27

Implement a first design for a deferred expression intermediate representation (IR) that can be compiled to SQL with Ibis. Basic type system, type promotions in arithmetic, and a set of composable scalar, array, tabular / relational algebra operations, and associated tests. No concrete connection to Impala yet.

wesm included the following code: http://github.mtv.cloudera.com/wesm/ibis/pull/27/commits

Semaphore array management and cleanup in production

Issue by wesm
Tuesday Dec 09, 2014 at 21:15 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/12

In pathological cases, Impala nodes could run out of semaphore arrays, which are currently being used for IPC. One solution is to clear them all out when spinning up the cluster. Here's one method to do that

#!/bin/bash
USER=`whoami`
SEMAPHORES=`ipcs -s | egrep "0x[0-9a-f]+ [0-9]+" | grep $USER | cut -f2 -d" "`
for id in $SEMAPHORES; do
  ipcrm -s $id;
done

Implement a "tier" type numeric field transform

Issue by wesm
Tuesday Jan 06, 2015 at 18:08 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/34

As a companion to #33, we can allow the user to indicate specific bucket edges, and under the hood we can generate a case statement that bins values into these buckets, e.g.:

table['field'].bucket([0, 100, 1000, 10000])

Provide Impala output sinks for table expressions in various compatible formats

Issue by wesm
Wednesday Jan 07, 2015 at 19:36 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/41

[CLOSED] Implement support for compound aggregate expressions

Issue by wesm
Tuesday Dec 30, 2014 at 01:16 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/24

For example:

log(sum(foo))

Or something more complex like (in the implied Python API):

foo.sum().log() / bar.sum().log() - 1

The actual aggregations are buried in the expression tree; internally we must recognize the expression as an aggregation and verify all column references are valid (all come from the same table)

[CLOSED] Decide on API required to support self-join workflows

Issue by wesm
Monday Jan 05, 2015 at 20:14 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/28

To allow self-joins to work, we need a way to reference the logical table entities when forming predicates and other expressions. For example:

SELECT left.key, sum(left.value - right.value) as total_deltas
FROM table left
  INNER JOIN table right
    ON left.current_period = right.previous_period + 1
GROUP BY 1

With our deferred expression API, one way might be to create table views:

left = table_expr
right = table_expr.view()

agg_expr = (left['value'] - right['value']).sum()
join_exprs = [left['current_period'] == (right['previous_period'] + 1)]
left.inner_join(right, join_exprs)
    .aggregate({'total_deltas': agg_expr}., by=[left['key']])

Under the hood, the table and its view would be treated as distinct ancestors and thus cause no ambiguous relation issues during SQL generation.

[CLOSED] Wrap impyla connections to Impala, enable to yield table expressions from existing tables

Issue by wesm
Wednesday Jan 07, 2015 at 19:35 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/40

Be nice for code to look something like:

db = ibis.impala.connect(...)
table_expr = db.table('my_table_name')

Show subargument names in repr for various expression operation types

Issue by wesm
Tuesday Jan 06, 2015 at 17:59 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/32

Have a general way to display lists (and other kinds of collections) of expressions in the repr. For example: in Aggregation the aggregation exprs and grouping exprs should be nested under some headings for better readability. Same goes for projections and filters

ibis.server doesn't close when Impala killed

Issue by bittorf
Tuesday Dec 09, 2014 at 20:09 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/7

[CLOSED] Zero argument UDAs

Issue by bittorf
Thursday Dec 18, 2014 at 18:43 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/23

Currently, we do not support UDAs which take 0 arguments. For example, you can't call:

SELECT ibis_uda("foobar") FROM tbl;

We have the requirement that there is at least one argument (in addition to the bytecode). Is this a reasonable restriction or should we expand support for 0 args?

Remove work from destructors

Issue by bittorf
Thursday Dec 18, 2014 at 00:26 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/21

We should avoid doing work in C++ class destructors. We should have trivial destructors (that do nothing) and instead have a Close(), Free(), Delete() method as appropriate.

The issue with destructors is that there is no way to handle errors if destructors fail (it will crash Impala); also exceptions caused in destructors can cause an immediate termination (if another exception is already in flight).

[CLOSED] Define API and implement single-case and multi-case expressions in expression IR

Issue by wesm
Tuesday Jan 06, 2015 at 17:55 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/31

In SQL server parlance, these correspond to simple and searched case expressions:

http://msdn.microsoft.com/en-us/library/ms181765.aspx

Most of the work here is in handling various type promotion cases. From the above article:

Returns the highest precedence type from the set of types in
result_expressions and the optional else_result_expression. For
more information, see Data Type Precedence (Transact-SQL).

Question: how does data type precedence differ between databases (Impala vs others)?

[CLOSED] Add support for post-aggregation predicates (HAVING clauses)

Issue by wesm
Wednesday Jan 07, 2015 at 22:47 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/47

e.g.

HAVING count(*) > 100

[CLOSED] Ibis task workers are not released after UDA execution

Issue by wesm
Wednesday Dec 10, 2014 at 19:03 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/16

[CLOSED] Cherry pick to cdh5-trunk

Issue by bittorf
Tuesday Dec 09, 2014 at 20:10 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/9

Propagate Ibis exceptions conservatively to impala-shell

Issue by wesm
Friday Dec 12, 2014 at 00:33 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/20

For example, import errors should be immediately visible in the Impala shell

Maintaining aggregation state between task calls

Issue by wesm
Tuesday Dec 09, 2014 at 20:08 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/6

In order to reduce serialization / data transfer costs between consecutive Update tasks. In the case of a group-wise aggregation, many aggregation states might exist simultaneously in a particular Python worker process. These can be identified by a UUID and the UUID will have to be made known to the parent Impala process

One problem with this is that we would have to keep sending tasks to the same worker process, so releasing the worker back to the pool could become problematic.

[CLOSED] Memory mapped files are not reused at all

Issue by wesm
Tuesday Dec 09, 2014 at 20:06 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/5

Instigate a common optimized fork of cloudpickle

Issue by wesm
Wednesday Jan 07, 2015 at 19:32 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/39

Spartan features a significantly optimized fork of PiCloud's cloudpickle:

https://github.com/spartan-array/spartan

Being Apache-licensed code, it would be beneficial to factor this out into a standalone module in PyPI that we (and other libraries needing the enhanced pickling capability) can rely on.

[CLOSED] Add .between yielding boolean values

Issue by wesm
Wednesday Jan 07, 2015 at 22:39 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/44

Both the left and right exprs must be comparable

Design a UDF/UDA extension API for Ibis

Issue by wesm
Wednesday Jan 07, 2015 at 23:57 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/50

Not going to get done in a first cut of the project, but we ought to think about how to weave user-defined functions through the expression IR and subsequent SQL generation. For example, a UDF operating on strings would be made available as a new instance method on StringValue, or multiple UDFs could be coalesced into a single Python API with some options, which could down the road be compiled to the correct concrete SQL UDF call.

# new UDF registration code omitted

transformed_expr = table['some_strings'].my_udf()

Implement a histogram operation type for numeric fields

Issue by wesm
Tuesday Jan 06, 2015 at 18:07 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/33

This can exist as a first-class operator in the expression IR. For the Impala implementation, this will require an aggregation inline view (to get the min/max), a [cross] join (to make the min/max available to all rows), and an expression to get the histogram bucket number.

It might be interesting to enable group-wise histograms (a different bucketing for each group).

Implement Ibis operators resulting in data ingest into Impala tables

Issue by wesm
Tuesday Jan 06, 2015 at 19:31 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/35

This issue will cover end-to-end issues here: creating the appropriate table expression and schema from a file, compiling to Impala DDL and issuing multiple queries as appropriate to ingest data and execute any queries built on the resulting table

related: #136, #139, #56

[CLOSED] Datetime support

Issue by wesm
Tuesday Dec 09, 2014 at 20:04 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/2

[CLOSED] Implement APIs / test use cases for semi/anti joins

Issue by wesm
Friday Jan 02, 2015 at 20:28 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/25

Covering IN / NOT IN and EXISTS / NOT EXISTS use cases. Related: #48

[CLOSED] Implement/test filter predicates involving aggregates

Issue by wesm
Wednesday Jan 07, 2015 at 19:59 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/43

Consider the following dplyr code:

dat %>%
  group_by(name, job) %>%
  filter(job != "Boss" | year == min(year)) %>%
  mutate(cumu_job2 = cumsum(job2))

cf http://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly/27718317#27718317

Support and implementation in SQL may depend on the database. see for example:

http://stackoverflow.com/questions/6319183/aggregate-function-in-sql-where-clause

Enable configurable row batchsize for passing to Python

Issue by wesm
Tuesday Dec 09, 2014 at 20:22 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/11

One way would be to add an extra parameter to the current ibis_uda C++ UDA, but there might be another way

[CLOSED] Add a convenient API for top-K dimension filtering

Issue by wesm
Friday Jan 02, 2015 at 20:33 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/26

For example, filtering an aggregation (across some other dimensions) to the top K cities ranked by some metric. Common, but tedious. We can make much simpler

SELECT t1.foo, t1.bar, sum(t1.baz)
FROM table t1
  LEFT SEMI JOIN (
    SELECT city, mean(qux) as metric
    FROM table
    GROUP BY 1
    ORDER BY metric DESCENDING
    LIMIT K
  ) t2 ON t1.city = t2.city
GROUP BY 1, 2

In many databases this would need to be expressed with WHERE city IN (SELECT city ...)

related to #87

Consider (and benchmark) IPC alternative to semaphore arrays

Issue by wesm
Tuesday Dec 09, 2014 at 21:16 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/13

They are very low latency, but there might be a more robust / similarly fast approach.

Resolve / test for concurrent access of PythonWorkerPool

Issue by wesm
Wednesday Dec 10, 2014 at 19:04 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/17

This class is not threadsafe at the moment, I don't think

[CLOSED] Fusing projections in the expression IR

Issue by wesm
Tuesday Jan 06, 2015 at 06:47 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/29

For example, if we have

t = table
t2 = t.add_column(t['f'] * 2, 'foo')
t2 = t2.add_column(t['f'] * 4, 'bar')

the expression tree for t2 looks like:

ref_0
  TableView[table]
  a : int8
  b : int16
  c : int32
  d : int64
  e : float
  f : double
  g : string
  h : boolean

ref_1
Projection[table]
  Table: ref_0
  Table: ref_0
  Multiply[array(double)]
    Column[double] 'f' from table ref_0
    Literal[int8] 2

Projection[table]
  Table: ref_1
  Table: ref_1
  Multiply[array(double)]
    Column[double] 'f' from table ref_0
    Literal[int8] 4

Unless there is some dependency on an intermediate projection, no reason not to roll these up into a single projection rather than creating nested projections that have to be fused at evaluation / SQL compilation time (we should do that anyway, but this aids readability)

[CLOSED] LIMIT and OFFSET support

Issue by wesm
Wednesday Jan 07, 2015 at 22:46 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/46

Address Ibis UDF security issues

Issue by wesm
Wednesday Dec 10, 2014 at 20:55 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/18

In many environments, admin privileges are required to create UDFs or UDAs. Since we'll be effectively opening a backdoor via ibis for the user to run arbitrary Python code, to address security concerns we will have to ensure that this code execution takes place within an environment where the user cannot hurt anything.

Possible solutions

Run ibis daemon/workers inside a container
Run ibis processes under a special user account with limited privileges (just enough to be able to see the data passed from Impala)

[CLOSED] Implement boolean scalar/array logical binary operations

Issue by wesm
Tuesday Jan 06, 2015 at 17:42 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/30

These are already stubbed out in BooleanValue

[CLOSED] Add a .sql API (a la Spark SQL) for generating well-schema'd TableExpr from a SQL query

Issue by wesm
Wednesday Jan 07, 2015 at 01:18 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/36

This will be on the Impala connection wrapper that is yet to be built.

Create user API for specifying tabular data to be written back to Impala

Issue by wesm
Tuesday Dec 09, 2014 at 20:11 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/10

User-defined aggregations right now will have their results stored as pickled Python objects in the column of a result set / new Impala table. We might wish to eventually emit data that Impala can recognize as a primitive tuple or a batch of rows that can be worked with independent of Python.

[CLOSED] Group-wise aggregation will create 1 Python worker per group

Issue by wesm
Tuesday Dec 09, 2014 at 20:05 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/4

Workers are held for the duration by the IbisBackend object; when a task is finished running, the worker should be released to the worker pool (so that it can be used by other IbisBackend objects)

[CLOSED] Achieve test coverage for gamut of scalar, correlated, and uncorrelated subqueries

Issue by wesm
Wednesday Jan 07, 2015 at 22:51 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/48

Refer to: http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_subqueries.html

Correlated subqueries don't yet fit neatly into the Ibis expression model and will require some consideration.

related: #25

will track the status here. Nothing is yet supported when the subquery originates from a different table.

scalar subqueries

t1 = ir.table(
  [
    ('job', 'string'),
    ('dept_id', 'string'),
    ('year', 'int32'),
    ('y', 'double')
], 't1')

t2 = ir.table(
    [('x', 'double'), ('job', 'string')], 't2')

t1[t1.y > t2.x.max()]

uncorrelated subqueries

t1[t1.job.isin(t2.job)]

correlated subqueries

using example from Impala docs, unsure about precise syntax

t3 = t1.view()
correlated_stat = t3[t1.dept_id == t3.dept_id].salary.mean()
t1[t1.salary > correlated_stat]

Integration with Impala "local mode" (once available)

Issue by wesm
Wednesday Jan 07, 2015 at 19:37 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/42

Ability to interact with files on the local filesystem

[CLOSED] First cut of a SQL/DDL pipeline generator for Impala

Issue by wesm
Wednesday Jan 07, 2015 at 19:26 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/38

Some basic principles:

DDL / SQL fragments must have some AST representation that can be programmatically generated
Needs to be as testable as possible without having to compare generated SQL query strings. That valid (and hopefully human-legible) SQL is generated can and should be tested separately
Evaluating an expression my involve multiple queries (e.g., data ingest followed by analytics, and teardown in some cases)
Adding new functions (e.g. built-ins) must be as lightweight as possible
Impala-specific nuances should be abstracted away to the extent possible (so if someone swoops in and wants to add PostgreSQL support someday, they won't be tearing their hair out necessarily)

I'm in favor of a clean room design; I'm not planning to look at dplyr or anything that solves similar problems right now. Can look later when we have something more fully baked and see if there's anything to learn.

Will keep a TODO list here as this starts to take shape. Anything here can be split out into a separate subissue, and we continue to reference those subissues here with a checklist. Work will span multiple PRs

SQL generator extracts inline views used multiple times (or any times at all, if you want!) into a WITH clause: http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_with.html
Joins with no predicates to be translated to CROSS JOIN because Impala requires explicit cartesian products (cf http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_joins.html)
Fuse nested projections (related to #29)
Fuse nested filters (WHERE clause equivalents) separated by some other non-filter table operations