ibis-project / ibis Goto Github PK
View Code? Open in Web Editor NEWthe portable Python dataframe library
Home Page: https://ibis-project.org
License: Apache License 2.0
the portable Python dataframe library
Home Page: https://ibis-project.org
License: Apache License 2.0
Issue by wesm
Wednesday Jan 07, 2015 at 23:17 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/49
Support metrics such as:
table[column].distinct().count()
Simple case with a single COUNT(DISTINCT ...) will be straightforward. Impala does not have support for multiple DISTINCT clauses at the moment; we will have to work around via joins or raise exceptions in the event that we can't get something Impala will execute.
For now let's limit issue scope to having a single DISTINCT per SELECT set. Error should be raised on SQL translation if there are multiple distincts (bit of a hack, we'll have to examine separately and see if DISTINCT is found multiple times...)
Issue by bittorf
Thursday Dec 18, 2014 at 18:42 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/22
Currently, if a UDA is called with 0 rows, it will return NULL.
For example,
SELECT ibis_uda(stuff, foo) FROM table WHERE false;
will return NULL. It will not call the Init or Finalize function for the user.
Issue by wesm
Wednesday Jan 07, 2015 at 22:43 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/45
Issue by wesm
Tuesday Dec 09, 2014 at 21:51 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/14
Some way to create a UDA that can be reused between sessions (and by other users)
Issue by wesm
Wednesday Dec 10, 2014 at 00:35 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/15
For example, the user might want all data to be wrapped in pandas Series objects. One approach is class level variables and creating a metaclass for all UDA subclasses to inherit from (perhaps optionally).
Issue by wesm
Wednesday Dec 10, 2014 at 21:10 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/19
Currently select ibis_uda("any string") from table
will segfault impalad
Issue by wesm
Tuesday Dec 09, 2014 at 20:04 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/1
Issue by wesm
Tuesday Dec 09, 2014 at 20:04 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/3
Issue by wesm
Wednesday Jan 07, 2015 at 18:34 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/37
Given SQL like:
select some_key, sum(value) as metric
from table
where other_key in (value1, value2, value2)
group by 1
It would be useful to be able to use pandas syntax to write:
cond = table['other_key'].isin([value1, value2, value3])
filtered_table = table[cond]
Issue by bittorf
Tuesday Dec 09, 2014 at 20:10 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/8
Need to do a pass to prepare for review by impala team; such as:
handling of namespaces, formatting rules for functions, etc.
Issue by wesm
Monday Jan 05, 2015 at 19:56 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/pull/27
Implement a first design for a deferred expression intermediate representation (IR) that can be compiled to SQL with Ibis. Basic type system, type promotions in arithmetic, and a set of composable scalar, array, tabular / relational algebra operations, and associated tests. No concrete connection to Impala yet.
wesm included the following code: http://github.mtv.cloudera.com/wesm/ibis/pull/27/commits
Issue by wesm
Tuesday Dec 09, 2014 at 21:15 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/12
In pathological cases, Impala nodes could run out of semaphore arrays, which are currently being used for IPC. One solution is to clear them all out when spinning up the cluster. Here's one method to do that
#!/bin/bash
USER=`whoami`
SEMAPHORES=`ipcs -s | egrep "0x[0-9a-f]+ [0-9]+" | grep $USER | cut -f2 -d" "`
for id in $SEMAPHORES; do
ipcrm -s $id;
done
Issue by wesm
Tuesday Jan 06, 2015 at 18:08 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/34
As a companion to #33, we can allow the user to indicate specific bucket edges, and under the hood we can generate a case statement that bins values into these buckets, e.g.:
table['field'].bucket([0, 100, 1000, 10000])
Issue by wesm
Wednesday Jan 07, 2015 at 19:36 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/41
Issue by wesm
Tuesday Dec 30, 2014 at 01:16 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/24
For example:
log(sum(foo))
Or something more complex like (in the implied Python API):
foo.sum().log() / bar.sum().log() - 1
The actual aggregations are buried in the expression tree; internally we must recognize the expression as an aggregation and verify all column references are valid (all come from the same table)
Issue by wesm
Monday Jan 05, 2015 at 20:14 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/28
To allow self-joins to work, we need a way to reference the logical table entities when forming predicates and other expressions. For example:
SELECT left.key, sum(left.value - right.value) as total_deltas
FROM table left
INNER JOIN table right
ON left.current_period = right.previous_period + 1
GROUP BY 1
With our deferred expression API, one way might be to create table views:
left = table_expr
right = table_expr.view()
agg_expr = (left['value'] - right['value']).sum()
join_exprs = [left['current_period'] == (right['previous_period'] + 1)]
left.inner_join(right, join_exprs)
.aggregate({'total_deltas': agg_expr}., by=[left['key']])
Under the hood, the table and its view would be treated as distinct ancestors and thus cause no ambiguous relation issues during SQL generation.
Issue by wesm
Wednesday Jan 07, 2015 at 19:35 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/40
Be nice for code to look something like:
db = ibis.impala.connect(...)
table_expr = db.table('my_table_name')
Issue by wesm
Tuesday Jan 06, 2015 at 17:59 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/32
Have a general way to display lists (and other kinds of collections) of expressions in the repr. For example: in Aggregation the aggregation exprs and grouping exprs should be nested under some headings for better readability. Same goes for projections and filters
Issue by bittorf
Tuesday Dec 09, 2014 at 20:09 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/7
Issue by bittorf
Thursday Dec 18, 2014 at 18:43 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/23
Currently, we do not support UDAs which take 0 arguments. For example, you can't call:
SELECT ibis_uda("foobar") FROM tbl;
We have the requirement that there is at least one argument (in addition to the bytecode). Is this a reasonable restriction or should we expand support for 0 args?
Issue by bittorf
Thursday Dec 18, 2014 at 00:26 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/21
We should avoid doing work in C++ class destructors. We should have trivial destructors (that do nothing) and instead have a Close(), Free(), Delete() method as appropriate.
The issue with destructors is that there is no way to handle errors if destructors fail (it will crash Impala); also exceptions caused in destructors can cause an immediate termination (if another exception is already in flight).
Issue by wesm
Tuesday Jan 06, 2015 at 17:55 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/31
In SQL server parlance, these correspond to simple and searched case expressions:
http://msdn.microsoft.com/en-us/library/ms181765.aspx
Most of the work here is in handling various type promotion cases. From the above article:
Returns the highest precedence type from the set of types in
result_expressions and the optional else_result_expression. For
more information, see Data Type Precedence (Transact-SQL).
Question: how does data type precedence differ between databases (Impala vs others)?
Issue by wesm
Wednesday Jan 07, 2015 at 22:47 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/47
e.g.
HAVING count(*) > 100
Issue by wesm
Wednesday Dec 10, 2014 at 19:03 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/16
Issue by bittorf
Tuesday Dec 09, 2014 at 20:10 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/9
Issue by wesm
Friday Dec 12, 2014 at 00:33 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/20
For example, import errors should be immediately visible in the Impala shell
Issue by wesm
Tuesday Dec 09, 2014 at 20:08 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/6
In order to reduce serialization / data transfer costs between consecutive Update tasks. In the case of a group-wise aggregation, many aggregation states might exist simultaneously in a particular Python worker process. These can be identified by a UUID and the UUID will have to be made known to the parent Impala process
One problem with this is that we would have to keep sending tasks to the same worker process, so releasing the worker back to the pool could become problematic.
Issue by wesm
Tuesday Dec 09, 2014 at 20:06 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/5
Issue by wesm
Wednesday Jan 07, 2015 at 19:32 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/39
Spartan features a significantly optimized fork of PiCloud's cloudpickle:
https://github.com/spartan-array/spartan
Being Apache-licensed code, it would be beneficial to factor this out into a standalone module in PyPI that we (and other libraries needing the enhanced pickling capability) can rely on.
Issue by wesm
Wednesday Jan 07, 2015 at 22:39 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/44
Both the left and right exprs must be comparable
Issue by wesm
Wednesday Jan 07, 2015 at 23:57 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/50
Not going to get done in a first cut of the project, but we ought to think about how to weave user-defined functions through the expression IR and subsequent SQL generation. For example, a UDF operating on strings would be made available as a new instance method on StringValue
, or multiple UDFs could be coalesced into a single Python API with some options, which could down the road be compiled to the correct concrete SQL UDF call.
# new UDF registration code omitted
transformed_expr = table['some_strings'].my_udf()
Issue by wesm
Tuesday Jan 06, 2015 at 18:07 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/33
This can exist as a first-class operator in the expression IR. For the Impala implementation, this will require an aggregation inline view (to get the min/max), a [cross] join (to make the min/max available to all rows), and an expression to get the histogram bucket number.
It might be interesting to enable group-wise histograms (a different bucketing for each group).
Issue by wesm
Tuesday Jan 06, 2015 at 19:31 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/35
This issue will cover end-to-end issues here: creating the appropriate table expression and schema from a file, compiling to Impala DDL and issuing multiple queries as appropriate to ingest data and execute any queries built on the resulting table
Issue by wesm
Tuesday Dec 09, 2014 at 20:04 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/2
Issue by wesm
Friday Jan 02, 2015 at 20:28 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/25
Covering IN
/ NOT IN
and EXISTS
/ NOT EXISTS
use cases. Related: #48
Issue by wesm
Wednesday Jan 07, 2015 at 19:59 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/43
Consider the following dplyr code:
dat %>%
group_by(name, job) %>%
filter(job != "Boss" | year == min(year)) %>%
mutate(cumu_job2 = cumsum(job2))
Support and implementation in SQL may depend on the database. see for example:
http://stackoverflow.com/questions/6319183/aggregate-function-in-sql-where-clause
Issue by wesm
Tuesday Dec 09, 2014 at 20:22 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/11
One way would be to add an extra parameter to the current ibis_uda
C++ UDA, but there might be another way
Issue by wesm
Friday Jan 02, 2015 at 20:33 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/26
For example, filtering an aggregation (across some other dimensions) to the top K cities ranked by some metric. Common, but tedious. We can make much simpler
SELECT t1.foo, t1.bar, sum(t1.baz)
FROM table t1
LEFT SEMI JOIN (
SELECT city, mean(qux) as metric
FROM table
GROUP BY 1
ORDER BY metric DESCENDING
LIMIT K
) t2 ON t1.city = t2.city
GROUP BY 1, 2
In many databases this would need to be expressed with WHERE city IN (SELECT city ...)
related to #87
Issue by wesm
Tuesday Dec 09, 2014 at 21:16 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/13
They are very low latency, but there might be a more robust / similarly fast approach.
Issue by wesm
Wednesday Dec 10, 2014 at 19:04 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/17
This class is not threadsafe at the moment, I don't think
Issue by wesm
Tuesday Jan 06, 2015 at 06:47 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/29
For example, if we have
t = table
t2 = t.add_column(t['f'] * 2, 'foo')
t2 = t2.add_column(t['f'] * 4, 'bar')
the expression tree for t2
looks like:
ref_0
TableView[table]
a : int8
b : int16
c : int32
d : int64
e : float
f : double
g : string
h : boolean
ref_1
Projection[table]
Table: ref_0
Table: ref_0
Multiply[array(double)]
Column[double] 'f' from table ref_0
Literal[int8] 2
Projection[table]
Table: ref_1
Table: ref_1
Multiply[array(double)]
Column[double] 'f' from table ref_0
Literal[int8] 4
Unless there is some dependency on an intermediate projection, no reason not to roll these up into a single projection rather than creating nested projections that have to be fused at evaluation / SQL compilation time (we should do that anyway, but this aids readability)
Issue by wesm
Wednesday Jan 07, 2015 at 22:46 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/46
Issue by wesm
Wednesday Dec 10, 2014 at 20:55 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/18
In many environments, admin privileges are required to create UDFs or UDAs. Since we'll be effectively opening a backdoor via ibis for the user to run arbitrary Python code, to address security concerns we will have to ensure that this code execution takes place within an environment where the user cannot hurt anything.
Possible solutions
Issue by wesm
Tuesday Jan 06, 2015 at 17:42 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/30
These are already stubbed out in BooleanValue
Issue by wesm
Wednesday Jan 07, 2015 at 01:18 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/36
This will be on the Impala connection wrapper that is yet to be built.
Issue by wesm
Tuesday Dec 09, 2014 at 20:11 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/10
User-defined aggregations right now will have their results stored as pickled Python objects in the column of a result set / new Impala table. We might wish to eventually emit data that Impala can recognize as a primitive tuple or a batch of rows that can be worked with independent of Python.
Issue by wesm
Tuesday Dec 09, 2014 at 20:05 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/4
Workers are held for the duration by the IbisBackend
object; when a task is finished running, the worker should be released to the worker pool (so that it can be used by other IbisBackend
objects)
Issue by wesm
Wednesday Jan 07, 2015 at 22:51 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/48
Correlated subqueries don't yet fit neatly into the Ibis expression model and will require some consideration.
related: #25
will track the status here. Nothing is yet supported when the subquery originates from a different table.
t1 = ir.table(
[
('job', 'string'),
('dept_id', 'string'),
('year', 'int32'),
('y', 'double')
], 't1')
t2 = ir.table(
[('x', 'double'), ('job', 'string')], 't2')
t1[t1.y > t2.x.max()]
t1[t1.job.isin(t2.job)]
using example from Impala docs, unsure about precise syntax
t3 = t1.view()
correlated_stat = t3[t1.dept_id == t3.dept_id].salary.mean()
t1[t1.salary > correlated_stat]
Issue by wesm
Wednesday Jan 07, 2015 at 19:37 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/42
Ability to interact with files on the local filesystem
Issue by wesm
Wednesday Jan 07, 2015 at 19:26 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/38
Some basic principles:
I'm in favor of a clean room design; I'm not planning to look at dplyr or anything that solves similar problems right now. Can look later when we have something more fully baked and see if there's anything to learn.
Will keep a TODO list here as this starts to take shape. Anything here can be split out into a separate subissue, and we continue to reference those subissues here with a checklist. Work will span multiple PRs
WITH
clause: http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_with.htmlCROSS JOIN
because Impala requires explicit cartesian products (cf http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_joins.html)A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.