Git Product home page Git Product logo

ibis-project / ibis Goto Github PK

View Code? Open in Web Editor NEW
4.2K 82.0 523.0 70.14 MB

the portable Python dataframe library

Home Page: https://ibis-project.org

License: Apache License 2.0

Python 98.42% Shell 0.07% CMake 0.04% C++ 0.92% Nix 0.24% Dockerfile 0.02% JavaScript 0.11% R 0.04% Just 0.10% Visual Basic 6.0 0.04%
python impala pandas database clickhouse postgresql sqlite mysql datafusion sql

ibis's People

Contributors

anjakefala avatar chloeh13q avatar cpcloud avatar datapythonista avatar deepyaman avatar emilyreff7 avatar gerrymanoim avatar gforsyth avatar github-actions[bot] avatar ibis-squawk-bot[bot] avatar icexelloss avatar jcrist avatar krzysztof-kwitt avatar kszucs avatar laserson avatar lostmygithubaccount avatar matthewmturner avatar mesejo avatar ncclementi avatar nickcrews avatar nicoretti avatar pre-commit-ci[bot] avatar renovate-bot avatar renovate[bot] avatar saulpw avatar semantic-release-bot avatar timothydijamco avatar tswast avatar wesm avatar xmnlab avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ibis's Issues

[CLOSED] Provide DISTINCT support such as Impala is capable of it

Issue by wesm
Wednesday Jan 07, 2015 at 23:17 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/49


Support metrics such as:

table[column].distinct().count()

Simple case with a single COUNT(DISTINCT ...) will be straightforward. Impala does not have support for multiple DISTINCT clauses at the moment; we will have to work around via joins or raise exceptions in the event that we can't get something Impala will execute.

For now let's limit issue scope to having a single DISTINCT per SELECT set. Error should be raised on SQL translation if there are multiple distincts (bit of a hack, we'll have to examine separately and see if DISTINCT is found multiple times...)

[CLOSED] Prototype for deferred expression IR

Issue by wesm
Monday Jan 05, 2015 at 19:56 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/pull/27


Implement a first design for a deferred expression intermediate representation (IR) that can be compiled to SQL with Ibis. Basic type system, type promotions in arithmetic, and a set of composable scalar, array, tabular / relational algebra operations, and associated tests. No concrete connection to Impala yet.


wesm included the following code: http://github.mtv.cloudera.com/wesm/ibis/pull/27/commits

Semaphore array management and cleanup in production

Issue by wesm
Tuesday Dec 09, 2014 at 21:15 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/12


In pathological cases, Impala nodes could run out of semaphore arrays, which are currently being used for IPC. One solution is to clear them all out when spinning up the cluster. Here's one method to do that

#!/bin/bash
USER=`whoami`
SEMAPHORES=`ipcs -s | egrep "0x[0-9a-f]+ [0-9]+" | grep $USER | cut -f2 -d" "`
for id in $SEMAPHORES; do
  ipcrm -s $id;
done

[CLOSED] Implement support for compound aggregate expressions

Issue by wesm
Tuesday Dec 30, 2014 at 01:16 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/24


For example:

log(sum(foo))

Or something more complex like (in the implied Python API):

foo.sum().log() / bar.sum().log() - 1

The actual aggregations are buried in the expression tree; internally we must recognize the expression as an aggregation and verify all column references are valid (all come from the same table)

[CLOSED] Decide on API required to support self-join workflows

Issue by wesm
Monday Jan 05, 2015 at 20:14 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/28


To allow self-joins to work, we need a way to reference the logical table entities when forming predicates and other expressions. For example:

SELECT left.key, sum(left.value - right.value) as total_deltas
FROM table left
  INNER JOIN table right
    ON left.current_period = right.previous_period + 1
GROUP BY 1

With our deferred expression API, one way might be to create table views:

left = table_expr
right = table_expr.view()

agg_expr = (left['value'] - right['value']).sum()
join_exprs = [left['current_period'] == (right['previous_period'] + 1)]
left.inner_join(right, join_exprs)
    .aggregate({'total_deltas': agg_expr}., by=[left['key']])

Under the hood, the table and its view would be treated as distinct ancestors and thus cause no ambiguous relation issues during SQL generation.

[CLOSED] Zero argument UDAs

Issue by bittorf
Thursday Dec 18, 2014 at 18:43 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/23


Currently, we do not support UDAs which take 0 arguments. For example, you can't call:

SELECT ibis_uda("foobar") FROM tbl;

We have the requirement that there is at least one argument (in addition to the bytecode). Is this a reasonable restriction or should we expand support for 0 args?

Remove work from destructors

Issue by bittorf
Thursday Dec 18, 2014 at 00:26 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/21


We should avoid doing work in C++ class destructors. We should have trivial destructors (that do nothing) and instead have a Close(), Free(), Delete() method as appropriate.

The issue with destructors is that there is no way to handle errors if destructors fail (it will crash Impala); also exceptions caused in destructors can cause an immediate termination (if another exception is already in flight).

[CLOSED] Define API and implement single-case and multi-case expressions in expression IR

Issue by wesm
Tuesday Jan 06, 2015 at 17:55 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/31


In SQL server parlance, these correspond to simple and searched case expressions:

http://msdn.microsoft.com/en-us/library/ms181765.aspx

Most of the work here is in handling various type promotion cases. From the above article:

Returns the highest precedence type from the set of types in
result_expressions and the optional else_result_expression. For
more information, see Data Type Precedence (Transact-SQL).

Question: how does data type precedence differ between databases (Impala vs others)?

Maintaining aggregation state between task calls

Issue by wesm
Tuesday Dec 09, 2014 at 20:08 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/6


In order to reduce serialization / data transfer costs between consecutive Update tasks. In the case of a group-wise aggregation, many aggregation states might exist simultaneously in a particular Python worker process. These can be identified by a UUID and the UUID will have to be made known to the parent Impala process

One problem with this is that we would have to keep sending tasks to the same worker process, so releasing the worker back to the pool could become problematic.

Design a UDF/UDA extension API for Ibis

Issue by wesm
Wednesday Jan 07, 2015 at 23:57 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/50


Not going to get done in a first cut of the project, but we ought to think about how to weave user-defined functions through the expression IR and subsequent SQL generation. For example, a UDF operating on strings would be made available as a new instance method on StringValue, or multiple UDFs could be coalesced into a single Python API with some options, which could down the road be compiled to the correct concrete SQL UDF call.

# new UDF registration code omitted

transformed_expr = table['some_strings'].my_udf()

Implement a histogram operation type for numeric fields

Issue by wesm
Tuesday Jan 06, 2015 at 18:07 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/33


This can exist as a first-class operator in the expression IR. For the Impala implementation, this will require an aggregation inline view (to get the min/max), a [cross] join (to make the min/max available to all rows), and an expression to get the histogram bucket number.

It might be interesting to enable group-wise histograms (a different bucketing for each group).

[CLOSED] Implement/test filter predicates involving aggregates

Issue by wesm
Wednesday Jan 07, 2015 at 19:59 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/43


Consider the following dplyr code:

dat %>%
  group_by(name, job) %>%
  filter(job != "Boss" | year == min(year)) %>%
  mutate(cumu_job2 = cumsum(job2))

cf http://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly/27718317#27718317

Support and implementation in SQL may depend on the database. see for example:

http://stackoverflow.com/questions/6319183/aggregate-function-in-sql-where-clause

[CLOSED] Add a convenient API for top-K dimension filtering

Issue by wesm
Friday Jan 02, 2015 at 20:33 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/26


For example, filtering an aggregation (across some other dimensions) to the top K cities ranked by some metric. Common, but tedious. We can make much simpler

SELECT t1.foo, t1.bar, sum(t1.baz)
FROM table t1
  LEFT SEMI JOIN (
    SELECT city, mean(qux) as metric
    FROM table
    GROUP BY 1
    ORDER BY metric DESCENDING
    LIMIT K
  ) t2 ON t1.city = t2.city
GROUP BY 1, 2

In many databases this would need to be expressed with WHERE city IN (SELECT city ...)

related to #87

[CLOSED] Fusing projections in the expression IR

Issue by wesm
Tuesday Jan 06, 2015 at 06:47 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/29


For example, if we have

t = table
t2 = t.add_column(t['f'] * 2, 'foo')
t2 = t2.add_column(t['f'] * 4, 'bar')

the expression tree for t2 looks like:

ref_0
  TableView[table]
  a : int8
  b : int16
  c : int32
  d : int64
  e : float
  f : double
  g : string
  h : boolean

ref_1
Projection[table]
  Table: ref_0
  Table: ref_0
  Multiply[array(double)]
    Column[double] 'f' from table ref_0
    Literal[int8] 2

Projection[table]
  Table: ref_1
  Table: ref_1
  Multiply[array(double)]
    Column[double] 'f' from table ref_0
    Literal[int8] 4

Unless there is some dependency on an intermediate projection, no reason not to roll these up into a single projection rather than creating nested projections that have to be fused at evaluation / SQL compilation time (we should do that anyway, but this aids readability)

Address Ibis UDF security issues

Issue by wesm
Wednesday Dec 10, 2014 at 20:55 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/18


In many environments, admin privileges are required to create UDFs or UDAs. Since we'll be effectively opening a backdoor via ibis for the user to run arbitrary Python code, to address security concerns we will have to ensure that this code execution takes place within an environment where the user cannot hurt anything.

Possible solutions

  • Run ibis daemon/workers inside a container
  • Run ibis processes under a special user account with limited privileges (just enough to be able to see the data passed from Impala)

[CLOSED] Achieve test coverage for gamut of scalar, correlated, and uncorrelated subqueries

Issue by wesm
Wednesday Jan 07, 2015 at 22:51 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/48


Refer to: http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_subqueries.html

Correlated subqueries don't yet fit neatly into the Ibis expression model and will require some consideration.

related: #25

will track the status here. Nothing is yet supported when the subquery originates from a different table.

  • scalar subqueries
t1 = ir.table(
  [
    ('job', 'string'),
    ('dept_id', 'string'),
    ('year', 'int32'),
    ('y', 'double')
], 't1')

t2 = ir.table(
    [('x', 'double'), ('job', 'string')], 't2')

t1[t1.y > t2.x.max()]
  • uncorrelated subqueries
t1[t1.job.isin(t2.job)]
  • correlated subqueries

using example from Impala docs, unsure about precise syntax

t3 = t1.view()
correlated_stat = t3[t1.dept_id == t3.dept_id].salary.mean()
t1[t1.salary > correlated_stat]

[CLOSED] First cut of a SQL/DDL pipeline generator for Impala

Issue by wesm
Wednesday Jan 07, 2015 at 19:26 GMT
Originally opened as http://github.mtv.cloudera.com/wesm/ibis/issues/38


Some basic principles:

  • DDL / SQL fragments must have some AST representation that can be programmatically generated
  • Needs to be as testable as possible without having to compare generated SQL query strings. That valid (and hopefully human-legible) SQL is generated can and should be tested separately
  • Evaluating an expression my involve multiple queries (e.g., data ingest followed by analytics, and teardown in some cases)
  • Adding new functions (e.g. built-ins) must be as lightweight as possible
  • Impala-specific nuances should be abstracted away to the extent possible (so if someone swoops in and wants to add PostgreSQL support someday, they won't be tearing their hair out necessarily)

I'm in favor of a clean room design; I'm not planning to look at dplyr or anything that solves similar problems right now. Can look later when we have something more fully baked and see if there's anything to learn.

Will keep a TODO list here as this starts to take shape. Anything here can be split out into a separate subissue, and we continue to reference those subissues here with a checklist. Work will span multiple PRs

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.