Git Product home page Git Product logo

datalab's Introduction

datalab's People

Contributors

amshali avatar blois avatar brandondutra avatar bryantgipson avatar chmeyers avatar corrieann avatar craigcitro avatar daweihuang avatar di-ku avatar drewbryant avatar ekrogers avatar fischman avatar gramster avatar haavardw avatar harmon avatar jimmc avatar jmmaldonado avatar mdhedley avatar mikehcheng avatar mvanwyk avatar nikhilk avatar ojarjur avatar parthea avatar qimingj avatar rajivpb avatar rileyjbauer avatar rnabel avatar umang-sh avatar yebrahim avatar yixinshi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datalab's Issues

Robust available port selection

The DataLab server is currently using a heuristic, stop-gap approach for finding open ports (used when creating new kernel instances). Need to implement a more robust scheme to find available ports.

Approach used by IPython is to create a socket with no port specification, which results in a random (but available) port being selected. This socket is then immediately closed. Normally the socket would go into TIME_WAIT and be unavailable temporarily, but they are setting a SO_REUSEADDR or making the socket linger zero. IPython then assumes this port to still be available (potentially not due to race condition) and used it for setting up kernel comm.

Minimal Markdown cell directive

Unlike the code editor cell which has two regions that co-exist visually (input and output), the Markdown cell has two regions that are mutually exclusive visually -- edit-mode and render-mode. See an IPy notebook's Markdown cell for example.

"Executing" the cell causes it to switch from edit-mode to render-mode. Selecting (double-click is IPy action) the cell causes it to enter edit-mode (if not already in edit-mode). In render mode, the html representation of the Markdown content is displayed.

Has only an input attribute for data-binding.

Builds upon a rendered-markdown directive and the editor-cell directive.

Cell magic syntax for BQ functionality

This will apply to other stuff besides BigQuery eventually, but BQ is what we have for the moment.

Current:

%%bq_sql ...
%%bq_udf ...

Proposed:
Make it look more like a normal command line.

%%bigquery sql [--name:<name>] [--help]
%%bigquery udf --name:<name> [--help]

Others possible commands (in future):

schema --name:<table name>
table --name:<table name>
dataset --name:<dataset name>

Log level support

RE: app/common/logging:Logger

Redirect log. to appropriate console. method if it exists.

Add polyfill that redirects all levels to console.log in absence of level-specific methods on the console object.

Logging configurable by scope

RE: app/common/logging:Logger

Add support for configuring logging output by scope; for example, disable all logging only for scope "foo" or allow only level >= "warn" for scope bar with default/global logging level of "info".

Add logging support and configuration

Currently the notebook server is making use of console.log for logging issues/info throughout, but these logs are never persisted anywhere for the moment.

Task is to replace console.log usage with appropriately configured bunyan loggers. See #73 for example

Streamline VM-based deployment model for IPython

See https://github.com/GoogleCloudPlatform/datalab/blob/master/deploy/ipython/vm.sh to deploy instances, and then use SSH tunneling to access the deployed instance.

Since we'll be using that to get our stuff to initial users, we should streamline the experience with that script. I'll list couple on my mind, but would be great if others can try and list issues/ideas for improvement as comments on this.

  • Wait for the VM to come up as well docker instance to start
  • Launch the browser at the end (don't know if a shell script can do so)

IPython docker container telemetry

Need to implement logging for the IPython docker container using Google Analytics.

Events to track on the server:

  • Docker container instantiation
  • Completed authentication flows

Events to track on the client in addition to standard GA tracking:

  • New notebook creation

Dimensions to add to standard GA tracking:

  • Project ID

Content navigation widget

Create an Angular directive that generate the notebooks list html with appropriate interactions/hooks.

Potential per-notebook fields to display:

  • Name
  • Type (IPyNB | DataLabNB)
  • Last modified time
  • Created time

Timestamps could be rendered as "k days/hours/minutes ago" or the absolute date.

Directive is driven by an external data object (e.g., array of notebook objects) by setting up two-way binding on one of the directive attributes. That is, the containing scope/controller drives the directive's data.

Page toolbar layout directive

Decomposes a page into a horizontal toolbar region at the top of the page, and a main content region underneath the toolbar.

 ---------
| toolbar |
 ---------
|         |
|  main   |
|         |
 ---------

Used for generating the high-level layout of the notebooks edit page.

Better autocompletion for our Python APIs

Right now the autocomplete in IPython leaves a lot to be desired, because it uses introspection on an object. So if you have a variable x, and do 'x.', IPython will introspect on x. This works okay if x is already bound from a previous cell, but not if it is bound in the current cell. Also, if you chain method calls in the current cell, it is useless:

Table('....').<tab>

gives autocompletion based off the working directory, so you may get something like:

.git/
.gitignore

I think this is something that could be fixed in IPython itself. Ideally it would use an automplete library that can do type inference like Jedi, although a simpler approach that may be enough for our purposes would be to use function annotations on our APIs. That said, we don't really want to customize IPython at this point so this is something we should do in the new UX.

Avoid src/*.ts recompilation when building tests/*.ts

Find a way to avoid needing to rebuild the src/* .ts files when compiling tests

Possible solution involves generating the .d.ts typedefs when building /src/ and correctly
symlinking these built files to the test directory, before building the tests.

First attempt at this approach was failing for reasons that I don't fully understand (tsc was attempting to use the src/ .ts files rather than the src/ .d.ts files when compiling tests).

Setup lint rules and tools

Setup lint for the codebase for the following:

  • python (already started with pylint-based rules)
  • java
  • typescript

Enable BigQuery ETL

Enable users to use BigQuery to do ETL workflows.

We need at least the following features:

  1. Ability to run a query against a specified source - so developers can use a sample query during development and then switch to a full table/larger query when they want to run their transforms for their complete dataset.
  2. Ability to write query results to a permanent table, rather than a temp one, or loading within the notebook.

Use consistent TypeScript module filepath naming

Module filepath naming between client- and server-side code is inconsistent.

client-side:

/path/to/SomeModule.ts

server-side

/path/to/somemodule.ts

Should pick one naming strategy or the other and be consistent. I prefer the server-side naming approach since it seems more compatible with nodejs naming practices.

Any objections/suggestions/thoughts here?

Feature: cost estimation for BQ queries

This will require some thinking on the API and UX fronts.
Many customer want the ability to get a cost estimate before firing off a query on a large DataSet. BQ apparently has an API that provides the proxy values. We should consider exposing that API in some suitable Python/SQL friendly form.

Ability to download/export data from BQ

  • Ability to stream down data into a local csv file (without loading data into memory)
  • Ability to insert an export job to have BQ write out results into GCS

Help for GCP/BQ objects

For beta/GA
Add help documentation for GCP/BQ classes added via "Help" option in the menu bar

Cross-platform Gradle 2.0 installer

Currently the initonce.sh setup script only checks for the availability of Gradle and does not do any installation if it is unavailable. Automate the installation in a cross-platform way for at least Linux and OS X.

Implement session ID generation

Currently the notebook server is simply using the user's IP address for the session id as stop-gap solution.

This approach works if you assume all users will have unique IP addresses (not necessarily true) and that a user is only working with a single notebook at a time (we want to support working with multiple notebook simultaneously).

Suggestion from nikhilk@

I think something as simple as a JWT encoding of userid + notebookid would be enough. The JWT encoding makes sure you can validate it was this server that generated the session id.

One thing to keep in mind is notebook ids likely change when renaming. In which case, maybe just a JWT encoding of the kernel id?

Support for Google Charting API in IPython

Support for generating charts using Google Charts API.
Proposal:

%%chart --data <name of list/dataframe variable>
[optional chart configuration in json format per charting API functionality]

Side note - also add json syntax coloring support in notebook and trigger that mode for %%chart.

Cross-platform Node.js installer

Currently the initonce.sh setup script only checks for the availability of Node.js and does not do any installation if it is unavailable. Automate the installation in a cross-platform way for at least Linux and OS X.

Cell input region directive

Adds styling and markup for an input region component. Both Markdown cells and code editor cells share this common "input region" component.

Provides data-binding for a single "content" attribute, which is the plain text content of the input region.

Builds upon the CodeMirror code editor directive.

Support for configuring log levels at runtime

Add support for a query string debug flag that can increase the logging verbosity of a running instance without rebuilding.

For example:
/notebooks/123?debug=true

Which would override the default logging levels to enable full debug output. This is useful for diagnosing a deployed instance without requiring a code change to bump (and then another change later to reduce) log levels.

Support for using variable placeholders in SQL for column names

The %%bq_sql cell has support for using variables/placeholders for the FROM parts and literals in the WHERE clauses.

Need to extend that to support using variables in the SELECT part - specifically the column names. This can be done by implementing _repr_sql on TableField objects that are contained within TableSchema objects.

Ability to insert/upload data into BQ

  • Ability to create a table with a schema
  • Ability to append data into a table (from a list/dataframe) using the streaming inject api
  • Ability to create an upload job to upload data from GCS into a BQ table

More telemetry: user agent for BQ and Dataflow stats

Add a unique user agent for DataLab to track queries sent / dataflows run

BQ apparently is already set up while dataflow is not. Will log an issue for dataflow separately - though likely won't get it in before beta.

gcloud command for launching DataLab

For beta, we want to have a more streamlined way for DataLab integration with GCP. This is 1 or 2 items (second one is Pantheon)

Gcloud has its own cycle / reviews etc so it would be better to start on this one early. It is also likely simpler than Pantheon integration and follows the API - CLI - UI sequence.

One wrinkle is that we would like to support local usage as well which is not gcloud's purpose. We could treat local launch as either a degenerate case and use gcloud for it as well or continue to provide a non-gcloud script.

Markdown rendering directive

Directive for rendering plain text Markdown content to html.

Provides data-binding for a single "content" attribute, which contains plain text. Directive displays the content as rendered Markdown (i.e., html content).

Feature: auto-completion for BQ

Low pri but potentially useful.
Suggest limited auto-completion for BQ. I don't think it is worth blocking on maximal auto-completion (in "select" clause for instance). Providing it in clauses that come after "from" may be worthwhile.

Lazy-load CSS assets

Currently all css assets are being loaded up front within index.html.

Lazy-loading of css assets, has at least two use cases:

  • Route/page-specific css rules
  • Plugin-specific css rules

Initial sidebar navigation content

Add content for the common navigation sidebar. This would include links to all of the "entity" pages

  • Content
  • Sessions

In the future also:

  • Datasets
  • Jobs
  • Pipelines
  • etc.

Minimal code editor cell directive

Adds markup/styling for cell framing of input/output regions.

Directive exposes data-binding attributes for controlling input/output region content. Input region content is plain text. Output region content supports html (which can degrade to a single text node).

Output region is selectively shown based upon the existence of output content.

Builds upon the code-editor component directive.

Enforce TypeScript interface naming consistency across existing UI code

The existing UI code may be a bit inconsistent with the TypeScript interface naming used in the server-side/node code; would be good to make it consistent.

Enforce the following policy for when a TypeScript interface is I-prefixed:

  • If the interface exists so that a class (or classes) can implement it, it should be I-prefixed: class Foo implements IFoo
  • If the interface is defining a function signature, no I-prefix: interface FooHandler { // signature }
  • If the interface is defining a data-only type, no I-prefix; these are basically for defining the equivalent of POJOs/tuples

Sample: Streaming data scenarios with BQ

Details TBD
Notebook in intro folder that covers use of streamed data in BQ. There are good examples of streaming data into BQ (inserts) already so this would be complementary.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.