googledatalab / datalab Goto Github PK

Interactive tools and developer experiences for Big Data on Google Cloud Platform.

License: Apache License 2.0

Shell 6.10% Python 11.36% JavaScript 9.56% CSS 2.57% TypeScript 47.84% HTML 16.87% Smarty 0.08% Jupyter Notebook 0.15% Dockerfile 1.60% Go 3.88%

datalab's Introduction

Google Cloud DataLab

Datalab is deprecated. Vertex AI Workbench provides a notebook-based environment that offers capabilities beyond Datalab. We recommend that you use Vertex AI Workbench for new projects and migrate your Datalab notebooks to Vertex AI Workbench. For more information, see Deprecation information. To get help migrating Datalab projects to Vertex AI Workbench see Get help.

datalab's People

Contributors

Stargazers

Watchers

Forkers

deflaux ustcldf ljzzju malves1982 corerax pombredanne falltodis saurabhprakash fhoffa markedmondson1234 m170897017 initaldk bbandaru obulpathi mbrukman feczo j450h1 kesuskim vasbala mlifemaker vad-babushkin codeaudit aman-ebay vkaya caigaojiang samuelhuylebroeck zhuangkechen belvo tosato3 kgov1 miquelnoguer elibixby pkpp1233 yangtree8 jackcox mrb1b0 abdiiwan1841 jmhehir brecht-d-m datastark cognitivetouch brucemetallians ojarjur wangjiahong digideskio jjsong angelapper macs720 tmatsuo majaengvall draa andrewverte dwmclary jajohe reedboehringer drewbryant yukoga phpmind yebrahim bdacode maccam912 markneville minhpascal zelladoor rxminus slietz craigcitro jgschmitz yukotan sobakavich mdbconsulting blois e-lin ecoblockchain susieyy datavizi neilsh maniacs-satm rodriguezjf bbarnes52-zz brianfarrar nitrek manifestlifeinc mrsrujankv ktiyab stewart-r alexxnica kryndex labbros clockfly khanhdinh ciandt-d1 vochicong rajivpb jkrauwer mrgoogol pchalcol jayden11 dgretton pulkitpahwa

datalab's Issues

Robust available port selection

The DataLab server is currently using a heuristic, stop-gap approach for finding open ports (used when creating new kernel instances). Need to implement a more robust scheme to find available ports.

Approach used by IPython is to create a socket with no port specification, which results in a random (but available) port being selected. This socket is then immediately closed. Normally the socket would go into TIME_WAIT and be unavailable temporarily, but they are setting a SO_REUSEADDR or making the socket linger zero. IPython then assumes this port to still be available (potentially not due to race condition) and used it for setting up kernel comm.

Cleanup python setup in docker container

Resolve the mix of apt-get and pip installs.

Also, figure out any additional packages to be installed.

Minimal Markdown cell directive

Unlike the code editor cell which has two regions that co-exist visually (input and output), the Markdown cell has two regions that are mutually exclusive visually -- edit-mode and render-mode. See an IPy notebook's Markdown cell for example.

"Executing" the cell causes it to switch from edit-mode to render-mode. Selecting (double-click is IPy action) the cell causes it to enter edit-mode (if not already in edit-mode). In render mode, the html representation of the Markdown content is displayed.

Has only an input attribute for data-binding.

Builds upon a rendered-markdown directive and the editor-cell directive.

Cell magic syntax for BQ functionality

This will apply to other stuff besides BigQuery eventually, but BQ is what we have for the moment.

Current:

%%bq_sql ...
%%bq_udf ...

Proposed:
Make it look more like a normal command line.

%%bigquery sql [--name:<name>] [--help]
%%bigquery udf --name:<name> [--help]

Others possible commands (in future):

schema --name:<table name>
table --name:<table name>
dataset --name:<dataset name>

IPython docker container needs to support authentication of users

Authenticate users, and authorize them against the containing cloud project.

In order to do this without changing IPython, we will add a node.js front-end server that will act as a reverse proxy, and along the way implement support for authN/Z.

Log level support

RE: app/common/logging:Logger

Redirect log. to appropriate console. method if it exists.

Add polyfill that redirects all levels to console.log in absence of level-specific methods on the console object.

Logging configurable by scope

RE: app/common/logging:Logger

Add support for configuring logging output by scope; for example, disable all logging only for scope "foo" or allow only level >= "warn" for scope bar with default/global logging level of "info".

Support for themes / built in dark theme

This is definitely post-alpha and could be post-beta. An intangible delighter to create a more emotional attachment to the tool and the experience of using it. In particular, a sizeable segment prefers dark themes. Sublime Text is a great example that defaults to such a theme and gets a lot of mileage out of it.

See http://www.damian.oquanta.info/posts/48-themes-for-your-ipython-notebook.html for some work that has been done.

Add logging support and configuration

Currently the notebook server is making use of console.log for logging issues/info throughout, but these logs are never persisted anywhere for the moment.

Task is to replace console.log usage with appropriately configured bunyan loggers. See #73 for example

Gradle build support for python libraries

Gradle plugin/task for building python source distributions (pygcp, ipython extensions).

Streamline VM-based deployment model for IPython

See https://github.com/GoogleCloudPlatform/datalab/blob/master/deploy/ipython/vm.sh to deploy instances, and then use SSH tunneling to access the deployed instance.

Since we'll be using that to get our stuff to initial users, we should streamline the experience with that script. I'll list couple on my mind, but would be great if others can try and list issues/ideas for improvement as comments on this.

Wait for the VM to come up as well docker instance to start
Launch the browser at the end (don't know if a shell script can do so)

IPython docker container telemetry

Need to implement logging for the IPython docker container using Google Analytics.

Events to track on the server:

Docker container instantiation
Completed authentication flows

Events to track on the client in addition to standard GA tracking:

New notebook creation

Dimensions to add to standard GA tracking:

Project ID

Content navigation widget

Create an Angular directive that generate the notebooks list html with appropriate interactions/hooks.

Potential per-notebook fields to display:

Name
Type (IPyNB | DataLabNB)
Last modified time
Created time

Timestamps could be rendered as "k days/hours/minutes ago" or the absolute date.

Directive is driven by an external data object (e.g., array of notebook objects) by setting up two-way binding on one of the directive attributes. That is, the containing scope/controller drives the directive's data.

IPython docker container should use a self-signed cert

This is for enabling HTTPS access.

The less-than-desirable aspect of this is a browser warning...

Static configuration support for modifying logger verbosity

RE: app/common/logging:Logger

Support the ability to configure the Logger verbosity (e.g., silence debug-level statements).

Page toolbar layout directive

Decomposes a page into a horizontal toolbar region at the top of the page, and a main content region underneath the toolbar.

 ---------
| toolbar |
 ---------
|         |
|  main   |
|         |
 ---------

Used for generating the high-level layout of the notebooks edit page.

Better autocompletion for our Python APIs

Right now the autocomplete in IPython leaves a lot to be desired, because it uses introspection on an object. So if you have a variable x, and do 'x.', IPython will introspect on x. This works okay if x is already bound from a previous cell, but not if it is bound in the current cell. Also, if you chain method calls in the current cell, it is useless:

Table('....').<tab>

gives autocompletion based off the working directory, so you may get something like:

.git/
.gitignore

I think this is something that could be fixed in IPython itself. Ideally it would use an automplete library that can do type inference like Jedi, although a simpler approach that may be enough for our purposes would be to use function annotations on our APIs. That said, we don't really want to customize IPython at this point so this is something we should do in the new UX.

BigQuery table schema objects should have better string representations

Right now these get persisted as class names in the notebook string representation.

Need to implement repr and/or str on various objects include schema objects.

Avoid src/.ts recompilation when building tests/.ts

Find a way to avoid needing to rebuild the src/* .ts files when compiling tests

Possible solution involves generating the .d.ts typedefs when building /src/ and correctly
symlinking these built files to the test directory, before building the tests.

First attempt at this approach was failing for reasons that I don't fully understand (tsc was attempting to use the src/ .ts files rather than the src/ .d.ts files when compiling tests).

Support for domain-scoped project ids

Deal with project ids such as google.com:foobar

This affects the VM deployment script as well as the GCS notebook manager.

Setup lint rules and tools

Setup lint for the codebase for the following:

python (already started with pylint-based rules)
java
typescript

Enable BigQuery ETL

Enable users to use BigQuery to do ETL workflows.

We need at least the following features:

Ability to run a query against a specified source - so developers can use a sample query during development and then switch to a full table/larger query when they want to run their transforms for their complete dataset.
Ability to write query results to a permanent table, rather than a temp one, or loading within the notebook.

"OK" button in "Rename Notebook" dialog appears disabled when it is enabled

Repro:
In Chrome on Linux (at least - may be the same on Mac)
1 Create new notebook
2 Click on the notebook name (Untitled...) to rename
3 In the dialog type new name and see the "OK" button

Expected: OK should look just like Cancel
Actual: OK looks disabled. Cancel looks enabled as expected

Use consistent TypeScript module filepath naming

Module filepath naming between client- and server-side code is inconsistent.

client-side:

/path/to/SomeModule.ts

server-side

/path/to/somemodule.ts

Should pick one naming strategy or the other and be consistent. I prefer the server-side naming approach since it seems more compatible with nodejs naming practices.

Any objections/suggestions/thoughts here?

Feature: cost estimation for BQ queries

This will require some thinking on the API and UX fronts.
Many customer want the ability to get a cost estimate before firing off a query on a large DataSet. BQ apparently has an API that provides the proxy values. We should consider exposing that API in some suitable Python/SQL friendly form.

Ability to download/export data from BQ

Ability to stream down data into a local csv file (without loading data into memory)
Ability to insert an export job to have BQ write out results into GCS

Help for GCP/BQ objects

For beta/GA
Add help documentation for GCP/BQ classes added via "Help" option in the menu bar

Cross-platform Gradle 2.0 installer

Currently the initonce.sh setup script only checks for the availability of Gradle and does not do any installation if it is unavailable. Automate the installation in a cross-platform way for at least Linux and OS X.

Implement session ID generation

Currently the notebook server is simply using the user's IP address for the session id as stop-gap solution.

This approach works if you assume all users will have unique IP addresses (not necessarily true) and that a user is only working with a single notebook at a time (we want to support working with multiple notebook simultaneously).

Suggestion from nikhilk@

I think something as simple as a JWT encoding of userid + notebookid would be enough. The JWT encoding makes sure you can validate it was this server that generated the session id.

One thing to keep in mind is notebook ids likely change when renaming. In which case, maybe just a JWT encoding of the kernel id?

Long running tasks in docker container need restart functionality

IPython proxy server - possibly use node cluster APIs (see https://github.com/elad/node-cluster-socket.io and http://schier.co/post/restarting-workers-in-a-nodejs-cluster)

Metadata service emulator - supervisor or node forever

Support for Google Charting API in IPython

Support for generating charts using Google Charts API.
Proposal:

%%chart --data <name of list/dataframe variable>
[optional chart configuration in json format per charting API functionality]

Side note - also add json syntax coloring support in notebook and trigger that mode for %%chart.

Add a git pre-push hook for running tests in sources directory.

Happened a couple of times that I broke the build and test because I did not run the tests before push, hence this issue.

Cross-platform Node.js installer

Currently the initonce.sh setup script only checks for the availability of Node.js and does not do any installation if it is unavailable. Automate the installation in a cross-platform way for at least Linux and OS X.

Group user notebooks into separate folder with navigation to intro or notebook folder

Per discussion this morning. Items include:

Create an extra folder ("notebooks" or something similar) at the top level for user's notebooks
Surface a way to navigate directly to the "notebooks" folder or the current "intro" folder on the initial page.

Cell input region directive

Adds styling and markup for an input region component. Both Markdown cells and code editor cells share this common "input region" component.

Provides data-binding for a single "content" attribute, which is the plain text content of the input region.

Builds upon the CodeMirror code editor directive.

Support for configuring log levels at runtime

Add support for a query string debug flag that can increase the logging verbosity of a running instance without rebuilding.

For example:
/notebooks/123?debug=true

Which would override the default logging levels to enable full debug output. This is useful for diagnosing a deployed instance without requiring a code change to bump (and then another change later to reduce) log levels.

Alpha UI readiness: Change name and hexagon

Cloud DataLab with generic GCP hexagon instead of the current one - maybe with "built on IPython" subtitle.

Support for using variable placeholders in SQL for column names

The %%bq_sql cell has support for using variables/placeholders for the FROM parts and literals in the WHERE clauses.

Need to extend that to support using variables in the SELECT part - specifically the column names. This can be done by implementing _repr_sql on TableField objects that are contained within TableSchema objects.

Notebook magic functions for table schema, table listing etc

Shortcuts like

%%bq_schema <table name>
%%bq_datasets [<project name>]
%%bq_tables [<project name>]

Ability to insert/upload data into BQ

Ability to create a table with a schema
Ability to append data into a table (from a list/dataframe) using the streaming inject api
Ability to create an upload job to upload data from GCS into a BQ table

More telemetry: user agent for BQ and Dataflow stats

Add a unique user agent for DataLab to track queries sent / dataflows run

BQ apparently is already set up while dataflow is not. Will log an issue for dataflow separately - though likely won't get it in before beta.

gcloud command for launching DataLab

For beta, we want to have a more streamlined way for DataLab integration with GCP. This is 1 or 2 items (second one is Pantheon)

Gcloud has its own cycle / reviews etc so it would be better to start on this one early. It is also likely simpler than Pantheon integration and follows the API - CLI - UI sequence.

One wrinkle is that we would like to support local usage as well which is not gcloud's purpose. We could treat local launch as either a degenerate case and use gcloud for it as well or continue to provide a non-gcloud script.

Markdown rendering directive

Directive for rendering plain text Markdown content to html.

Provides data-binding for a single "content" attribute, which contains plain text. Directive displays the content as rendered Markdown (i.e., html content).

Jasmine test runner Gradle task

Gradle task for executing Jasmine tests.

One possibility: https://github.com/eriwen/gradle-js-plugin

Feature: auto-completion for BQ

Low pri but potentially useful.
Suggest limited auto-completion for BQ. I don't think it is worth blocking on maximal auto-completion (in "select" clause for instance). Providing it in clauses that come after "from" may be worthwhile.

Lazy-load CSS assets

Currently all css assets are being loaded up front within index.html.

Lazy-loading of css assets, has at least two use cases:

Route/page-specific css rules
Plugin-specific css rules

Initial sidebar navigation content

Add content for the common navigation sidebar. This would include links to all of the "entity" pages

Content
Sessions

In the future also:

Datasets
Jobs
Pipelines
etc.

Minimal code editor cell directive

Adds markup/styling for cell framing of input/output regions.

Directive exposes data-binding attributes for controlling input/output region content. Input region content is plain text. Output region content supports html (which can degrade to a single text node).

Output region is selectively shown based upon the existence of output content.

Builds upon the code-editor component directive.

Enforce TypeScript interface naming consistency across existing UI code

The existing UI code may be a bit inconsistent with the TypeScript interface naming used in the server-side/node code; would be good to make it consistent.

Enforce the following policy for when a TypeScript interface is I-prefixed:

If the interface exists so that a class (or classes) can implement it, it should be I-prefixed: class Foo implements IFoo
If the interface is defining a function signature, no I-prefix: interface FooHandler { // signature }
If the interface is defining a data-only type, no I-prefix; these are basically for defining the equivalent of POJOs/tuples

Sample: Streaming data scenarios with BQ

Details TBD
Notebook in intro folder that covers use of streamed data in BQ. There are good examples of streaming data into BQ (inserts) already so this would be complementary.

googledatalab / datalab Goto Github PK

datalab's Introduction

Google Cloud DataLab

datalab's People

Contributors

Stargazers

Watchers

Forkers

datalab's Issues

Recommend Projects

Recommend Topics

Recommend Org