spark-notebook / spark-notebook Goto Github PK

Interactive and Reactive Data Science using Scala and Spark.

License: Apache License 2.0

CoffeeScript 0.82% CSS 1.01% HTML 2.65% Shell 0.06% Scala 10.73% Java 0.01% JavaScript 54.20% Makefile 0.02% Jupyter Notebook 29.34% Less 1.16%

apache-spark notebook scala data-science spark reactive

spark-notebook's Introduction

Spark Notebook

The Spark Notebook is the open source notebook aimed at enterprise environments, providing Data Scientists and Data Engineers with an interactive web-based editor that can combine Scala code, SQL queries, Markup and JavaScript in a collaborative manner to explore, analyse and learn from massive data sets.

The Spark Notebook allows performing reproducible analysis with Scala, Apache Spark and the Big Data ecosystem.

Features Highlights

Apache Spark is available out of the box, and is simply accessed by the variable sparkContext or sc.

Multiple Spark Context Support

One of the top most useful feature brought by the Spark Notebook is its separation of the running notebooks. Each started notebook will spawn a new JVM with its own SparkSession instance. This allows a maximal flexibility for:

dependencies without clashes
access different clusters
tune differently each notebook
external scheduling (on the roadmap)

Metadata-driven configuration

We achieve maximum flexibility with the availability of multiple sparkContexts by enabling metadata driven configuration.

Scala

The Spark Notebook supports exclusively the Scala programming language, the Unpredicted Lingua Franca for Data Science and extensibly exploits the JVM ecosystem of libraries to drive an smooth evolution of data-driven software from exploration to production.

The Spark Notebook is available for *NIX and Windows systems in easy to use ZIP/TAR, Docker and DEB packages.

Reactive

All components in the Spark Notebook are dynamic and reactive.

The Spark Notebook comes with dynamic charts and most (if not all) components can be listened for and can react to events. This is very helpful in many cases, for example:

data entering the system live at runtime
visually plots of events
multiple interconnected visual components Dynamic and reactive components mean that you don't have write the html, js, server code just for basic use cases.

Quick Start

Go to Quick Start for our 5-minutes guide to get up and running with the Spark Notebook.

C'mon on to Gitter to discuss things, to get some help, or to start contributing!

Learn more

Explore the Spark Notebook
HTML Widgets
Visualization Widgets
Notebook Browser
Configuration
Running on Clusters and Clouds
Community
Advanced Topics
- Using Releases
- Building from Sources
- Creating Specific Distributions
- Creating your own custom visualizations
- User Authentication
  - Supports: Basic, Form & Kerberos auth, and many more via pac4j (OAuth, OpendID, ...)
  - Passing the logged in user to Secure Hadoop+YARN clusters via the proxy-user impersonation
Advanced: How to Develop/improve spark-notebook
- Overview of Project structure

Testimonials

Skymind - Deeplearning4j

Spark Notebook gives us a clean, useful way to mix code and prose when we demo and explain our tech to customers. The Spark ecosystem needed this.

Vinted.com

It allows our analysts and developers (15+ users) to run ad-hoc queries, to perform complex data analysis and data visualisations, prototype machine learning pipelines. In addition, we use it to power our BI dashboards.

Adopters

Name	URL	Description
Kensu	website	Lifting Data Science to the Enterprise level
Agile Lab	website	The only Italian Spark Certified systems integrator
CloudPhysics	website	Data-Driven Inisghts for Smarter IT
Aliyun	product	Spark runtime environment on ECS and management tool of Spark Cluster running on Aliyun ECS
EMBL European Bioinformatics Institute	website	EMBL-EBI provides freely available data from life science experiments, performs basic research in computational biology and offers an extensive user training programme, supporting researchers in academia and industry.
Metail	website	The best body shape and garment fit company in the world. To create and empower everyone’s online body identity.
kt NexR	website	the kt NexR is one of the leading BigData company in the Korea from 2007.
Skymind	website	At Skymind, we’re tackling some of the most advanced problems in data analysis and machine intelligence. We offer start-of-the-art, flexible, scalable deep learning for industry.
Amino	website	A new way to get the facts about your health care choices.
Vinted	website	Online marketplace and a social network focused on young women’s lifestyle.
Vingle	website	Vingle is the community where you can meet someone like you.
47 Degrees	website	47 Degrees is a global consulting firm and certified Typesafe & Databricks Partner specializing in Scala & Spark.
Barclays	website	Barclays is a British multinational banking and financial services company headquartered in London.
Swisscom	website	Swisscom is the leading mobile service provider in Switzerland.
Knoldus	website	Knoldus is a global consulting firm and certified "Select" Lightbend & Databricks Partner specializing in Scala & Spark ecosystem.

spark-notebook's People

Contributors

Stargazers

Watchers

Forkers

skumarbigdata pierre-borckmans malcolmgreaves daishichao theclaymethod djamelz bbossy tcfuji uberwach algarecu gdtm86 intellidiscovery wanderlustzoe edwardt frosner vsingh58 acanalesg kmader leochencipher martinweindel stevenbeeckman nathan-gs davande miloveme xiangacadia linearregression parkx408 hunglin irjerad bigdatagenomics mandubian richwhitjr 4quant mindis agile-lab mkolod phanther inferlife malagori analog76 jackerxff wewela srikraj8341 bthuillier minyk antonkulaga mjrkmail mikeaddison93 tsaxena lucentcosmos nagabharat vicchugu mt0803 vitan wypb mbonaci umeshkbhaskar nightscape dataman-cloud fysoft2006 josephwinston chenqxi sungsoo weixu02 hcchen abdoulayediallo miguel0alves pcliupc unclegen shijinkui defaultrobot otds gchen yanjiegao zhichao-li codeaudit paitingo eronwright changguanghua jeromebanks fikrimuhal aerondir d13sl0w burakonal89 sourcedelica crazyjvm quantcruncher lukehan ammachado viplav huangjun6919 pallayr yannbyron grobins rahulgithub nikshe debasish83 bestco cheleb lev112

spark-notebook's Issues

Data Connector/Browser

Widget that would be
File browser,
HDFS browser,
Tachyon browser
...

Hadoop 2.3 support

A version with Spark 1.1 (or 1.2) and Hadoop 2.3 would be appreciated.

Late/Lazy/Delayed startup of observable.js

This line fails when executed before IPython init → hence it breaks observable mechanism, which is almost everything interesting...

Enhance Spark feedback

The current is very basic and can be found here.

In the near future such thingy should be even easier to have → apache/spark#3009

Executing spark while "play run" is failing

This is a tricky one (again), because looks like it's only happening when running in Dev mode (like run).

Hence, it's still hard to debug or develop things involving spark execution thus (Spark, SparkInfo, Sql, and c°)... Which is sad.

So the symptom is that there is a mismatch while writing/reading Tasks by the closure serializer (as far as the current investigation discovered). The problem occurs for scala.Option, scala.None$. The serialiazed (Java) is reporting bad serial uid, but the show ids (see below) are for Option and None.type → that's normal. Although, the written bytes should only refer the characters None$ and never Option.

No idea, how this happens now, however, I guess that the Option field in Task is metrics.

[error] o.a.s.e.Executor - Exception in task 1.0 in stage 0.0 (TID 1)
java.io.InvalidClassException: scala.Option; local class incompatible: stream classdesc serialVersionUID = -2062608324514658839, local class serialVersionUID = 5081326844987135632
    at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617) ~[na:1.7.0_72]
    at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622) ~[na:1.7.0_72]
    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) ~[na:1.7.0_72]
    at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622) ~[na:1.7.0_72]
    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) ~[na:1.7.0_72]
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) ~[na:1.7.0_72]
[error] o.a.s.s.TaskSetManager - Task 7 in stage 0.0 failed 1 times; aborting job

Run notebook on mesos

"Executor Spark home spark.mesos.executor.home is not set!" even if spark.executor.uri is correctly set

`Deps.resolveAndAddToJars` should also be able to act like `:cp`

Painful to have to download the deps then still have to create the according :cp block.

So resolve should have a parameter that will update the classpath in one go.

Using up arrow (`↑`) on second line of input moves to previous block

This is very annoying... need to play with → and ← and end to get to the first character of a block...

Open notebooks from any folder

In Ipython, the command ipython notebook can be called from any directory, launching a web service with all the .ipynb in the current directory and subdirs.

It would be useful to mantain the same approach, to store the notebooks within the projects / git repos / etc.

Add twitter stream example

would be so fun, mostly if it has an evolving graph

S3 Wagon for Aether

Plug some stuffs in Deps.scala.

Ref(?): http://site.kuali.org/maven/wagons/maven-s3-wagon/1.1.20/

Register predefined SparkContexts

And update the main page to choose which SparkContext to use when a notebook is opened.

Within a notebook, allow Spark to choose one of them.

Restore downloads as Snb + Scala

The switch to the new IPython UI broke it

no response to statement execution

I am testing using the "Simple spark" example . When executing any command there is no response.

terminal-log

Embedded server listening at
http://0.0.0.0:8899
Press any key to stop.
log4j:ERROR Could not find value for key log4j.appender.console
log4j:ERROR Could not instantiate appender named "console".
log4j:WARN No appenders could be found for logger (notebook.kernel.pfork.BetterFork$).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
[WARN] [12/20/2014 16:32:12.241] [main] [EventStream(akka://Remote)] [akka.event-handlers] config is deprecated, use [akka.loggers]

Shutdown remote processes

when page is closed or reloaded, the remote process should be shut down.

Using :cp looses env

Since using :cp will restart the REPL the env (variables mainly created from the history) are lost.

Solution is to ask a new REPL by passing the history of the previous one...

Dockerized Container

Would be nice to spin it up in a VM or a container, on kubernetes or Mesos.
I've started working on it, but it's a bit of a hack for the command.
Security, easy url, and sbt are a little painful;
I'l send a Pr when I can.

Improve doc with new features

docker installation
data table, :sh, hadoop and version builds
so on and so forth

Change default port 9000

Hey guys, I hope this is the right spot to ask questions (and possibly raise an issue).

I would like to run spark-notebook on a server, also used as namenode of an Hadoop Cluster (for a proof-of-concept).
When starting spark-notebook, it produces an error:
"org.jboss.netty.channel.ChannelException: Failed to bind to: /0.0.0.0:9000"
Most likely, it cannot bind to that port, since Hadoop Filesystem is already bind to port 9000.

Is there a way to change the port of spark-notebook?

Best regards,
Benjamin

add ScalaJS to react with results instead of JS (like in Playground.scala)

ScalaJS is much more convenient for interacting with Javascript then Playground (just like at http://www.scala-js-fiddle.com/ it is easy to use). And one can also write visualization code with scalajs.

Specific context for Shell call

Add a new context :sh like :sql, :cp.

We can use scala.sys.process for execute them.

Sanitize the themes

LSS: lots of css imports...

Longer:
Quite some different themes are used, like jquery-ui, bootstrap.
Also libs like bokeh are defining clashing class names (with jquery-ui AFAIK) and are either shipping or relying on certain css libs (jquerui / boostrap?)

SparkR integration

Need a new context (:r?)

Need to see how to integrate and use SparkR from scala.

Git Clone on Windows machine

error: unable to create file Spark+on+Mesos+using+C*.snb (Invalid argument)

probbaly some irregular char in the file name.

Can I submit a PR ?

Validate that notebooks.dir conf works

Enable interpolation in Markup block

It's painful to use <h2></h2> code in code blocks because we need to render a local variable.

The idea here is to make the local variables accessible to the markup renderer.

Integrate with bokeh

This works slightly started here.

It needs to be continued here, it's a great move and idea! This would ease the integration of drawing capabilities, at least a first and complete one.

Other ideas will be nvd3, c3.js and so on.

Spark 1.2

It'll need this: https://issues.apache.org/jira/browse/SPARK-4923

But also, akka has been bumped to a new version → switch to Play 2.3 should be okay now (hence some adaptations in the websocket would be nice, like using actors right away).

S3 link format

Clicking the S3 link from: https://github.com/andypetrella/spark-notebook/releases

Gives me:

Is that expected, instead of a directory hierarchy?

Native packaging

Deb, ...

Create small website...

PySpark integration

Need a new context (:py?)

Need to see how to integrate and use PySpark from scala.

Download notebook as code

Add a menu item to download the code part as a scala file. Maybe consider markdown as comments, :cp and :dp will be more complicated and could be left apart at first

Deal with Ivy deps (Aether)

Plug some stuffs in Deps.scala

Hook `clusters` to preconfigure notebooks

The cluster list can be used to list the spark clusters that can be used to preconfigure notebooks

Forked process won't work using `play run`

This is mainly a play (sbt) problem actually, it seems that it interferes with the classpath construction and loading, hence the classes are not found in the forked process.

Some stuffs have been tried so far, like this, however if now the process can find the classes, it fails weirdly at runtime:
cannot find function f in StringContext.

The thing is that f is actually a macro and should have been injected in the class' bytecode (AFAIK) hence it should be resolved.

A clash in the scala version is one of the potential problem, but not sure.

Notebook showing how to use GeoTrellis

that would be a great show up
/cc @lossyrob ^^

Implement HTML export

adam usage

I see that you added info about configuration spark-notebook with different things. I think guide about configuration with ADAM will also be useful

Allow Inputs to remain visible after execution

Input blocks in the notebook are always hidden after "successful" execution → which means no exception.
Which is sometimes a bit awkward because the result might be a Try in fail state, or simply we want to keep seeing the input code.

What could be done is to add a menu entry plus shortcut for this new mode where inputs are never hidden (à la ALT+A)

Failed to initialize compiler: object scala.runtime in compiler mirror not found

I get "Failed to initialize compiler: object scala.runtime in compiler mirror not found" errors.
What I did is just ran

sbt run

And tried simplest example

Also, trying to add mllib with the following code doesn't have any effect:

:dp org.apache.spark  % "spark-mllib_2.10" % 1.2.0