Git Product home page Git Product logo

spark-notebook's Introduction

Spark Notebook

Gitter

The Spark Notebook is the open source notebook aimed at enterprise environments, providing Data Scientists and Data Engineers with an interactive web-based editor that can combine Scala code, SQL queries, Markup and JavaScript in a collaborative manner to explore, analyse and learn from massive data sets.

notebook intro

The Spark Notebook allows performing reproducible analysis with Scala, Apache Spark and the Big Data ecosystem.

Features Highlights

Apache Spark

Apache Spark is available out of the box, and is simply accessed by the variable sparkContext or sc.

Multiple Spark Context Support

One of the top most useful feature brought by the Spark Notebook is its separation of the running notebooks. Each started notebook will spawn a new JVM with its own SparkSession instance. This allows a maximal flexibility for:

  • dependencies without clashes
  • access different clusters
  • tune differently each notebook
  • external scheduling (on the roadmap)

Metadata-driven configuration

We achieve maximum flexibility with the availability of multiple sparkContexts by enabling metadata driven configuration.

Scala

The Spark Notebook supports exclusively the Scala programming language, the Unpredicted Lingua Franca for Data Science and extensibly exploits the JVM ecosystem of libraries to drive an smooth evolution of data-driven software from exploration to production.

The Spark Notebook is available for *NIX and Windows systems in easy to use ZIP/TAR, Docker and DEB packages.

Reactive

All components in the Spark Notebook are dynamic and reactive.

The Spark Notebook comes with dynamic charts and most (if not all) components can be listened for and can react to events. This is very helpful in many cases, for example:

  • data entering the system live at runtime
  • visually plots of events
  • multiple interconnected visual components Dynamic and reactive components mean that you don't have write the html, js, server code just for basic use cases.

Quick Start

Go to Quick Start for our 5-minutes guide to get up and running with the Spark Notebook.

C'mon on to Gitter to discuss things, to get some help, or to start contributing!

Learn more

Testimonials

Skymind - Deeplearning4j

Spark Notebook gives us a clean, useful way to mix code and prose when we demo and explain our tech to customers. The Spark ecosystem needed this.

It allows our analysts and developers (15+ users) to run ad-hoc queries, to perform complex data analysis and data visualisations, prototype machine learning pipelines. In addition, we use it to power our BI dashboards.

Adopters

Name Logo URL Description
Kensu Kensu website Lifting Data Science to the Enterprise level
Agile Lab Agile Lab website The only Italian Spark Certified systems integrator
CloudPhysics CloudPhysics website Data-Driven Inisghts for Smarter IT
Aliyun Alibaba - Aliyun ECS product Spark runtime environment on ECS and management tool of Spark Cluster running on Aliyun ECS
EMBL European Bioinformatics Institute EMBL - EBI website EMBL-EBI provides freely available data from life science experiments, performs basic research in computational biology and offers an extensive user training programme, supporting researchers in academia and industry.
Metail Metail website The best body shape and garment fit company in the world. To create and empower everyone’s online body identity.
kt NexR kt NexR website the kt NexR is one of the leading BigData company in the Korea from 2007.
Skymind website At Skymind, we’re tackling some of the most advanced problems in data analysis and machine intelligence. We offer start-of-the-art, flexible, scalable deep learning for industry.
Amino website A new way to get the facts about your health care choices.
Vinted Vinted website Online marketplace and a social network focused on young women’s lifestyle.
Vingle Vingle website Vingle is the community where you can meet someone like you.
47 Degrees website 47 Degrees is a global consulting firm and certified Typesafe & Databricks Partner specializing in Scala & Spark.
Barclays Barclays website Barclays is a British multinational banking and financial services company headquartered in London.
Swisscom Swisscom website Swisscom is the leading mobile service provider in Switzerland.
Knoldus knoldus website Knoldus is a global consulting firm and certified "Select" Lightbend & Databricks Partner specializing in Scala & Spark ecosystem.

spark-notebook's People

Contributors

0asa avatar agile-lab avatar andypetrella avatar antonkulaga avatar bigsnarfdude avatar cbvoxel avatar copumpkin avatar ericacm avatar eronwright avatar folone avatar frbayart avatar gitter-badger avatar hanxue avatar huitseeker avatar kencoder avatar maasg avatar mandubian avatar meh-ninja avatar minyk avatar mrt avatar nathan-gs avatar nightscape avatar paulp avatar petervandenabeele avatar rpcmoritz avatar shijinkui avatar stevenbeeckman avatar uberwach avatar vidma avatar xtordoir avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-notebook's Issues

Hadoop 2.3 support

A version with Spark 1.1 (or 1.2) and Hadoop 2.3 would be appreciated.

Executing spark while "play run" is failing

This is a tricky one (again), because looks like it's only happening when running in Dev mode (like run).

Hence, it's still hard to debug or develop things involving spark execution thus (Spark, SparkInfo, Sql, and c°)... Which is sad.

So the symptom is that there is a mismatch while writing/reading Tasks by the closure serializer (as far as the current investigation discovered). The problem occurs for scala.Option, scala.None$. The serialiazed (Java) is reporting bad serial uid, but the show ids (see below) are for Option and None.type → that's normal. Although, the written bytes should only refer the characters None$ and never Option.

No idea, how this happens now, however, I guess that the Option field in Task is metrics.

[error] o.a.s.e.Executor - Exception in task 1.0 in stage 0.0 (TID 1)
java.io.InvalidClassException: scala.Option; local class incompatible: stream classdesc serialVersionUID = -2062608324514658839, local class serialVersionUID = 5081326844987135632
    at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617) ~[na:1.7.0_72]
    at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622) ~[na:1.7.0_72]
    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) ~[na:1.7.0_72]
    at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622) ~[na:1.7.0_72]
    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) ~[na:1.7.0_72]
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) ~[na:1.7.0_72]
[error] o.a.s.s.TaskSetManager - Task 7 in stage 0.0 failed 1 times; aborting job

Run notebook on mesos

"Executor Spark home spark.mesos.executor.home is not set!" even if spark.executor.uri is correctly set

Open notebooks from any folder

In Ipython, the command ipython notebook can be called from any directory, launching a web service with all the .ipynb in the current directory and subdirs.

It would be useful to mantain the same approach, to store the notebooks within the projects / git repos / etc.

no response to statement execution

I am testing using the "Simple spark" example . When executing any command there is no response.

terminal-log

Embedded server listening at
http://0.0.0.0:8899
Press any key to stop.
log4j:ERROR Could not find value for key log4j.appender.console
log4j:ERROR Could not instantiate appender named "console".
log4j:WARN No appenders could be found for logger (notebook.kernel.pfork.BetterFork$).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
[WARN] [12/20/2014 16:32:12.241] [main] [EventStream(akka://Remote)] [akka.event-handlers] config is deprecated, use [akka.loggers]

Using :cp looses env

Since using :cp will restart the REPL the env (variables mainly created from the history) are lost.

Solution is to ask a new REPL by passing the history of the previous one...

Dockerized Container

Would be nice to spin it up in a VM or a container, on kubernetes or Mesos.
I've started working on it, but it's a bit of a hack for the command.
Security, easy url, and sbt are a little painful;
I'l send a Pr when I can.

Change default port 9000

Hey guys, I hope this is the right spot to ask questions (and possibly raise an issue).

I would like to run spark-notebook on a server, also used as namenode of an Hadoop Cluster (for a proof-of-concept).
When starting spark-notebook, it produces an error:
"org.jboss.netty.channel.ChannelException: Failed to bind to: /0.0.0.0:9000"
Most likely, it cannot bind to that port, since Hadoop Filesystem is already bind to port 9000.

Is there a way to change the port of spark-notebook?

Best regards,
Benjamin

Sanitize the themes

LSS: lots of css imports...

Longer:
Quite some different themes are used, like jquery-ui, bootstrap.
Also libs like bokeh are defining clashing class names (with jquery-ui AFAIK) and are either shipping or relying on certain css libs (jquerui / boostrap?)

SparkR integration

Need a new context (:r?)

Need to see how to integrate and use SparkR from scala.

Git Clone on Windows machine

error: unable to create file Spark+on+Mesos+using+C*.snb (Invalid argument)

probbaly some irregular char in the file name.

Can I submit a PR ?

Enable interpolation in Markup block

It's painful to use <h2></h2> code in code blocks because we need to render a local variable.

The idea here is to make the local variables accessible to the markup renderer.

Integrate with bokeh

This works slightly started here.

It needs to be continued here, it's a great move and idea! This would ease the integration of drawing capabilities, at least a first and complete one.

Other ideas will be nvd3, c3.js and so on.

PySpark integration

Need a new context (:py?)

Need to see how to integrate and use PySpark from scala.

Download notebook as code

Add a menu item to download the code part as a scala file. Maybe consider markdown as comments, :cp and :dp will be more complicated and could be left apart at first

Forked process won't work using `play run`

This is mainly a play (sbt) problem actually, it seems that it interferes with the classpath construction and loading, hence the classes are not found in the forked process.

Some stuffs have been tried so far, like this, however if now the process can find the classes, it fails weirdly at runtime:
cannot find function f in StringContext.

The thing is that f is actually a macro and should have been injected in the class' bytecode (AFAIK) hence it should be resolved.

A clash in the scala version is one of the potential problem, but not sure.

adam usage

I see that you added info about configuration spark-notebook with different things. I think guide about configuration with ADAM will also be useful

Allow Inputs to remain visible after execution

Input blocks in the notebook are always hidden after "successful" execution → which means no exception.
Which is sometimes a bit awkward because the result might be a Try in fail state, or simply we want to keep seeing the input code.

What could be done is to add a menu entry plus shortcut for this new mode where inputs are never hidden (à la ALT+A)

Cursor size glitch

It happens in lines # > 1. The cursor is spanning more than one line in height, this is distracting me so much and is very uncomfortable.

Dependencies are not updated

Executing "Update classpath and Spark's jars", it seems the dependencies are not updated as documented.

At the end of the execution, the key spark.jars is still empty.

Also, trying to add mllib with the following code doesn't have any effect:

:dp org.apache.spark  % "spark-mllib_2.10" % 1.2.0

Create a print css

When printing only a part of the notebook is printed, the rest is actually hidden

Header glitch -- disappearing oO

Sometimes, after runs or things like these (even just at first load), the header is not visible and screws the UI.

It requires a click somewhere around the banner. This is a classical symptom of bad dom/css manip.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.