vericast / spylon-kernel Goto Github PK

View Code? Open in Web Editor NEW

181.0 12.0 36.0 204 KB

Jupyter kernel for scala and spark

License: Other

Python 99.11% Makefile 0.89%

spark kernel scala metakernel jupyter-kernels team-platform

spylon-kernel's Introduction

spylon-kernel

A Scala Jupyter kernel that uses metakernel in combination with py4j.

Prerequisites

Apache Spark 2.1.1 compiled for Scala 2.11
Jupyter Notebook
Python 3.5+

Install

You can install the spylon-kernel package using pip or conda.

pip install spylon-kernel
# or
conda install -c conda-forge spylon-kernel

Using it as a Scala Kernel

You can use spylon-kernel as Scala kernel for Jupyter Notebook. Do this when you want to work with Spark in Scala with a bit of Python code mixed in.

Create a kernel spec for Jupyter notebook by running the following command:

python -m spylon_kernel install

Launch jupyter notebook and you should see a spylon-kernel as an option in the New dropdown menu.

See the basic example notebook for information about how to intiialize a Spark session and use it both in Scala and Python.

Using it as an IPython Magic

You can also use spylon-kernel as a magic in an IPython notebook. Do this when you want to mix a little bit of Scala into your primarily Python notebook.

from spylon_kernel import register_ipython_magics
register_ipython_magics()

%%scala
val x = 8
x

Using it as a Library

Finally, you can use spylon-kernel as a Python library. Do this when you want to evaluate a string of Scala code in a Python script or shell.

from spylon_kernel import get_scala_interpreter

interp = get_scala_interpreter()

# Evaluate the result of a scala code block.
interp.interpret("""
    val x = 8
    x
""")

interp.last_result()

Release Process

Push a tag and submit a source dist to PyPI.

git commit -m 'REL: 0.2.1' --allow-empty
git tag -a 0.2.1 # and enter the same message as the commit
git push origin master # or send a PR

# if everything builds / tests cleanly, release to pypi
make release

Then update https://github.com/conda-forge/spylon-kernel-feedstock.

spylon-kernel's People

Contributors

Stargazers

Watchers

spylon-kernel's Issues

Failed running `python -m spylon_kernel install`

I got the following error message

Traceback (most recent call last):
  File "/Users/username/miniconda3/lib/python3.9/runpy.py", line 188, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/Users/username/miniconda3/lib/python3.9/runpy.py", line 147, in _get_module_details
    return _get_module_details(pkg_main_name, error)
  File "/Users/username/miniconda3/lib/python3.9/runpy.py", line 111, in _get_module_details
    __import__(pkg_name)
  File "/Users/username/miniconda3/lib/python3.9/site-packages/spylon_kernel/__init__.py", line 3, in <module>
    from .scala_kernel import SpylonKernel
  File "/Users/username/miniconda3/lib/python3.9/site-packages/spylon_kernel/scala_kernel.py", line 5, in <module>
    from .init_spark_magic import InitSparkMagic
  File "/Users/username/miniconda3/lib/python3.9/site-packages/spylon_kernel/init_spark_magic.py", line 3, in <module>
    import spylon.spark
  File "/Users/username/miniconda3/lib/python3.9/site-packages/spylon/spark/__init__.py", line 27, in <module>
    from .launcher import SparkConfiguration, with_spark_context, with_sql_context
  File "/Users/username/miniconda3/lib/python3.9/site-packages/spylon/spark/launcher.py", line 51, in <module>
    import pyspark
  File "/Users/username/miniconda3/lib/python3.9/site-packages/pyspark/__init__.py", line 46, in <module>
    from pyspark.context import SparkContext
  File "/Users/username/miniconda3/lib/python3.9/site-packages/pyspark/context.py", line 31, in <module>
    from pyspark import accumulators
  File "/Users/username/miniconda3/lib/python3.9/site-packages/pyspark/accumulators.py", line 97, in <module>
    from pyspark.cloudpickle import CloudPickler
  File "/Users/username/miniconda3/lib/python3.9/site-packages/pyspark/cloudpickle.py", line 146, in <module>
    _cell_set_template_code = _make_cell_set_template_code()
  File "/Users/username/miniconda3/lib/python3.9/site-packages/pyspark/cloudpickle.py", line 127, in _make_cell_set_template_code
    return types.CodeType(
TypeError: an integer is required (got type bytes)

environment:
MacOS Monterey, apple M1 pro

Does spylon-kernel support Spark 3.0?

We tried to run spylon notebook with Spark 3.0 and run into the following error. Is Spark 3.0 supported by spylon? Thanks.

Error in calling magic 'init_spark' on cell:
Java gateway process exited before sending its port number
args: []
kwargs: {}
Traceback (most recent call last):
File "/home/yuzhou/miniconda3/lib/python3.7/site-packages/metakernel/magic.py", line 94, in call_magic
func(*args, **kwargs)
File "/home/yuzhou/miniconda3/lib/python3.7/site-packages/spylon_kernel/init_spark_magic.py", line 59, in cell_init_spark
init_spark(conf=self.env['launcher'], capture_stderr=stderr)
File "/home/yuzhou/miniconda3/lib/python3.7/site-packages/spylon_kernel/scala_interpreter.py", line 99, in init_spark
spark_context = conf.spark_context(application_name)
File "/home/yuzhou/miniconda3/lib/python3.7/site-packages/spylon/spark/launcher.py", line 521, in spark_context
return pyspark.SparkContext(appName=application_name, conf=spark_conf)
File "/home/yuzhou/spark/spark/python/pyspark/context.py", line 128, in init
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "/home/yuzhou/spark/spark/python/pyspark/context.py", line 320, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "/home/yuzhou/spark/spark/python/pyspark/java_gateway.py", line 105, in launch_gateway
raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number

%%init_spark [--stderr] - starts a SparkContext with a custom
configuration defined using Python code in the body of the cell

Includes a spylon.spark.launcher.SparkConfiguration instance
in the variable launcher. Looks for an application_name
variable to use as the name of the Spark session.

Example

%%init_spark
launcher.jars = ["file://some/jar.jar"]
launcher.master = "local[4]"
launcher.conf.spark.app.name = "My Fancy App"
launcher.conf.spark.executor.cores = 8

Options:

--stderr Capture stderr in the notebook instead of in the kernel log [default: False]

Write NULL File to HDFS

train_drop.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save("hdfs:////ECommAI/train.csv")
I use this code to write file, but the file is null.

Error with anonymous functions

Raising errors with simple anonymous functions applied to RDDs using map.

val x = sc.parallelize(List("spark", "rdd", "example", "sample", "example"), 3)
val y = x.map(r => (r, 1))

x collects successfully:

x.collect
// res2: Array[String] = Array(spark, rdd, example, sample, example)

y raises an error:

y.collect

	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:348)
	at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
	at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
	at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)
  at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:935)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at org.apache.spark.rdd.RDD.collect(RDD.scala:934)
  ... 37 elided
Caused by: java.lang.ClassNotFoundException: $anonfun$1
  at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:348)
  at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
  at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
  at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
  at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
  at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
  at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
  at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
  at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
  at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
  at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
  at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
  at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80)
  at org.apache.spark.scheduler.Task.run(Task.scala:99)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  ... 1 more

Integrate or break dependency on spylon package

maxpoint/spylon is used for configuration purposes in the %%init_spark block and to create the SparkContext from that config. We're uncertain as to whether we're going to keep maintaining the spylon package as a separate entity or not. Either we should break ties with it and go back to using the plain old SparkConfig object with strings, or merge the pieces that support attribute dot-notation and autocomplete into spylon-kernel.

Output is sometimes suppressed

Similar to an old issue with Toree: sometimes the kernel refuses to print output, even for extremely simple things (e.g,, sparkDF.printSchema). I have not noticed a pattern in when this surfaces.

Add support for display from the scala side

Right now there is no direct way of triggering display from the scala interpreter side.

Doing this fully would likely require a simple python server thread listening on a random localhost port preshared with the intepreter at creation. Injecting a kernel object for controlling some interaction from the scala side can be done at the same time the sparkSession bindings are created.

s3 filesystem not found

Hi, I am trying to read a parquet file from S3 bucket using spylon kernel
Getting error like "s3 filesystem scheme not found"
I have tried adding hadoop-aws jar and aws java sdk
error changed but still not able to read the file.

Can you help me this?

Python autocomplete does not work with jedi 0.10

jedi API has changed.

Any code is not running

for any code which i run on this kernel nothing happens just Intitializing Scala interpreter ... is written as output and code never completes

Using spylon-kernel with java?

Currently I'm using the spylon-kernel to access our spark-cluster which runs on mesos. This works perfectly from scala. Now I would like to use the notebook with spark from java. I tried the IJava kernel which however does not manage to create a spark-session on the cluster (it does work with local spark, but not for the mess-hosted spark-cluster).
Is it possible to use the spylon-kernel as a base for creating a java version?

spylon-kernel error : compilation: disabled (not enough contiguous free space left)

I found the following error after training a model in Spark with spylon-kernel(Scala kernel for Jupyter Notebook) on Mac OS X.

CodeCache: size=245760Kb used=230556Kb max_used=233036Kb free=15203Kb
 bounds [0x00000001176b8000, 0x00000001266b8000, 0x00000001266b8000]
 total_blobs=30250 nmethods=29053 adapters=1097
 compilation: disabled (not enough contiguous free space left)

According to the page "https://confluence.atlassian.com/confkb/confluence-crashes-due-to-codecache-is-full-compiler-has-been-disabled-780864803.html" the error can be resolved with changing arguments to the Java startup options.

-XX:ReservedCodeCacheSize=384m
-XX:+UseCodeCacheFlushing

Please comment how to changing arguments to the Java startup options on Mac OS X.

Run Scala cell on Jupyter notebook

Question: How to gracefully stop execution in a cell?

In ipython notebook, I can use the following code to stop execution in a cell.
How to do it in spylon notebook?

class StopExecution(Exception):
def _render_traceback_(self):
pass

raise StopExecution

Graph Frames modules are missing

The graphX are not part of the module, i am trying to import GraphFrames in my notebook but it keeps on throwing an error. Anyone know how to solve this problem?

import org.graphframes._
import org.graphframes.GraphFrame

Output:

<console>:34: error: object graphframes is not a member of package org.apache
       import org.apache.graphframes._
                         ^
<console>:35: error: object graphframes is not a member of package org
       import org.graphframes.GraphFrame

[BUG]: Spark submit fails: No such file or directory: '/opt/spark/python/pyspark/./bin/spark-submit'

Describe the bug

I'm trying to run spark in jupyter notebook using the spylon-kernel but when I try to run any code, it is just stuck at "Initializing scala interpreter..." and the error in the ubuntu terminal (I'm a windows user running spark in the WSL - ubuntu 18.04 ) is attached below.

To Reproduce

Steps to reproduce the behavior:

Install Anaconda3-2021.11-Linux-x86_64, java8 (openjdk-8-jdk), Spark 3.2.0 and spylon-kernel using the steps described in the attached file: spark installation instructions for Windows users.pdf
Open a jupyter notebook and type sc.version or anything
Observe that it is stuck at "Initializing scala interpreter..."
Go to the ubuntu 18.04 terminal and see error

I used the following the set up the spark env variables: (my spark is installed in opt/spark and my python path is the anaconda 3 python path)

echo "export SPARK_HOME=/opt/spark" >> ~/.profile
echo "export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin" >> ~/.profile
echo "export PYSPARK_PYTHON=/home/lai/anaconda3/bin/python" >> ~/.profile 
source ~/.profile

Expected behavior
I expect the scala interpreter to run with no problem and sc.version should output 3.2.0

Screenshots
A screenshot from jupyter notebook

Error in ubuntu terminal is in the additional context section.

Desktop (please complete the following information):

OS: Windows 10
Browser: Chrome (for jupyter notebook)
Version: I have java8, python 3.9, spark 3.2.0 and hadoop 3.2.

Additional context
The error code is as follows:

[MetaKernelApp] ERROR | Exception in message handler:
Traceback (most recent call last):
  File "/home/lai/anaconda3/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 353, in dispatch_shell
    await result
  File "/home/lai/anaconda3/lib/python3.9/site-packages/ipykernel/kernelbase.py", line 643, in execute_request
    reply_content = self.do_execute(
  File "/home/lai/anaconda3/lib/python3.9/site-packages/metakernel/_metakernel.py", line 397, in do_execute
    retval = self.do_execute_direct(code)
  File "/home/lai/anaconda3/lib/python3.9/site-packages/spylon_kernel/scala_kernel.py", line 141, in do_execute_direct
    res = self._scalamagic.eval(code.strip(), raw=False)
  File "/home/lai/anaconda3/lib/python3.9/site-packages/spylon_kernel/scala_magic.py", line 157, in eval
    intp = self._get_scala_interpreter()
  File "/home/lai/anaconda3/lib/python3.9/site-packages/spylon_kernel/scala_magic.py", line 46, in _get_scala_interpreter
    self._interp = get_scala_interpreter()
  File "/home/lai/anaconda3/lib/python3.9/site-packages/spylon_kernel/scala_interpreter.py", line 568, in get_scala_interpreter
    scala_intp = initialize_scala_interpreter()
  File "/home/lai/anaconda3/lib/python3.9/site-packages/spylon_kernel/scala_interpreter.py", line 163, in initialize_scala_interpreter
    spark_session, spark_jvm_helpers, spark_jvm_proc = init_spark()
  File "/home/lai/anaconda3/lib/python3.9/site-packages/spylon_kernel/scala_interpreter.py", line 99, in init_spark
    spark_context = conf.spark_context(application_name)
  File "/home/lai/anaconda3/lib/python3.9/site-packages/spylon/spark/launcher.py", line 521, in spark_context
    return pyspark.SparkContext(appName=application_name, conf=spark_conf)
  File "/opt/spark/python/pyspark/context.py", line 144, in __init__
    SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
  File "/opt/spark/python/pyspark/context.py", line 339, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
  File "/opt/spark/python/pyspark/java_gateway.py", line 98, in launch_gateway
    proc = Popen(command, **popen_kwargs)
  File "/home/lai/anaconda3/lib/python3.9/site-packages/spylon_kernel/scala_interpreter.py", line 94, in Popen
    spark_jvm_proc = subprocess.Popen(*args, **kwargs)
  File "/home/lai/anaconda3/lib/python3.9/subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/home/lai/anaconda3/lib/python3.9/subprocess.py", line 1821, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/opt/spark/python/pyspark/./bin/spark-submit'

Thanks for your help!

Add support for STDOUT/STDERR inside the scala magic / kernel

Right now println effectively goes off into the void, since the stderr/stdout when starting up a JVM SparkCotext is not captured by python subprocess call.

Expose `application_name` to `InitSparkMagic`

Allow the %%init_spark magic to set an application anme.

spylon launcher.packages inside kernel.json args

Is there a way to have these jars maven coordinates inside the kernel.json? Thanks in advance.

[Request] provide example of Spark Yarn cluster conectivity (EMR)

please provide some examples of yarn cluster connectivity

use case:
want to connect to an EMR cluster running spark on YARN mode, running on a public vpc/subnet and allowing ssh (port 22) connections, can open other ports when necessary.

thanks

"Run All" is stopped by cells where all code is commented out

"Run All" in any of its forms (restart, from above, etc.) stops processing cells after it encounters a cell that only consists of comments.

This example used "Restart & Run All":

Doc completion trim function

Leftover from code review:

https://github.com/maxpoint/spylon-kernel/pull/30/files/40f950b891deb8324fcfbde237556a6afaeed43b#diff-abb0a90c5c89c2312c21d5b3efd9d765R226

Can't use case class in the Scala notebook

the version of docker:
jupyter/all-spark-notebook:lastest

the way to start docker:
docker run -it --rm -p 8888:8888 jupyter/all-spark-notebook:latest
or
docker ps -a
docker start -i containerID

the steps:

Visit http://localhost:8888
Start an spylon-kernal notebook
input code above

import spark.implicits._
val p = spark.sparkContext.textFile ("../Data/person.txt")
val pmap = p.map ( _.split (","))
pmap.collect()

the output:res0: Array[Array[String]] = Array(Array(Barack, Obama, 53), Array(George, Bush, 68), Array(Bill, Clinton, 68))

case class Persons (first_name:String,last_name: String,age:Int)
val personRDD = pmap.map ( p => Persons (p(0), p(1), p(2).toInt))
personRDD.take(1)

the error message:

org.apache.spark.SparkDriverExecutionException: Execution error
  at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1186)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1711)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
  at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1354)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at org.apache.spark.rdd.RDD.take(RDD.scala:1327)
  ... 39 elided
Caused by: java.lang.ArrayStoreException: [LPersons;
  at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:90)
  at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:2043)
  at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:2043)
  at org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:59)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1182)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1711)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

The above code is working with the spark-shell. From error message, I speculated that the driver program didn't correctly handle case class Persons to RDD partition.

%%scala magic is busted

I'm working on a sizable PR to add doc and clean up various code paths. Spotted this along the way.

     40             self._interp = get_scala_interpreter()
     41             # Ensure that spark is available in the python session as well.
---> 42             self.kernel.cell_magics['python'].env['spark'] = self._interp.spark_session
     43             self.kernel.cell_magics['python'].env['sc'] = self._interp.sc
     44 

KeyError: 'python'

I'm on the fence about fixing it vs removing support for IPython-to-Scala entirely from this package with the goal of making it more single-purpose: a minimal yet solid Scala+Spark kernel. Pixiedust and https://github.com/maxpoint/spylon can already be used to do scala in ipython.

ExecutorClassLoader error in Spylon notebook

Transfer from jupyter/docker-stacks#690

What docker image you are using?

jupyter/all-spark-notebook

What complete docker command do you run to launch the container (omitting sensitive values)?

docker run -it --rm -p 8888:8888 -v "$PWD":/home/jovyan/work jupyter/all-spark-notebook

What steps do you take once the container is running to reproduce the issue?

Visit http://localhost:8888
Start a Spylon noteobok
Execute the following code:

import spark.implicits._

case class A (a1: String, a2: String)
case class B (b1: String, b2: String)

val as = Seq(A("1", "2"), A("2", "2"), A("1", "3"))
val df = spark.createDataFrame(as)
df.as[A].repartition(2, df.col("a1")).mapPartitions(seq => {
    seq.toSeq.groupBy(_.a1).map{
        case (key, group) => {
            val last = group.maxBy(_.a2)
            B(last.a1, last.a2)
        }
    }.iterator
}).collect()

What do you expect to happen?

We should have the following result:

res0: Array[B] = Array(B(1,3), B(2,2))

What actually happens?

It fails with the following exception:

2018-08-03 12:49:01 ERROR ExecutorClassLoader:91 - Failed to check existence of class <root>.package on REPL class server at spark://dc5457f81c19:39449/classes
java.net.URISyntaxException: Illegal character in path at index 35: spark://dc5457f81c19:39449/classes/<root>/package.class

Note that the same code, typed directly in a spark-shell session (launched using a terminal from Jupyter) gives the correct result.

Self-defined Classes in one cell cannot be invoked in another cell

I am trying to run some code with self defined classes in Scala. But when I try to use them in another cell, it said reference to List is ambiguous.

Long-running cell results are sometimes returned to wrong cell

When executing many cells at once, the occasional long-running (by execution time) cell has its output delivered to the next cell. This seems like it happens most often with minute-runtime-level cell followed by a millisecond-runtime-level cell.

I am not able to reproduce an example on demand, it just happens when it happens.

Outdated versioneer.py broken for Python 3.12

Python 3.12 removed the long deprecated configparser.SafeConfigParser class which old versions of versioneer depended on.

Expected Behavior

pip install spylon-kernel correctly installs the package.

Current Behavior

pip install spylon-kernel fails with this output:

...
              File "/tmp/pip-install-7tqi8u7z/spylon_2393384d9e244f009732dc4b8192d5fb/versioneer.py", line 412, in get_config_from_root
                parser = configparser.SafeConfigParser()
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
            AttributeError: module 'configparser' has no attribute 'SafeConfigParser'. Did you mean: 'RawConfigParser'?
            [end of output]
        
        note: This error originates from a subprocess, and is likely not a problem with pip.
      error: subprocess-exited-with-error

      × Getting requirements to build wheel did not run successfully.
      │ exit code: 1
      ╰─> See above for output.

Steps to Reproduce

pip install spylon-kernel, using Python versions 3.12 and higher.

Detailed Description

SafeConfigParser had been deprecated since Python 3.2., and renamed to simply ConfigParser. In Python 3.12 it has finally been removed.
The versioneer.py file in this repository has been generated with versioneer version 0.17, which generated code that uses SafeConfigParser.
New versions of versioneer use ConfigParser instead, so generating a new versioneer.py with a more recent version fixes the issue.

See: https://docs.python.org/3/whatsnew/3.2.html#configparser
and: https://docs.python.org/3/whatsnew/3.12.html#removed

Possible Solution

Update the code generated by versioneer to a more recent version.

Cannot install spylon-kernel on Ubuntu 22

I was installing spylon-kernel on Ubuntu 22 and encountered the following error:

❯ pip install spylon-kernel
Collecting spylon-kernel
  Using cached spylon-kernel-0.4.1.tar.gz (33 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [1 lines of output]
      ERROR: Can not execute `setup.py` since setuptools is not available in the build environment.
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

It turned out to be setuptools cannot be imported.

  try:
      import setuptools
  except ImportError as error:
      print(
          "ERROR: Can not execute `setup.py` since setuptools is not available in "
          "the build environment.",
          file=sys.stderr,
      )
      sys.exit(1)

Which was caused by libffi7 does not come with Ubuntu 22.

ImportError: libffi.so.7: cannot open shared object file: No such file or directory

I installed libffi7 manually according to this and the spylon-kernel was installed. Probably it is more of a ubuntu-python issue.

stop/start does not work

For my spylon notebook I:

Did spark.stop()
Did not restart the notebook kernel
Ran all the %%init_spark and spark to start up a Spark application again

what I found was that most operations work, like reading datasets using the sparkSession and showing them and stuff

However, when I tried to use the sparkContext, it thinks it's not running. Here's the code I was running and the error:

val bRetailersList = (sparkSession.sparkContext
                      .broadcast(trainedModel.itemFactors.select("id")
                                 .rdd.map(x => x(0).asInstanceOf[Int]).collect)
                      )

java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
This stopped SparkContext was created at:

org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:236)
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
py4j.GatewayConnection.run(GatewayConnection.java:214)
java.lang.Thread.run(Thread.java:745)

The currently active SparkContext was created at:

org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
org.apache.spark.ml.util.BaseReadWrite$class.sparkSession(ReadWrite.scala:69)
org.apache.spark.ml.util.MLReader.sparkSession(ReadWrite.scala:189)
org.apache.spark.ml.util.BaseReadWrite$class.sc(ReadWrite.scala:80)
org.apache.spark.ml.util.MLReader.sc(ReadWrite.scala:189)
org.apache.spark.ml.recommendation.ALSModel$ALSModelReader.load(ALS.scala:317)
org.apache.spark.ml.recommendation.ALSModel$ALSModelReader.load(ALS.scala:311)
org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:227)
org.apache.spark.ml.recommendation.ALSModel$.load(ALS.scala:297)
<init>(<console>:53)
<init>(<console>:58)
<init>(<console>:60)
<init>(<console>:62)
<init>(<console>:64)
<init>(<console>:66)
<init>(<console>:68)
<init>(<console>:70)
<init>(<console>:72)
<init>(<console>:74)
<init>(<console>:76)

  at org.apache.spark.SparkContext.assertNotStopped(SparkContext.scala:101)
  at org.apache.spark.sql.SparkSession.<init>(SparkSession.scala:80)
  at org.apache.spark.sql.SparkSession.<init>(SparkSession.scala:77)
  ... 44 elided

Not able of import external packages

I am trying to execute flink in spylon, for that I execute this code in the cell.

%%init_spark
launcher.packages = ["org.apache.flink:flink-scala_2.11:1.9.1",
"org.apache.flink:flink-streaming-scala_2.11:1.9.1",
"org.apache.flink:flink-connector-kafka_2.11:1.9.1",
"org.apache.flink:flink-avro:1.9.1",
"org.apache.flink:flink-jdbc_2.11:1.9.1",
"org.apache.flink:flink-metrics-prometheus_2.11:1.9.1"]

launcher.jars = ["file:///C:/Users/david/.ivy2/jars/org.apache.flink_flink-scala_2.11-1.9.1.jar"]
launcher.master = "local[4]"
launcher.conf.spark.executor.cores = 8

launcher.conf.spark.app.name = "MyApp"

then I start sparkcontext, and try to import flink
import org.apache.flink

But I get this error:

In the spark UI it looks like they are installed

Also the packages look like they are downloaded, with launcher.packages

Would anbody know how to make this work?

Thank you very much!!!

Unable to install spylon kernel

When running the below command (after pip install):
python -m spylon_kernel install

The command hangs for very long (> 1 hour), with no CLI feedback and I had to ctrl+c the process.

Is anyone else facing the same issue?

I'm using the below versions:

macOS Catalina
python = 3.7.9
spark = 2.4.6
jupyter = 4.6.3

Python stacktrace is not returned by the Python magic

Using the %%python magic when running a spylon-kernel notebook in Scala mode, any Python stacktrace is not returned when Python raises an error. Example:

In:

%%python
1 / 0

Out:

  File ".../lib/python3.5/site-packages/metakernel/magics/python_magic.py", line 25, in exec_code
    exec(code, env)

  File "<string>", line 1, in <module>

First few lines of stack traces clipped

getUserJars API changed in Spark 2.2.1

No longer accepts a boolean and no longer includes the YARN JARs.

apache/spark@d10c9dc

py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.util.Utils.getUserJars. Trace:
py4j.Py4JException: Method getUserJars([class org.apache.spark.SparkConf, class java.lang.Boolean]) does not exist
        at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
        at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339)
        at py4j.Gateway.invoke(Gateway.java:274)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Thread.java:748)

I think it's OK to try/except around the newer API call then fall back on the old one.

Write documentation

Spylon kernel needs some docs

Use Spark 2.1.1 to support Python 3.6

Spark 2.1.1 is released and now supports Python 3.6.

Move stdout / stderr into ScalaMagic instead of ScalaKernel

Currently stdout / stderr are only handled as part of the kernel

How add additional jar files to SparkContext

Is it possible with Spylon to add external jars to the initialized SparkContext, akin to the %AddJar in Toree or import $ivy.... in Jupyter Scala?

Cannot get Hive data

Hi I’m Jennifer and I’m having trouble with getting Hive data using scala kernel.
I’m testing exact same code with same hive-site.xml(hive config file) on both spark-shell and jupyterlab spylon-kernel

Here’s my jupyterlab code:

Here’s my spark-shell code:

There weren’t many references, and the ones that I’ve tried are:

(Optional) Configuring Spark for Hive AccessÂ -Â Hortonworks Data Platform
https://groups.google.com/g/cloud-dataproc-discuss/c/O5wKJDW9kNQ
There’s no hive or spark related logs on JupyterLab. Here’s the logs

[W 2022-09-23 09:47:49.887 LabApp] Could not determine jupyterlab build status without nodejs
[I 2022-09-23 09:47:50.393 ServerApp] Kernel started: afa4234d-48ac-4505-b6a0-e3fa220161cd
[I 2022-09-23 09:47:50.404 ServerApp] Kernel started: 9404ff88-622f-4ba8-86b1-404d648588fc
[MetaKernelApp] ERROR | No such comm target registered: jupyter.widget.control
[MetaKernelApp] WARNING | No such comm: 94bab30b-35b1-48bf-bb51-6000d46df671
[MetaKernelApp] ERROR | No such comm target registered: jupyter.widget.control
[MetaKernelApp] WARNING | No such comm: 936a8763-707c-4076-8471-7ceed85ccb53
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
[I 2022-09-23 09:57:52.541 ServerApp] Saving file at /scala-spark/Untitled.ipynb
[I 2022-09-23 09:59:52.580 ServerApp] Saving file at /scala-spark/Untitled.ipynb
[I 2022-09-23 10:21:55.527 ServerApp] Kernel restarted: 9404ff88-622f-4ba8-86b1-404d648588fc
[MetaKernelApp] ERROR | No such comm target registered: jupyter.widget.control
[MetaKernelApp] WARNING | No such comm: 6eb29ba4-7dab-4314-ace4-88488935840b
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
I’ve also tried this on Jupyter Notebook, removed the kernel and reinstalled it but it was the same 😢

What could I’ve been missing?
Where should I check? Please help.

Thanks
Jennifer

Unable to use existing spark server with spylon-kernel

I already have a spark-server running on my machine with 1 master and 1 worker node. However, every time I run anything in scala, it creates it's own spark cluster.

How can I make it use the existing spark servers that are running. I can do this with pyspark, but not with spylon-kernel.

Failed with Spylon-kernel -

Share additional variables between Python and Scala

If possible:

Share a Spark session / context initialized from Python with code running in %%scala cells. As it stands, the first scala cell tries to initialize another Spark context which fails.
Share additional variables between Scala and %%python cells. This is already somewhat possible using the interpret and last_result calls but can be made prettier.