Git Product home page Git Product logo

spark's Introduction

Logo

Data-Application Performance Monitoring for data engineers

Maven Package Slack Test Status Docs License

If you enjoy DataFlint please give us a โญ๏ธ and join our slack community for feature requests, support and more!

What is DataFlint?

DataFlint is an open-source D-APM (Data-Application Performance Monitoring) for Apache Spark, built for big data engineers.

DataFlint mission is to bring the development experience of using APM (Application Performance Monitoring) solutions such as DataDog and New Relic for the big data world.

DataFlint is installed within minutes via open source library, working on top of the existing Spark-UI infrastructure, all in order to help you solve big data performance issues and debug failures!

Demo

Demo

Features

  • ๐Ÿ“ˆ Real-time query and cluster status
  • ๐Ÿ“Š Query breakdown with performance heat map
  • ๐Ÿ“‹ Application Run Summary
  • โš ๏ธ Performance alerts and suggestions
  • ๐Ÿ‘€ Identify query failures
  • ๐Ÿค– Spark AI Assistant

See Our Features for more information

Installation

Scala

Install DataFlint via sbt:

libraryDependencies += "io.dataflint" %% "spark" % "0.1.7"

Then instruct spark to load the DataFlint plugin:

val spark = SparkSession
    .builder()
    .config("spark.plugins", "io.dataflint.spark.SparkDataflintPlugin")
    ...
    .getOrCreate()

PySpark

Add these 2 configs to your pyspark session builder:

builder = pyspark.sql.SparkSession.builder
    ...
    .config("spark.jars.packages", "io.dataflint:spark_2.12:0.1.7") \
    .config("spark.plugins", "io.dataflint.spark.SparkDataflintPlugin") \
    ...

Spark Submit

Alternatively, install DataFlint with no code change as a spark ivy package by adding these 2 lines to your spark-submit command:

spark-submit
--packages io.dataflint:spark_2.12:0.1.7 \
--conf spark.plugins=io.dataflint.spark.SparkDataflintPlugin \
...

Usage

After the installations you will see a "DataFlint" button in Spark UI, click on it to start using DataFlint

Logo

Additional installation options

  • There is also support for scala 2.13, if your spark cluster is using scala 2.13 change package name to io.dataflint:spark_2.13:0.1.7
  • For more installation options, including for python and k8s spark-operator, see Install on Spark docs
  • For installing DataFlint in spark history server for observability on completed runs see install on spark history server docs
  • For installing DataFlint on DataBricks see install on databricks docs

How it Works

How it Works

DataFlint is installed as a plugin on the spark driver and history server.

The plugin exposes an additional HTTP resoures for additional metrics not available in Spark UI, and a modern SPA web-app that fetches data from spark without the need to refresh the page.

For more information, see how it works docs

Articles

Fixing small files performance issues in Apache Spark using DataFlint

Compatibility Matrix

DataFlint require spark version 3.2 and up, and supports both scala versions 2.12 or 2.13.

Spark Platforms DataFlint Realtime DataFlint History server
Local โœ… โœ…
Standalone โœ… โœ…
Kubernetes Spark Operator โœ… โœ…
EMR โœ… โœ…
Dataproc โœ… โ“
HDInsights โœ… โ“
Databricks โœ… โŒ

For more information, see supported versions docs

spark's People

Contributors

dependabot[bot] avatar menishmueli avatar michael72 avatar omerelazar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

spark's Issues

Support to turn off auto refresh

Describe the bug
A clear and concise description of what the bug is.
It's not a bug, but improvement

Environemnt
spark verison: 3.2/
platform: standalone

To Reproduce
Steps to reproduce the behavior:

  1. Submit spark query
  2. Open spark history page
  3. Open Dataflint
  4. It refresh page automatically

Expected behavior
Could Turn off auto-refresh

Screenshots

Screen.Recording.2024-02-01.at.5.46.42.PM.mov

As you can see, data has been updating

Additional context
Add any other context about the problem here.

AWS Glue Compatibility

Hello everyone, I know that AWS Glue is not in the supported platforms list, but I decided to give it a try and see if it would work.
This attempt failed, resulting in an error when initializing the Spark Context.
I was wondering if this is a known issue, or if anyone managed to get this working.

Environment
spark version: 3.3
platform: Glue 4.0

To Reproduce
Steps to reproduce the behavior:

  1. Download jar from maven repo
  2. Upload to S3
  3. Add to job's dependent jars
  4. Set plugin config in SparkSession builder (or set as --conf property)
  5. Run the script
  6. See the error

Expected behavior
Session and context initialized and job running successfully.

Additional context
Returned error:
File "/tmp/job.py", line 78, in
.getOrCreate()
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 269, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/context.py", line 491, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/context.py", line 197, in init
self._do_init(
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/context.py", line 282, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/context.py", line 410, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1585, in call
return_value = get_return_value(
File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:529)
at scala.None$.get(Option.scala:527)
at org.apache.spark.dataflint.DataflintSparkUILoader$.install(DataflintSparkUILoader.scala:17)
at io.dataflint.spark.SparkDataflintDriverPlugin.registerMetrics(SparkDataflintPlugin.scala:26)
at org.apache.spark.internal.plugin.DriverPluginContainer.$anonfun$registerMetrics$1(PluginContainer.scala:75)
at org.apache.spark.internal.plugin.DriverPluginContainer.$anonfun$registerMetrics$1$adapted(PluginContainer.scala:74)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.internal.plugin.DriverPluginContainer.registerMetrics(PluginContainer.scala:74)
at org.apache.spark.SparkContext.$anonfun$new$41(SparkContext.scala:681)
at org.apache.spark.SparkContext.$anonfun$new$41$adapted(SparkContext.scala:681)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.SparkContext.(SparkContext.scala:681)
at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750)

Query summary does not show join operator

(been playing a bit, this tool looks pretty cool, thanks for sharing it!)

tried with a basic 2-way join and some aggregate (on TPCDS tableS).
When viewing a SQL query (summary --> clicking a query), the join operator (ShuffledHashJoin) is presented as SELECT, both in BASIC view
image
and in ADVANCED view
image

I am playing with Spark 3.5.0, this is the physical plan:

AdaptiveSparkPlan (31)
+- == Final Plan ==
   TakeOrderedAndProject (20)
   +- * HashAggregate (19)
      +- AQEShuffleRead (18)
         +- ShuffleQueryStage (17), Statistics(sizeInBytes=399.0 KiB, rowCount=1.28E+4)
            +- Exchange (16)
               +- * HashAggregate (15)
                  +- * Project (14)
                     +- * ShuffledHashJoin Inner BuildRight (13)
                        :- AQEShuffleRead (6)
                        :  +- ShuffleQueryStage (5), Statistics(sizeInBytes=87.9 MiB, rowCount=2.88E+6)
                        :     +- Exchange (4)
                        :        +- * Filter (3)
                        :           +- * ColumnarToRow (2)
                        :              +- Scan parquet  (1)
                        +- AQEShuffleRead (12)
                           +- ShuffleQueryStage (11), Statistics(sizeInBytes=281.3 KiB, rowCount=1.80E+4)
                              +- Exchange (10)
                                 +- * Filter (9)
                                    +- * ColumnarToRow (8)
                                       +- Scan parquet  (7)

This is how Spark draws it
image

Keep getting "Server Disconnected" error. Baremetal Cloudera setup.

Describe the bug
Dataflint tab in spark UI keeps giving server disconnected, and then it refreshes,
it also doesn't show spark sql queries (ongoing or completed) and keeps saying that no spark sql query

Environemnt
spark verison: 3.2
platform: standalone/ cloudera

To Reproduce
Steps to reproduce the behavior:

  1. Go to DataFlint tab

it will keep showing server disconnected, then it'll refresh and show the page for a few moments, and then again it'll say server disconnected.

Expected behavior
2 expectations,

  • for whatever reason, if server disconnected, i should be able to view existing data if possible.
  • if the job is running, and spark ui is working fine, then dataflint shoudln't show server disocnnected.

Screenshots
image

Additional context
Add any other context about the problem here.

see negative number in the Duration

Describe the bug
in the "Duration" I see negative values (i.e -34735s), also the DCU is negative

Environemnt
spark version 3.2.0
scala 2.12.14
dataflint jar version 0.1.1
CLOUDERA on prem environment

Screenshots
If applicable, add screenshots to help explain your problem.
image
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.