Git Product home page Git Product logo

spark-metrics's Introduction

Apache Spark metrics extensions

This is a repository for ApacheSpark metrics related custom classes (e.g. sources, sinks). We were trying to extend the Spark Metrics subsystem with a Prometheus sink but the PR was not merged upstream. In order to support others to use Prometheus we have externalized the sink and made available through this repository, thus there is no need to build an Apache Spark fork.

For further information how we use this extension and the Prometheus sink at Banzai Cloud please read these posts:

spark-metrics's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-metrics's Issues

Need Hostname value in instance key instead of app id

appid is redundant value in job and instance. we want to corelate executor and driver with spark worker. we need hostname information. Can we have optional property to enable hostname in instance instead of appId?

Filtering spark metrics before writing in prometheus

Hello,

I am trying to create my custom Prometheus sink, in order that I can filter some of the metrics that are exposed to Prometheus.

To do that I am writing my own PrometheusSink that has a Reporter class in it, and extends the ScheduledReporter class.

class PrometheusSink(
                                 val property: Properties,
                                 val registry: MetricRegistry,
                                 securityMgr: org.apache.spark.SecurityManager)
  extends Sink with Logging {

  protected class Reporter(registry: MetricRegistry)
    extends ScheduledReporter(
      registry,
      "prometheus-reporter",
      new CustomMetricFilter(),
      TimeUnit.SECONDS,
      TimeUnit.MILLISECONDS) {

That filter is used to generate the report:

  public void report() {
        synchronized (this) {
            report(registry.getGauges(filter),
                    registry.getCounters(filter),
                    registry.getHistograms(filter),
                    registry.getMeters(filter),
                    registry.getTimers(filter));
        }
    }

I changed the filter in the way that it return the values that I need to depending if the metric need to be filtered or not.

public class CustomMetricFilter implements MetricFilter {
    /**
     * Matches all metrics, regardless of type or name.
     */
    public boolean matches(String name, Metric metric) {
        return false;
    }
}

The result that I was expecting is that all the metrics will be filter in my application, and none of them will appear in prometheus. But surprisingly, no matter what I write in the matches function of the class. Always is returning all of them.

It seems that the filter is not taken into account and it returns the implementation by default.

public SortedMap<String, Counter> getCounters() {
        return getCounters(MetricFilter.ALL);
    }

Metrics namespace

Describe the bug
I want to replace the job value field which from what I undertand should be done by setting the spark.metrics.namespace= to the appropriate value, like ${spark.app.name}
It is not working, the job value stays set to the spark app.id

I also briefly tried the .sink.prometheus.metrics-name-capture-regex= without result.

Steps to reproduce the issue:

spark-submit --master yarn --queue default --conf spark.metrics.conf=/mnt/code/infra-hdp-test/metrics.conf  --deploy-mode cluster --repositories https://raw.github.com/banzaicloud/spark-metrics/master/maven-repo/releases --packages org.apache.hadoop:hadoop-aws:2.7.7,org.elasticsearch:elasticsearch-spark-20_2.11:6.5.4,org.apache.hadoop:hadoop-aws:2.7.7,org.elasticsearch:elasticsearch-spark-20_2.11:6.5.4,com.banzaicloud:spark-metrics_2.11:2.3-2.1.0,io.prometheus:simpleclient:0.3.0,io.prometheus:simpleclient_dropwizard:0.3.0,io.prometheus:simpleclient_pushgateway:0.3.0,io.dropwizard.metrics:metrics-core:3.1.2  /tmp/test.py

metrics.conf

*.sink.prometheus.class=com.banzaicloud.spark.metrics.sink.PrometheusSink

# Prometheus pushgateway address
*.sink.prometheus.pushgateway-address-protocol=https
*.sink.prometheus.pushgateway-address=ourpushgatewayaddress

#*.sink.prometheus.period=<period> - defaults to 10
#*.sink.prometheus.unit=< unit> - defaults to seconds (TimeUnit.SECONDS)
#*.sink.prometheus.pushgateway-enable-timestamp=<enable/disable metrics timestamp> - defaults to false
# Metrics name processing (version 2.3-1.1.0 +)
#*.sink.prometheus.metrics-name-capture-regex=<regular expression to capture sections metric name sections to be replaces>
#*.sink.prometheus.metrics-name-replacement=<replacement captured sections to be replaced with>
#*.sink.prometheus.labels=<labels in label=value format separated by comma>
# Support for JMX Collector (version 2.3-2.0.0 +)
*.sink.prometheus.enable-dropwizard-collector=true
*.sink.prometheus.enable-jmx-collector=true
*.sink.prometheus.jmx-collector-config=/mnt/code/infra-hdp-test/jmxCollector.yaml
# Enable HostName in Instance instead of Appid (Default value is false i.e. instance=${appid})
*.sink.prometheus.enable-hostname-in-instance=true
# Enable JVM metrics source for all instances by class name
*.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink
*.source.jvm.class=org.apache.spark.metrics.source.JvmSource
spark.metrics.namespace='chopchop'

test.py

from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import col

conf = (
        SparkConf()
        .set('spark.serializer', 'org.apache.spark.serializer.KryoSerializer')
    )
spark = (
    SparkSession.builder
    .config(conf=conf)
    .getOrCreate()
)


spark.range(100).withColumn("test", col("id")/ 2).select(col("test").cast(IntegerType())).distinct().count()

Expected behavior

Screenshots

Additional context

spark-submit --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.1.3.0.1.0-187
      /_/
                        
Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_232
Branch HEAD
Compiled by user jenkins on 2018-09-19T10:10:07Z
Revision fe7bed1ca174a6687ebd2aa0f8ba5fb7bf668399
Url [email protected]:hortonworks/spark2.git

Release Spark provided fix to maven

Describe the bug

spark-core is a compile dependency for the artifact in maven central, causing merge issues when building an assembly.

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-core_2.11</artifactId>
  <version>2.4.4</version>
</dependency>

https://repo1.maven.org/maven2/com/banzaicloud/spark-metrics_2.11/2.4-1.0.5/spark-metrics_2.11-2.4-1.0.5.pom

Expected behavior

spark-core should be a provided dependency.
As a workaround I had to add

"com.banzaicloud" %% "spark-metrics"  % "2.4-1.0.5" excludeAll(ExclusionRule(organization = "org.apache.spark"))

to my dependencies.

Additional context

Fix has already been applied, but not released.
9c5a517

pushgateway info log message to "Fix pushed metrics!"

Prometheus reports that the help message for some metrics are in conflict. For example the driver and executor help message conflict from the logs:

time="2019-05-08T22:16:33Z" level=info msg="Metric families 
'name:\"java_lang_MemoryPool_CollectionUsage_used\" help:\"java.lang.management.MemoryUsage (java.lang<type=MemoryPool, name=Tenured Gen><CollectionUsage>used)\" type:UNTYPED 
### driver metrics collected for java_lang_MemoryPool_CollectionUsage_used 
metric:<label:<name:\"app_name\" value:\"\" > label:<name:\"instance\" value:\"\" > label:<name:\"job\" value:\"recent\" > label:<name:\"name\" value:\"Tenured Gen\" > label:<name:\"role\" value:\"driver\" > untyped:<value:6.2114752e+07 > > 
metric:<label:<name:\"app_name\" value:\"\" > label:<name:\"instance\" value:\"\" > label:<name:\"job\" value:\"recent\" > label:<name:\"name\" value:\"Survivor Space\" > label:<name:\"role\" value:\"driver\" > untyped:<value:1.433232e+06 > > 
metric:<label:<name:\"app_name\" value:\"\" > label:<name:\"instance\" value:\"\" > label:<name:\"job\" value:\"recent\" > label:<name:\"name\" value:\"Eden Space\" > label:<name:\"role\" value:\"driver\" > untyped:<value:0 > > ' 
and 'name:\"java_lang_MemoryPool_CollectionUsage_used\" help:\"java.lang.management.MemoryUsage (java.lang<type=MemoryPool, name=PS Old Gen><CollectionUsage>used)\" type:UNTYPED 
### sample of executor metrics for java_lang_MemoryPool_CollectionUsage_used
metric:<label:<name:\"app_name\" value:\"\" > label:<name:\"instance\" value:\"\" > label:<name:\"job\" value:\"recent\" > label:<name:\"name\" value:\"PS Old Gen\" > label:<name:\"number\" value:\"5\" > label:<
metric:<label:<name:\"app_name\" value:\"\" > label:<name:\"instance\" value:\"\" > label:<name:\"job\" value:\"recent\" > label:<name:\"name\" value:\"PS Eden Space\" > label:<name:\"number\" value:\"5\" > labe
metric:<label:<name:\"app_name\" value:\"\" > label:<name:\"instance\" value:\"\" > label:<name:\"job\" value:\"recent\" > label:<name:\"name\" value:\"PS Survivor Space\" > label:<name:\"number\" value:\"5\" > 
metric:<label:<name:\"app_name\" value:\"\" > label:<name:\"instance\" value:\"\" > label:<name:\"job\" value:\"recent\" > label:<name:\"name\" value:\"PS Old Gen\" > label:<name:\"number\" value:\"6\" > label:<
metric:<label:<name:\"app_name\" value:\"\" > label:<name:\"instance\" value:\"\" > label:<name:\"job\" value:\"recent\" > label:<name:\"name\" value:\"PS Eden Space\" > label:<name:\"number\" value:\"6\" > labe
have inconsistent help strings. The latter will have priority. This is bad. Fix your pushed metrics!" source="diskmetricstore.go:126"

java_lang_MemoryPool_CollectionUsage_used help message should be the same for the driver and executor... but it looks like it is used 2x: one for "Tenured Gen" help and one for "PS Old Gen". Is there an updated lib or config that can help address this?

Add complete example

First of all thank you for greate job you did.
But it is very hard to use this example.
The ideal solution wiil be to create docker images with prometheus, spark etc. Where all property files are put into correct places, all enviroment variables are set...
A spark-submit command with all it's argument can be put into some sh file so to see how all of this works use just do the following:

  1. Start docker containers (may be via docker-compose up command)
  2. connects to spark container
  3. invokes this sh file
  4. wait a couple of minutes when dummy spark jobs is finished
  5. see metrics in promethus

For now every user have to do all of this from scratch and spend a lot time. Especially when deals with metrics first time.

Metrics filter doesn't work

Describe the bug

The MetricFilter doesn't filter the metrics.

Steps to reproduce the issue:

Set the corresponding parameters, creating an implementation that filters out everything, observe everything is still on the pgw.

Expected behavior

Well, it should filter the metrics :D

Additional context

I spent some time pondering on the code and came to the conclusion that the output of the MetricFilter we register never get used. The report will use everything in SparkDropwizardExports, created while constructing the sink, and not the arguments passed to the report function (which are the ones that are filtered). A viable approach is creating a new SparkDropwizardExports on every report, created from exclusively from the metrics passed to the report method.

specify job name and pushgateway relevant labels via config

Is your feature request related to a problem? Please describe.

For each spark application a separate pushgateway job incl. specific labels will be created. As the pushgateway doesn't support TTL of groups we should be able to configure the push gateway sink accordingly to push metrics of the same job (lets say based on the spark app name) into the same group.

Describe the solution you'd like to see

Implement a configuration property for the sink that points towards a label that should be used as pushgateway job name and which labels should be part of the URL used to push to pushgateway.

I'm happy to provide a PR for this 🎉

Describe alternatives you've considered

  • delete groups at pushgateway manually

No Metrics From Spark Executors (Classes are being instantiated)

I am receiving metrics from the Spark driver. I can see the PrometheusSink class being initialized on the Spark executor but nothing is showing up in the pushgateway.

I also don't see any logs/reason why it wouldn't be pushing, even though I have logs at DEBUG level.

Here is my executor logs:

2020-08-12 20:26:05,522 INFO [org.apache.spark.banzaicloud.metrics.sink.PrometheusSink] - Initializing Prometheus Sink...
2020-08-12 20:26:05,522 INFO [org.apache.spark.banzaicloud.metrics.sink.PrometheusSink] - Metrics polling period -> 30 SECONDS
2020-08-12 20:26:05,523 INFO [org.apache.spark.banzaicloud.metrics.sink.PrometheusSink] - Metrics timestamp enabled -> false
2020-08-12 20:26:05,523 INFO [org.apache.spark.banzaicloud.metrics.sink.PrometheusSink] - pushgateway-address -> services-xxxx/xxxxx-pushgateway-dev
2020-08-12 20:26:05,523 INFO [org.apache.spark.banzaicloud.metrics.sink.PrometheusSink] - pushgateway-address-protocol -> https
2020-08-12 20:26:05,523 INFO [org.apache.spark.banzaicloud.metrics.sink.PrometheusSink] - metrics-name-capture-regex -> 
2020-08-12 20:26:05,524 INFO [org.apache.spark.banzaicloud.metrics.sink.PrometheusSink] - metrics-name-replacement -> 
2020-08-12 20:26:05,525 INFO [org.apache.spark.banzaicloud.metrics.sink.PrometheusSink] - labels -> Map()
2020-08-12 20:26:05,525 INFO [org.apache.spark.banzaicloud.metrics.sink.PrometheusSink] - jmx-collector-config -> jmxCollector.yaml
2020-08-12 20:26:35,686 INFO [org.apache.spark.banzaicloud.metrics.sink.PrometheusSink] - metricsNamespace=None, sparkAppName=Some(nokia-sps-decoder-Spark), sparkAppId=Some(application_1584132323547_997813), executorId=Some(6)
2020-08-12 20:26:35,687 INFO [org.apache.spark.banzaicloud.metrics.sink.PrometheusSink] - role=executor, job=application_1584132323547_997813

Is there a way I can get more information on why this isn't working? ie. more verbose logs
I've provided my other related configurations as reference below.

Spark Submit:

spark2-submit --master=yarn \
    --deploy-mode=cluster \
    --conf "spark.executor.extraClassPath=spark-metrics_2.11-2.4-1.0.5.jar:metrics-core-3.1.2.jar:simpleclient-0.8.1.jar:simpleclient_dropwizard-0.8.1.jar:simpleclient_pushgateway-0.8.1.jar:simpleclient_common-0.8.1.jar:collector-0.12.0.jar:slf4j-api-1.7.16.jar:guava-26.0-android.jar" \
    --conf "spark.driver.extraClassPath=spark-metrics_2.11-2.4-1.0.5.jar:metrics-core-3.1.2.jar:simpleclient-0.8.1.jar:simpleclient_dropwizard-0.8.1.jar:simpleclient_pushgateway-0.8.1.jar:simpleclient_common-0.8.1.jar:collector-0.12.0.jar:slf4j-api-1.7.16.jar:guava-26.0-android.jar" \
	--executor-memory 4G \
    --num-executors 5 \
    --driver-memory 4G \
    --executor-cores 4 \
	--conf spark.yarn.queue=root.high \
	--files "Config/DEV.ini,Build/src/main/resources/kafka-DEV.jaas,${USRENV},Build/src/main/resources/metrics-prod.properties,/xxxx/truststore.jks,Build/src/main/resources/jmxCollector.yaml,Build/src/main/resources/log4j.properties" \
	--class xxxxx.ProcessStream \
    --jars hdfs://nameservice1/Common/Code/Lib/FOX/Alarming/v2/fox-alarming-v2.jar,spark-metrics_2.11-2.4-1.0.5.jar,metrics-core-3.1.2.jar,simpleclient-0.8.1.jar,simpleclient_dropwizard-0.8.1.jar,simpleclient_pushgateway-0.8.1.jar,simpleclient_common-0.8.1.jar,collector-0.12.0.jar,slf4j-api-1.7.16.jar,guava-26.0-android.jar \
    --conf "spark.metrics.conf=metrics-prod.properties" \
	--conf "spark.driver.extraJavaOptions=-Djavax.net.ssl.trustStore=truststore.jks -Djavax.net.ssl.trustStorePassword=xxxx-Dspark.metrics.conf=metrics-prod.properties -Dlog4j.configuration=log4j.properties" \
	--conf "spark.executor.extraJavaOptions=-Djavax.net.ssl.trustStore=truststore.jks -Djavax.net.ssl.trustStorePassword=xxxx-Dspark.metrics.conf=metrics-prod.properties -Dlog4j.configuration=log4j.properties" \
	--conf spark.executor.memoryOverhead=5G \
    --conf spark.driver.memoryOverhead=5G \
	Build/target/nokia-sps-decoder-jar-with-dependencies.jar DEV.ini kafka-DEV.jaas

metrics-prod.properties

# Enable Prometheus for all instances by class name
*.sink.prometheus.class=org.apache.spark.banzaicloud.metrics.sink.PrometheusSink
# Prometheus pushgateway address
*.sink.prometheus.pushgateway-address-protocol=https
*.sink.prometheus.pushgateway-address=services-xxxxx/xxxx-pushgateway-dev
*.sink.prometheus.period=30
*.sink.prometheus.unit=seconds
*.sink.prometheus.pushgateway-enable-timestamp=false
# Support for JMX Collector (version 2.3-2.0.0 +)
*.sink.prometheus.enable-dropwizard-collector=false
*.sink.prometheus.enable-jmx-collector=true
*.sink.prometheus.jmx-collector-config=jmxCollector.yaml
# Enable HostName in Instance instead of Appid (Default value is false i.e. instance=${appid})
*.sink.prometheus.enable-hostname-in-instance=true
# Enable JVM metrics source for all instances by class name
*.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink
*.source.jvm.class=org.apache.spark.metrics.source.JvmSource

jmxCollector.yaml

lowercaseOutputName: false
lowercaseOutputLabelNames: false
whitelistObjectNames: ["*:*"]

Pushgateway UI:
image

I'd really appreciate any guidance, as it stands I won't be able to use the library without the executor metrics :(

Thanks!

unable to send metrics to Prometheus Push Gateway

Describe the bug
Unable to push the data to Push Gateway

Here is the error stack

19/09/19 08:12:17 ERROR DatabricksMain$DBUncaughtExceptionHandler: Uncaught exception in thread PrometheusMetricsExporter!
java.lang.NoSuchFieldError: timestampMs
	at com.banzaicloud.spark.metrics.DropwizardExports$$anonfun$collect$1$$anonfun$apply$1.apply(DropwizardExports.scala:39)
	at com.banzaicloud.spark.metrics.DropwizardExports$$anonfun$collect$1$$anonfun$apply$1.apply(DropwizardExports.scala:34)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at com.banzaicloud.spark.metrics.DropwizardExports$$anonfun$collect$1.apply(DropwizardExports.scala:33)
	at com.banzaicloud.spark.metrics.DropwizardExports$$anonfun$collect$1.apply(DropwizardExports.scala:29)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at com.banzaicloud.spark.metrics.DropwizardExports.collect(DropwizardExports.scala:28)
	at com.banzaicloud.spark.metrics.DropwizardExportsWithMetricNameCaptureAndReplace.collect(DropwizardExportsWithMetricNameCaptureAndReplace.scala:48)
	at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.findNextElement(CollectorRegistry.java:72)
	at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.nextElement(CollectorRegistry.java:87)
	at io.prometheus.client.CollectorRegistry$MetricFamilySamplesEnumeration.nextElement(CollectorRegistry.java:57)
	at java.util.Collections.list(Collections.java:5240)
	at io.prometheus.client.exporter.common.TextFormat.write004(TextFormat.java:17)
	at com.databricks.DatabricksMain.com$databricks$DatabricksMain$$prometheusMetricsString(DatabricksMain.scala:315)
	at com.databricks.DatabricksMain$$anonfun$startPrometheusMetricsExport$1.apply$mcV$sp(DatabricksMain.scala:298)
	at com.databricks.DatabricksMain$$anonfun$startPrometheusMetricsExport$1.apply(DatabricksMain.scala:297)
	at com.databricks.DatabricksMain$$anonfun$startPrometheusMetricsExport$1.apply(DatabricksMain.scala:297)
	at com.databricks.util.UntrustedUtils$.tryLog(UntrustedUtils.scala:98)
	at com.databricks.threading.NamedTimer$$anon$1$$anonfun$run$1.apply(NamedTimer.scala:53)
	at com.databricks.threading.NamedTimer$$anon$1$$anonfun$run$1.apply(NamedTimer.scala:53)
	at com.databricks.logging.UsageLogging$$anonfun$recordOperation$1.apply(UsageLogging.scala:359)
	at com.databricks.logging.UsageLogging$$anonfun$withAttributionContext$1.apply(UsageLogging.scala:235)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
	at com.databricks.logging.UsageLogging$class.withAttributionContext(UsageLogging.scala:230)
	at com.databricks.threading.NamedTimer$$anon$1.withAttributionContext(NamedTimer.scala:50)
	at com.databricks.logging.UsageLogging$class.withAttributionTags(UsageLogging.scala:268)
	at com.databricks.threading.NamedTimer$$anon$1.withAttributionTags(NamedTimer.scala:50)
	at com.databricks.logging.UsageLogging$class.recordOperation(UsageLogging.scala:345)
	at com.databricks.threading.NamedTimer$$anon$1.recordOperation(NamedTimer.scala:50)
	at com.databricks.threading.NamedTimer$$anon$1.run(NamedTimer.scala:52)
	at java.util.TimerThread.mainLoop(Timer.java:555)
	at java.util.TimerThread.run(Timer.java:505)

My Metrics.properties file

# Enable Prometheus for all instances by class name
*.sink.prometheus.class=com.banzaicloud.spark.metrics.sink.PrometheusSink
# Prometheus pushgateway address
*.sink.prometheus.pushgateway-address-protocol=http
*.sink.prometheus.pushgateway-address=<IP>:9091
*.sink.prometheus.period=10
*.sink.prometheus.unit=seconds

# Metrics name processing (version 2.3-1.1.0 +)
#*.sink.prometheus.metrics-name-capture-regex=(.+driver_)(.+)
#*.sink.prometheus.metrics-name-replacement=$2
master.sink.prometheus.metrics-name-capture-regex=(.*)
master.sink.prometheus.metrics-name-replacement=master_$1
worker.sink.prometheus.metrics-name-capture-regex=(.*)
worker.sink.prometheus.metrics-name-replacement=worker_$1
executor.sink.prometheus.metrics-name-capture-regex=(.*)
executor.sink.prometheus.metrics-name-replacement=executor_$1
driver.sink.prometheus.metrics-name-capture-regex=(.*)
driver.sink.prometheus.metrics-name-replacement=driver_$1
applications.sink.prometheus.metrics-name-capture-regex=(.*)
applications.sink.prometheus.metrics-name-replacement=app_$1
*.sink.prometheus.labels=name=vijaytest1,name2=test123
# Support for JMX Collector (version 2.3-2.0.0 +)
#*.sink.prometheus.enable-dropwizard-collector=false
#*.sink.prometheus.enable-jmx-collector=true
#*.sink.prometheus.jmx-collector-config=/some/path/jvm_exporter.yml

# Enable HostName in Instance instead of Appid (Default value is false i.e. instance=${appid})
*.sink.prometheus.enable-hostname-in-instance=true

# Enable JVM metrics source for all instances by class name
*.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink
*.source.jvm.class=org.apache.spark.metrics.source.JvmSource

Steps to reproduce the issue:
I copied the Jar from Maven - https://mvnrepository.com/artifact/com.banzaicloud/spark-metrics_2.11/2.3-2.1.0

Expected behavior

Screenshots

Additional context
Pushgateway version: 0.9.1
Prometheus version: 2.12.0

Similar issues have been reported census-instrumentation/opencensus-java#1215

Configure sink to stop sending job as label/group-key

Is your feature request related to a problem? Please describe.
Whenever I'm running spark-metrics on YARN, I see all metrics being tagged with job=application_yarn_id. This means when I restart the application I create new groups on pushgateway, increasing memory usage by a lot.

Describe the solution you'd like to see
A way to remove job from labels and group-keys. For example, if we override the job on both labels and group-key I would expect spark-metrics not to send application_yarn_id

Describe alternatives you've considered
I tried to configure my application overriding the job group-key with no success:

*.sink.prometheus.group-key=job=streams"

Sink class com.banzaicloud.spark.metrics.sink.PrometheusSink cannot be instantiated

Hello,
on spark 2.3.2
here is my spark-submit:
spark-submit --num-executors 3 --master yarn --deploy-mode client --files /tmp/log4j2.xml#log4j2.xml --repositories https://raw.github.com/banzaicloud/spark-metrics/master/maven-repo/releases --packages com.banzaicloud:spark-metrics_2.11:2.3-2.0.4,io.prometheus:simpleclient:0.3.0,io.prometheus:simpleclient_dropwizard:0.3.0,io.prometheus:simpleclient_pushgateway:0.3.0,io.dropwizard.metrics:metrics-core:3.1.2 --jars /tmp/metrics-core-3.1.2.jar,/tmp/simpleclient-0.3.0.jar,/tmp/simpleclient_dropwizard-0.3.0.jar,/tmp/simpleclient_pushgateway-0.3.0.jar,/tmp/spark-metrics_2.11-2.3-2.0.4.jar --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j2.xml" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j2.xml" --conf "spark.metrics.conf=/tmp/metrics.properties" --class myjob /tmp/drools-kie-spark-1.5.jar -droolsEmbedded

i put metrics.properties on each executors

Thanks for your help

Does spark-metrics work with spark 2.3

I am trying to use spark-metrics with spark 2.3 but run into the follow error.

2018-05-02 20:49:09 INFO  BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, spark-pi-9e38353c8a563fdfa274d7151aa3d8b3-driver-svc.spark.svc, 7079, None)
2018-05-02 20:49:09 ERROR MetricsSystem:70 - Sink class com.banzaicloud.spark.metrics.sink.PrometheusSink cannot be instantiated
2018-05-02 20:49:09 ERROR SparkContext:91 - Error initializing SparkContext.
java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:200)
	at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:194)
	at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
	at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
	at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
	at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
	at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
	at org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:194)
	at org.apache.spark.metrics.MetricsSystem.start(MetricsSystem.scala:102)
	at org.apache.spark.SparkContext.&lt;init&gt;(SparkContext.scala:513)
	at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2486)
	at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:930)
	at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:921)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921)
	at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
	at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
Caused by: java.lang.AbstractMethodError
	at org.apache.spark.internal.Logging$class.initializeLogIfNecessary(Logging.scala:99)
	at com.banzaicloud.spark.metrics.sink.PrometheusSink.initializeLogIfNecessary(PrometheusSink.scala:36)
	at org.apache.spark.internal.Logging$class.log(Logging.scala:46)
	at com.banzaicloud.spark.metrics.sink.PrometheusSink.log(PrometheusSink.scala:36)
	at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)
	at com.banzaicloud.spark.metrics.sink.PrometheusSink.logInfo(PrometheusSink.scala:36)
	at com.banzaicloud.spark.metrics.sink.PrometheusSink.&lt;init&gt;(PrometheusSink.scala:136)

It seems like there is a compatibility issue. Would you agree?

Spark Metrics Stop Pushing After Pushgateway Restarts

I've been having some problems with my Prometheus Pushgateway in Kubernetes resulting in the pod to restart. I noticed after restart, many of the metrics stop pushing to the Pushgateway from the Spark-Prometheus plugin.

Steps to reproduce the issue:
This should be easily reproducible by restarting the Prometheus Pushgateway when an active Spark app is pushing metrics.

Expected behavior
I would expect that when the Prometheus Pushgateway goes down, the plugin would log an error message. Once it comes back up, it should start pushing to the Pushgateway again. Since these are HTTP requests, and not maintaining a long-running connection.

Is there a way to achieve this behavior? I appreciate the help.

Screenshot, Pushgateway after restart (Notice only 3 nodes are pushing):

image

Prometheus Sink is not working with SparkPi

spark sunmit command:
spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client /usr/lib/spark/examples/jars/spark-examples_2.11-2.4.7-amzn-1.jar 1000 --package com.banzaicloud:spark-metrics_2.11:2.4-1.0.5 --jars /home/hadoop/spark-metrics_2.11-2.4-1.0.5.jar,/home/hadoop/simpleclient-0.8.1.jar,/home/hadoop/simpleclient_dropwizard-0.8.1.jar,/home/hadoop/simpleclient_pushgateway-0.8.1.jar,/home/hadoop/simpleclient_common-0.8.1.jar,/home/hadoop/collector-0.12.0.jar,/home/hadoop/snakeyaml-1.26.jar --conf spark.metrics.conf=/home/hadoop/metrics.properties --conf spark.jars.packages=com.banzaicloud:spark-metrics_2.11:2.4-1.0.5

but there is no metrics reported to pushgateway

Parsing error

I'm unable to send any metrics to prometheus pushgateway, getting the following error:

2019-01-16 08:16:41 INFO  PrometheusSink:54 - metricsNamespace=None, sparkAppName=None, sparkAppId=None, executorId=None
2019-01-16 08:16:41 INFO  PrometheusSink:54 - role=shuffle, job=shuffle
2019-01-16 08:16:41 INFO  PushGatewayWithTimestamp:217 - Sending metrics data to 'http://fkpr-prometheus-pushgateway.fkpr:9091/metrics/job/shuffle/role/shuffle'
2019-01-16 08:16:41 INFO  PushGatewayWithTimestamp:247 - Error response from http://fkpr-prometheus-pushgateway.fkpr:9091/metrics/job/shuffle/role/shuffle
2019-01-16 08:16:41 INFO  PushGatewayWithTimestamp:250 - text format parsing error in line 244: second HELP line for metric name "HiveExternalCatalog_fileCacheHits"
2019-01-16 08:16:41 ERROR PushGatewayWithTimestamp:255 - Sending metrics failed due to: 
java.io.IOException: Response code from http://fkpr-prometheus-pushgateway.fkpr:9091/metrics/job/shuffle/role/shuffle was 400
	at com.banzaicloud.metrics.prometheus.client.exporter.PushGatewayWithTimestamp.doRequest(PushGatewayWithTimestamp.java:252)
	at com.banzaicloud.metrics.prometheus.client.exporter.PushGatewayWithTimestamp.pushAdd(PushGatewayWithTimestamp.java:168)
	at com.banzaicloud.spark.metrics.sink.PrometheusSink$Reporter.report(PrometheusSink.scala:122)
	at com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:162)
	at com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:117)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Spark 2.4.0: Sink class com.banzaicloud.spark.metrics.sink.PrometheusSink cannot be instantiated

I'm using spark 2.4.0.
When I follow the steps mentioned in https://github.com/banzaicloud/spark-metrics/blob/master/PrometheusSink.md, I get below error:

2019-04-21 19:07:02 ERROR YarnClientSchedulerBackend:70 - YARN application has exited unexpectedly with state FAILED! Check the YARN application logs for more details.
2019-04-21 19:07:02 ERROR YarnClientSchedulerBackend:70 - Diagnostics message: Uncaught exception: java.lang.ClassNotFoundException: com.banzaicloud.spark.metrics.sink.PrometheusSink
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

I'm using below config:

--repositories https://raw.github.com/banzaicloud/spark-metrics/master/maven-repo/releases \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0,com.banzaicloud:spark-metrics_2.11:2.3-2.0.1,io.prometheus:simpleclient:0.3.0,io.prometheus:simpleclient_dropwizard:0.3.0,io.prometheus:simpleclient_pushgateway:0.3.0,io.dropwizard.metrics:metrics-core:3.1.2

In the executor log, I can the jar being passed to the command as:

--user-class-path \ 
      file:$PWD/com.banzaicloud_spark-metrics_2.11-2.3-2.0.1.jar \ 
--user-class-path \ 
      file:$PWD/io.prometheus_simpleclient-0.3.0.jar \ 
--user-class-path
.
.
.

Can you help me with a solution around this.

proplems with metrics-name-replacement

hi,all. thanks for your product. there are some proplems when I use metrics-name-replacement. when I try to use regex : (application_\d+\d+.{1,6}_)(.+) to remove application_id . the push away keep write log like this:

time="2018-11-05T14:04:16+08:00" level=info msg="Metric families 'name:"CodeGenerator_compilationTime" help:"Generated from Dropwizard metric import (metric=application_1539228068007_5477.driver.CodeGenerator.compilationTime, type=com.codahale.metrics.Histogram)" type:SUMMARY metric:<label:<name:"app_name" value:"livy-session-8" > label:<name:"instance" value:"application_1539228068007_5477" > label:<name:"job" value:"application_1539228068007_5477" > label:<name:"role" value:"driver" > summary:<sample_count:3 quantile:<quantile:0.5 value:66 > quantile:<quantile:0.75 value:172 > quantile:<quantile:0.95 value:172 > quantile:<quantile:0.98 value:172 > quantile:<quantile:0.99 value:172 > quantile:<quantile:0.999 value:172 > > > ' and 'name:"CodeGenerator_compilationTime" help:"Generated from Dropwizard metric import (metric=application_1539228068007_5478.driver.CodeGenerator.compilationTime, type=com.codahale.metrics.Histogram)" type:SUMMARY metric:<label:<name:"app_name" value:"livy-session-9" > label:<name:"instance" value:"application_1539228068007_5478" > label:<name:"job" value:"application_1539228068007_5478" > label:<name:"role" value:"driver" > summary:<sample_count:12 quantile:<quantile:0.5 value:32 > quantile:<quantile:0.75 value:56 > quantile:<quantile:0.95 value:56 > quantile:<quantile:0.98 value:56 > quantile:<quantile:0.99 value:56 > quantile:<quantile:0.999 value:56 > > > ' have inconsistent help strings. The latter will have priority. This is bad. Fix your pushed metrics!" source="diskmetricstore.go:122"

the log file size grow very fast ,I'm not familiar with promethus , could you tell me how to resolve this proplem?

Only driver metrics visible on local

Issue : Only driver metrics are visible
Screenshot 2023-07-02 at 4 38 30 PM

scala version 2.11.7
spark version 2.4.0

 val spark: SparkSession = SparkSession
      .builder
      .master("local[*]")
      .appName("KafkaMetricsSourceDemo")
      .config("spark.sql.streaming.metricsEnabled", true)
      .config("spark.metrics.executorMetricsSource.enabled",true)
      .config("spark.executor.processTreeMetrics.enabled",true)
      .config("spark.ui.prometheus.enabled",true)
      .config("spark.metrics.conf.*.sink.prometheus.class", "com.banzaicloud.spark.metrics.sink.PrometheusSink")
      .config("spark.metrics.conf.*.sink.prometheus.pushgateway-address-protocol", "http")
      .config("spark.metrics.conf.*.sink.prometheus.pushgateway-address", "localhost:9091")
      .config("spark.metrics.conf.*.sink.prometheus.period", 15)
      .config("spark.metrics.conf.*.sink.prometheus.unit", "seconds")
      .config("spark.metrics.conf.*.sink.prometheus.pushgateway-enable-timestamp",false)
      .getOrCreate()

My build.sbt contents:

ThisBuild / version := "0.1.0-SNAPSHOT"

ThisBuild / scalaVersion := "2.11.7"

lazy val root = (project in file("."))
  .settings(
    name := "debug"
  )

libraryDependencies += "org.apache.maven.plugins" % "maven-compiler-plugin" % "3.7.0"
libraryDependencies += "org.apache.maven.plugins" % "maven-dependency-plugin" % "3.0.0"

libraryDependencies += "com.banzaicloud" % "spark-metrics_2.11" % "2.3-2.1.0"
libraryDependencies += "io.prometheus" % "simpleclient_pushgateway" % "0.16.0"
libraryDependencies += "io.prometheus" % "simpleclient_dropwizard" % "0.16.0"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core",
  "org.apache.spark" %% "spark-sql-kafka-0-10",
  "org.apache.spark" %% "spark-sql"
).map(_ % "2.4.4")

could you please help!

Want to understand that this spark matrics repo will work with prometheus in Hadoop cluster ?

Want to understand that this spark matrics repo will work with Prometheus in the Hadoop cluster?

Actually I have a use case where my java application is deployed in Hadoop cluster and we run the application using the Spark job, which is batch job application, now we want to monitor our spark matrics including java so we wanted to push the spark matrics in Prometheus so that we can use the Grafana ?
so just wanted to understand will this work in the Hadoop cluster as well?

Getting Executor Ids from one query

Is there any option in the metrics where we can retrieve the list of all executors id from one query so that I can create a variable in Grafana and use that to create graphs.

Push dropwizard metrics error, PushGatewayWithTimestamp: text format parsing error in line 64: second HELP line for metric name "HiveExternalCatalog_fileCacheHits"

Describe the bug

Spark version: 2.4.3
spark-metrics version: spark-metrics_2.11-2.3-2.1.1.jar

Error in spark master.log:

PushGatewayWithTimestamp: text format parsing error in line 64: second HELP line for metric name "HiveExternalCatalog_fileCacheHits"

Reason is: pushfateway may not accept duplicate metrics.

here is push request body (while logging with debug level)

# HELP HiveExternalCatalog_fileCacheHits Generated from Dropwizard metric import (metric=HiveExternalCatalog.fileCacheHits, type=com.codahale.metrics.Counter)
# TYPE HiveExternalCatalog_fileCacheHits gauge
HiveExternalCatalog_fileCacheHits 0.0
# HELP HiveExternalCatalog_filesDiscovered Generated from Dropwizard metric import (metric=HiveExternalCatalog.filesDiscovered, type=com.codahale.metrics.Counter)
# TYPE HiveExternalCatalog_filesDiscovered gauge
HiveExternalCatalog_filesDiscovered 0.0
# HELP HiveExternalCatalog_hiveClientCalls Generated from Dropwizard metric import (metric=HiveExternalCatalog.hiveClientCalls, type=com.codahale.metrics.Counter)
# TYPE HiveExternalCatalog_hiveClientCalls gauge
HiveExternalCatalog_hiveClientCalls 0.0
# HELP HiveExternalCatalog_parallelListingJobCount Generated from Dropwizard metric import (metric=HiveExternalCatalog.parallelListingJobCount, type=com.codahale.metrics.Counter)
# TYPE HiveExternalCatalog_parallelListingJobCount gauge
HiveExternalCatalog_parallelListingJobCount 0.0
# HELP HiveExternalCatalog_partitionsFetched Generated from Dropwizard metric import (metric=HiveExternalCatalog.partitionsFetched, type=com.codahale.metrics.Counter)
# TYPE HiveExternalCatalog_partitionsFetched gauge
HiveExternalCatalog_partitionsFetched 0.0
# HELP CodeGenerator_compilationTime Generated from Dropwizard metric import (metric=CodeGenerator.compilationTime, type=com.codahale.metrics.Histogram)
# TYPE CodeGenerator_compilationTime summary
CodeGenerator_compilationTime{quantile="0.5"} 0.0
CodeGenerator_compilationTime{quantile="0.75"} 0.0
CodeGenerator_compilationTime{quantile="0.95"} 0.0
CodeGenerator_compilationTime{quantile="0.98"} 0.0
CodeGenerator_compilationTime{quantile="0.99"} 0.0
CodeGenerator_compilationTime{quantile="0.999"} 0.0
CodeGenerator_compilationTime_count 0.0
# HELP CodeGenerator_generatedClassSize Generated from Dropwizard metric import (metric=CodeGenerator.generatedClassSize, type=com.codahale.metrics.Histogram)
# TYPE CodeGenerator_generatedClassSize summary
CodeGenerator_generatedClassSize{quantile="0.5"} 0.0
CodeGenerator_generatedClassSize{quantile="0.75"} 0.0
CodeGenerator_generatedClassSize{quantile="0.95"} 0.0
CodeGenerator_generatedClassSize{quantile="0.98"} 0.0
CodeGenerator_generatedClassSize{quantile="0.99"} 0.0
CodeGenerator_generatedClassSize{quantile="0.999"} 0.0
CodeGenerator_generatedClassSize_count 0.0
# HELP CodeGenerator_generatedMethodSize Generated from Dropwizard metric import (metric=CodeGenerator.generatedMethodSize, type=com.codahale.metrics.Histogram)
# TYPE CodeGenerator_generatedMethodSize summary
CodeGenerator_generatedMethodSize{quantile="0.5"} 0.0
CodeGenerator_generatedMethodSize{quantile="0.75"} 0.0
CodeGenerator_generatedMethodSize{quantile="0.95"} 0.0
CodeGenerator_generatedMethodSize{quantile="0.98"} 0.0
CodeGenerator_generatedMethodSize{quantile="0.99"} 0.0
CodeGenerator_generatedMethodSize{quantile="0.999"} 0.0
CodeGenerator_generatedMethodSize_count 0.0
# HELP CodeGenerator_sourceCodeSize Generated from Dropwizard metric import (metric=CodeGenerator.sourceCodeSize, type=com.codahale.metrics.Histogram)
# TYPE CodeGenerator_sourceCodeSize summary
CodeGenerator_sourceCodeSize{quantile="0.5"} 0.0
CodeGenerator_sourceCodeSize{quantile="0.75"} 0.0
CodeGenerator_sourceCodeSize{quantile="0.95"} 0.0
CodeGenerator_sourceCodeSize{quantile="0.98"} 0.0
CodeGenerator_sourceCodeSize{quantile="0.99"} 0.0
CodeGenerator_sourceCodeSize{quantile="0.999"} 0.0
CodeGenerator_sourceCodeSize_count 0.0
# HELP master_aliveWorkers Generated from Dropwizard metric import (metric=master.aliveWorkers, type=org.apache.spark.deploy.master.MasterSource$$anon$2)
# TYPE master_aliveWorkers gauge
master_aliveWorkers 1.0
# HELP master_apps Generated from Dropwizard metric import (metric=master.apps, type=org.apache.spark.deploy.master.MasterSource$$anon$3)
# TYPE master_apps gauge
master_apps 0.0
# HELP master_waitingApps Generated from Dropwizard metric import (metric=master.waitingApps, type=org.apache.spark.deploy.master.MasterSource$$anon$4)
# TYPE master_waitingApps gauge
master_waitingApps 0.0
# HELP master_workers Generated from Dropwizard metric import (metric=master.workers, type=org.apache.spark.deploy.master.MasterSource$$anon$1)
# TYPE master_workers gauge
master_workers 1.0
# HELP HiveExternalCatalog_fileCacheHits Generated from Dropwizard metric import (metric=HiveExternalCatalog.fileCacheHits, type=com.codahale.metrics.Counter)
# TYPE HiveExternalCatalog_fileCacheHits gauge
HiveExternalCatalog_fileCacheHits 0.0
# HELP HiveExternalCatalog_filesDiscovered Generated from Dropwizard metric import (metric=HiveExternalCatalog.filesDiscovered, type=com.codahale.metrics.Counter)
# TYPE HiveExternalCatalog_filesDiscovered gauge
HiveExternalCatalog_filesDiscovered 0.0
# HELP HiveExternalCatalog_hiveClientCalls Generated from Dropwizard metric import (metric=HiveExternalCatalog.hiveClientCalls, type=com.codahale.metrics.Counter)
# TYPE HiveExternalCatalog_hiveClientCalls gauge
HiveExternalCatalog_hiveClientCalls 0.0
# HELP HiveExternalCatalog_parallelListingJobCount Generated from Dropwizard metric import (metric=HiveExternalCatalog.parallelListingJobCount, type=com.codahale.metrics.Counter)
# TYPE HiveExternalCatalog_parallelListingJobCount gauge
HiveExternalCatalog_parallelListingJobCount 0.0
# HELP HiveExternalCatalog_partitionsFetched Generated from Dropwizard metric import (metric=HiveExternalCatalog.partitionsFetched, type=com.codahale.metrics.Counter)
# TYPE HiveExternalCatalog_partitionsFetched gauge
HiveExternalCatalog_partitionsFetched 0.0
# HELP CodeGenerator_compilationTime Generated from Dropwizard metric import (metric=CodeGenerator.compilationTime, type=com.codahale.metrics.Histogram)
# TYPE CodeGenerator_compilationTime summary
CodeGenerator_compilationTime{quantile="0.5"} 0.0
CodeGenerator_compilationTime{quantile="0.75"} 0.0
CodeGenerator_compilationTime{quantile="0.95"} 0.0
CodeGenerator_compilationTime{quantile="0.98"} 0.0
CodeGenerator_compilationTime{quantile="0.99"} 0.0
CodeGenerator_compilationTime{quantile="0.999"} 0.0
CodeGenerator_compilationTime_count 0.0
# HELP CodeGenerator_generatedClassSize Generated from Dropwizard metric import (metric=CodeGenerator.generatedClassSize, type=com.codahale.metrics.Histogram)
# TYPE CodeGenerator_generatedClassSize summary
CodeGenerator_generatedClassSize{quantile="0.5"} 0.0
CodeGenerator_generatedClassSize{quantile="0.75"} 0.0
CodeGenerator_generatedClassSize{quantile="0.95"} 0.0
CodeGenerator_generatedClassSize{quantile="0.98"} 0.0
CodeGenerator_generatedClassSize{quantile="0.99"} 0.0
CodeGenerator_generatedClassSize{quantile="0.999"} 0.0
CodeGenerator_generatedClassSize_count 0.0
# HELP CodeGenerator_generatedMethodSize Generated from Dropwizard metric import (metric=CodeGenerator.generatedMethodSize, type=com.codahale.metrics.Histogram)
# TYPE CodeGenerator_generatedMethodSize summary
CodeGenerator_generatedMethodSize{quantile="0.5"} 0.0
CodeGenerator_generatedMethodSize{quantile="0.75"} 0.0
CodeGenerator_generatedMethodSize{quantile="0.95"} 0.0
CodeGenerator_generatedMethodSize{quantile="0.98"} 0.0
CodeGenerator_generatedMethodSize{quantile="0.99"} 0.0
CodeGenerator_generatedMethodSize{quantile="0.999"} 0.0
CodeGenerator_generatedMethodSize_count 0.0
# HELP CodeGenerator_sourceCodeSize Generated from Dropwizard metric import (metric=CodeGenerator.sourceCodeSize, type=com.codahale.metrics.Histogram)
# TYPE CodeGenerator_sourceCodeSize summary
CodeGenerator_sourceCodeSize{quantile="0.5"} 0.0
CodeGenerator_sourceCodeSize{quantile="0.75"} 0.0
CodeGenerator_sourceCodeSize{quantile="0.95"} 0.0
CodeGenerator_sourceCodeSize{quantile="0.98"} 0.0
CodeGenerator_sourceCodeSize{quantile="0.99"} 0.0
CodeGenerator_sourceCodeSize{quantile="0.999"} 0.0
CodeGenerator_sourceCodeSize_count 0.0

Steps to reproduce the issue:

Expected behavior

Screenshots

image

Additional context

Metrics name pre-processing by custom Prometheus sink is working for only one component(driver/executor/applicationMaster)

Hello

As we already know that Spark doesn't support Promotheus sink by default in versions < 3.0, I wanted to integrate this custom sink with our AWS EMR clusters. I was successful in publishing metrics to Prometheus PushGateway. I am having issues with metric name pre-processing. I followed this link to setup the metric name pre-processing. I am able to do metric name pre-processing only for driver or executor or applicationMaster at once, but, not on all three of them.

This is the regex suggested in the above link and it name pre-processing works only for driver. I tried replacing it with executor, applicationMaster and it works as expected for that respective component

*.sink.prometheus.metrics-name-capture-regex=(.+driver_)(.+)
*.sink.prometheus.metrics-name-replacement=$2

Below are few regex I tried for metric name pre-processing, but I still see the metric name processed only for one of the component instead of all the components(exector/driver/applicationMaster).

(.*driver|.*executor|.*application)[0123456789_]+(.+)
(\w+)((_\d+)_)(\w*(driver|executor|master|application)\w*_)?
(\w+)((_\d+)_)(\w*(driver|executor|applicationMaster)\w*_)?
([a-zA-Z]+)((_[0-9]+)+_)(\w*(executor|master|application)\w*_)?
.*(driver|executor|application)[0123456789_]+(.+)

For all the various regex, I used the name replacement to be $2 like in the above mentioned link

*.sink.prometheus.metrics-name-replacement=$2

Versions of the JARs used -

  - spark-metrics_2.12-2.4-1.0.5.jar
  - metrics-core-3.1.2.jar
  - simpleclient_dropwizard-0.3.0.jar
  - simpleclient_pushgateway-0.3.0.jar
  - simpleclient-0.3.0.jar
  - simpleclient_common-0.3.0.jar

Any help with the metric name pre-processing is greatly helpful and much appreciated.

Thanks
Druva

Metric naming & application name as tag

Hi folks,

At KLM we're using PrometheusSink with great success. What we're missing though, is the ability to filter on application name. Currently, the metric names themselves are constructed using a lot of variables, including the application (attempt id), so we get metric names like:

application_1534430873940_0001_driver_UserSessionProcessor_StreamingMetrics_streaming_lastCompletedBatch_totalDelay

This makes it hard to aggregate metrics for a specific Spark application over time, across redeployments. I think it makes more sense to remove the application id from the metric name, so the metric name doesn't change when you deploy a new version. If that is not desired in your opinion, the next best thing would be to at least add the application name (UserSessionProcessor in the example) as label, so it becomes slightly easier to aggregate on.

What are your thoughts? If you agree with one or both solutions, I can make a PR for this.

Adding the ability to set custom labels on metrics

Is your feature request related to a problem? Please describe.
When creating batches of jobs, I want the batch number to be a label on the same metric not a part of the metric name. This will make it easier to report both individual batches and all batches over a given time.

Describe the solution you'd like to see
Something like this: #19 (comment)

If you look at the Prometheus best practices, specifically the documentation about the data model and the part about Metric and label naming, you'll see that Prometheus advocates using more generic metric names

I know I can set labels statically using *.sink.<your-sink>.labels=label1=value1,label2=value2 but I want to apply labels where the value is determined at runtime.

Describe alternatives you've considered
I can continue to make the metric name longer by adding to the accumulator, but it would be more in line with Prometheus best practices to use job labels.

Additional context
None that I can think of

Problems with spark -submit.

Hello,

I got some problems running spark in cluster mode. (works OK in client mode)

I am getting the next error:

18/05/04 09:52:23 ERROR MetricsSystem: Sink class com.banzaicloud.spark.metrics.sink.PrometheusSink cannot be instantiated
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1854)
	at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.lang.ClassNotFoundException: com.banzaicloud.spark.metrics.sink.PrometheusSink
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:348)
	at org.apache.spark.util.Utils$.classForName(Utils.scala:230)
	at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:198)
	at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:194)
	at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
	at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
	at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
	at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
	at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
	at org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:194)
	at org.apache.spark.metrics.MetricsSystem.start(MetricsSystem.scala:102)
	at org.apache.spark.SparkEnv$.create(SparkEnv.scala:366)
	at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:201)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:223)
	at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67)
	at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)

The spark submit is exactly the same that was provided in the readme.

--repositories https://raw.github.com/banzaicloud/spark-metrics/master/maven-repo/releases 
--packages com.banzaicloud:spark-metrics_2.11:2.2.1-1.0.0,io.dropwizard.metrics:metrics-core:3.1.2 
--jars /home/hadoop/avro-1.8.2.jar,/home/hadoop/simpleclient-0.0.23.jar,/home/hadoop/simpleclient_dropwizard-0.0.23.jar,/home/hadoop/simpleclient_pushgateway-0.0.23.jar

Caused by: java.lang.ClassNotFoundException: com.banzaicloud.spark.metrics.sink.PrometheusSink

Hello,

first of all thanks for putting this lib together. I'm a fan of spark and prometheus but there was nothing to bridge these two worlds and you guys did an amazing job.

I've started using you lib few weeks ago and everything was working fine will a few days ago a new structured streaming job on emr 5.14 failed to launch executors..

The error:

18/06/08 12:46:47 ERROR MetricsSystem: Sink class com.banzaicloud.spark.metrics.sink.PrometheusSink cannot be instantiated
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1854)
        at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:293)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.lang.ClassNotFoundException: com.banzaicloud.spark.metrics.sink.PrometheusSink
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:348)
        at org.apache.spark.util.Utils$.classForName(Utils.scala:235)
        at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:198)
        at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:194)
        at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
        at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
        at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
        at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
        at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
        at org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:194)
        at org.apache.spark.metrics.MetricsSystem.start(MetricsSystem.scala:102)
        at org.apache.spark.SparkEnv$.create(SparkEnv.scala:364)
        at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:200)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:228)
        at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65)
        at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)

The ARGS[] of the aws emr create step:

    --repositories, https://raw.github.com/banzaicloud/spark-metrics/master/maven-repo/releases, \
    --packages, '"com.banzaicloud:spark-metrics_2.11:2.3-1.0.0,org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0"', \
    --conf, spark.driver.extraJavaOptions="${CONFIG_FILE}", \
    --conf, spark.metrics.conf="/tmp/prometheus-sink.conf", \
    --conf, spark.metrics.namespace="interleaving-stream", \

And the libs in the build.sbt:

"io.prometheus" % "simpleclient" % "0.3.0",
      "io.prometheus" % "simpleclient_dropwizard" % "0.3.0",
      "io.prometheus" % "simpleclient_pushgateway" % "0.3.0",
      "io.dropwizard.metrics" % "metrics-core" % "3.1.2",
      ```

Do you guys have any clue how's that possible that the container terminates with the class not found?

Note that the pushing  to the prometheus gateway works fine.

Q: Expose AccumulatorSource

Question about exposing spark accumulators to Prometheus.

I think I read about it in the documentation, but couldn't find now. Could I expose value of spark accumulator?
I assume this should work

*.source.jvm.class=org.apache.spark.metrics.source.JvmSource
*.source.jvm.class=org.apache.spark.metrics.source.LongAccumulatorSource

but I don't see spark accumulator value exposed.

Pushgateway Read timed out

I'm experiencing an odd issue where my spark workers will randomly begin reporting that it cannot connect to my push gateway.

2020-04-13 11:41:55 ERROR ScheduledReporter:184 - Exception thrown from Reporter#report. Exception was suppressed.
java.net.SocketTimeoutException: Read timed out
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
	at java.net.SocketInputStream.read(SocketInputStream.java:171)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
	at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
	at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)
	at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
	at io.prometheus.client.exporter.PushGateway.doRequest(PushGateway.java:315)
	at io.prometheus.client.exporter.PushGateway.pushAdd(PushGateway.java:182)
	at com.banzaicloud.spark.metrics.sink.PrometheusSink$Reporter.report(PrometheusSink.scala:98)
	at com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:242)
	at com.codahale.metrics.ScheduledReporter.lambda$start$0(ScheduledReporter.java:182)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

I have verified the pushgateway is up and running and I can connect to it without issue.
However, I did notice that my pushgateway piles up in memory usage. The only piece that is sending metrics are my spark workers via this package/library.

I thought perhaps it could be a pushgateway issue, but then I found this issue on the pushgateway repo:
prometheus/pushgateway#340

That seems to indicate that something is pushing metrics into the gateway (this lib) and is not disposing of the connection properly?

Any assistance would greatly be appreciated. The error is not blocking my workers but it is very annoying causing logs to get spammed and instability in the pushgateway.

jars+versions

collector-0.12.0.jar
metrics-core-4.1.2.jar
simpleclient-0.8.1.jar
simpleclient_common-0.8.1.jar
simpleclient_dropwizard-0.8.1.jar
simpleclient_pushgateway-0.8.1.jar
snakeyaml-1.16.jar
spark-metrics_2.11-2.3-3.0.1.jar

Thanks!

Repetitions of last metric value

Hello,

stoader
I have a general question: After a Spark application ends, the metric in the pushgateway becomes “stale” as no more updates are pushed. This results in the following problem:

Problem

The last metric is repeatedly scraped and only stops getting plotted when the gateway server shuts down which means that the cluster needs to terminate. This is apparently by design and users on the mailing group ask for a way to avoid this behaviour, e.g. https://groups.google.com/g/prometheus-users/c/uGYUQhQAdOE/m/0ICfNNHaAQAJ
There seems to be no way around this problem, the authors explicitly decided against implementing something like a metric “timeout”

Have you also observed this and do you know a way to solve this problem? Unfortunately, the pull-based approach does not seem to work for multiple executors per node on a YARN cluster

Not seeing executor metrics, only driver

Describe the bug
Not seeing executor metrics (only driver).

Steps to reproduce the issue:
Spark 2.3.0 / Hadoop 2.7

metrics.properties:

# Enable JVM metrics source for all instances by class name
*.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink
#*.source.jvm.class=org.apache.spark.metrics.source.JvmSource

# Enable Prometheus for all instances by class name
driver.sink.prometheus.class=com.banzaicloud.spark.metrics.sink.PrometheusSink
# Prometheus pushgateway address
*.sink.prometheus.pushgateway-address-protocol=http
*.sink.prometheus.pushgateway-address=domain.com:9091/
*.sink.prometheus.period=10
*.sink.prometheus.unit=seconds
*.sink.prometheus.pushgateway-enable-timestamp=false
# Enable HostName in Instance instead of Appid (Default value is false i.e. instance=${appid})
*.sink.prometheus.enable-hostname-in-instance=true
*.sink.prometheus.labels=worker=somename,environment=test,type=hadoop,cluster=360
*.sink.prometheus.enable-dropwizard-collector=false
*.sink.prometheus.enable-jmx-collector=true
*.sink.prometheus.jmx-collector-config=jmxCollector.yml

# Enable jvm source for instance master, worker, driver and executor
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
master.source.executors.class=org.apache.spark.metrics.source.JvmSource

jmxCollector.yml

rules:

  # These come from the master
  # Example: master.aliveWorkers
  - pattern: "metrics<name=master\\.(.*)><>Value"
    name: spark_master_$1

  # These come from the worker
  # Example: worker.coresFree
  - pattern: "metrics<name=worker\\.(.*)><>Value"
    name: spark_worker_$1

  # These come from the application driver
  # Example: app-20160809000059-0000.driver.DAGScheduler.stage.failedStages
  - pattern: "metrics<name=(.*)\\.driver\\.(DAGScheduler|BlockManager|jvm)\\.(.*)><>Value"
    name: spark_driver_$2_$3
    type: GAUGE
    labels:
      app_id: "$1"

  # These come from the application driver
  # Emulate timers for DAGScheduler like messagePRocessingTime
  - pattern: "metrics<name=(.*)\\.driver\\.DAGScheduler\\.(.*)><>Count"
    name: spark_driver_DAGScheduler_$2_count
    type: COUNTER
    labels:
      app_id: "$1"

  # HiveExternalCatalog is of type counter
  - pattern: "metrics<name=(.*)\\.driver\\.HiveExternalCatalog\\.(.*)><>Count"
    name: spark_driver_HiveExternalCatalog_$2_total
    type: COUNTER
    labels:
      app_id: "$1"

  # These come from the application driver
  # Emulate histograms for CodeGenerator
  - pattern: "metrics<name=(.*)\\.driver\\.CodeGenerator\\.(.*)><>Count"
    name: spark_driver_CodeGenerator_$2_count
    type: COUNTER
    labels:
      app_id: "$1"

  # These come from the application driver
  # Emulate timer (keep only count attribute) plus counters for LiveListenerBus
  - pattern: "metrics<name=(.*)\\.driver\\.LiveListenerBus\\.(.*)><>Count"
    name: spark_driver_LiveListenerBus_$2_count
    type: COUNTER
    labels:
      app_id: "$1"

  # Get Gauge type metrics for LiveListenerBus
  - pattern: "metrics<name=(.*)\\.driver\\.LiveListenerBus\\.(.*)><>Value"
    name: spark_driver_LiveListenerBus_$2
    type: GAUGE
    labels:
      app_id: "$1"

  # These come from the application driver if it's a streaming application
  # Example: app-20160809000059-0000.driver.com.example.ClassName.StreamingMetrics.streaming.lastCompletedBatch_schedulingDelay
  - pattern: "metrics<name=(.*)\\.driver\\.(.*)\\.StreamingMetrics\\.streaming\\.(.*)><>Value"
    name: spark_driver_streaming_$3
    labels:
      app_id: "$1"
      app_name: "$2"

  # These come from the application driver if it's a structured streaming application
  # Example: app-20160809000059-0000.driver.spark.streaming.QueryName.inputRate-total
  - pattern: "metrics<name=(.*)\\.driver\\.spark\\.streaming\\.(.*)\\.(.*)><>Value"
    name: spark_driver_structured_streaming_$3
    labels:
      app_id: "$1"
      query_name: "$2"

  # These come from the application executors
  # Example: app-20160809000059-0000.0.executor.threadpool.activeTasks (value)
  #  app-20160809000059-0000.0.executor.JvmGCtime (counter)
  - pattern: "metrics<name=(.*)\\.(.*)\\.executor\\.(.*)><>Value"
    name: spark_executor_$3
    type: GAUGE
    labels:
      app_id: "$1"
      executor_id: "$2"

  # Executors counters
  - pattern: "metrics<name=(.*)\\.(.*)\\.executor\\.(.*)><>Count"
    name: spark_executor_$3_total
    type: COUNTER
    labels:
      app_id: "$1"
      executor_id: "$2"

  # These come from the application executors
  # Example: app-20160809000059-0000.0.jvm.threadpool.activeTasks
  - pattern: "metrics<name=(.*)\\.([0-9]+)\\.(jvm|NettyBlockTransfer)\\.(.*)><>Value"
    name: spark_executor_$3_$4
    type: GAUGE
    labels:
      app_id: "$1"
      executor_id: "$2"

  - pattern: "metrics<name=(.*)\\.([0-9]+)\\.HiveExternalCatalog\\.(.*)><>Count"
    name: spark_executor_HiveExternalCatalog_$3_count
    type: COUNTER
    labels:
      app_id: "$1"
      executor_id: "$2"

  # These come from the application driver
  # Emulate histograms for CodeGenerator
  - pattern: "metrics<name=(.*)\\.([0-9]+)\\.CodeGenerator\\.(.*)><>Count"
    name: spark_executor_CodeGenerator_$3_count
    type: COUNTER
    labels:
      app_id: "$1"
      executor_id: "$2"

I see metrics flowing in appropriately to the push gateway, but it is only driver metrics, no executor...

spark_driver_jvm_total_used
spark_driver_jvm_total_max
spark_driver_jvm_total_init
....

I found a post from someone who has a similar set of spark/hadoop versions on a cloudera forum, but no answers.
https://community.cloudera.com/t5/Support-Questions/Spark-metrics-sink-doesn-t-expose-executor-s-metrics/td-p/281915

I see the same problem with the graphite metric sink built into spark.
It occurs whether we're in yarn cluster mode or local spark-submit mode.

Can anyone explain what I'm doing wrong here?

Thanks,
Drew

Security Policy violation Binary Artifacts

This issue was automatically created by Allstar.

Security Policy Violation
Project is out of compliance with Binary Artifacts policy: binaries present in source code

Rule Description
Binary Artifacts are an increased security risk in your repository. Binary artifacts cannot be reviewed, allowing the introduction of possibly obsolete or maliciously subverted executables. For more information see the Security Scorecards Documentation for Binary Artifacts.

Remediation Steps
To remediate, remove the generated executable artifacts from the repository.

Artifacts Found

  • maven-repo/releases/com/banzaicloud/spark-metrics_2.11/2.2.1-1.0.0/spark-metrics_2.11-2.2.1-1.0.0-javadoc.jar
  • maven-repo/releases/com/banzaicloud/spark-metrics_2.11/2.2.1-1.0.0/spark-metrics_2.11-2.2.1-1.0.0-sources.jar
  • maven-repo/releases/com/banzaicloud/spark-metrics_2.11/2.2.1-1.0.0/spark-metrics_2.11-2.2.1-1.0.0.jar

Additional Information
This policy is drawn from Security Scorecards, which is a tool that scores a project's adherence to security best practices. You may wish to run a Scorecards scan directly on this repository for more details.


This issue will auto resolve when the policy is in compliance.

Issue created by Allstar. See https://github.com/ossf/allstar/ for more information. For questions specific to the repository, please contact the owner or maintainer.

Make the Prometheus sink a Spark package

Spark packages is a community index of third-party packages for Apache Spark. Ideally, the Prometheus sink should be a package as it makes easier for people outside Banzai Cloud to consume it.

At this time we are not putting effort into this (due to lack of time) but we are open to contributions.

Sample Prometheus queries

Would it be possible for you all to share the queries you all have to build a display or alert based off spark metrics?

VictoriaMetrics

Good afternoon! if we don't have pushgateway, is it possible to send metrics to VictoriaMetrics using your library?

spark metrics namespace configuration issue

According to the documentation at https://spark.apache.org/docs/latest/monitoring.html, under Metrics, we can set the spark.metrics.namespace configuration to something like ${spark.app.name}, which presumably is expanded in the the metrics sink to be the value of that variable. We opened up our spark-defaults.conf file and added the following line.

spark.metrics.namespace					     ${spark.app.name}

As a result, our metrics changed from

app_20180802190042_0119_driver_DAGScheduler_stage_runningStages{instance="",job="app-20180802190042-0119",role="driver"} 1

... to this ....

Parse_AUT1_records_driver_DAGScheduler_stage_runningStages{instance="",job="${spark.app.name}",role="driver"} 1

The actual name of the metric changed successfully (Parse_AUT1_records is in fact the name we expected), but it seems the job tag did not expand properly (says ${spark.app.name}). We looked into the source code and are pretty sure there is an issue here in https://github.com/banzaicloud/spark-metrics/blob/master/src/main/scala/com/banzaicloud/spark/metrics/sink/PrometheusSink.scala#L77 on line 77. The name seems to be taken verbatim as a string and not expanded as a variable.

Filter metrics

Is your feature request related to a problem? Please describe.

I would like to use push-based metrics collection provided by this plugin alongside the direct scraping approach introduced in Spark 3.0, as the push-based version still provides benefits for some metrics. In order not to overload the push gateway, I want to filter for a select set of metrics (e.g. batch job metrics that have to be reliably scraped), a significantly smaller amount than routing all of them.

Describe the solution you'd like to see

A new configuration parameter that can be used to filter metrics by namespace, or more generally, a regex pattern.

Describe alternatives you've considered

  1. It would be better if this was a feature of the Spark's MetricsSystem instead, so that arbitrary source to sink routing could be established in a plugin agnostic manner. I've asked on the spark-users mailing list about this, but didn't get an answer so far. Spark doesn't provide such a feature currently according to my investigation, and I don't know how long it would take to get this through.

  2. metrics-name-capture-regex could be altered so that it drops metrics that doesn't match the regex, while still fulfilling the original purpose to enable rewriting the contents of its capture groups. However this would be a breaking change, as users are expecting non-matching patterns to be included too.

Additional context

Metric Name RegEx Replacement doesn't work with JMX

My goal, like others before me, has been to replace all of the metrics on the pushgateway each time a new instance of the application runs. I.e. no reference to the application id & no rand ints generated in the metrics names.

I achieved this by doing three things:

  1. Changing the spark.metrics.namespace to the app name.
  2. Using *.sink.prometheus.metrics-name-capture-regex, *.sink.prometheus.metrics-name-replacement to replace rand int and application id. FYI: (application_.*?_.*?_|.*spark_streaming_.*?_.*?_.*?_.*?_.*?_.*?)(.+)
  3. Using the group-key to remove the instance label

The problem is, step no. 2 (*.sink.prometheus.metrics-name-capture-regex, *.sink.prometheus.metrics-name-replacement) only seems to work when jmx collection is disabled. Otherwise it is ignored.

JMX collection gives many valuable stats (specifically for my application Hadoop S3A file system metrics, and Kafka producer metrics. )

Is this an oversight? a bug? Or intended behavior?

Much appreciate the replies.

Support Spark 3.0

Describe the bug

./bin/spark-submit --master spark://KANGTIAN-MB0:7077 --class org.apache.spark.examples.SparkPi     --repositories https://raw.github.com/banzaicloud/spark-metrics/master/maven-repo/releases     --packages com.banzaicloud:spark-metrics_2.11:2.3-2.1.0,io.prometheus:simpleclient:0.3.0,io.prometheus:simpleclient_dropwizard:0.3.0,io.prometheus:simpleclient_pushgateway:0.3.0,io.dropwizard.metrics:metrics-core:3.1.2     ~/Documents/program/spark/examples/jars/spark-examples_2.11-2.4.4.jar  1000

ERROR:

19/11/08 13:22:47 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/internal/Logging$class
	at com.banzaicloud.spark.metrics.sink.PrometheusSink.<init>(PrometheusSink.scala:47)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.spark.metrics.MetricsSystem.$anonfun$registerSinks$1(MetricsSystem.scala:216)
	at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149)
	at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
	at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
	at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
	at scala.collection.mutable.HashMap.foreach(HashMap.scala:149)
	at org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:196)
	at org.apache.spark.metrics.MetricsSystem.start(MetricsSystem.scala:106)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:571)
	at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2555)
	at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:896)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:887)
	at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
	at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:901)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:179)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:202)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:89)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:980)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:989)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.internal.Logging$class
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 32 more

Steps to reproduce the issue:

Expected behavior

Screenshots
image

Additional context

ClassNotFoundException: com.banzaicloud.spark.metrics.sink.PrometheusSink

Hello,
I'm really interested in using the library but currently I'm failing to use it in emr/yarn. It is giving a ClassNotFoundException: com.banzaicloud.spark.metrics.sink.PrometheusSink error.
I have tried including the jar in my fat jar but that does not help.
Here is the complete stack trace:

Uncaught exception: java.lang.ClassNotFoundException: com.banzaicloud.spark.metrics.sink.PrometheusSink at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.util.Utils$.classForName(Utils.scala:238) at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:198) at org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:194) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at scala.collection.mutable.HashMap.foreach(HashMap.scala:130) at org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:194) at org.apache.spark.metrics.MetricsSystem.start(MetricsSystem.scala:102) at org.apache.spark.deploy.yarn.ApplicationMaster.createAllocator(ApplicationMaster.scala:454) at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:481) at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:773) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:772) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:797) at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.