tokio-rs / tokio-metrics Goto Github PK

Utilities for collecting metrics from a Tokio application

License: MIT License

Rust 100.00%

tokio-metrics's Introduction

Tokio Metrics

Provides utilities for collecting metrics from a Tokio application, including runtime and per-task metrics.

[dependencies]
tokio-metrics = { version = "0.3.1", default-features = false }

Getting Started With Task Metrics

Use TaskMonitor to instrument tasks before spawning them, and to observe metrics for those tasks. All tasks instrumented with a given TaskMonitor aggregate their metrics together. To split out metrics for different tasks, use separate TaskMetrics instances.

// construct a TaskMonitor
let monitor = tokio_metrics::TaskMonitor::new();

// print task metrics every 500ms
{
    let frequency = std::time::Duration::from_millis(500);
    let monitor = monitor.clone();
    tokio::spawn(async move {
        for metrics in monitor.intervals() {
            println!("{:?}", metrics);
            tokio::time::sleep(frequency).await;
        }
    });
}

// instrument some tasks and spawn them
loop {
    tokio::spawn(monitor.instrument(do_work()));
}

Task Metrics

Base Metrics

instrumented_count
The number of tasks instrumented.
dropped_count
The number of tasks dropped.
first_poll_count
The number of tasks polled for the first time.
total_first_poll_delay
The total duration elapsed between the instant tasks are instrumented, and the instant they are first polled.
total_idled_count
The total number of times that tasks idled, waiting to be awoken.
total_idle_duration
The total duration that tasks idled.
total_scheduled_count
The total number of times that tasks were awoken (and then, presumably, scheduled for execution).
total_scheduled_duration
The total duration that tasks spent waiting to be polled after awakening.
total_poll_count
The total number of times that tasks were polled.
total_poll_duration
The total duration elapsed during polls.
total_fast_poll_count
The total number of times that polling tasks completed swiftly.
total_fast_poll_duration
The total duration of fast polls.
total_slow_poll_count
The total number of times that polling tasks completed slowly.
total_slow_poll_duration The total duration of slow polls.
total_short_delay_count The total count of short scheduling delays.
total_short_delay_duration The total duration of short scheduling delays.
total_long_delay_count The total count of long scheduling delays.
total_long_delay_duration The total duration of long scheduling delays.

Derived Metrics

mean_first_poll_delay
The mean duration elapsed between the instant tasks are instrumented, and the instant they are first polled.
mean_idle_duration
The mean duration of idles.
mean_scheduled_duration
The mean duration that tasks spent waiting to be executed after awakening.
mean_poll_duration
The mean duration of polls.
slow_poll_ratio
The ratio between the number polls categorized as slow and fast.
mean_fast_poll_duration
The mean duration of fast polls.
mean_slow_poll_duration
The mean duration of slow polls.
long_delay_ratio
The ratio between the number of long scheduling delays and the number of total schedules.
mean_short_delay_duration The mean duration of short schedules.
mean_long_delay_duration The mean duration of long schedules.

Getting Started With Runtime Metrics

This unstable functionality requires tokio_unstable, and the rt crate feature. To enable tokio_unstable, the --cfg tokio_unstable must be passed to rustc when compiling. You can do this by setting the RUSTFLAGS environment variable before compiling your application; e.g.:

RUSTFLAGS="--cfg tokio_unstable" cargo build

Or, by creating the file .cargo/config.toml in the root directory of your crate. If you're using a workspace, put this file in the root directory of your workspace instead.

[build]
rustflags = ["--cfg", "tokio_unstable"]
rustdocflags = ["--cfg", "tokio_unstable"]

Putting .cargo/config.toml files below the workspace or crate root directory may lead to tools like Rust-Analyzer or VSCode not using your .cargo/config.toml since they invoke cargo from the workspace or crate root and cargo only looks for the .cargo directory in the current & parent directories. Cargo ignores configurations in child directories. More information about where cargo looks for configuration files can be found here.

Missing this configuration file during compilation will cause tokio-metrics to not work, and alternating between building with and without this configuration file included will cause full rebuilds of your project.

The rt feature of tokio-metrics is on by default; simply check that you do not set default-features = false when declaring it as a dependency; e.g.:

[dependencies]
tokio-metrics = "0.3.1"

From within a Tokio runtime, use RuntimeMonitor to monitor key metrics of that runtime.

let handle = tokio::runtime::Handle::current();
let runtime_monitor = tokio_metrics::RuntimeMonitor::new(&handle);

// print runtime metrics every 500ms
let frequency = std::time::Duration::from_millis(500);
tokio::spawn(async move {
    for metrics in runtime_monitor.intervals() {
        println!("Metrics = {:?}", metrics);
        tokio::time::sleep(frequency).await;
    }
});

// run some tasks
tokio::spawn(do_work());
tokio::spawn(do_work());
tokio::spawn(do_work());

Runtime Metrics

Base Metrics

workers_count
The number of worker threads used by the runtime.
total_park_count
The number of times worker threads parked.
max_park_count
The maximum number of times any worker thread parked.
min_park_count
The minimum number of times any worker thread parked.
total_noop_count
The number of times worker threads unparked but performed no work before parking again.
max_noop_count
The maximum number of times any worker thread unparked but performed no work before parking again.
min_noop_count
The minimum number of times any worker thread unparked but performed no work before parking again.
total_steal_count
The number of tasks worker threads stole from another worker thread.
max_steal_count
The maximum number of tasks any worker thread stole from another worker thread.
min_steal_count
The minimum number of tasks any worker thread stole from another worker thread.
total_steal_operations
The number of times worker threads stole tasks from another worker thread.
max_steal_operations
The maximum number of times any worker thread stole tasks from another worker thread.
min_steal_operations
The minimum number of times any worker thread stole tasks from another worker thread.
num_remote_schedules
The number of tasks scheduled from outside of the runtime.
total_local_schedule_count
The number of tasks scheduled from worker threads.
max_local_schedule_count
The maximum number of tasks scheduled from any one worker thread.
min_local_schedule_count
The minimum number of tasks scheduled from any one worker thread.
total_overflow_count
The number of times worker threads saturated their local queues.
max_overflow_count
The maximum number of times any one worker saturated its local queue.
min_overflow_count
The minimum number of times any one worker saturated its local queue.
total_polls_count
The number of tasks that have been polled across all worker threads.
max_polls_count
The maximum number of tasks that have been polled in any worker thread.
min_polls_count
The minimum number of tasks that have been polled in any worker thread.
total_busy_duration
The amount of time worker threads were busy.
max_busy_duration
The maximum amount of time a worker thread was busy.
min_busy_duration
The minimum amount of time a worker thread was busy.
injection_queue_depth
The number of tasks currently scheduled in the runtime's injection queue.
total_local_queue_depth
The total number of tasks currently scheduled in workers' local queues.
max_local_queue_depth
The maximum number of tasks currently scheduled any worker's local queue.
min_local_queue_depth
The minimum number of tasks currently scheduled any worker's local queue.
elapsed
Total amount of time elapsed since observing runtime metrics.
budget_forced_yield_count The number of times that a task was forced to yield because it exhausted its budget.
io_driver_ready_count The number of ready events received from the I/O driver.

Derived Metrics

mean_polls_per_park
busy_ratio

Relation to Tokio Console

Currently, Tokio Console is primarily intended for local debugging. Tokio metrics is intended to enable reporting of metrics in production to your preferred tools. Longer term, it is likely that tokio-metrics will merge with Tokio Console.

License

This project is licensed under the MIT license.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in tokio-metrics by you, shall be licensed as MIT, without any additional terms or conditions.

tokio-metrics's People

Contributors

Stargazers

Watchers

tokio-metrics's Issues

Compatibility with Prometheus and pull-based approach in general

Hi there!

I'd like to start exposing tokio runtime metrics as part of my application's prometheus metrics. Unfortunately, there is a number of conceptual differences, which make tokio-metrics not really suitable for this.

Prometheus usually scrapes applications' metrics by calling an HTTP endpoint in equal time intervals. In my practice I've encountered scrape intervals between 15 secs and 5 minutes, it is determined by a trade-off in resolution requirements and available storage resources. In any case, all metric changes between two scrapes are not observable via Prometheus, usually the best practice for that is to implement most metrics as non-decreasing counters and derive frequency properties from that.

Also, since each metric scrape is a network interaction, it can be failed and retried without guarantees that the request really made it through to the process or not. Due to that it's important for a metrics endpoint to be stateless, which is violated in how intervals iterator is implemented. Ideally, there would be no state change at all when retrieving the current state of metrics.

Do you think that tokio-metrics is a good place to implement that kind of stuff or do you believe it targets a different type of metrics here?

cargo test with 3 failures at main branch (e66d2ff654c72868b887f77bb472cf5d9bbbcc07)

~/github.com/tokio-metrics:main@e66d2ff$ RUSTFLAGS="--cfg tokio_unstable" cargo test --all-features
    Finished test [unoptimized + debuginfo] target(s) in 0.15s
     Running unittests (target/debug/deps/tokio_metrics-ec134d5a58bb3238)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

   Doc-tests tokio-metrics

running 40 tests
test src/task.rs - task::TaskMetrics::first_poll_count (line 607) ... ok
test src/task.rs - task::TaskMetrics::instrumented_count (line 538) ... ok
test src/task.rs - task::TaskMetrics::mean_poll_duration (line 2001) ... ok
test src/task.rs - task::TaskMetrics::dropped_count (line 568) ... ok
test src/task.rs - task::TaskMetrics::total_fast_poll_count (line 1089) ... ok
test src/task.rs - task::TaskMetrics::mean_slow_poll_duration (line 2222) ... ok
test src/task.rs - task::TaskMetrics::mean_fast_poll_duration (line 2131) ... ok
test src/task.rs - task::TaskMetrics::slow_poll_ratio (line 2046) ... ok
test src/task.rs - task::TaskMetrics::mean_idle_duration (line 1881) ... ok
test src/task.rs - task::TaskMetrics::total_fast_poll_duration (line 1144) ... ok
test src/task.rs - task::TaskMetrics::total_first_poll_delay (line 648) ... ok
test src/task.rs - task::TaskMetrics::total_first_poll_delay (line 697) ... ok
test src/task.rs - task::TaskMetrics::total_first_poll_delay (line 731) ... FAILED
test src/task.rs - task::TaskMetrics::total_idle_duration (line 811) ... ok
test src/task.rs - task::TaskMetrics::total_idled_count (line 770) ... ok
test src/task.rs - task::TaskMonitor (line 306) ... ignored
test src/task.rs - task::TaskMonitor (line 321) ... ignored
test src/task.rs - task::TaskMetrics::total_poll_count (line 989) ... ok
test src/task.rs - task::TaskMetrics::total_poll_duration (line 1054) ... ok
test src/task.rs - task::TaskMetrics::total_scheduled_count (line 850) ... ok
test src/task.rs - task::TaskMetrics::mean_first_poll_delay (line 1811) ... ok
test src/task.rs - task::TaskMetrics::total_slow_poll_count (line 1211) ... ok
test src/task.rs - task::TaskMetrics::total_slow_poll_duration (line 1269) ... ok
test src/task.rs - task::TaskMonitor (line 71) - compile ... ok
test src/task.rs - task::TaskMonitor (line 362) ... FAILED
test src/task.rs - task::TaskMonitor (line 388) ... FAILED
test src/task.rs - task::TaskMonitor (line 413) ... ok
test src/lib.rs - (line 12) ... ok
test src/task.rs - task::TaskMonitor::cumulative (line 1571) ... ok
test src/task.rs - task::TaskMonitor (line 452) ... ok
test src/task.rs - task::TaskMonitor::instrument (line 1488) ... ok
test src/task.rs - task::TaskMonitor::instrument (line 1510) ... ok
test src/task.rs - task::TaskMonitor::instrument (line 1530) ... ok
test src/task.rs - task::TaskMonitor (line 281) ... ok
test src/task.rs - task::TaskMetrics::total_scheduled_duration (line 920) ... ok
test src/task.rs - task::TaskMonitor::intervals (line 1632) ... ok
test src/task.rs - task::TaskMonitor::slow_poll_threshold (line 1467) ... ok
test src/task.rs - task::TaskMonitor::with_slow_poll_threshold (line 1406) ... ok
test src/task.rs - task::TaskMetrics::mean_scheduled_duration (line 1920) ... ok
test src/task.rs - task::TaskMonitor (line 24) ... ok

failures:

---- src/task.rs - task::TaskMetrics::total_first_poll_delay (line 731) stdout ----
Test executable failed (exit code 101).

stderr:
thread 'main' panicked at 'overflow when adding duration to instant', library/std/src/time.rs:409:33
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


---- src/task.rs - task::TaskMonitor (line 362) stdout ----
Test executable failed (exit code 101).

stderr:
thread 'main' panicked at 'overflow when adding duration to instant', library/std/src/time.rs:409:33
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


---- src/task.rs - task::TaskMonitor (line 388) stdout ----
Test executable failed (exit code 101).

stderr:
thread 'main' panicked at 'overflow when adding duration to instant', library/std/src/time.rs:409:33
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace



failures:
    src/task.rs - task::TaskMetrics::total_first_poll_delay (line 731)
    src/task.rs - task::TaskMonitor (line 362)
    src/task.rs - task::TaskMonitor (line 388)

test result: FAILED. 35 passed; 3 failed; 2 ignored; 0 measured; 0 filtered out; finished in 7.52s

error: test failed, to rerun pass '--doc'

This is a Mac OSX environment.

impl Debug for public types

It would be helpful to add Debug impl for all public types, like TaskMonitor.

Allow `Stream`s to be instrumented

Implement Instrumented<T> for T: Stream.

Crisper examples of runtime metrics.

For each task metric, it's fairly easy to write a crisp, self-contained example that reliably induces a change in a metric. For runtime metrics, it's currently not so easy to do this, because:

runtime metrics are buffered
some runtime metrics are dependent on scheduling pathologies that are finicky to induce

We could resolve the first obstacle to provide some mechanism to flush metrics on-demand. For the second obstacle, I'm not sure there's much we can do.

Emit task metrics for single invocations instead of interval samples

Hello,

This is a feature request for some way to get the TaskMetrics for the invocation of a single future. Something like:

let monitor = tokio_metrics::TaskMonitor::new();

let (metrics, other_return_value) = monitor.instrument_single(some_future()).await;

The API usage above is not intended to be the actual API, just illustrating the idea. I want this feature is so that I can record metrics for the overhead every single execution of the some_future() future.

The ultimate reason is that I'm trying to write a program that measures the latency of remote service calls, and I want to understand what kind of overhead I'm seeing as a result of using an async runtime, as opposed to a simple blocking thread application. I'd like to see this on a per-request basis so that I can confirm that requests with high latency are only the result of the remote system, not a result of a delay in scheduling the task.

Is it worth tracking/exposing `num_scheduled`?

I think num_scheduled is going to equal num_polls - num_tasks? Need to double-check this, but if so, it doesn't need to be a field in the Metrics struct; it could be computed in a method, instead.

Should it even be exposed? @carllerche points out that this metric matters much more at the runtime level, since there are multiple ways tasks may be scheduled. For task metrics, what matters more is time spent scheduled. At least internally, we need to account for num_scheduled so we can compute mean_time_scheduled, but maybe num_scheduled doesn't actually need to be exposed.

0.1 TODOs

clean up Cargo.toml, features
- should time be an optional feature?
proofread documentation
decide what the runtime metrics MVP is and fill the gaps
set up CI
update README
blog post

Fix based on changes to yield_now

Due to tokio-rs/tokio#5223, some metrics tests were broken. These need to be fixed.

failures:

---- src/task.rs - task::TaskMetrics::mean_scheduled_duration (line 1924) stdout ----
Test executable failed (exit status: 101).

stderr:
thread 'main' panicked at 'assertion failed: interval.mean_scheduled_duration() >= Duration::from_secs(1)', src/task.rs:34:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/fc594f15669680fa70d255faec3ca3fb507c3405/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/fc594f15669680fa70d255faec3ca3fb507c3405/library/core/src/panicking.rs:64:14
   2: core::panicking::panic
             at /rustc/fc594f15669680fa70d255faec3ca3fb507c3405/library/core/src/panicking.rs:111:5
   3: rust_out::main::{{closure}}
   4: <core::pin::Pin<P> as core::future::future::Future>::poll
   5: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}::{{closure}}::{{closure}}
   6: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}::{{closure}}
   7: tokio::runtime::scheduler::current_thread::Context::enter
   8: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}
   9: tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure}}
  10: tokio::macros::scoped_tls::ScopedKey<T>::set
  11: tokio::runtime::scheduler::current_thread::CoreGuard::enter
  12: tokio::runtime::scheduler::current_thread::CoreGuard::block_on
  13: tokio::runtime::scheduler::current_thread::CurrentThread::block_on
  14: tokio::runtime::runtime::Runtime::block_on
  15: rust_out::main
  16: core::ops::function::FnOnce::call_once
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.


---- src/task.rs - task::TaskMetrics::total_scheduled_duration (line 922) stdout ----
Test executable failed (exit status: 101).

stderr:
thread 'main' panicked at 'assertion failed: total_scheduled_duration >= Duration::from_millis(1000)', src/task.rs:30:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/fc594f15669680fa70d255faec3ca3fb507c3405/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/fc594f15669680fa70d255faec3ca3fb507c3405/library/core/src/panicking.rs:64:14
   2: core::panicking::panic
             at /rustc/fc594f15669680fa70d255faec3ca3fb507c3405/library/core/src/panicking.rs:111:5
   3: rust_out::main::{{closure}}
   4: <core::pin::Pin<P> as core::future::future::Future>::poll
   5: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}::{{closure}}::{{closure}}
   6: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}::{{closure}}
   7: tokio::runtime::scheduler::current_thread::Context::enter
   8: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}
   9: tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure}}
  10: tokio::macros::scoped_tls::ScopedKey<T>::set
  11: tokio::runtime::scheduler::current_thread::CoreGuard::enter
  12: tokio::runtime::scheduler::current_thread::CoreGuard::block_on
  13: tokio::runtime::scheduler::current_thread::CurrentThread::block_on
  14: tokio::runtime::runtime::Runtime::block_on
  15: rust_out::main
  16: core::ops::function::FnOnce::call_once
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.



failures:
    src/task.rs - task::TaskMetrics::mean_scheduled_duration (line 1924)
    src/task.rs - task::TaskMetrics::total_scheduled_duration (line 922)

test result: FAILED. 56 passed; 2 failed; 2 ignored; 0 measured; 0 filtered out; finished in 35.93s

Remove default-features = false from readme

The readme should stick to the default imo.

UI for metrics

Hey there,
is there some kind of (maybe optionally feature gated) integrated UI planned for this?

I'm not really good at web stuff, but I guess I'll integrate a small Chart.js driven one without data retention into my project for now. Should I share that once it's done?

compatibility with tokio

I want to print metrics of Tokio example with master HEAD, then I get below error:


error[E0308]: mismatched types
    --> examples/tinyhttp.rs:40:51
     |
40   |         let runtime_monitor = RuntimeMonitor::new(&handle);
     |                               ------------------- ^^^^^^^ expected struct `tokio::runtime::handle::Handle`, found struct `Handle`
     |                               |
     |                               arguments to this function are incorrect
     |
     = note: expected reference `&tokio::runtime::handle::Handle`
                found reference `&Handle`
     = note: perhaps two different versions of crate `tokio` are being used?
note: associated function defined here
    --> /root/github/tokio-metrics/src/runtime.rs:1015:12
     |
1015 |     pub fn new(runtime: &runtime::Handle) -> RuntimeMonitor {
     |            ^^^

For more information about this error, try `rustc --explain E0308`.
error: could not compile `examples` due to previous error

Full change in Tokio:

diff --git a/.cargo/config b/.cargo/config
index df885898..71097e3c 100644
--- a/.cargo/config
+++ b/.cargo/config
@@ -1,2 +1,5 @@
+[build]
+rustflags = ["--cfg", "tokio_unstable"]
+rustdocflags = ["--cfg", "tokio_unstable"]
 # [build]
-# rustflags = ["--cfg", "tokio_unstable"]
\ No newline at end of file
+# rustflags = ["--cfg", "tokio_unstable"]
diff --git a/examples/Cargo.toml b/examples/Cargo.toml
index b35c587b..e628ceb2 100644
--- a/examples/Cargo.toml
+++ b/examples/Cargo.toml
@@ -10,7 +10,7 @@ edition = "2018"
 tokio = { version = "1.0.0", path = "../tokio", features = ["full", "tracing"] }
 tokio-util = { version = "0.7.0", path = "../tokio-util", features = ["full"] }
 tokio-stream = { version = "0.1", path = "../tokio-stream" }
-
+tokio-metrics = { version = "0.1.0", path = "../../tokio-metrics" }
 tracing = "0.1"
 tracing-subscriber = { version = "0.3.1", default-features = false, features = ["fmt", "ansi", "env-filter", "tracing-log"] }
 bytes = "1.0.0"
@@ -24,6 +24,9 @@ httpdate = "1.0"
 once_cell = "1.5.2"
 rand = "0.8.3"

+
+
+
 [target.'cfg(windows)'.dev-dependencies.windows-sys]
 version = "0.42.0"

diff --git a/examples/tinyhttp.rs b/examples/tinyhttp.rs
index fa0bc669..0457406a 100644
--- a/examples/tinyhttp.rs
+++ b/examples/tinyhttp.rs
@@ -18,8 +18,10 @@ use futures::SinkExt;
 use http::{header::HeaderValue, Request, Response, StatusCode};
 #[macro_use]
 extern crate serde_derive;
+use std::time::Duration;
 use std::{env, error::Error, fmt, io};
 use tokio::net::{TcpListener, TcpStream};
+use tokio_metrics::RuntimeMonitor;
 use tokio_stream::StreamExt;
 use tokio_util::codec::{Decoder, Encoder, Framed};

@@ -33,6 +35,18 @@ async fn main() -> Result<(), Box<dyn Error>> {
     let server = TcpListener::bind(&addr).await?;
     println!("Listening on: {}", addr);

+    let handle = tokio::runtime::Handle::current();
+    {
+        let runtime_monitor = RuntimeMonitor::new(&handle);
+        tokio::spawn(async move {
+            for interval in runtime_monitor.intervals() {
+                // pretty-print the metric interval
+                println!("{:?}", interval);
+                // wait 500ms
+                tokio::time::sleep(Duration::from_secs(1)).await;
+            }
+        });
+    }
     loop {
         let (stream, _) = server.accept().await?;
         tokio::spawn(async move {

Command:

RUSTFLAGS="--cfg tokio_unstable" cargo run --example tinyhttp

Metric integrity in long-running applications.

Is storing durations as u64 nanoseconds enough? It’s 584 years, but if you have 5000 tasks, you’ll burn through it in 42 days of uptime. That sounds doable. At minimum, we should make sure it doesn’t panic on overflow/underflow.

tokio-rs / tokio-metrics Goto Github PK

tokio-metrics's Introduction

Tokio Metrics

Getting Started With Task Metrics

Task Metrics

Base Metrics

Derived Metrics

Getting Started With Runtime Metrics

Runtime Metrics

Base Metrics

Derived Metrics

Relation to Tokio Console

License

Contribution

tokio-metrics's People

Contributors

Stargazers

Watchers

Forkers

tokio-metrics's Issues

Recommend Projects

Recommend Topics

Recommend Org