Git Product home page Git Product logo

diagnostics's Introduction

Timely Diagnostics

Diagnostic tools for timely dataflow computations. Timely dataflows are data-parallel and scale from single threaded execution on your laptop to distributed execution across clusters of computers. Each thread of execution is called a worker.

The tools in this repository have the shared goal of providing insights into timely dataflows of any scale, in order to understand the structure and resource usage of a dataflow.

Each timely worker can be instructed to publish low-level event streams over a TCP socket, by setting the TIMELY_WORKER_LOG_ADDR environment variable. In order to cope with the high volume of these logging streams the diagnostic tools in this repository are themselves timely computations that we can scale out. In order to avoid confusion, we will refer to the workers of the dataflow that is being analysed as the source peers. The workers of the diagnostic computation we will refer to as inspector peers.

This repository contains a library, tdiag-connect, and a command line interface to the diagnostic tools, tdiag.

tdiag-connect (in /connect) is a library of utiltities that can be used by inspector peers to source event streams from source peers.

tdiag (in /tdiag) is an unified command line interface to all diagnostic tools (only one is currently available, more are coming).

Getting Started with tdiag

tdiag Crates.io is the CLI to all diagnostic tools. Install it via cargo:

cargo install tdiag

All diagnostic computations require you to specify the number of workers running in the source computation via the source-peers parameter. This is required in order to know when all source event streams are connected.

graph - Visualize the Source Dataflow

In order to better understand what is happening inside of a dataflow computation, it can be invaluable to visualize the structure of the dataflow. Start the graph diagnosis:

tdiag --source-peers 2 graph --out graph.html

You should be presented with a notice, informing you that tdiag is waiting for as many connections as specified via --source-peers (two in this case).

In a separate shell, start your source computation. In this case, we will analyse the Timely PageRank example. From inside the timely-dataflow/timely sub-directory, run:

env TIMELY_WORKER_LOG_ADDR="127.0.0.1:51317" cargo run --example pagerank 1000 1000000 -w 2

Most importantly, env TIMELY_WORKER_LOG_ADDR="127.0.0.1:51317" will cause the source workers to connect to our diagnostic computation. The -w parameter specifies the number of workers we want to run the PageRank example with. Whatever we specify here therefore has to match the --source-peers parameter we used when starting tdiag.

Once the computation is running, head back to the diagnostic shell, where you should now see something like the following:

$ tdiag --source-peers 2 graph --out graph.html

Listening for 2 connections on 127.0.0.1:51317
Trace sources connected
Press enter to generate graph (this will crash the source computation if it hasn't terminated).

At any point, press enter as instructed. This will produce a fully self-contained HTML file at the path specified via --out (graph.html in this example). Open that file in any modern browser and you should see a rendering of the dataflow graph at the time you pressed enter. For the PageRank computation, the rendering should look similar to the following:

PageRank Graph

You can use your mouse or touchpad to move the graph around, and to zoom in and out.

profile - Profile the Source Dataflow

The profile subcommand reports aggregate runtime for each scope/operator.

tdiag --source-peers 2 profile

You should be presented with a notice informing you that tdiag is waiting for as many connections as specified via --source-peers (two in this case).

In a separate shell, start your source computation. In this case, we will analyse the Timely PageRank example. From inside the timely-dataflow/timely sub-directory, run:

env TIMELY_WORKER_LOG_ADDR="127.0.0.1:51317" cargo run --example pagerank 1000 1000000 -w 2

Most importantly, env TIMELY_WORKER_LOG_ADDR="127.0.0.1:51317" will cause the source workers to connect to our diagnostic computation. The -w parameter specifies the number of workers we want to run the PageRank example with. Whatever we specify here therefore has to match the --source-peers parameter we used when starting tdiag.

Once the computation is running, head back to the diagnostic shell, where you should now see something like the following:

$ tdiag --source-peers 2 profile

Listening for 2 connections on 127.0.0.1:51317
Trace sources connected
Press enter to stop collecting profile data (this will crash the source computation if it hasn't terminated).

At any point, press enter as instructed. This will produce an aggregate summary of runtime for each scope/operator. Note that the aggregates for the scopes (denoted by [scope]) include the time of all contained operators.

[scope]	Dataflow	(id=0, addr=[0]):	1.17870668e-1 s
	PageRank	(id=3, addr=[0, 3]):	1.17197194e-1 s
	Feedback	(id=2, addr=[0, 2]):	3.56249e-4 s
	Probe	(id=6, addr=[0, 4]):	7.86e-6 s
	Input	(id=1, addr=[0, 1]):	3.408e-6 s

Diagnosing Differential Dataflows

The differential subcommand groups diagnostic tools that are only relevant to timely dataflows that make use of differential dataflow. To enable Differential logging in your own computation, add the following snippet to your code:

if let Ok(addr) = ::std::env::var("DIFFERENTIAL_LOG_ADDR") {
    if let Ok(stream) = ::std::net::TcpStream::connect(&addr) {
        differential_dataflow::logging::enable(worker, stream);
        info!("enabled DIFFERENTIAL logging to {}", addr);
    } else {
        panic!("Could not connect to differential log address: {:?}", addr);
    }
}

With this snippet included in your executable, you can use any of the following tools to analyse differential-specific aspects of your computation.

differential arrangements - Track the Size of Differential Arrangements

Stateful differential dataflow operators often maintain indexed input traces called arrangements. You will want to understand how these traces grow (through the accumulation of new inputs) and shrink (through compaction) in size, as your computation executes.

tdiag --source-peers differential arrangements

You should be presented with a notice informing you that tdiag is waiting for as many connections as specified via --source-peers (two in this case).

In a separate shell, start your source computation. In this case, we will analyse the Differential BFS example. From inside the differential dataflow repository, run:

export TIMELY_WORKER_LOG_ADDR="127.0.0.1:51317"
export DIFFERENTIAL_LOG_ADDR="127.0.0.1:51318"

cargo run --example bfs 1000 10000 100 20 false -w 2

When analysing differential dataflows (in contrast to pure timely computations), both TIMELY_WORKER_LOG_ADDR and DIFFERENTIAL_LOG_ADDR must be set for the source workers to connect to our diagnostic computation. The -w parameter specifies the number of workers we want to run the PageRank example with. Whatever we specify here therefore has to match the --source-peers parameter we used when starting tdiag.

Once the computation is running, head back to the diagnostic shell, where you should now see something like the following:

$ tdiag --source-peers 2 differential arrangements

Listening for 2 Timely connections on 127.0.0.1:51317
Listening for 2 Differential connections on 127.0.0.1:51319
Will report every 1000ms
Trace sources connected

ms	Worker	Op. Id	Name	# of tuples
1000	0	18	Arrange ([0, 4, 6])	654
1000	0	20	Arrange ([0, 4, 7])	5944
1000	0	28	Arrange ([0, 4, 10])	3790
1000	0	30	Reduce ([0, 4, 11])	654
1000	1	18	Arrange ([0, 4, 6])	679
1000	1	20	Arrange ([0, 4, 7])	6006
1000	1	28	Arrange ([0, 4, 10])	3913
1000	1	30	Reduce ([0, 4, 11])	678
2000	0	18	Arrange ([0, 4, 6])	654
2000	0	18	Arrange ([0, 4, 6])	950
2000	0	20	Arrange ([0, 4, 7])	5944
2000	0	20	Arrange ([0, 4, 7])	6937
2000	0	28	Arrange ([0, 4, 10])	3790

Each row of output specifies the time of the measurement, worker and operator ids, the name of the arrangement and the number of tuples it maintains. Updated sizes will be reported every second by default, this can be controlled via the output-interval parameter.

The tdiag-connect library

Crates.io Docs

tdiag-connect (in /connect) is a library of utiltities that can be used by inspector peers to source event streams from source peers.

Documentation is at docs.rs/tdiag-connect.

diagnostics's People

Contributors

comnik avatar frankmcsherry avatar li1 avatar maddyblue avatar namibj avatar utaal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

diagnostics's Issues

[graph] visualization to indicate where a dataflow is stuck

@frankmcsherry:

It is hard to diagnose a "stuck" timely dataflow computation, where for some reason there is a capability (or perhaps message) in the system that prevents forward progress. In the system there is fairly clear information (in the progress tracking) about which pointstamps have non-zero accumulation, and although perhaps not strictly speaking a "visualization" we could imagine extracting and presenting this information.

@antiguru recently had a similar issue, in which he wanted to "complete" a dataflow without simply exiting the worker (to take some measurements), and when he attempts this the dataflow never reports completion. The root cause was ultimately that a forgotten input was left un-closed.

One idiom that seemed helpful here was to imagine a version of the dataflow graph that reports e.g. whether operators have been tombstoned or not (closed completely, memory reclaimed). This would reveal who was keeping a dataflow open, which is a rougher version of what is holding a dataflow back. We might also look for similar idioms that allow people to ask, for a given timestamp/frontier, which operators have moved past that frontier and which have not, revealing where in the dataflow graph a time is "stuck".

[graph] Support filtering/collapsing of data flow regions to make huge graphs readable

This may be needed for complex graphs like Frank's epic doop graph.

Various snippets from #1:


@frankmcsherry

I've been thinking a bit about how to present these, and one thought was: maybe it makes sense to have two nodes for the feedback node, and to not connect them other than visually. This maybe allows the graph to dangle a bit better, and reveals the acyclic definitions.

@comnik

I think this could be a good use case for a touch of interactivity, e.g. draw the nodes somewhat differently to indicate an outgoing / incoming feedback edge, and then highlight the pair when the user hovers on either of the two nodes.

As an experiment, I built an extra script for adding DataScript into the mix. This is intended to be completely opt-in, without changing anything about the current representation.

I also added a hook to re-render the whole thing reactively.

This should give us a low-overhead (no React!) way to experiment with a few more dynamic features, such as highlighting feedback edges.

It would be helpful to have scopes be exported as well, which would allow us to do things such as collapsing / expanding scopes.

Diagnostics PageRank Example Stuck

Hello. I was trying to follow the example on the README.md but tdiag gets stuck.

  1. One one terminal I execute: cargo run --release -- --source-peers 2 graph --out graph.html
  2. One second terminal I execute: env TIMELY_WORKER_LOG_ADDR="127.0.0.1:51317" cargo run --release --example pagerank 1000 100000 -w 2
  3. pagerank runs to completion.
  4. tdiag acknowledges connection via:
Listening for 2 connections on 127.0.0.1:51317
Trace sources connected
Press enter to generate graph (this will crash the source computation if it hasn't terminated).
  1. I press enter but tdiag hangs indefinitely.

Looking at the stack trace of tdiag there are two threads. The main thread is waiting on a thread join. Thread2 also seems stuck on await_events. Stack trace for Thread2:

futex_wait_cancelable 0x00007ffff7d9c376
__pthread_cond_wait_common 0x00007ffff7d9c376
__pthread_cond_wait 0x00007ffff7d9c376
std::sys::unix::condvar::Condvar::wait condvar.rs:73
std::sys_common::condvar::Condvar::wait condvar.rs:50
std::sync::condvar::Condvar::wait condvar.rs:200
std::thread::park mod.rs:923
<timely_communication::allocator::thread::Thread as timely_communication::allocator::Allocate>::await_events thread.rs:44
<timely_communication::allocator::generic::Generic as timely_communication::allocator::Allocate>::await_events generic.rs:99
timely::worker::Worker<A>::step_or_park worker.rs:216
timely::execute::execute::{{closure}} execute.rs:206
timely_communication::initialize::initialize_from::{{closure}} initialize.rs:269
std::sys_common::backtrace::__rust_begin_short_backtrace backtrace.rs:130
std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}} mod.rs:475
<std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once panic.rs:318
std::panicking::try::do_call panicking.rs:297
__rust_try 0x000055555661a74d
std::panicking::try panicking.rs:274
std::panic::catch_unwind panic.rs:394
std::thread::Builder::spawn_unchecked::{{closure}} mod.rs:474
core::ops::function::FnOnce::call_once{{vtable-shim}} function.rs:232
<alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once boxed.rs:1034
<alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once boxed.rs:1034
std::sys::unix::thread::Thread::new::thread_start thread.rs:87
start_thread 0x00007ffff7d95609
clone 0x00007ffff7ed1103

Please lmk if I missed something when executing the commands?

This tool seems to be broken

Hi,

I have tried to use this tool with different examples and crate versions.

It always gets stuck: pressing enter has no result.

To reproduce:

FROM ubuntu:latest
RUN apt-get update
RUN apt-get install -y cargo git

RUN git clone https://github.com/TimelyDataflow/timely-dataflow
WORKDIR timely-dataflow

RUN cargo install tdiag

ENV PATH="$PATH:/root/.cargo/bin"

Run:

docker build -t foo .

And run this docker in two different shells:

docker run --name foo_container --rm -it foo tdiag --source-peers 2 graph --out graph.html
docker exec -it foo_container env TIMELY_WORKER_LOG_ADDR="127.0.0.1:51317" cargo run --example pagerank 1000 1000000 -w 2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.