Git Product home page Git Product logo

sightglass's Introduction

sightglass

A benchmarking suite and tooling for Wasmtime and Cranelift

A Bytecode Alliance project

build status zulip chat supported rustc stable

About

This repository contains benchmarking infrastructure for Wasmtime and Cranelift, as described in this RFC. In particular, it has

  • a benchmark suite of Wasm applications in benchmarks/*, and

  • a benchmark runner CLI tool to record, analyze, and display benchmark results in crates/cli/*.

We plan to implement a server that periodically runs benchmarks as new commits are pushed to Wasmtime and display the history of those benchmark results, similar to Firefox's Are We Fast Yet?. However, this work is not completed yet. See issue 93 for details.

Results are always broken down by phase — compilation vs. instantiation vs. execution — for each program in the suite. This allows us to reason about, for example, compiler performance separately from its generated code quality. How all this works together:

  • each benchmark is compiled to a benchmark.wasm module that calls two host functions, bench.start and bench.end, to notify Sightglass of the portion of the execution to measure (see the benchmarks README)
  • we build an engine (e.g., Wasmtime) as a shared library that implements the [bench API]; the Sightglass infrastructure uses this to measure each phase (see an engine README)
  • the sightglass-cli tool runs benchmarks using the engines and emits measurements for each phase; this is configurable, e.g., by various measurement mechanisms, various output formats, different aggregations, etc.

This is NOT a General-Purpose WebAssembly Benchmark Suite

This benchmark suite and tooling is specifically designed for Wasmtime and Cranelift, as explained in the benchmarking suite RFC:

It is also worth mentioning this explicit non-goal: we do not intend to develop a general-purpose WebAssembly benchmark suite, used to compare between different WebAssembly compilers and runtimes. We don't intend to trigger a WebAssembly benchmarking war, reminiscent of JavaScript benchmarking wars in Web browsers. Doing so would make the benchmark suite's design high stakes, because engineers would be incentivized to game the benchmarks, and would additionally impose cross-engine portability constraints on the benchmark runner. We only intend to compare the performance of various versions of Wasmtime and Cranelift, where we don't need the cross-engine portability in the benchmark runner, and where gaming the benchmarks isn't incentivized.

Furthermore, general-purpose WebAssembly benchmarking must include WebAssembly on the Web. Doing that well requires including interactions with the rest of the Web browser: JavaScript, rendering, and the DOM. Building and integrating a full Web browser is overkill for our purposes, and represents significant additional complexity that we would prefer to avoid.

Even if someone did manage to get other Wasm engines hooked into this benchmarking infrastructure, comparing results across engines would likely be invalid. The wasmtime-bench-api intentionally does things that will likely hurt its absolute performance numbers but which help us more easily get statistically meaningful results, like randomizing the locations of heap allocations. Without taking great care to level the playing field with respect to these sorts of tweaks, as well as keeping an eye on all engine specific configuration options, you'll end up comparing apples and oranges.

Usage

You can always see all subcommands and options via

cargo run -- help

There are flags to control how many different processes we spawn and take measurements from, how many iterations we perform for each process, etc...

That said, here are a couple of typical usage scenarios.

Building the Runtime Engine for Wasmtime

$ cd engines/wasmtime && rustc build.rs && ./build && cd ../../

Running the Default Benchmark Suite

$ cargo run -- benchmark --engine engines/wasmtime/libengine.so

This runs all benchmarks listed in default.suite. The output will be a summary of each benchmark program's compilation, instantiation, and execution times.

Running a Single Wasm Benchmark

$ cargo run -- benchmark --engine engines/wasmtime/libengine.so -- path/to/benchmark.wasm

Append multiple *.wasm paths to the end of that command to run multiple benchmarks.

Running All Benchmarks

$ cargo run -- benchmark --engine engines/wasmtime/libengine.so -- benchmarks/all.suite

*.suite files contain relative paths of a list of benchmarks to run. This is a convenience for organizing benchmarks but is functionally equivalent to listing all *.wasm paths at the end of the benchmark command.

Comparing a Feature Branch to Main

First, build libwasmtime_bench_api.so (or .dylib or .dll depending on your OS) for the latest main branch:

$ cd ~/wasmtime
$ git checkout main
$ cargo build --release -p wasmtime-bench-api
$ cp target/release/libwasmtime_bench_api.so /tmp/wasmtime_main.so

Then, checkout your feature branch and build its libwasmtime_bench_api.so:

$ git checkout my-feature
$ cargo build --release -p wasmtime-bench-api

Finally, run the benchmarks and supply both versions of libwasmtime_bench_api.so via repeated use of the --engine flag:

$ cd ~/sightglass
$ cargo run -- \
    benchmark \
    --engine /tmp/wasmtime_main.so \
    --engine ~/wasmtime/target/release/libwasmtime_bench_api.so \
    -- \
    benchmarks/all.suite

The output will show a comparison between the main branch's results and your feature branch's results, giving you an effect size and confidence interval (i.e. "we are 99% confident that my-feature is 1.32x to 1.37x faster than main" or "there is no statistically significant difference in performance between my-feature and main") for each benchmark Wasm program in the suite.

As you make further changes to your my-feature branch, you can execute this command whenever you want new, updated benchmark results:

$ cargo build --manifest-path ~/wasmtime/Cargo.toml --release -p wasmtime-bench-api && \
    cargo run --manifest-path ~/sightglass/Cargo.toml -- \
      benchmark \
      --engine /tmp/wasmtime_main.so \
      --engine ~/wasmtime/target/release/libwasmtime_bench_api.so \
      -- \
      benchmarks/all.suite

Collecting Different Kinds of Results

Sightglass comes enabled with several different kinds of measurement mechanisms — a measure. The default measure is cycles, which simply measures the elapsed duration of CPU cycles for each phase (e.g., using RDTSC). The accuracy of this measure is documented here but note that measuring using CPU cycles alone can be problematic (e.g., CPU frequency changes, context switches, etc.).

Several measures can be configured using the --measure option:

  • cycles: the number of CPU cycles elapsed
  • perf-counters: a selection of common perf counters (CPU cycles, instructions retired, cache accesses, cache misses); only available on Linux
  • vtune: record each phase as a VTune task for analysis; see this help documentation for more details
  • noop: no measurement is performed

For example, run:

$ cargo run -- benchmark --measure perf-counters ...

Getting Raw JSON or CSV Results

If you don't want the results to be summarized and displayed in a human-readable format, you can get raw JSON or CSV via the --raw flag:

$ cargo run -- benchmark --raw --output-format csv -- benchmark.wasm

Then you can use your own R/Python/spreadsheets/etc. to analyze and visualize the benchmark results.

Adding a New Benchmark

Add a Dockerfile under benchmarks/<your benchmark> building a Wasm file that brackets the work to measure with the bench.start and bench.end host calls. See the benchmarks README for a fuller set of requirements and the build.sh script for building this file.

sightglass's People

Contributors

abrown avatar adambratschikaye avatar alexcrichton avatar bnjbvr avatar brianjjones avatar cdisselkoen avatar cfallin avatar dependabot[bot] avatar fitzgen avatar jameysharp avatar jedisct1 avatar jlb6740 avatar upsuper avatar yurydelendik avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sightglass's Issues

Detect and warn if the samples are not normally distributed

From #138:

Ah, and one more thought: have we considered any statistical analysis that would look for multi-modal distributions (and warn, at least)? If we see that e.g. half of all runs of a benchmark run in 0.3s and half in 0.5s, and the distribution looks like the sum of two Gaussians, it may be better to warn the user "please check settings X, Y, Z; you seem to be alternating between two different configurations randomly" than to just present a mean of 0.4s with some wide variance, while the latter makes more sense if we just have a single Gaussian with truly random noise.

We can use the Shapiro-Wilk test to determine normality: https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test

Account for varying CPU frequency more robustly

Most modern CPUs scale their clock frequency according to demand, and this CPU frequency scaling is always a headache when running benchmarks. There are two main dimensions in which this variance could cause trouble:

  • Varying frequency across time: if the CPU load of benchmarking causes the CPU to ramp up its frequency, then different benchmark runs could observe different results based on different CPU frequency.
  • Varying frequency across space: if different CPU cores are running at different frequencies, then benchmark runs might intermittently experience very different performance if they are not pinned to specific cores.

I've been seeing some puzzling results lately and I suspect at least part of the trouble has to do with the above. I've set my CPU cores to the Linux kernel's performance governor, but even then, on my 12-core Ryzen CPU, I see clock speeds between 3.6GHz and 4.2GHz, likely due to best-effort frequency boost (which is regulated by thermal bounds and so unpredictable).

Note that measuring only cycles does not completely remove the effects of clock speed, because parts of performance are pinned to other clocks -- e.g., memory latency depends on the DDR clock, not the core clock, and L3 cache latency depends on the uncore clock.

The best ways I know to avoid noise from varying CPU performance are:

  • Have longer benchmarks. Some of the benchmarks in this suite are only a few milliseconds long; this is not enough time to reach a steady state.
  • Interleave benchmark runs appropriately. Right now, it looks like the top-level runner does a batch of runs with one engine, then a batch of runs with another. If the runs for different engines/configurations were interleaved at the innermost loop, then system effects that vary over time would at least impact all configurations roughly equally.
  • Pin to a particular CPU core. For single-threaded benchmarks, this is probably the most robust way to have accurate A/B comparisons: if cores have slightly different clock frequencies, just pick one of them. Even better would be to do many runs and average across them all, but in high-core-count systems, removing this noise would take a lot of runs (hundreds of processes); with a few (5-10) process starts, it's entirely possible for the variance in mean core speed to be significant.
  • Observe CPU governor settings when on a known platform (Linux: /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor text file, will usually be ondemand, we wantperformance) and warn if scaling is turned on

Thoughts? Other ideas?

Use the median instead of/in addition to the average

The average isn't a stable statistical measurement. (Whenever Bill Gates enters a very crowded bar, everyone in this bar becomes a billionaire on average.)

It would be nice to add the median value to the set of benchmarking results, since it tends to be more stable overall.

Benchmarking initializes a `precision::Precision` for every run, spending a second or two in a calibration loop

I spent some time looking into why my benchmark runs seemed to be "choppy", sitting doing ~nothing between each (very fast) Wasm benchmark run. It seems that we have pauses of up to a few seconds between each run.

It turns out that this is because the wall_cycle measurement provider in the recorder crate invokes precision::Precision::new, building a new precision measurement object, for each run: link

The precision crate notes in its docs that "Note that on Linux system, this will perform calibration before returning" and recommends against running it more than necessary.

It seems we could get a significant speedup by saving one Precision object, maybe in a lazy-init'd global or somesuch, and reusing it.

Test docker containers in CI

For the future: I really wish this were running in CI; it's very difficult to know what effect these changes will have without downloading the branch and running the thing on my system. There's no reason that GitHub actions couldn't do this, right?

Originally posted by @abrown in #26 (comment)

Report code size

We should report the code size of each Wasm benchmark when it is compiled to native code.

`--stop-after` isn't working

For example

cargo run -- --stop-after compilation -- benchmarks-next/pulldown-cmark/benchmark.wasm

starts instantiating and executing the benchmark instead of just compiling it.

cc @abrown

sightglass-next: add a format option for Markdown

In order to display tables of significant performance changes, we discussed creating a Markdown table from the analyzed data. Here is a proposed table:

| Benchmark      | Phase       | Event                | `main` (SHA: `0123456`) | PR #42 (SHA `abcdef2`) | Significant Change  |
|----------------|-------------|----------------------|-------------------------|------------------------|---------------------|
| shootout-fib2  | Execution   | cycles               | 100,000,000             | 110,000,000            | -10%                |
| shootout-gimli | Compilation | instructions-retired | ...                     | ...                    | +2%                 |
| ...            | ...         | ...                  | ...                     | ...                    | ...                 |

sightglass-next build process downloads WASI SDK many times

While watching the output of benchmarks-next/build.sh scroll by, I noticed that it seems that every individual benchmark build is reconstructing a Docker image from scratch, including a download of the 48MB WASI SDK tarball as well as a number of Ubuntu packages. This is somewhat wasteful (bandwidth is not free) and also means the build is slower than it could be.

Could we consider building a base Docker image once and then using it to build each benchmark in turn?

sightglass-next: extract native baseline benchmarking from webui_runner

One useful feature of the original sightglass code was the ability to run the benchmarks as native machine code in order to form a baseline for comparison. If we migrate this functionality from webui_runner to benchmarks-next (e.g.), we can then fully replace the old sightglass runner with the new one.

This involves some investigation to determine how to hook into the bench_start() and bench_end() calls with perf.

Emit measurement results immediately

@jameysharp mentioned in #202 (comment) that it would be nice if measurement results were emitted as soon as they were collected. Currently this is not the case: all the measurements for all of the benchmark runs are collected and emitted at the end. This issue could be fixable by refactoring how the measurements are serialized by serde.

sightglass-next: add independence test

We talked about adding an independence test to check that the measurements of multiple runs aren't correlated (@fitzgen, is that right?). We discussed also implementing this in Rust and it would seem to fit in the sightglass-analysis crate.

Warn when CPU governor is not "performance" on Linux

From #138:

Observe CPU governor settings when on a known platform (Linux: /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor text file, will usually be ondemand, we want performance) and warn if scaling is turned on

If anyone knows how to detect the equivalent thing for other OSes, please share :)

Tracking issue for moving to new runner

In #46, we introduced a new test runner implementation and conventions but there are many remaining tasks for improving that PR. I am copying the issues listed in #46 (comment) here and will check them off as they are completed:

  • move and rename webui/sg-history to crates/result-server
  • add the ability to run multiple iterations (e.g. -n 10) to sightglass-recorder
  • fix the perf-counting in sightglass-recorder, probably by using perf-event or, if not, by resolving gz/rust-perfcnt#19
  • add the ability to choose the output format (e.g. CSV vs JSON); this can probably be lifted from what sightglass currently does
  • add a sightglass-analysis crate to aggregate and synthesize results
  • add more benchmarks! (Initially, port the existing benchmarks in shootout and polybench to the new style)
  • add a CLI command (build-engine?) for building engine libraries from name@commit strings in addition to passing paths to pre-built engines (the current approach)
  • eventually, integrate duplicated parts (e.g. sightglass's test runner, webui_runner's Docker infrastructure) and remove unused code
  • re-enable the test (and run it in CI) verifying that sightglass-artifact can build Docker images into Wasm benchmarks despite differences in Docker CLI output (i.e. potentially switch to using a crate exposing the Docker API)--perhaps we should just regenerate all the benchmark Wasm files with benchmarks-next/build.sh and check for changes to ensure that stuff is deterministic
  • add a test (and run it in CI) to run sightglass-recorder on a compiled Wasm benchmark--perhaps we just run all the benchmark Wasm files with benchmarks-next/run.sh to make sure they are all runnable

sightglass-next: Check that benchmarks produced expected results

A few things might lead us to measuring the wrong thing / not measuring what we want:

  • A buggy benchmark program that does the wrong computation
  • A codegen bug in a compiler PR we are testing
  • Not writing the benchmark program's result through I/O, allowing the compiler to optimize it away

I propose we solve these issues by:

  • requiring that benchmark programs write the result of their computation to stdout
  • the bench API's WASI context redirects the benchmark program's stdout to a known log file
  • the known log file is passed into the bench API through a config struct (that will also have the WASI working dir, so that won't be a standalone argument anymore)
  • benchmark programs have default.expected and (if it supports small workloads) small.expected files containing the hash of the expected stdout
  • the benchmark runner hashes the stdout log file and compares it to the expected hash
    • if they match, the runner keeps doing its thing
    • if not, then it exits with an error

Remove duplication in VM infrastructure

As noted in a few places (e.g, here and here), there is duplication in the code that enables each of the VMs to be run inside sightglass. This would be resolved by factoring out the duplicated parts in webui_runner/plugs into one common implementation.

Interleave benchmark iterations, not just processes

From #138:

Interleave benchmark runs appropriately. Right now, it looks like the top-level runner does a batch of runs with one engine, then a batch of runs with another. If the runs for different engines/configurations were interleaved at the innermost loop, then system effects that vary over time would at least impact all configurations roughly equally.

Faster/slower summary messages are confusing: invert one of the adjectives or one of the ratios

The sightglass-cli benchmark command produces output like the following:

  baseline.so is 1.00x to 1.03x FASTER than target/release/libwasmtime_bench_api.so!
  target/release/libwasmtime_bench_api.so is 0.97x to 1.00x SLOWER than baseline.so!

The intent to show the ratio in both directions is nice; however, these two statements actually contradict, I think. if A is n times faster than B, then B is n times slower than A. However, this summary says that B is 1/n times slower than A.

In other words, either both words above should be "FASTER", or the second ratio should also be 1.00x to 1.03x.

This was producing some confusion for me earlier and I had to read the source to figure out which direction is intended.

Weird file CI issues with benchmark_effect_size test

On Windows, the benchmark_effect_size test has started intermittently failing since #170. It is unclear to me what could have modified the execution of the code that copies the built engine to a location provided by tempfile::NamedTempFile. I added an assert in hopes that it would trigger the error sooner but the OS always thinks the file exists:

assert!(alt_engine_path.exists());

One example of this failure is here:

thread 'benchmark::benchmark_effect_size' panicked at 'Unexpected failure.
code--1073741819
...
command=`"D:\\a\\sightglass\\sightglass\\target\\debug\\sightglass-cli.exe" "benchmark" "--engine" "\\\\?\\D:\\a\\sightglass\\sightglass\\engines\\wasmtime\\engine.dll" "--engine" "C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\.tmpaNVVum" "--processes" "1" "--iterations-per-process" "3" "../../benchmarks-next/noop/benchmark.wasm"`

If this random link is to be believed, the code referenced above is an access violation that might indicate referencing a null pointer, e.g. But why? The original engine library seems to run fine according to the logs; only the copied engine library has the problem.

sightglass-next: modify lucet to use new runner

The two known users of the original Sightglass code (see src) are:

  • lucet, see the Git submodule and the benchmarks directory
  • webui_runner, a component of Sightglass (see webui_runner)

If lucet migrates to sightglass-next, then it is significantly easier to refactor this repository to remove the old Sightglass code.

Organize benchmarks into suites

I am proposing in this issue that we add a way for sightglass-cli to benchmark a specific set of files together.

Why

  • I have noticed that certain users only run the larger benchmarks (spidermonkey, bz2, pulldown-cmark); in the past I have wanted to only run the shootout-* micro-benchmarks or "all the benchmarks that use SIMD."
  • New benchmarks, like the wasi-nn one submitted in #201, would require special setup and would not run successfully with sightglass-cli benchmark ... benchmarks/*/benchmark.wasm (as the README currently advises the user). These special benchmarks (e.g., due to environment setup, compile-time or runtime flags, etc.) could be separated in to their own suite.
  • As described in the RFCs on benchmarking, candidate benchmarks should be "real, widely used programs, or at least extracted kernels of such programs." Though this is good criteria, it can be subjective (e.g., though some shootout benchmarks are "extracted kernels" some may consider them too small to be representative); organizing the benchmarks into suites would recognize and document these differences.
  • #71 requests a default set of benchmarks to run automatically if no benchmark files are specified — one could imagine a default suite for that.

How

The core idea would be to create a set of *.suite files. *.suite files would contain a newline-delimited list of benchmark paths relative to the Sightglass project directory and would accept # line comments. E.g., foo.suite could look like:

# This suite contains benchmarks that call the `foo` function.
benchmarks/foo/benchmark.wasm
benchmarks/bar/benchmark.wasm
benchmarks/baz/specially-crafted-benchmark.wasm

sightglass-cli would accept a new flag — --suite — with the path to the *.suite file to use. It would be an error to specify both a suite and a benchmark file; other than that, running benchmarks by path (the current way) would remain unchanged. Users could create their own *.suite files and manage these outside the repository but the Sightglass repository would contain a few *.suite files. There are many ways to slice this, but I could see the following initial set:

  • default.suite
  • shootout.suite
  • simd.suite

If no benchmark paths or suites are specified, sightglass-cli would run the benchmarks contained in the default.suite.


I would appreciate feedback on this proposal before implementing it. Any thoughts one way or another are appreciated!

Provide option to collect JSON output in separate file

I was unable to work out how to collect JSON output into a machine-readable file when running a sequence of slightglass-cli in-process-benchmark commands. For now I am grepping through and manually cleaning up a transcript of stdout/stderr -- but this is far from ideal. Would it be possible to either add an option to write output to a particular file, or document this if there is already a way?

Replace sightglass with sightglass-next

At some point, the code in src will no longer be used and can be removed. When this happens, we should be able to remove that directory as well as the benchmarks directory (and rename benchmarks-next to benchmarks). There are several issues to close before this can happen:

  • modify lucet to use new runner, #92
  • integrate the output of the new runner with the webui, #94
  • extract native baseline benchmarking from webui_runner, #96

Allow multiple instantiations per compilation

This would allow us to get more samples in that much less time (could get, say, ten instantiation and execution samples per compilation) but would also let us stress test things like Wasmtime's pooling allocator.

(We can't allow multiple executions per instantiation because of WASI's contract for commands that _start is only called once).

Allow `summarize` to aggregate multiple benchmarks into one score

When measuring more than one benchmark, it would be nice to be able to aggregate the results into a single score. One common way to do this is to take the geometric mean of a set of results. This issue proposes adding a --aggregate-benchmarks flag to do exactly this. When enabled, sightglass-cli summarize --aggregate-benchmarks would emit a single, "aggregated by geomean" result for each of the phases using <all benchmarks>, e.g., for the benchmark column.

sightglass next: Support for multiple input workloads

After chatting with @abrown we came up with the following extension to the benchmark program protocol to support multiple input workloads (eg a "default" workload and a "small" workload that is used for quickly testing an idea):

  • the runner has a --small-workload flag
  • this sets a WASM_BENCH_USE_SMALL_WORKLOAD=1 environment variable for benchmark programs
  • the directory where the wasm benchmark program lives is preopened for its execution
  • the wasm benchmark program may read that env var and use that to decide which input files to read and process as its workload

Example:

fn main() {
    let workload = if std::env::var("WASM_BENCH_USE_SMALL_WORKLOAD").is_ok() {
        read_small()
    } else {
        read_default()
    };
    bench_start();
    process(workload);
    bench_end();
}

Things to note:

  • benchmarks can ignore the env var if they have only a single workload and it Just Works (eg micro benchmarks)
  • easier to implement in runner than a stdin protocol, work is pushed to benchmark programs themselves, and it is easier for them then the runner
  • flexible for situations where a workload is made up of multiple files, where as stdin would require some sort of JSON wrapping of them or something
  • easy to add new env vars in the future for new extension points, if needed

Change default branch name

As a policy, the Bytecode Alliance is changing the default branch names in all repositories. We would like for all projects to change the default to main by June 26. (We mention June 26th because there is some suggestion that GitHub may be adding the ability to make this process more seamless. Feel free to wait for that, but only up to June 26. We'll provide further support and documentation before that date.)

Please consider this a tracking issue. It is not intended for public debate.

  • Change branch name
  • Update CI
  • Update build scripts
  • Update documentation

Updates for sightglass to target portability and plugin features

@jedisct1 @pchickey @sunfishcode

Hi,

Having had some interest in surveying the performance of various standalone WASM VMs, I've made several updates to sightglass that I am thinking should be useful as a contribution. In my updates are here:

I've explicitly split the runner and the UI in master into different projects. One of the reasons for this is that I've also containerized each component so that they can be installed and used independently. In addition to the logical separations and container feature, there are several updates that should be useful:

  1. The runner portion is now plugin based and automated for the combination of VMs and workloads. What this means is that you add a script that defines how to check out and compile a VM and the framework will automatically call this script before running a test suite. It also means that there is a separate script that defines how to build for a particular workload suite and the framework will automatically call this script as well when needed.

  2. I've added script support for lucet and wasmtime but other VMs including (node, wamr, etc) can be plugged in as well. So instead of adding sightglass to lucet, lucet is added to sightglass and updated (and the same for all the other VMs)

  3. Also added polybench http://web.cse.ohio-state.edu/~pouchet.2/software/polybench/ which currently runs for wasmtime_app and lucet_app

  4. Added a driver to make running everything simple. In particular it is possible to send results from the sightglass_runner to the server running the sightglass_viewer.

  5. Added a viewer for sightglass_runner to see results immediately without needing to send to server running sightglass_viewer which is really intended to track history.

To get started with sightglass_runner, build the container and run a command through the driver, i.e.:

./sg_container_build.sh
./sg_container_runner.sh --help
or
./sg_container_runner.sh -r lucet_app -p shootout

To get started with sightglass_ui, build the container and launch the containerized web-server

./sg_webui_build.sh
./sg_webui_run.sh

I'd like some feedback on this, but I have found sightglass to be useful, I've found these changes to be useful and I am thinking this contributions could either be contributed back to sightglass as separate branches or, more preferred ... folded into the cranestation project https://github.com/CraneStation since there is more visibility here and this would be useful for all projects that depend on cranestation and wasi.

CLI tests hanging

Not sure what happened, but the CLI tests are hanging for me on my local machine after I pulled the latest main branch. They used to take 10s to complete, which I thought was a little bit longer than they should, but I was hopeful that after #135 they would speed up, so I wanted to measure them again. Now they are just hanging. So I did a git bisect to find where they started hanging, and now they are hanging in the very first commit that introduced the CLI tests: 4345122. This is perplexing, because they definitely didn't used to hang when I introduced that commit!

I've also verified that manually running the commands that the tests run does not hang locally.

I'm going to try profiling the tests to see where time is being spent now.

Allow passing paths to wasmtime repo checkouts as engines

Right now we allow either

  • the path to a libwasmtime_bench_api.{so,dll,dylib} shared library
  • a URL to a hosted git repo that we can git clone

to be passed as engines to sightglass-cli benchmark.

We should also allow paths to wasmtime repo checkouts on the local file system.

The second option above would normally work for this, except that the git clone from inside a docker container, so we don't have access to that part of the filesystem. Also, re-cloning locally is kinda ridiculous since it would uselessly copy a bunch of files and also force a clean build when we could otherwise do a nice incremental build.

Prettify JSON output

It might be nice to have an option for prettifying the JSON output for those who don't have jq installed, e.g.

sightglass-next: add statistical significance test

Implement (or re-use) some Rust code that will be able to test whether the performance difference between two runs (i.e., the aggregate runs of two different engines) is statistically significant or not. Probably involves some additions to the sightglass-analysis crate and potentially some refactoring.

sightglass-next: add principal component analysis

To minimize the number of benchmarks needed to get a representative result, we could run PCA on the executed benchmarks. This would likely be implemented in R. There is a lot more discussion needed for this but this issue serves as a placeholder for that.

Allow building engine to ordinary file, and specifying ordinary file as engine for run

This feature request needs some background: I'm currently working on the new regalloc, and am wanting to use Sightglass to measure the impact of each possible feature or heuristic tweak as I make it. This is fairly fine-grained work that ideally benefits from a tight feedback loop; unfortunately, the current best practice for Sightglass has quite a lot more overhead than a "build wasmtime locally and measure bz2 runtime 5 times with time" approximate test I'm using now:

  • Sightglass can build an engine based on a git repo and a branch/commit for wasmtime;
  • wasmtime depends on regalloc.rs, and regalloc.rs provides a shim for the new regalloc and then pulls in regalloc2;
  • The main-branch version of wasmtime refers to deps on crates.io, but for local development, I hack Cargo.toml files so that my wasmtime checkout refers to regalloc.rs refers to regalloc2 all in sibling directories;
  • So to get Sightglass to build an engine with the new configuration, I would need to: (i) commit my regalloc2 experiment (possibly just a tweak to a constant in some heuristic) and push to GitHub; (ii) adjust Cargo.toml in regalloc.rs to refer to this new commit hash, commit that, and push that to GitHub; (iii) adjust Cargo.toml in cranelift-codegen to refer to the new regalloc.rs commit, commit that, and push that to GitHub; (iv) use that commit hash to ask Sightglass to build a new engine file.

The above gets even more fun if I try to use Cranelift settings to control experiment knobs: I would need to alter the defaults in cranelift-codegen, and commit that. If I have n regalloc2 variants and m different knob settings, I need to make n*m separate commits and push them all to GitHub.

The config-knob issue is separate (#103) so let's focus on the separate-versions-of-engine-code problem below.

IMHO, there is a possibly much easier way: we could make the tool fit into the same workflow that someone doing local testing of any other sort would do, and (i) build an artifact from the code on disk, and (ii) test that artifact.

Concretely, this means that Sightglass would (in this mode at least) no longer manage its own cache of engine .sos. Instead, it would have some command (for example):

$ # (make some tweaks to wasmtime/regalloc2/regalloc.rs source in local checkouts)
$ sightglass-cli compile-engine wasmtime ../../path/to/wasmtime -o wasmtime-variant1.so

Then just test using that file explicitly:

$ sightglass-cli run -f wasmtime-variant1.so ...

This sidesteps all of the questions of syntax for repository paths, what unique name (the "slug" as the source calls it) to use for the engine, etc., and fits much more nicely into the usual Unix tool ecosystem, IMHO; there is no hidden state to worry about.

Then, the infrastructure to build a canonical version of an engine from a clean git checkout with a Dockerized hermetic build could be built on top of this, and used by whatever CI infrastructure we have.

Thoughts?

Support for setting different cranelift flags / wasmtime::Config flags

It would be nice to benchmark these configurations (both one off and continuously over time):

  • builds w/ vs w/out debug info
  • optimized(speed) vs optimized(speed + size) vs not optimized
  • bounds checking strategy (explicit vs virtual memory)
  • with or without fuel

All of these things require dynamically setting flags on cranelift and/or wasmtime.

ci: rebuilding the benchmarks should check for reproducibility

The point of the rebuild CI task is to ensure that we can reliably reproduce the benchmarks included with Sightglass. IIRC, there was a time when that task checked if the benchmark output had changed, e.g., with git diff --exit-code. That check is no longer in place but I believe it should be re-added so we can more confidently claim that we have reproducible benchmark builds.

Latest docker breaks sightglass

It appears that docker build for docker version Docker version 20.10.5, build 55c4c88 (macOS) outputs this as the last line:

Use 'docker scan' to run Snyk tests against images to find vulnerabilities and learn how to fix them

This breaks sightglass such that it parses them as the image id for what was just built, causing this error message:

Unable to find image 'them:latest' locally
Error response from daemon: pull access denied for them, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
 ERROR sightglass_artifact::docker > Failed: Child { stdin: None, stdout: None, stderr: None }
Error: failed to execute docker command:

Provide three-state output: "changed", "not changed", "unsure"

Right now, Sightglass uses a single threshold based on a confidence interval computed by Behrens-Fisher to determine whether a sampled statistic shifted between configurations.

The result of this is that we get either "changed" (i.e., benchmark got 5% faster) or "not changed". However, the latter answer can also appear if we simply don't have enough data points to prove statistical significance, or if the system is too noisy.

This "false negative" is somewhat dangerous: we could make a change, see that it is performance-neutral according to Sightglass, and accept it, but actually we just didn't turn the knobs up high enough.

Ideally, Sightglass should provide a third output of "unsure" if the measurements aren't precise enough to prove either "changed" or "not changed" to the desired confidence.

sightglass-next: implement sightglass-server

In order to report performance results based on PRs, we talked about implementing an HTTP server (e.g. in crates/server) that would:

  • listen for incoming POST requests that contain JSON with the PR URL, commit SHA, etc. necessary for doing a "master vs PR" comparison
  • to avoid DoS, verify that the request is an authorized one (not exactly sure how to do this but the GitHub action will need some form of token)
  • kick off some benchmark running, like sightglass-cli benchmark ... but we could call the same APIs from inside the web service
  • upon success, push a Markdown table of the significant performance differences as a comment to the GitHub PR (implies that sightglass-server has a GitHub token, like one from bytecodealliance-highfive)
  • upon failure, push the error message as a comment to the GitHub PR (implies that we maintain error logs somewhere)

There is a lot more to be done here but that should be a workable start.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.