ziglang / gotta-go-fast Goto Github PK
View Code? Open in Web Editor NEWPerformance Tracking for Zig
License: MIT License
Performance Tracking for Zig
License: MIT License
wall clock time can have a lot of noise; these other metrics are quite useful.
justification: Test performance relies heavily on thread spawn (tests without checking memory problems) or process spawn (common tests), if no run in parallel (for diagnosis after existence of problem).
Also, I have a PR for extra pipes I would like to bench to figure out if the approach is feasible.
The maxrss
results of the random-find
has some anomalous spikes for unknown reasons:
From IRC:
<squeek502> andrewrk, is it expected that the deflate improvements would lead to a speedup in a lot of the benchmarks (zig-fmt, ast-check-os, etc)?
<andrewrk> mmm I don't think so. there are no dependencies on std lib deflate in the compiler
<squeek502> strange, there's a big spike down at exactly that PR being merged for a lot of benchmarks
<andrewrk> is it around jan 25? that also happens to be the end of the backfill script, so perhaps there was some kind of externality accidentally leaking into the measurements?
<squeek502> ah yeah it is, that makes more sense
<andrewrk> it's most noticeable on Max RSS of random-find
<andrewrk> I have no idea what is happening in this graph
<andrewrk> but the big dip is the end of the backfill script
<andrewrk> ummmm maybe it's the hash table of the records.csv file in the backfill script itself, and we're unintentionally including the parent process memory in the measurement
<andrewrk> so as it goes backwards over time, more measurements collected, more hash table memory
<squeek502> ah that would be interesting, wouldn't explain the CPU instruction count/wall clock/etc dropping though
<andrewrk> agreed, but at least it shows you where the script stopped running
<squeek502> zig-fmt benchmark is the most noticeable for the instruction count/wall clock
<andrewrk> hmm maybe are we accidentally using the (more recent) zig exe that is running the backfill script, instead of using the freshly built zig exe from the relevant commit
I've tried testing the theory that the memory usage of the backfill script itself is being included and I ultimately don't think that's what's happening.
To test this, I added:
// allocate 100MiB to attempt to skew the results to check if the backfill process is included
const dummy = try arena.alloc(u8, 100 * 1024 * 1024);
_ = dummy;
to backfill.zig
and then ran the benchmarks once with collect-measurements.zig
and then once with backfill.zig
for the same commit (e498fb155051f548071da1a13098b8793f527275
). Here's the maxrss
of random-find
for each:
collect-measurements | backfill |
---|---|
2408 | 2404 |
Interestingly, values ~2400 were only gotten by any gotta-go-fast
benchmarking during the original runs, and it seems like somehow all results since then (most done by the backfill script AFAIK) seem to be getting larger and larger (for no discernible reason, as the benchmark and the hash map implementation haven't changed much in that time). Also, if you look at the graph, there are actually quite a few of these mini-spikes which are also presumably in line with where the backfill script was run previously.
So, no clue what's going on here still, but hopefully I've narrowed it down slightly. Will probably try running the backfill script locally with just random-find
next, and seeing if I can reproduce the increasing maxrss
results.
Compiler version: 0.8.0-dev.2133+ad33e3483
OS: Linux DESKTOP-8AGJOHC 4.19.128-microsoft-standard ziglang/zig#1 SMP Tue Jun 23 12:58:10 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
(running in WSL 2)
Consider the following Zig program:
const std = @import("std");
fn init_array(points: []f32) void {
var i: u64 = 0;
while(i < points.len) : (i += 1) {
points[i] = 5.3;
}
}
fn sum_points(array: []f32) f64 {
var sum: f64 = 0;
for (array) |point| {
sum = sum + point;
}
return sum;
}
pub fn main() anyerror!void {
var array = try std.heap.page_allocator.alloc(f32, 1000 * 1000 * 30);
init_array(array);
var start = std.time.milliTimestamp();
var sum = sum_points(array);
var elapsed = std.time.milliTimestamp() - start;
const stdout = std.io.getStdOut().writer();
try stdout.print("Elapsed: {any} ms\n", .{elapsed});
try stdout.print("Sum: {any}\n", .{sum});
}
Running the program with zig build run -Drelease-fast=true
reports that executing the sum_points
method takes 80ms.
Running with zig build run -Drelease-safe=true
or -Drelease-small=true
reports that execution takes 23ms. The expected behaviour is that execution with the -Drelease-fast=true
flag takes 23ms too.
I did a comparison of the LLVM IR generated for the different release modes for sum_points
, but the IR was identical between them. This to me suggests that Zig must be invoking LLVM with different parameters when running with the -Drelease-fast=true
flag. I suspect that either this mode of invocation, or LLVM executing this invocation has a performance regression now. Version 0.7.1 does not seem to be affected, execution in any of the release modes takes about 23ms.
A question came up in the Reddit thread whether these kind of regressions can be tested against.
I started doing this myself https://github.com/LindonKelley/gotta-go-fast, and am just going to use this issue to verify if that's okay and if I'm doing it right so far.
Commiting with andrewrk as the author makes it show up in the contribution graph. If you commit with a different author eg --author="Gotta Go Fast <[email protected]>"
, it won't do that.
The following diff
diff --git a/bench.zig b/bench.zig
index 7958d03..e79b4fe 100644
--- a/bench.zig
+++ b/bench.zig
@@ -146,6 +146,20 @@ pub fn bench(options: Options, comptime func: anytype, args: anytype) Results {
_ = std.os.linux.ioctl(perf_fds[0], PERF.EVENT_IOC.DISABLE, PERF.IOC_FLAG_GROUP);
}
_ = std.os.linux.ioctl(perf_fds[0], PERF.EVENT_IOC.RESET, PERF.IOC_FLAG_GROUP);
+ const sample_init = Sample{
+ .wall_time = 0,
+ .utime = 0,
+ .stime = 0,
+ .cpu_cycles = readPerfFd(perf_fds[0]),
+ .instructions = readPerfFd(perf_fds[1]),
+ .cache_references = readPerfFd(perf_fds[2]),
+ .cache_misses = readPerfFd(perf_fds[3]),
+ .branch_misses = readPerfFd(perf_fds[4]),
+ };
+ std.debug.print("start sample: {}\n", .{sample_init});
+ if (sample_init.cpu_cycles != 0 or sample_init.instructions != 0 or sample_init.cache_references != 0 or sample_init.cache_misses != 0 or sample_init.branch_misses != 0) {
+ @panic("non-zero starting perf stats");
+ }
const start_rusage = std.os.getrusage(rusage_who);
const start = timer.read();
if (!options.use_child_process) {
@@ -168,6 +182,7 @@ pub fn bench(options: Options, comptime func: anytype, args: anytype) Results {
.cache_misses = readPerfFd(perf_fds[3]),
.branch_misses = readPerfFd(perf_fds[4]),
};
+ std.debug.print("sample: {}\n", .{samples_buf[sample_index]});
sample_index += 1;
}
const all_samples = samples_buf[0..sample_index];
results in:
Running 'zig-fmt' for a55dc4a3bcecbd5adc7ee1724c487786f74cbd8e...
start sample: Sample{ .wall_time = 0, .utime = 0, .stime = 0, .cpu_cycles = 0, .instructions = 0, .cache_references = 0, .cache_misses = 0, .branch_misses = 0 }
sample: Sample{ .wall_time = 436870163, .utime = 408370000, .stime = 28025000, .cpu_cycles = 1698681991, .instructions = 4858398474, .cache_references = 16275089, .cache_misses = 1143828, .branch_misses = 5888024 }
start sample: Sample{ .wall_time = 0, .utime = 0, .stime = 0, .cpu_cycles = 1698681991, .instructions = 4858398474, .cache_references = 16275089, .cache_misses = 1143828, .branch_misses = 5888024 }
thread 759674 panic: non-zero starting perf stats
(and this panic would also trip for all benchmarks that use child processes)
This means that the results for these benchmarks are effectively cumulative for all samples instead of per-sample.
Note: for the non-child-process benchmarks the panic is not tripped:
Running 'self-hosted-parser' for a55dc4a3bcecbd5adc7ee1724c487786f74cbd8e...
start sample: Sample{ .wall_time = 0, .utime = 0, .stime = 0, .cpu_cycles = 0, .instructions = 0, .cache_references = 0, .cache_misses = 0, .branch_misses = 0 }
sample: Sample{ .wall_time = 154560032, .utime = 94207000, .stime = 60073000, .cpu_cycles = 447942230, .instructions = 1030970542, .cache_references = 6631980, .cache_misses = 375555, .branch_misses = 2269160 }
start sample: Sample{ .wall_time = 0, .utime = 0, .stime = 0, .cpu_cycles = 0, .instructions = 0, .cache_references = 0, .cache_misses = 0, .branch_misses = 0 }
sample: Sample{ .wall_time = 151310745, .utime = 118835000, .stime = 31993000, .cpu_cycles = 449910626, .instructions = 1030971846, .cache_references = 6751459, .cache_misses = 371683, .branch_misses = 2269930 }
...etc
In theory, the least significant bits of a pointer address should already be close to an ideal hash. We should be able to use those directly as the hash rather than hashing the pointer address.
Another possible improvement would be if we know the alignment of the element type, it guarantees the least significant bits to always be zero. We could shift those 0 bits out.
It would look like this:
pub fn getAutoHashFn(comptime K: type) (fn (K) u32) {
return struct {
fn hash(key: K) u32 {
switch (@typeInfo(K)) {
.Pointer => |info| if (info.size != .Slice) {
// No need to pipe this through Wyhash. The least significant bits of
// a pointer address are already nearly perfect for use as a hash.
const bits_to_shift = comptime math.log2(info.alignment);
return @truncate(u32, @ptrToInt(key) >> bits_to_shift);
},
else => {},
}
var hasher = Wyhash.init(0);
autoHash(&hasher, key);
return @truncate(u32, hasher.final());
}
}.hash;
}
As usual it would be good to test this before blindly implementing it.
Related: #2
I've been trying to make gotta-go-fast
more consistent, and have been testing by running the benchmarks for a single commit multiple times. I noticed that on the first run, if bench.zig
needs to be compiled, maxrss
will be much larger than it will on subsequent runs:
benchmark | maxrss run 1 | maxrss run 2 | maxrss run 3 |
---|---|---|---|
self-hosted-parser | 534872 | 95520 | 95284 |
zig-fmt | 406304 | 97664 | 97744 |
translate-c-windows-h | 408728 | 69344 | 69692 |
arena-allocator | 461844 | 95512 | 94976 |
std-rand | 461384 | 97816 | 97820 |
std-hash-map | 460512 | 96332 | 95488 |
insert-10M-int | 452404 | 221248 | 221248 |
random-distinct | 453284 | 110736 | 110728 |
random-find | 460872 | 99680 | 95232 |
ast-check-os | 404396 | 69016 | 68724 |
ast-check-AstGen | 404984 | 68940 | 69092 |
ast-check-Sema | 405624 | 69432 | 69164 |
One extremely simple but quite wasteful fix would be to add
{
const warmup_run = try execCapture(gpa, main_argv.items, .{ .cwd = bench_cwd });
defer gpa.free(warmup_run);
}
before the const main_stdout = try execCapture(...)
call.
The latest run of the backfill script seems to be reporting much lower wall clock/cpu instruction count/etc results for all the commits included in its run (the oldest commit included in the backfill was commit 0c1df96b17fcafdb0aa973dfcb1819929719d650
, which is exactly where this performance difference is reported).
This is not a real performance difference, though, as the commit that is reported as the cause is unrelated to any of the current benchmarks. It also only affects certain benchmarks--some don't have any performance difference at this commit.
Really unsure what is causing this, but have noticed one thing:
zig-fmt
, ast-check-os
, comptime-guid-parse
, etc) uses a child process for their benchmark (that is, they call useChildProcess()
in their setup fn). Any benchmark that doesn't use a child process does not have the same erroneous performance difference.A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.