ziglang / gotta-go-fast Goto Github PK

View Code? Open in Web Editor NEW

226.0 226.0 15.0 7.65 MB

Performance Tracking for Zig

License: MIT License

Zig 89.55% C 8.78% Assembly 1.66%

gotta-go-fast's People

Contributors

Stargazers

Watchers

Forkers

squeek502 sahnvour wozeparrot wilsonk hoangpq isabella232 marler8997 spencerx jarred-sumner edisplay rudedogg gizlang lindonkelley mrjbq7

gotta-go-fast's Issues

add benchmark for compiling & running std lib tests

start tracking cycles, instruction counts, and cache misses

wall clock time can have a lot of noise; these other metrics are quite useful.

add benchmark for translate-c

add benchmarks for std.AutoHashMap

add benchmark for thread and process spawn

justification: Test performance relies heavily on thread spawn (tests without checking memory problems) or process spawn (common tests), if no run in parallel (for diagnosis after existence of problem).

Also, I have a PR for extra pipes I would like to bench to figure out if the approach is feasible.

add benchmark for std.sort.sort

Weird memory usage of random-find benchmark

The maxrss results of the random-find has some anomalous spikes for unknown reasons:

From IRC:

<squeek502> andrewrk, is it expected that the deflate improvements would lead to a speedup in a lot of the benchmarks (zig-fmt, ast-check-os, etc)?
<andrewrk> mmm I don't think so. there are no dependencies on std lib deflate in the compiler
<squeek502> strange, there's a big spike down at exactly that PR being merged for a lot of benchmarks
<andrewrk> is it around jan 25? that also happens to be the end of the backfill script, so perhaps there was some kind of externality accidentally leaking into the measurements?
<squeek502> ah yeah it is, that makes more sense
<andrewrk> it's most noticeable on Max RSS of random-find
<andrewrk> I have no idea what is happening in this graph
<andrewrk> but the big dip is the end of the backfill script
<andrewrk> ummmm maybe it's the hash table of the records.csv file in the backfill script itself, and we're unintentionally including the parent process memory in the measurement
<andrewrk> so as it goes backwards over time, more measurements collected, more hash table memory
<squeek502> ah that would be interesting, wouldn't explain the CPU instruction count/wall clock/etc dropping though
<andrewrk> agreed, but at least it shows you where the script stopped running
<squeek502> zig-fmt benchmark is the most noticeable for the instruction count/wall clock
<andrewrk> hmm maybe are we accidentally using the (more recent) zig exe that is running the backfill script, instead of using the freshly built zig exe from the relevant commit

I've tried testing the theory that the memory usage of the backfill script itself is being included and I ultimately don't think that's what's happening.

To test this, I added:

    // allocate 100MiB to attempt to skew the results to check if the backfill process is included
    const dummy = try arena.alloc(u8, 100 * 1024 * 1024);
    _ = dummy;

to backfill.zig and then ran the benchmarks once with collect-measurements.zig and then once with backfill.zig for the same commit (e498fb155051f548071da1a13098b8793f527275). Here's the maxrss of random-find for each:

collect-measurements	backfill
2408	2404

Interestingly, values ~2400 were only gotten by any gotta-go-fast benchmarking during the original runs, and it seems like somehow all results since then (most done by the backfill script AFAIK) seem to be getting larger and larger (for no discernible reason, as the benchmark and the hash map implementation haven't changed much in that time). Also, if you look at the graph, there are actually quite a few of these mini-spikes which are also presumably in line with where the backfill script was run previously.

So, no clue what's going on here still, but hopefully I've narrowed it down slightly. Will probably try running the backfill script locally with just random-find next, and seeing if I can reproduce the increasing maxrss results.

Performance regression in 0.8.0

Compiler version: 0.8.0-dev.2133+ad33e3483
OS: Linux DESKTOP-8AGJOHC 4.19.128-microsoft-standard ziglang/zig#1 SMP Tue Jun 23 12:58:10 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux (running in WSL 2)

Consider the following Zig program:

const std = @import("std");

fn init_array(points: []f32) void {
    var i: u64 = 0;
    while(i < points.len) : (i += 1) {
        points[i] = 5.3;
    }
}

fn sum_points(array: []f32) f64 {
    var sum: f64 = 0;
    for (array) |point| {
        sum = sum + point;
    }
    return sum;
}

pub fn main() anyerror!void {
    var array = try std.heap.page_allocator.alloc(f32, 1000 * 1000 * 30);
    init_array(array);

    var start = std.time.milliTimestamp();
    var sum = sum_points(array);
    var elapsed = std.time.milliTimestamp() - start;

    const stdout = std.io.getStdOut().writer();
    try stdout.print("Elapsed: {any} ms\n", .{elapsed});
    try stdout.print("Sum: {any}\n", .{sum});
}

Running the program with zig build run -Drelease-fast=true reports that executing the sum_points method takes 80ms.
Running with zig build run -Drelease-safe=true or -Drelease-small=true reports that execution takes 23ms. The expected behaviour is that execution with the -Drelease-fast=true flag takes 23ms too.

I did a comparison of the LLVM IR generated for the different release modes for sum_points, but the IR was identical between them. This to me suggests that Zig must be invoking LLVM with different parameters when running with the -Drelease-fast=true flag. I suspect that either this mode of invocation, or LLVM executing this invocation has a performance regression now. Version 0.7.1 does not seem to be affected, execution in any of the release modes takes about 23ms.

A question came up in the Reddit thread whether these kind of regressions can be tested against.

Update to zig 0.11.0

I started doing this myself https://github.com/LindonKelley/gotta-go-fast, and am just going to use this issue to verify if that's okay and if I'm doing it right so far.

Commit updated records under a different author

Commiting with andrewrk as the author makes it show up in the contribution graph. If you commit with a different author eg --author="Gotta Go Fast <[email protected]>", it won't do that.

add benchmarks for std json parsing and serializing

add benchmark for std.math.big.Int

Child process benchmarks don't reset their perf stats between samples

The following diff

diff --git a/bench.zig b/bench.zig
index 7958d03..e79b4fe 100644
--- a/bench.zig
+++ b/bench.zig
@@ -146,6 +146,20 @@ pub fn bench(options: Options, comptime func: anytype, args: anytype) Results {
             _ = std.os.linux.ioctl(perf_fds[0], PERF.EVENT_IOC.DISABLE, PERF.IOC_FLAG_GROUP);
         }
         _ = std.os.linux.ioctl(perf_fds[0], PERF.EVENT_IOC.RESET, PERF.IOC_FLAG_GROUP);
+        const sample_init = Sample{
+            .wall_time = 0,
+            .utime = 0,
+            .stime = 0,
+            .cpu_cycles = readPerfFd(perf_fds[0]),
+            .instructions = readPerfFd(perf_fds[1]),
+            .cache_references = readPerfFd(perf_fds[2]),
+            .cache_misses = readPerfFd(perf_fds[3]),
+            .branch_misses = readPerfFd(perf_fds[4]),
+        };
+        std.debug.print("start sample: {}\n", .{sample_init});
+        if (sample_init.cpu_cycles != 0 or sample_init.instructions != 0 or sample_init.cache_references != 0 or sample_init.cache_misses != 0 or sample_init.branch_misses != 0) {
+            @panic("non-zero starting perf stats");
+        }
         const start_rusage = std.os.getrusage(rusage_who);
         const start = timer.read();
         if (!options.use_child_process) {
@@ -168,6 +182,7 @@ pub fn bench(options: Options, comptime func: anytype, args: anytype) Results {
             .cache_misses = readPerfFd(perf_fds[3]),
             .branch_misses = readPerfFd(perf_fds[4]),
         };
+        std.debug.print("sample: {}\n", .{samples_buf[sample_index]});
         sample_index += 1;
     }
     const all_samples = samples_buf[0..sample_index];

results in:

Running 'zig-fmt' for a55dc4a3bcecbd5adc7ee1724c487786f74cbd8e...
start sample: Sample{ .wall_time = 0, .utime = 0, .stime = 0, .cpu_cycles = 0, .instructions = 0, .cache_references = 0, .cache_misses = 0, .branch_misses = 0 }
sample: Sample{ .wall_time = 436870163, .utime = 408370000, .stime = 28025000, .cpu_cycles = 1698681991, .instructions = 4858398474, .cache_references = 16275089, .cache_misses = 1143828, .branch_misses = 5888024 }
start sample: Sample{ .wall_time = 0, .utime = 0, .stime = 0, .cpu_cycles = 1698681991, .instructions = 4858398474, .cache_references = 16275089, .cache_misses = 1143828, .branch_misses = 5888024 }
thread 759674 panic: non-zero starting perf stats

(and this panic would also trip for all benchmarks that use child processes)

This means that the results for these benchmarks are effectively cumulative for all samples instead of per-sample.

Note: for the non-child-process benchmarks the panic is not tripped:

Running 'self-hosted-parser' for a55dc4a3bcecbd5adc7ee1724c487786f74cbd8e...
start sample: Sample{ .wall_time = 0, .utime = 0, .stime = 0, .cpu_cycles = 0, .instructions = 0, .cache_references = 0, .cache_misses = 0, .branch_misses = 0 }
sample: Sample{ .wall_time = 154560032, .utime = 94207000, .stime = 60073000, .cpu_cycles = 447942230, .instructions = 1030970542, .cache_references = 6631980, .cache_misses = 375555, .branch_misses = 2269160 }
start sample: Sample{ .wall_time = 0, .utime = 0, .stime = 0, .cpu_cycles = 0, .instructions = 0, .cache_references = 0, .cache_misses = 0, .branch_misses = 0 }
sample: Sample{ .wall_time = 151310745, .utime = 118835000, .stime = 31993000, .cpu_cycles = 449910626, .instructions = 1030971846, .cache_references = 6751459, .cache_misses = 371683, .branch_misses = 2269930 }
...etc

performance optimization idea: skip Wyhash for auto hashing of pointers and simply truncate their address to u32

In theory, the least significant bits of a pointer address should already be close to an ideal hash. We should be able to use those directly as the hash rather than hashing the pointer address.

Another possible improvement would be if we know the alignment of the element type, it guarantees the least significant bits to always be zero. We could shift those 0 bits out.

It would look like this:

pub fn getAutoHashFn(comptime K: type) (fn (K) u32) {
    return struct {
        fn hash(key: K) u32 {
            switch (@typeInfo(K)) {
                .Pointer => |info| if (info.size != .Slice) {
                    // No need to pipe this through Wyhash. The least significant bits of
                    // a pointer address are already nearly perfect for use as a hash.
                    const bits_to_shift = comptime math.log2(info.alignment);
                    return @truncate(u32, @ptrToInt(key) >> bits_to_shift);
                },
                else => {},
            }
            var hasher = Wyhash.init(0);
            autoHash(&hasher, key);
            return @truncate(u32, hasher.final());
        }
    }.hash;
}

As usual it would be good to test this before blindly implementing it.

Related: #2

`maxrss` is skewed if `bench.zig` needs to be compiled

I've been trying to make gotta-go-fast more consistent, and have been testing by running the benchmarks for a single commit multiple times. I noticed that on the first run, if bench.zig needs to be compiled, maxrss will be much larger than it will on subsequent runs:

benchmark	maxrss run 1	maxrss run 2	maxrss run 3
self-hosted-parser	534872	95520	95284
zig-fmt	406304	97664	97744
translate-c-windows-h	408728	69344	69692
arena-allocator	461844	95512	94976
std-rand	461384	97816	97820
std-hash-map	460512	96332	95488
insert-10M-int	452404	221248	221248
random-distinct	453284	110736	110728
random-find	460872	99680	95232
ast-check-os	404396	69016	68724
ast-check-AstGen	404984	68940	69092
ast-check-Sema	405624	69432	69164

One extremely simple but quite wasteful fix would be to add

{
    const warmup_run = try execCapture(gpa, main_argv.items, .{ .cwd = bench_cwd });
    defer gpa.free(warmup_run);
}

before the const main_stdout = try execCapture(...) call.

Weird discrepencies in performance results in latest backfill

The latest run of the backfill script seems to be reporting much lower wall clock/cpu instruction count/etc results for all the commits included in its run (the oldest commit included in the backfill was commit 0c1df96b17fcafdb0aa973dfcb1819929719d650, which is exactly where this performance difference is reported).

This is not a real performance difference, though, as the commit that is reported as the cause is unrelated to any of the current benchmarks. It also only affects certain benchmarks--some don't have any performance difference at this commit.

Really unsure what is causing this, but have noticed one thing:

Every affected benchmark (zig-fmt, ast-check-os, comptime-guid-parse, etc) uses a child process for their benchmark (that is, they call useChildProcess() in their setup fn). Any benchmark that doesn't use a child process does not have the same erroneous performance difference.