Git Product home page Git Product logo

gotta-go-fast's People

Contributors

andrewrk avatar marler8997 avatar sahnvour avatar squeek502 avatar tensorush avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gotta-go-fast's Issues

add benchmark for thread and process spawn

justification: Test performance relies heavily on thread spawn (tests without checking memory problems) or process spawn (common tests), if no run in parallel (for diagnosis after existence of problem).

Also, I have a PR for extra pipes I would like to bench to figure out if the approach is feasible.

Weird memory usage of random-find benchmark

The maxrss results of the random-find has some anomalous spikes for unknown reasons:

random-find-maxrss

From IRC:

<squeek502> andrewrk, is it expected that the deflate improvements would lead to a speedup in a lot of the benchmarks (zig-fmt, ast-check-os, etc)?
<andrewrk> mmm I don't think so. there are no dependencies on std lib deflate in the compiler
<squeek502> strange, there's a big spike down at exactly that PR being merged for a lot of benchmarks
<andrewrk> is it around jan 25? that also happens to be the end of the backfill script, so perhaps there was some kind of externality accidentally leaking into the measurements?
<squeek502> ah yeah it is, that makes more sense
<andrewrk> it's most noticeable on Max RSS of random-find
<andrewrk> I have no idea what is happening in this graph
<andrewrk> but the big dip is the end of the backfill script
<andrewrk> ummmm maybe it's the hash table of the records.csv file in the backfill script itself, and we're unintentionally including the parent process memory in the measurement
<andrewrk> so as it goes backwards over time, more measurements collected, more hash table memory
<squeek502> ah that would be interesting, wouldn't explain the CPU instruction count/wall clock/etc dropping though
<andrewrk> agreed, but at least it shows you where the script stopped running
<squeek502> zig-fmt benchmark is the most noticeable for the instruction count/wall clock
<andrewrk> hmm maybe are we accidentally using the (more recent) zig exe that is running the backfill script, instead of using the freshly built zig exe from the relevant commit

I've tried testing the theory that the memory usage of the backfill script itself is being included and I ultimately don't think that's what's happening.

To test this, I added:

    // allocate 100MiB to attempt to skew the results to check if the backfill process is included
    const dummy = try arena.alloc(u8, 100 * 1024 * 1024);
    _ = dummy;

to backfill.zig and then ran the benchmarks once with collect-measurements.zig and then once with backfill.zig for the same commit (e498fb155051f548071da1a13098b8793f527275). Here's the maxrss of random-find for each:

collect-measurements backfill
2408 2404

Interestingly, values ~2400 were only gotten by any gotta-go-fast benchmarking during the original runs, and it seems like somehow all results since then (most done by the backfill script AFAIK) seem to be getting larger and larger (for no discernible reason, as the benchmark and the hash map implementation haven't changed much in that time). Also, if you look at the graph, there are actually quite a few of these mini-spikes which are also presumably in line with where the backfill script was run previously.

So, no clue what's going on here still, but hopefully I've narrowed it down slightly. Will probably try running the backfill script locally with just random-find next, and seeing if I can reproduce the increasing maxrss results.

Performance regression in 0.8.0

Compiler version: 0.8.0-dev.2133+ad33e3483
OS: Linux DESKTOP-8AGJOHC 4.19.128-microsoft-standard ziglang/zig#1 SMP Tue Jun 23 12:58:10 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux (running in WSL 2)

Consider the following Zig program:

const std = @import("std");

fn init_array(points: []f32) void {
    var i: u64 = 0;
    while(i < points.len) : (i += 1) {
        points[i] = 5.3;
    }
}

fn sum_points(array: []f32) f64 {
    var sum: f64 = 0;
    for (array) |point| {
        sum = sum + point;
    }
    return sum;
}

pub fn main() anyerror!void {
    var array = try std.heap.page_allocator.alloc(f32, 1000 * 1000 * 30);
    init_array(array);

    var start = std.time.milliTimestamp();
    var sum = sum_points(array);
    var elapsed = std.time.milliTimestamp() - start;

    const stdout = std.io.getStdOut().writer();
    try stdout.print("Elapsed: {any} ms\n", .{elapsed});
    try stdout.print("Sum: {any}\n", .{sum});
}

Running the program with zig build run -Drelease-fast=true reports that executing the sum_points method takes 80ms.
Running with zig build run -Drelease-safe=true or -Drelease-small=true reports that execution takes 23ms. The expected behaviour is that execution with the -Drelease-fast=true flag takes 23ms too.

I did a comparison of the LLVM IR generated for the different release modes for sum_points, but the IR was identical between them. This to me suggests that Zig must be invoking LLVM with different parameters when running with the -Drelease-fast=true flag. I suspect that either this mode of invocation, or LLVM executing this invocation has a performance regression now. Version 0.7.1 does not seem to be affected, execution in any of the release modes takes about 23ms.

A question came up in the Reddit thread whether these kind of regressions can be tested against.

Child process benchmarks don't reset their perf stats between samples

The following diff

diff --git a/bench.zig b/bench.zig
index 7958d03..e79b4fe 100644
--- a/bench.zig
+++ b/bench.zig
@@ -146,6 +146,20 @@ pub fn bench(options: Options, comptime func: anytype, args: anytype) Results {
             _ = std.os.linux.ioctl(perf_fds[0], PERF.EVENT_IOC.DISABLE, PERF.IOC_FLAG_GROUP);
         }
         _ = std.os.linux.ioctl(perf_fds[0], PERF.EVENT_IOC.RESET, PERF.IOC_FLAG_GROUP);
+        const sample_init = Sample{
+            .wall_time = 0,
+            .utime = 0,
+            .stime = 0,
+            .cpu_cycles = readPerfFd(perf_fds[0]),
+            .instructions = readPerfFd(perf_fds[1]),
+            .cache_references = readPerfFd(perf_fds[2]),
+            .cache_misses = readPerfFd(perf_fds[3]),
+            .branch_misses = readPerfFd(perf_fds[4]),
+        };
+        std.debug.print("start sample: {}\n", .{sample_init});
+        if (sample_init.cpu_cycles != 0 or sample_init.instructions != 0 or sample_init.cache_references != 0 or sample_init.cache_misses != 0 or sample_init.branch_misses != 0) {
+            @panic("non-zero starting perf stats");
+        }
         const start_rusage = std.os.getrusage(rusage_who);
         const start = timer.read();
         if (!options.use_child_process) {
@@ -168,6 +182,7 @@ pub fn bench(options: Options, comptime func: anytype, args: anytype) Results {
             .cache_misses = readPerfFd(perf_fds[3]),
             .branch_misses = readPerfFd(perf_fds[4]),
         };
+        std.debug.print("sample: {}\n", .{samples_buf[sample_index]});
         sample_index += 1;
     }
     const all_samples = samples_buf[0..sample_index];

results in:

Running 'zig-fmt' for a55dc4a3bcecbd5adc7ee1724c487786f74cbd8e...
start sample: Sample{ .wall_time = 0, .utime = 0, .stime = 0, .cpu_cycles = 0, .instructions = 0, .cache_references = 0, .cache_misses = 0, .branch_misses = 0 }
sample: Sample{ .wall_time = 436870163, .utime = 408370000, .stime = 28025000, .cpu_cycles = 1698681991, .instructions = 4858398474, .cache_references = 16275089, .cache_misses = 1143828, .branch_misses = 5888024 }
start sample: Sample{ .wall_time = 0, .utime = 0, .stime = 0, .cpu_cycles = 1698681991, .instructions = 4858398474, .cache_references = 16275089, .cache_misses = 1143828, .branch_misses = 5888024 }
thread 759674 panic: non-zero starting perf stats

(and this panic would also trip for all benchmarks that use child processes)

This means that the results for these benchmarks are effectively cumulative for all samples instead of per-sample.


Note: for the non-child-process benchmarks the panic is not tripped:

Running 'self-hosted-parser' for a55dc4a3bcecbd5adc7ee1724c487786f74cbd8e...
start sample: Sample{ .wall_time = 0, .utime = 0, .stime = 0, .cpu_cycles = 0, .instructions = 0, .cache_references = 0, .cache_misses = 0, .branch_misses = 0 }
sample: Sample{ .wall_time = 154560032, .utime = 94207000, .stime = 60073000, .cpu_cycles = 447942230, .instructions = 1030970542, .cache_references = 6631980, .cache_misses = 375555, .branch_misses = 2269160 }
start sample: Sample{ .wall_time = 0, .utime = 0, .stime = 0, .cpu_cycles = 0, .instructions = 0, .cache_references = 0, .cache_misses = 0, .branch_misses = 0 }
sample: Sample{ .wall_time = 151310745, .utime = 118835000, .stime = 31993000, .cpu_cycles = 449910626, .instructions = 1030971846, .cache_references = 6751459, .cache_misses = 371683, .branch_misses = 2269930 }
...etc

performance optimization idea: skip Wyhash for auto hashing of pointers and simply truncate their address to u32

In theory, the least significant bits of a pointer address should already be close to an ideal hash. We should be able to use those directly as the hash rather than hashing the pointer address.

Another possible improvement would be if we know the alignment of the element type, it guarantees the least significant bits to always be zero. We could shift those 0 bits out.

It would look like this:

pub fn getAutoHashFn(comptime K: type) (fn (K) u32) {
    return struct {
        fn hash(key: K) u32 {
            switch (@typeInfo(K)) {
                .Pointer => |info| if (info.size != .Slice) {
                    // No need to pipe this through Wyhash. The least significant bits of
                    // a pointer address are already nearly perfect for use as a hash.
                    const bits_to_shift = comptime math.log2(info.alignment);
                    return @truncate(u32, @ptrToInt(key) >> bits_to_shift);
                },
                else => {},
            }
            var hasher = Wyhash.init(0);
            autoHash(&hasher, key);
            return @truncate(u32, hasher.final());
        }
    }.hash;
}

As usual it would be good to test this before blindly implementing it.

Related: #2

`maxrss` is skewed if `bench.zig` needs to be compiled

I've been trying to make gotta-go-fast more consistent, and have been testing by running the benchmarks for a single commit multiple times. I noticed that on the first run, if bench.zig needs to be compiled, maxrss will be much larger than it will on subsequent runs:

benchmark maxrss run 1 maxrss run 2 maxrss run 3
self-hosted-parser 534872 95520 95284
zig-fmt 406304 97664 97744
translate-c-windows-h 408728 69344 69692
arena-allocator 461844 95512 94976
std-rand 461384 97816 97820
std-hash-map 460512 96332 95488
insert-10M-int 452404 221248 221248
random-distinct 453284 110736 110728
random-find 460872 99680 95232
ast-check-os 404396 69016 68724
ast-check-AstGen 404984 68940 69092
ast-check-Sema 405624 69432 69164

One extremely simple but quite wasteful fix would be to add

{
    const warmup_run = try execCapture(gpa, main_argv.items, .{ .cwd = bench_cwd });
    defer gpa.free(warmup_run);
}

before the const main_stdout = try execCapture(...) call.

Weird discrepencies in performance results in latest backfill

The latest run of the backfill script seems to be reporting much lower wall clock/cpu instruction count/etc results for all the commits included in its run (the oldest commit included in the backfill was commit 0c1df96b17fcafdb0aa973dfcb1819929719d650, which is exactly where this performance difference is reported).

zig-fmt-jump

This is not a real performance difference, though, as the commit that is reported as the cause is unrelated to any of the current benchmarks. It also only affects certain benchmarks--some don't have any performance difference at this commit.

Really unsure what is causing this, but have noticed one thing:

  • Every affected benchmark (zig-fmt, ast-check-os, comptime-guid-parse, etc) uses a child process for their benchmark (that is, they call useChildProcess() in their setup fn). Any benchmark that doesn't use a child process does not have the same erroneous performance difference.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.