sirupsen / napkin-math Goto Github PK

Techniques and numbers for estimating system's performance from first-principles

Home Page: https://www.youtube.com/watch?v=IxkSlnrRFqc

License: MIT License

Rust 49.83% Ruby 9.15% Gnuplot 2.59% Jupyter Notebook 12.02% Shell 10.72% Go 15.69%

napkin-math's Introduction

Napkin Math

The goal of this project is to collect software, numbers, and techniques to quickly estimate the expected performance of systems from first-principles. For example, how quickly can you read 1 GB of memory? By composing these resources you should be able to answer interesting questions like: how much storage cost should you expect to pay for logging for an application with 100,000 RPS?

The best introduction to this skill is through my talk at SRECON.

The best way to practise napkin math in the grand domain of computers is to work on your own problems. The second-best is to subscribe to this newsletter where you'll get a problem every few weeks to practise on. It should only take you a few minutes to solve each one as your facility with these techniques improve.

The archive of problems to practise with are here. The solution will be in the following newsletter.

Numbers

Below are numbers that are rounded from runs on a metal Intel Xeon E-2236 3.4GHz with 12 (virtual) cores.

Note 1: Some throughput and latency numbers don't line up, this is intentional for ease of calculations.

Note 2: Take the numbers with a grain of salt. E.g. for I/O, fio is the state-of-the-art. I am continuously updating these numbers as I learn more to improve accuracy and as hardware improves.

Operation	Latency	Throughput	1 MiB	1 GiB
Sequential Memory R/W (64 bytes)	0.5 ns
├ Single Thread, No SIMD		10 GiB/s	100 μs	100 ms
├ Single Thread, SIMD		20 GiB/s	50 μs	50 ms
├ Threaded, No SIMD		30 GiB/s	35 μs	35 ms
├ Threaded, SIMD		35 GiB/s	30 μs	30 ms
Network Same-Zone		10 GiB/s	100 μs	100 ms
├ Inside VPC		10 GiB/s	100 μs	100 ms
├ Outside VPC		3 GiB/s	300 μs	300 ms
Hashing, not crypto-safe (64 bytes)	25 ns	2 GiB/s	500 μs	500 ms
Random Memory R/W (64 bytes)	50 ns	1 GiB/s	1 ms	1s
Fast Serialization `[8]` `[9]` †	N/A	1 GiB/s	1 ms	1s
Fast Deserialization `[8]` `[9]` †	N/A	1 GiB/s	1 ms	1s
System Call	500 ns	N/A	N/A	N/A
Hashing, crypto-safe (64 bytes)	500 ns	200 MiB/s	10 ms	10s
Sequential SSD read (8 KiB)	1 μs	4 GiB/s	200 μs	200 ms
Context Switch `[1] [2]`	10 μs	N/A	N/A	N/A
Sequential SSD write, -fsync (8KiB)	10 μs	1 GiB/s	1 ms	1s
TCP Echo Server (32 KiB)	10 μs	4 GiB/s	200 μs	200 ms
Decompression `[11]`	N/A	1 GiB/s	1 ms	1s
Compression `[11]`	N/A	500 MiB/s	2 ms	2s
Sequential SSD write, +fsync (8KiB)	1 ms	10 MiB/s	100 ms	2 min
Sorting (64-bit integers)	N/A	200 MiB/s	5 ms	5s
Sequential HDD Read (8 KiB)	10 ms	250 MiB/s	2 ms	2s
Blob Storage same region, 1 file	50 ms	500 MiB/s	2 ms	2s
Blob Storage same region, n files	50 ms	NW limit
Random SSD Read (8 KiB)	100 μs	70 MiB/s	15 ms	15s
Serialization `[8]` `[9]` †	N/A	100 MiB/s	10 ms	10s
Deserialization `[8]` `[9]` †	N/A	100 MiB/s	10 ms	10s
Proxy: Envoy/ProxySQL/Nginx/HAProxy	50 μs	?	?	?
Network within same region	250 μs	2 GiB/s	500 μs	500 ms
Premium network within zone/VPC	250 μs	25 GiB/s	50 μs	40 ms
{MySQL, Memcached, Redis, ..} Query	500 μs	?	?	?
Random HDD Read (8 KiB)	10 ms	0.7 MiB/s	2 s	30m
Network between regions `[6]`	Varies	25 MiB/s	40 ms	40s
Network NA Central <-> East	25 ms	25 MiB/s	40 ms	40s
Network NA Central <-> West	40 ms	25 MiB/s	40 ms	40s
Network NA East <-> West	60 ms	25 MiB/s	40 ms	40s
Network EU West <-> NA East	80 ms	25 MiB/s	40 ms	40s
Network EU West <-> NA Central	100 ms	25 MiB/s	40 ms	40s
Network NA West <-> Singapore	180 ms	25 MiB/s	40 ms	40s
Network EU West <-> Singapore	160 ms	25 MiB/s	40 ms	40s

†: "Fast serialization/deserialization" is typically a simple wire-protocol that just dumps bytes, or a very efficient environment. Typically standard serialization such as e.g. JSON will be of the slower kind. We include both here as serialization/deserialization is a very, very broad topic with extremely different performance characteristics depending on data and implementation.

You can run this with ./run to run with the right optimization levels. You won't get the right numbers when you're compiling in debug mode. You can help this project by adding new suites and filling out the blanks.

Note: I'm currently porting the benchmarks over to Criterion.rs, so some are in bench/ now. You can run those by uncommenting the relevant line in ./run.

I am aware of some inefficiencies in this suite. I intend to improve my skills in this area, in order to ensure the numbers are the upper-bound of performance you may be able to squeeze out in production. I find it highly unlikely any of them will be more than 2-3x off, which shouldn't be a problem for most users.

Cost Numbers

Approximate numbers that should be consistent between Cloud providers.

What	Amount	$ / Month	1y commit $ /month	Spot $ /month	Hourly Spot $
CPU	1	$15	$10	$2	$0.005
GPU	1	$5000	$3000	$1500	$2
Memory	1 GB	$2	$1	$0.2	$0.0005
Storage
├ Warehouse Storage	1 GB	$0.02
├ Blob (S3, GCS)	1 GB	$0.02
├ Zonal HDD	1 GB	$0.05
├ Ephemeral SSD	1 GB	$0.08	$0.05	$0.05	$0.07
├ Regional HDD	1 GB	$0.1
├ Zonal SSD	1 GB	$0.2
├ Regional SSD	1 GB	$0.35
Networking
├ Same Zone	1 GB	$0
├ Blob	1 GB	$0
├ Ingress	1 GB	$0
├ Inter-Zone	1 GB	$0.01
├ Inter-Region	1 GB	$0.02
├ Internet Egress †	1 GB	$0.1
CDN Egress	1 GB	$0.05
CDN Fill ‡	1 GB	$0.01
Warehouse Query	1 GB	$0.005
Logs/Traces ♣	1 GB	$0.5
Metrics	1000	$20

† This refers to network leaving your cloud provider, e.g. sending data to S3 from GCP or egress network for sending HTML from AWS to a client.

‡ An additional per cache-fill fee is incurred that costs close to blob storage write costs (see just below).

7 This is standard pricing among a few logging providers, but e.g. Datadog pricing is different and charges $0.1 per ingested logs with $1.5 per 1m on top for 7d retention.

Furthermore, for blob storage (S3/GCS/R2/...), you're charged per read/write operation (fewer, large files is cheaper):

	1M	1000
Reads	$0.4	$0.0004
Writes	$5	$0.005

Compression Ratios

This is sourced from a few sources. [3] [4] [5] Note that compression speeds (but generally not ratios) vary by an order of magnitude depending on the algorithm and the level of compression (which trades speed for compression).

I typically ballpark that another x in compression ratio decreases performance by 10x. E.g. we can get a 2x ratio on English Wikipedia at ~200 MiB/s, and 3x at ~20MiB/s, and 4x at 1MB/s.

What	Compression Ratio
HTML	2-3x
English	2-4x
Source Code	2-4x
Executables	2-3x
RPC	5-10x
SSL	-2% `[10]`

Techniques

Don't overcomplicate. If you are basing your calculation on more than 6 assumptions, you're likely making it harder than it should be.
Keep the units. They're good checksumming. Wolframalpha has terrific support if you need a hand in converting e.g. KiB to TiB.
Calculate with exponents. A lot of back-of-the-envelope calculations are done with just coefficients and exponents, e.g. c * 10^e. Your goal is to get within an order of magnitude right--that's just e. c matters a lot less. Only worrying about single-digit coefficients and exponents makes it much easier on a napkin (not to speak of all the zeros you avoid writing).
Perform Fermi decomposition. Write down things you can guess at until you can start to hint at an answer. When you want to know the cost of storage for logging, you're going to want to know how big a log line is, how many of those you have per second, what that costs, and so on.

Resources

[1]: https://eli.thegreenplace.net/2018/measuring-context-switching-and-memory-overheads-for-linux-threads/
[2]: https://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html
[3]: https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf
[4]: https://github.com/google/snappy
[5]: https://quixdb.github.io/squash-benchmark/
[6]: https://dl.acm.org/doi/10.1145/1879141.1879143
[7]: https://en.wikipedia.org/wiki/Hard_disk_drive_performance_characteristics#Seek_times_&_characteristics
[8]: https://github.com/simdjson/simdjson#performance-results
[9]: https://github.com/protocolbuffers/protobuf/blob/master/docs/performance.md
[10]: https://www.imperialviolet.org/2010/06/25/overclocking-ssl.html
[11]: https://github.com/inikep/lzbench
"How to get consistent results when benchmarking on Linux?". Great compilation of various Kernel and CPU features to toggle for reliable bench-marking, e.g. CPU affinity, disabling turbo boost, etc. It also has resources on proper statistical methods for benchmarking.
LLVM benchmarking tips. Similar to the above in terms of dedicating CPUs, disabling address space randomization, etc.
Top-Down performance analysis methodology. Useful post about using toplev to find the bottlenecks. This is particularly useful for the benchmarking suite we have here, to ensure the programs are correctly written (I have not taken them through this yet, but plan to).
Godbolt's compiler explorer. Fantastic resource for comparing assembly between Rust and e.g. C with Clang/GCC.
cargo-show-asm. Cargo extension to allow disassembling functions. Unfortunately the support for closure is a bit lacking, which requires some refactoring.
Agner's Assembly Guide. An excellent resource on writing optimum assembly, which will be useful to inspect the various functions for inefficiencies in our suite.
Agner's Instruction Tables. Thorough resource on the expected throughput for various instructions which is helpful to inspect the assembly.
halobates.de. Useful resource for low-level performance by the author of toplev.
Systems Performance (book). Fantastic book about analyzing system performance, finding bottlenecks, and understanding operating systems.
io_uring. Best summary, it links to many resources.
How Long Does It Takes To Make a Context Switch
Integer Compression Comparisons
Files are hard

napkin-math's People

Contributors

Stargazers

Watchers

Forkers

davidventura mogarg 0mk1 isgasho aebm graydon jasimmonsv ixil weburnit pviotti kahirokunn manumilou jinhoko davidberglund lscoder kirs sergioclemente sysbot saadbinakhlaq rfesi venkyvb devclahore alanhowlett ccl0326 bouk enfield 4thel00z yangsu rafaeljesus xrx vadirajks greatcodebase barseghyanartur mahanteshambali pnkc studiovc takashiidobe kamyar aalhour indera-shsp ivanvas stonecharioteer arpitsardhana nachinius boydfd neeravarora sergey-v-galtsev epk govindev llenroc mble biswapanda srinitechhub devmusings unoexperto greenstevester laithshadeed eternalerrors l-narasimhan beyonddream-productions asdyxcyxc ammar1c doesnotexist arryboom elmotec debankur1 lochaa fisheuler fork2keep r1proto 51app 2021-uam-4619 daniellansun bishwajitdey juanbono david0718 rubberbullet harizvi yyyyaaa vearutop bwplotka arnauorriols stanleyjacob stjordanis enjection su-gyeomkim minhhungit odyssey2077 pinkdiamond1 alehkot ravindu-san avee137 pfefar chadbrewbaker kinddevil e-tjan147 pacak jackckr smartcold flowchartsman

napkin-math's Issues

Suggestion: rename Random SSD Seek to Random SSD Read

My understanding is that a sequential SSD read takes 1us and a random SSD read takes 100us.

In the repository, "Random SSD Seek" should be renamed to "Random SSD Read" so you can compare it to "Sequential SSD Read" easily.

Otherwise, users may think that to do a random SSD read, you first need to do a random SSD seek and then a sequential SSD read.

Add more napkin math problems to https://sirupsen.com/

Hi @sirupsen - not sure if this is the right place to post this, but I've done most of your napkin math problems on your website, and would love to do more, if you could make them.

Granted, I'm sure it takes a while to come up with and solve these problems. So, I noticed we could use ChatGPT to generate these problems and create a initial solution:

These are not the best but you could make better ones, if you give it a topic and some more details.

Limit memory allocations by size instead of time

Hey

On my machine, I can write to memory at ~3GB/s, but I don't have 5 seconds worth of memory available to allocate. By changing the benchmark duration to 3.5s the writes finish succesfully (otherwise I get an allocation error).

It'd be nice to be able to limit the memory usage by available memory instead of by time.

[Write Seq Vec<usize>] Iterations in 2004 miliseconds, no overhead: 899,250,576
[Write Seq Vec<usize>] Iterations / second: 448,727,832
[Write Seq Vec<usize>] Bytes handled per iteration: 8 bytes
[Write Seq Vec<usize>] Total bytes processed: 6.700 GiB
[Write Seq Vec<usize>] Throughput: 3.343 GiB/s
[Write Seq Vec<usize>] Avg single iteration: 2.230 ns
[Write Seq Vec<usize>] Avg single iteration cycles: 8.51
[Write Seq Vec<usize>] Time to process 1 MiB: 292 μs
[Write Seq Vec<usize>] Time to process 1 GiB: 299 ms
[Write Seq Vec<usize>] Time to process 1 TiB: 5.10 min

X/Y/Z (T) Plots

https://www.twitch.tv/videos/1450298775 by @colmmacc.

What is the chance of X at timescale T experiencing a fault rate Y for Z time?

https://github.com/SiegeLord/RustGnuplot looks sane for a plotting API. Store community sourced timings in zstd compressed sqlite database as an external binary?

Fix random memory read benchmark

Even though we only look at 64 bits, a full cache-line will be transferred. This means we should expect up to 8x the throughput. Let's change it to read a full cache-line instead. This shouldn't matter much for the sequential benchmarks, but they'd likely also be sped up.

More metrics

Maybe it's worth adding these?

A packet round trip within a datacenter
Pinging an external DNS server

Anki flash card

Do you have Anki flash card? You mentioned it at your talk but can't find out. And great thanks for this repo.

Spit out a markdown table with a command line options

Much better than manually constructing them.

Disk read benchmark fails if "foo.txt" is not present

Sorry if I just report instead of fixing, a bit short of time right now and I am absolutely not familiar with Rust.

I installed Rust and cloned the repo, but when I try to run the benchmark (cargo run --release) I get the following errors:

    Finished release [optimized] target(s) in 2m 23s
     Running `target\release\base-rates.exe`

[Read Seq Vec<usize>] Iterations in 100 miliseconds, no overhead: 134,217,727
[Read Seq Vec<usize>] Iterations / second: 1,342,177,270
[Read Seq Vec<usize>] Bytes handled per iteration: 8 bytes
[Read Seq Vec<usize>] Total bytes processed: 1024.000 MiB
[Read Seq Vec<usize>] Throughput: 10.000 GiB/s
[Read Seq Vec<usize>] Avg single iteration: 0.752 ns
[Read Seq Vec<usize>] Avg single iteration cycles: 2.71
[Read Seq Vec<usize>] Time to process 1 MiB: 98 μs
[Read Seq Vec<usize>] Time to process 1 GiB: 100 ms
[Read Seq Vec<usize>] Time to process 1 TiB: 103.31 s

[Write Seq Vec<usize>] Iterations in 5655 miliseconds, no overhead: 538,076,672
[Write Seq Vec<usize>] Iterations / second: 95,150,605
[Write Seq Vec<usize>] Bytes handled per iteration: 8 bytes
[Write Seq Vec<usize>] Total bytes processed: 4.009 GiB
[Write Seq Vec<usize>] Throughput: 725.942 MiB/s
[Write Seq Vec<usize>] Avg single iteration: 10.510 ns
[Write Seq Vec<usize>] Avg single iteration cycles: 37.84
[Write Seq Vec<usize>] Time to process 1 MiB: 1377 μs
[Write Seq Vec<usize>] Time to process 1 GiB: 1410 ms
[Write Seq Vec<usize>] Time to process 1 TiB: 24.07 min

[Random Read Vec<usize>] Iterations in 4028 miliseconds, no overhead: 134,217,727
[Random Read Vec<usize>] Iterations / second: 33,321,183
[Random Read Vec<usize>] Bytes handled per iteration: 8 bytes
[Random Read Vec<usize>] Total bytes processed: 1024.000 MiB
[Random Read Vec<usize>] Throughput: 254.220 MiB/s
[Random Read Vec<usize>] Avg single iteration: 30 ns
[Random Read Vec<usize>] Avg single iteration cycles: 108.04
[Random Read Vec<usize>] Time to process 1 MiB: 3933 μs
[Random Read Vec<usize>] Time to process 1 GiB: 4.03 s
[Random Read Vec<usize>] Time to process 1 TiB: 1.15 hours

[Real: Random Write Vec<usize>] Iterations in 5006 miliseconds, no overhead: 475,077,717
[Real: Random Write Vec<usize>] Iterations / second: 94,901,661
[Real: Random Write Vec<usize>] Bytes handled per iteration: 8 bytes
[Real: Random Write Vec<usize>] Total bytes processed: 3.540 GiB
[Real: Random Write Vec<usize>] Throughput: 724.042 MiB/s
[Real: Random Write Vec<usize>] Avg single iteration: 10.539 ns
[Real: Random Write Vec<usize>] Avg single iteration cycles: 37.94
[Real: Random Write Vec<usize>] Time to process 1 MiB: 1381 μs
[Real: Random Write Vec<usize>] Time to process 1 GiB: 1414 ms
[Real: Random Write Vec<usize>] Time to process 1 TiB: 24.13 min
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "The system cannot find the file specified." }', src\libcore\result.rs:1084:5note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
error: process didn't exit successfully: `target\release\base-rates.exe` (exit code: 101)

With RUST_BACKTRACE=1 I get the following (not very helpful IMHO) backtrace:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "The system cannot find the file specified." }', src\libcore\result.rs:1084:5stack backtrace:
   0: backtrace::backtrace::trace_unsynchronized
             at C:\Users\VssAdministrator\.cargo\registry\src\github.com-1ecc6299db9ec823\backtrace-0.3.34\src\backtrace\mod.rs:66
   1: std::sys_common::backtrace::_print
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\/src\libstd\sys_common\backtrace.rs:47
   2: std::sys_common::backtrace::print
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\/src\libstd\sys_common\backtrace.rs:36
   3: std::panicking::default_hook::{{closure}}
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\/src\libstd\panicking.rs:200
   4: std::panicking::default_hook
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\/src\libstd\panicking.rs:214
   5: std::panicking::rust_panic_with_hook
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\/src\libstd\panicking.rs:477
   6: std::panicking::continue_panic_fmt
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\/src\libstd\panicking.rs:384
   7: std::panicking::rust_begin_panic
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\/src\libstd\panicking.rs:311
   8: core::panicking::panic_fmt
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\/src\libcore\panicking.rs:85
   9: core::result::unwrap_failed
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\/src\libcore\result.rs:1084
  10: alloc::alloc::box_free
  11: alloc::alloc::box_free
  12: <T as core::any::Any>::type_id
  13: std::rt::lang_start_internal::{{closure}}
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\/src\libstd\rt.rs:49
  14: std::panicking::try::do_call<closure-0,i32>
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\/src\libstd\panicking.rs:296
  15: panic_unwind::__rust_maybe_catch_panic
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\/src\libpanic_unwind\lib.rs:80
  16: std::panicking::try
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\/src\libstd\panicking.rs:275
  17: std::panic::catch_unwind
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\/src\libstd\panic.rs:394
  18: std::rt::lang_start_internal
             at /rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54\/src\libstd\rt.rs:48
  19: main
  20: invoke_main
             at d:\agent\_work\2\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:78
  21: __scrt_common_main_seh
             at d:\agent\_work\2\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:288
  22: BaseThreadInitThunk
  23: RtlUserThreadStart
error: process didn't exit successfully: `target\release\base-rates.exe` (exit code: 101)

The directory does not have a file called foo.txt. If I create it, then the benchmark is able to execute this test.

Fill out the question marks

Quite a few numbers we need.

Data center level benchmarks?

There is an uncanny valley missing at a cloud vendor data center level.

I was thinking of contributing some benchmarks via shell. While I was at it also try to get some market prices to integrate cost.

Is this a good model for Azure/GCP too or am I missing a common data center abstraction?

In data center DNS.
Vendor permission authentication (AWS IAM equivalent)
Latency/bandwith/concurrency of blob store (AWS S3 equivalent)
Latency with vendor control plane API (AWS CLI equivalent)
Latency/bandwidth/concurrency of block storage (AWS EBS equivalent)
Latency of vendor stock linux container worker (AWS Lambda equivalent)
Latency/bandwidth/concurrency of message ingress (AWS Kinesis equivalent)

Discrepancy in SSD read with Latency numbers every programmer should know

Hi,
In this site, it's mentioned 1,000,000 bytes from SSD takes 31μs (2020)
https://colin-scott.github.io/personal_website/research/interactive_latency.html

But as per the numbers published in this repo, it's 200μs for 1M sequential read from SSD.

Could you please help me understand how this discrepancy?
Or are they referring to different hardware?

Some questions

Hey Simon,

Thanks for the great talk and resource!
I'm quite new to all things perf and also haven't really used Rust before.
I had a couple of questions:

When I ran your tool on an AWS ec2 host running Ubuntu 18.4 and all the memory functions failed. I wonder if that's related to some host restrictions?
I didn't see Brendan Gregg in any of your references, I think his work might be worth mentioning here.

Cheers

Any tools to get real metrics reliably

I wonder if there exists a tool that can inspect the hardware in detail and outputs a report with those metrics so that we can do napkin-math more accurately. And it also helps comparing different machines when doing performance experiments