microsoft / netperf Goto Github PK

View Code? Open in Web Editor NEW

28.0 7.0 11.0 12.13 MB

Automation system for executing networking performance tests

License: MIT License

PowerShell 4.35% HTML 0.87% JavaScript 91.85% CSS 0.35% Shell 2.58%

networking testing

netperf's Introduction

Dashboard

Network Perfomance Testing

The repo is the common place other Microsoft owned networking projects (including Windows itself) use to run, store and visualize networking performance testing. Currently, the following projects (will) use this:

Goal

Historically, networking performance testing has been spotty, inconsistent, not reproducable, and not easily accessible. Different groups or projects test performance in different ways, on different hardware. They even have different definitions of things like throughput and latency. This repo aims to fix that by providing a common, open place to run, store, and visualize networking performance testing. The end result is ultimately a set of dashboards summarizing the performance of the various projects, across various scenarios, and across various platforms.

Documentation

Questions we're trying to answer
Machine hardware and configuration used for testing
Architecture and design of the project

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

netperf's People

Contributors

Stargazers

Watchers

Forkers

alan-jowett projectsbyjackhe maolson-msft operaha homaerr step-security-bot siyvashtola99 shadiwodi gtrevi stuffbyjackhe sathishcyberintelsys

netperf's Issues

Add Dashboard Dev Automation

Since we're deploying our dashboard with Github Pages and using React, we should setup an actions workflow that automatically updates the /dist directory on --deploy branch when new code gets checked in to 'main' under directory /dashboard.

Netperf Version 1 TODO: Migrate F4 VMs to new Experimental boost VMs.

We use Standard F4s v2 VMs in Azure currently, which document their max throughput at 10000 Mbps. Based on the data we're seeing for TCP throughput, it looks like we are hitting that limit:

We will probably need to increase the VM size.

The Standard_F8s_v2 VMs (2x cost) increases the limit to 12500 Mbps. The next bandwidth limit isn't until the Standard_F32s_v2 size, which has 16000 Mbps, but is 8x the cost!

Documents & script fix

Because the script doesn't work by default based on document, I checked the content and found the issues. This is not reliable to run.

-u and -p are also required parameter

netperf/docs/machines.md

Line 103 in 9c70821

 bash setup-runner-linux.sh -i <peerip> -g <github token *do this on client only> -n <no reboot *optional> 

Why ssh-copy-id is needed after doing in the script

netperf/docs/machines.md

Line 105 in 9c70821

ssh-copy-id <username of peer>@<peerip>

Too open. 400

netperf/setup-runner-linux.sh

Line 80 in 9c70821

chmod 777 $HOME/.ssh/id_rsa

I forget, but maybe this script need to be executed in the directory. or showing error.

netperf/setup-runner-linux.sh

Line 124 in 9c70821

 bash $HOME/actions-runner/config.sh --url https://github.com/microsoft/netperf --token $githubtoken --labels $runnerlabels --unattended 

Sleep is not needed

netperf/setup-runner-linux.sh

Line 91 in 9c70821

sleep 5

netperf/setup-runner-linux.sh

Line 133 in 9c70821

sleep 5

~ is home directory. try echo ~

netperf/setup-runner-linux.sh

Line 54 in 9c70821

HOME="/home/$username"

if you want to split role for server/client (as doing by github token), openssh-server and its config is not needed on client. but this is recommendation.

netperf/setup-runner-linux.sh

Line 64 in 9c70821

sudo apt install openssh-server -y

Detailed Pages Must Show All IO Models

Need to show stuff like XDP and WSK.

Refactor secnetperf tests into a callable workflow

Both eBPF and QUIC use the secnet tests for performance testing. Rather than cutting and pasting code, it would be better to refactor this into a shared callable workflow.

Collect Crash Dumps on Hang / Timeout

Today in the QUIC runs we're occasionally seeing timeouts which indicate a hang. We do have watchdogs in the code, but if the hang is after the watchdog is stopped and on general app cleanup then we still have a problem. We need to leverage something like procdump or notmyfault for kernel to collect dumps when this happens.

Regression Detection

Tests results need to be compared to historical results to determine if a new value might be a regression.

Consider Moving PATs to Azure Key Vault

See example: https://github.com/microsoft/ebpf-for-windows/blob/c4b7fe423402fa66ff243ec02a451c354ca4be83/.github/workflows/upload-perf-results.yml#L80C1-L87C1

This would eliminate the need to store them with GitHub.

Detailed Pages Must Support Azure/Lab Differentiation

We need to update the detailed pages in the dashboard to support viewing the Azure and Lab data separately.

Migrate off of using filenames to pass environment data [NEED TO INCLUDE HYPHEN FOR os_name]

As of today, as part of the SQL automation, we use a fixed format for filenames we save in our pipeline.

That fixed format allows us to build an automation that parses for keywords and query the database.

As a consequence of this, because we split the names with a hyphen "-" (i.e.: test-results-ubuntu-20.05-x64) we use the hyphen as a delimiter.

Notice how the name "ubuntu-20.05" has a hyphen in it even though its 1 atomic item we want parsed. This makes the parsing logic extra complicated, and requires all future OS_names in the YAML to include a hyphen.

In the short term, this is OK, but when we refactor, this is 1 area to be improved upon.

Support Dashboard Data Consumption

As of now, everything infrastructure-wise is set up to run SecNetPerf and produce throughput data. That data gets checked in to the --deploy branch. Saved as a JSON https://microsoft.github.io/netperf/data/secnetperf/2023-12-01-20-25-09._.67ee09354f52d014ad4e9ec85fcb6b9260890134.json/test_result.json

Ideally, we should have an automation that also aggregates all the performance runs from the last ~20 commits / runs, and produce 1 data.json (secnetperf.json?) for each project (XDP, eBPF...).

The dashboard will just use this single file. We can add more features to the dashboard to query any number of commits in the past, but thats a future thing.

Netperf 1.5 TODO: Scale fleet of boost VMs to parallelize work.

Once we have a stable set of experimental VMs running, we should scale this fleet up.

Netperf version 1 TODO: Squash Git History for sqlite Branch

To prevent an explosion of storage usage by the sqlite branch, we need to squash this history every commit we push. Maybe, worst case, we keep a .backup copy of the previous commit each time.

Support Data Analysis Solution

Since our perf data will be stored on an orphan branch, we still want devs to be able to run intracate queries to analyze details in the perf data that may not be available on the dashboard.

In general, we want to eliminate the dependency on the dashboard.

The solution that I have tentatively adopted is to create custom workflow runs that accepts a parameter (last N commits), and run the actions workflow which calls a bunch of powershell / python scripts to produce a CSV, or a SQL compatible file, and devs can use existing tools like Excel or PostgreSQL to look at columns and do analysis without having to rely on the dashboard.

This should be super easy for any MsQuic / XDP / Windows dev. They just need to go to Netperf, run the workflow associated with their project, and download the files to analyze.

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

404 on subpaths

Scenario:

I navigate to https://microsoft.github.io/netperf/dist/.
I click e.g. the "latency" tab, which changes my URL bar to https://microsoft.github.io/netperf/dist/latency.
I see some interesting data, and copy the current URL to send to my compatriot.
Compatriot gets a 404.

This appears to be a well-known issue with SPAs: https://stackoverflow.com/questions/46056414/getting-404-for-links-with-create-react-app-deployed-to-github-pages

Collect Detailed Latency Data

We need to collect and store the full latency curve (~50 data points) for each latency run.

Document DB Data Format

I was taking a look at an example output here, and I have a few suggestions. Also, we need to not forget to update api-interface-schema.md with the format once we're done.

TestRuns

  "TestRuns": {
    "Max throughput test using TCP protocol with -upload:10000, -timed:1": [
      "Started!",
      "",
      "Result: 1172439040 bytes @ 934493 kbps (10037.006 ms).",
      "App Main returning status 0"
    ],
    "Max throughput test using QUIC protocol with -upload:10000, -timed:1": [
      "Started!",
      "",
      "Result: 1264713728 bytes @ 908461 kbps (11137.188 ms).",
      "App Main returning status 0"
    ]
  },

The file doesn't need to be human readable on its own. We will document the schema and tools will have documentation to indicate what certain arguments do, so we don't need the "Max throughput test using..." parts. We should replace it with simply the arguments of the tool used.
This shouldn't be a list of test runs but should instead be a multi-level list of tests, each with a set of multiple runs. I would format it like this:

"Tests": {
  "1": {
    "ServerArgs": "...",
    "ClientArgs": "...",
    "Runs": [
      "Result: 1172439040 bytes @ 934493 kbps (10037.006 ms).",
      "..."
    ]
  }
}

Additionally, you will notice I removed everything but the "Result: " line in the Runs list. I'm still thinking if we should further preprocess and simplify the output here. Currently, we have several types of results, depending on what we're looking at/for:

For throughput, we only care about the rate number (i.e. in kbps).
For RPS, we only care about the RPS rate number.
For HPS, we only care about the HPS rate number.
For RPS latency, it's more complex because we care about the latency curve, which consists of multiple numbers, likely stored in a separate file; but maybe we store certain percentiles here?

TestEnv

"TestEnv": {
  "Client": {
    "NIC": "Mellanox ConnectX-5",
    "CPU": "Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz",
    "OS": "Windows Server 2022",
    "Arch": "x64"
  },
  "Server": {
    "NIC": "Mellanox ConnectX-5",
    "CPU": "Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz",
    "OS": "Windows Server 2022",
    "Arch": "x64"
  }
},
"RunDate": "2023-12-04-15-15-27",
"MachineName": "netperf-secnetp",
"TestConfig": {
  "MsQuicCommit": "5a6975e82f6ab97b83d10abf92f862e24910bf03",
  "PerfTool": "secnetperf"
}

Let's put this first in the file.
MachineName should actually be Client and Server specific and not go in the top-level part.
Instead of TestConfig, let's call it Tool, put it first, and rename the fields in it:

"Tool": {
  "Name": "secnetperf",
  "Commit": "5a6975e82f6ab97b83d10abf92f862e24910bf03",
  "Config": "-TLS Schannel"
}

We need the Config option to handle various build configuration options for the tool itself.
For Client and `Server, let's put those last. Additionally, for their field values, I think we should try to compress things some, as we really don't need all the explicit details. Maybe something like this:

"Client": {
  "NIC": "CX5",
  "CPU": "8272CL",
  "OS": "WS", # I thought about Windows, Win or WinSer too, but we should eventually support Windows Client as well
  "Ver": "2022"
  "Arch": "x64"
}

Should we capture NIC driver versions? What about any other configuration knobs for them?
Another thought: Should we really separate out Client and Server here? While it would be nice to test one type against another, we've never done that up until now, and if we do go down that road, the matrix would explode... Food for thought.

General Issues

With how this is currently set up, you can only put one set of runs for a single build/OS configuration in a file. In other words, Windows Server 2019 and Windows Server 2022 results go in separate files; Ubuntu 20.04 goes in a different one; and so on. But you name this file test_result.json, which is too generic. So, we need to either:
- update the file to support multiple configurations in the same file, or
- we need to name the file based on the configuration.
I think putting stuff like that into the file name gets pretty messy, which we really stuff like OS type and version, TLS type, IO type, etc. So we should probably update the json format to have a top-level array to store per-configuration test/run data.
BUT this gets further complicated by my next thought: I not sure using a directory structure based on date/time is the right approach. But I'm not sure. A few thoughts here:
- We want to support running a subset of tests when part of a test matrix changes. For instance, if we update the prerelease version of Windows, then run all the Windows tests again on that new build. If MsQuic changes commit, run it on all OS versions.
- So, how do we store this data, index it and search it appropriately. If the dashboard needs to grab "the latest" data where does it look? If we have files by date, some of the latest might be in one file while the rest is in one or more other files.
So, I then wonder if we should have just a raw folder that has each configurations run, named by date/time and/or guid (for uniqueness), and then another directory structure that can be used as an index to point to the relevant raw files (or just the latest?).
This is where a real DB might be a better option, but I'm not sure.
Additionally, should we have a separate directory for the tool name? What if we want to support multiple tools for the same XDP commit (i.e. if we get running MsQuic tests for a given XDP commit; plus we have the normal XDP perf results)?

Bottom line, I think this is on the right track, but we need to discuss some of these things for sure.

Netperf version 2 TODO: Automatic Provisioning of Lab VMs

We need to work with the Cloud Test and SCORE (internal) teams to set up automation to provision the VMs dynamically in the lab environment.

Refactor data processing pipeline.

As of right now, we generate the .sql file dynamically and have a simple python script batch execute all the SQL files generated from the test automation to save data to the database / update other state.

A new proposal for a better data architecture is to offload the generation of the sql script to the python script, and have intermediary JSON files be used.

Benefits of this approach is we have more control over how we generate this SQL script, and have that separation with the powershell script that actually runs the jobs.

This is a P2 right now as we don't have logic that is super complicated that needs to be generated at the end.

Netperf version 1.5 TODO: Automatic Provisioning of Azure VMs

We need to figure out a solution to automatically deploy/provision the Azure VMs on demand. We might be able to do this easily enough by having a 'setup job/step' that uses the Azure CLI to create the VMs and then run the provisioning script on them. This script would start add the VM to the desired pool (for a single job). Then all subsequent jobs could leverage the pool.

Netperf version 1.5 TODO: PGO Support

We need to add support for PGO runs and then the automation to push back the PGO files to in a GitHub PR.

This repo is missing a LICENSE file

This repository is currently missing a LICENSE file.

A license helps users understand how to use your project in a compliant manner. You can find the standard MIT license Microsoft uses at: https://github.com/microsoft/repo-templates/blob/main/shared/LICENSE.

If you would like to learn more about open source licenses, please visit the document at https://aka.ms/license (Microsoft-internal guidance).

Remove dependencies on PATs

PATs need manual rotation, and although #162 would reduce the burden on individual projects, PATs ultimately are error-prone, a maintenance chore, and AFAIK no longer considered a security best practice.

microsoft / netperf Goto Github PK

netperf's Introduction

Network Perfomance Testing

Goal

Documentation

Contributing

Trademarks

netperf's People

Contributors

Stargazers

Watchers

Forkers

netperf's Issues

TestRuns

TestEnv

General Issues

Recommend Projects

Recommend Topics

Recommend Org