Git Product home page Git Product logo

netperf's Introduction

Dashboard

Network Perfomance Testing

The repo is the common place other Microsoft owned networking projects (including Windows itself) use to run, store and visualize networking performance testing. Currently, the following projects (will) use this:

Goal

Historically, networking performance testing has been spotty, inconsistent, not reproducable, and not easily accessible. Different groups or projects test performance in different ways, on different hardware. They even have different definitions of things like throughput and latency. This repo aims to fix that by providing a common, open place to run, store, and visualize networking performance testing. The end result is ultimately a set of dashboards summarizing the performance of the various projects, across various scenarios, and across various platforms.

Documentation

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

netperf's People

Contributors

alan-jowett avatar ami-gs avatar dependabot[bot] avatar keith-horton avatar maolson-msft avatar microsoft-github-operations[bot] avatar microsoftopensource avatar mtfriesen avatar nibanks avatar projectsbyjackhe avatar step-security-bot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

netperf's Issues

Add Dashboard Dev Automation

Since we're deploying our dashboard with Github Pages and using React, we should setup an actions workflow that automatically updates the /dist directory on --deploy branch when new code gets checked in to 'main' under directory /dashboard.

Netperf Version 1 TODO: Migrate F4 VMs to new Experimental boost VMs.

We use Standard F4s v2 VMs in Azure currently, which document their max throughput at 10000 Mbps. Based on the data we're seeing for TCP throughput, it looks like we are hitting that limit:

image

We will probably need to increase the VM size.

The Standard_F8s_v2 VMs (2x cost) increases the limit to 12500 Mbps. The next bandwidth limit isn't until the Standard_F32s_v2 size, which has 16000 Mbps, but is 8x the cost!

image

Documents & script fix

Because the script doesn't work by default based on document, I checked the content and found the issues. This is not reliable to run.

-u and -p are also required parameter

bash setup-runner-linux.sh -i <peerip> -g <github token *do this on client only> -n <no reboot *optional>

Why ssh-copy-id is needed after doing in the script

ssh-copy-id <username of peer>@<peerip>

Too open. 400

chmod 777 $HOME/.ssh/id_rsa

I forget, but maybe this script need to be executed in the directory. or showing error.

bash $HOME/actions-runner/config.sh --url https://github.com/microsoft/netperf --token $githubtoken --labels $runnerlabels --unattended

Sleep is not needed


~ is home directory. try echo ~

HOME="/home/$username"

if you want to split role for server/client (as doing by github token), openssh-server and its config is not needed on client. but this is recommendation.

sudo apt install openssh-server -y

Collect Crash Dumps on Hang / Timeout

Today in the QUIC runs we're occasionally seeing timeouts which indicate a hang. We do have watchdogs in the code, but if the hang is after the watchdog is stopped and on general app cleanup then we still have a problem. We need to leverage something like procdump or notmyfault for kernel to collect dumps when this happens.

Regression Detection

Tests results need to be compared to historical results to determine if a new value might be a regression.

Migrate off of using filenames to pass environment data [NEED TO INCLUDE HYPHEN FOR os_name]

As of today, as part of the SQL automation, we use a fixed format for filenames we save in our pipeline.

That fixed format allows us to build an automation that parses for keywords and query the database.

As a consequence of this, because we split the names with a hyphen "-" (i.e.: test-results-ubuntu-20.05-x64) we use the hyphen as a delimiter.

Notice how the name "ubuntu-20.05" has a hyphen in it even though its 1 atomic item we want parsed. This makes the parsing logic extra complicated, and requires all future OS_names in the YAML to include a hyphen.

In the short term, this is OK, but when we refactor, this is 1 area to be improved upon.

Support Dashboard Data Consumption

As of now, everything infrastructure-wise is set up to run SecNetPerf and produce throughput data. That data gets checked in to the --deploy branch. Saved as a JSON https://microsoft.github.io/netperf/data/secnetperf/2023-12-01-20-25-09._.67ee09354f52d014ad4e9ec85fcb6b9260890134.json/test_result.json

Ideally, we should have an automation that also aggregates all the performance runs from the last ~20 commits / runs, and produce 1 data.json (secnetperf.json?) for each project (XDP, eBPF...).

The dashboard will just use this single file. We can add more features to the dashboard to query any number of commits in the past, but thats a future thing.

Support Data Analysis Solution

Since our perf data will be stored on an orphan branch, we still want devs to be able to run intracate queries to analyze details in the perf data that may not be available on the dashboard.

In general, we want to eliminate the dependency on the dashboard.

The solution that I have tentatively adopted is to create custom workflow runs that accepts a parameter (last N commits), and run the actions workflow which calls a bunch of powershell / python scripts to produce a CSV, or a SQL compatible file, and devs can use existing tools like Excel or PostgreSQL to look at columns and do analysis without having to rely on the dashboard.

This should be super easy for any MsQuic / XDP / Windows dev. They just need to go to Netperf, run the workflow associated with their project, and download the files to analyze.

Document DB Data Format

I was taking a look at an example output here, and I have a few suggestions. Also, we need to not forget to update api-interface-schema.md with the format once we're done.

TestRuns

  "TestRuns": {
    "Max throughput test using TCP protocol with -upload:10000, -timed:1": [
      "Started!",
      "",
      "Result: 1172439040 bytes @ 934493 kbps (10037.006 ms).",
      "App Main returning status 0"
    ],
    "Max throughput test using QUIC protocol with -upload:10000, -timed:1": [
      "Started!",
      "",
      "Result: 1264713728 bytes @ 908461 kbps (11137.188 ms).",
      "App Main returning status 0"
    ]
  },
  • The file doesn't need to be human readable on its own. We will document the schema and tools will have documentation to indicate what certain arguments do, so we don't need the "Max throughput test using..." parts. We should replace it with simply the arguments of the tool used.

  • This shouldn't be a list of test runs but should instead be a multi-level list of tests, each with a set of multiple runs. I would format it like this:

"Tests": {
  "1": {
    "ServerArgs": "...",
    "ClientArgs": "...",
    "Runs": [
      "Result: 1172439040 bytes @ 934493 kbps (10037.006 ms).",
      "..."
    ]
  }
}
  • Additionally, you will notice I removed everything but the "Result: " line in the Runs list. I'm still thinking if we should further preprocess and simplify the output here. Currently, we have several types of results, depending on what we're looking at/for:
  1. For throughput, we only care about the rate number (i.e. in kbps).
  2. For RPS, we only care about the RPS rate number.
  3. For HPS, we only care about the HPS rate number.
  4. For RPS latency, it's more complex because we care about the latency curve, which consists of multiple numbers, likely stored in a separate file; but maybe we store certain percentiles here?

TestEnv

"TestEnv": {
  "Client": {
    "NIC": "Mellanox ConnectX-5",
    "CPU": "Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz",
    "OS": "Windows Server 2022",
    "Arch": "x64"
  },
  "Server": {
    "NIC": "Mellanox ConnectX-5",
    "CPU": "Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz",
    "OS": "Windows Server 2022",
    "Arch": "x64"
  }
},
"RunDate": "2023-12-04-15-15-27",
"MachineName": "netperf-secnetp",
"TestConfig": {
  "MsQuicCommit": "5a6975e82f6ab97b83d10abf92f862e24910bf03",
  "PerfTool": "secnetperf"
}
  • Let's put this first in the file.
  • MachineName should actually be Client and Server specific and not go in the top-level part.
  • Instead of TestConfig, let's call it Tool, put it first, and rename the fields in it:
"Tool": {
  "Name": "secnetperf",
  "Commit": "5a6975e82f6ab97b83d10abf92f862e24910bf03",
  "Config": "-TLS Schannel"
}
  • We need the Config option to handle various build configuration options for the tool itself.
  • For Client and `Server, let's put those last. Additionally, for their field values, I think we should try to compress things some, as we really don't need all the explicit details. Maybe something like this:
"Client": {
  "NIC": "CX5",
  "CPU": "8272CL",
  "OS": "WS", # I thought about Windows, Win or WinSer too, but we should eventually support Windows Client as well
  "Ver": "2022"
  "Arch": "x64"
}
  • Should we capture NIC driver versions? What about any other configuration knobs for them?
  • Another thought: Should we really separate out Client and Server here? While it would be nice to test one type against another, we've never done that up until now, and if we do go down that road, the matrix would explode... Food for thought.

General Issues

  • With how this is currently set up, you can only put one set of runs for a single build/OS configuration in a file. In other words, Windows Server 2019 and Windows Server 2022 results go in separate files; Ubuntu 20.04 goes in a different one; and so on. But you name this file test_result.json, which is too generic. So, we need to either:
    • update the file to support multiple configurations in the same file, or
    • we need to name the file based on the configuration.
  • I think putting stuff like that into the file name gets pretty messy, which we really stuff like OS type and version, TLS type, IO type, etc. So we should probably update the json format to have a top-level array to store per-configuration test/run data.
  • BUT this gets further complicated by my next thought: I not sure using a directory structure based on date/time is the right approach. But I'm not sure. A few thoughts here:
    • We want to support running a subset of tests when part of a test matrix changes. For instance, if we update the prerelease version of Windows, then run all the Windows tests again on that new build. If MsQuic changes commit, run it on all OS versions.
    • So, how do we store this data, index it and search it appropriately. If the dashboard needs to grab "the latest" data where does it look? If we have files by date, some of the latest might be in one file while the rest is in one or more other files.
  • So, I then wonder if we should have just a raw folder that has each configurations run, named by date/time and/or guid (for uniqueness), and then another directory structure that can be used as an index to point to the relevant raw files (or just the latest?).
  • This is where a real DB might be a better option, but I'm not sure.
  • Additionally, should we have a separate directory for the tool name? What if we want to support multiple tools for the same XDP commit (i.e. if we get running MsQuic tests for a given XDP commit; plus we have the normal XDP perf results)?

Bottom line, I think this is on the right track, but we need to discuss some of these things for sure.

Refactor data processing pipeline.

As of right now, we generate the .sql file dynamically and have a simple python script batch execute all the SQL files generated from the test automation to save data to the database / update other state.

A new proposal for a better data architecture is to offload the generation of the sql script to the python script, and have intermediary JSON files be used.

Benefits of this approach is we have more control over how we generate this SQL script, and have that separation with the powershell script that actually runs the jobs.

This is a P2 right now as we don't have logic that is super complicated that needs to be generated at the end.

Netperf version 1.5 TODO: Automatic Provisioning of Azure VMs

We need to figure out a solution to automatically deploy/provision the Azure VMs on demand. We might be able to do this easily enough by having a 'setup job/step' that uses the Azure CLI to create the VMs and then run the provisioning script on them. This script would start add the VM to the desired pool (for a single job). Then all subsequent jobs could leverage the pool.

Remove dependencies on PATs

PATs need manual rotation, and although #162 would reduce the burden on individual projects, PATs ultimately are error-prone, a maintenance chore, and AFAIK no longer considered a security best practice.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.