Crater is a laboratory for running experiments across a large body of Rust source code. Its primary purpose is to detect regressions in the Rust compiler, and it does this by building large numbers of crates, running their test suites, and comparing the results between two versions of Rust.
It can operate completely locally, with only a dependency on docker, or it can run distributed on AWS. It should work on Windows.
Some of the goals of Crater:
- Discover Rust codebases from crates.io and GitHub
- Download all Rust code to a local disk
- Build and manage custom Rust toolchains
- Run
cargo build
andcargo test
over all codebases - Cache dependencies to avoid unnecessary rebuilds
- Lockfiles shared between runs
- Dependencies fetched ahead of time
- Building and testing is
--frozen
- no dependency updates or network access - Run arbitrary tests over all codebases
- Resume partial test runs
- Generate summary HTML and text reports
- Run on Linux and Windows
- Isolate tests into docker containers on Linux and Windows
- Test against Linux-based cross targets under docker
- Hosted, distributed testing on AWS
Crater is a successor to https://github.com/brson/taskcluster-crater. It was subsequently named cargobomb before resuming the Crater name, so for now the code still refers to cargobomb in many places (Being addressed in #134).
Warning: do not run Crater in an unsandboxed environment. Crater executes malicious code that will destroy what you love.
These commands will run Crater, in local configuration, on the demo crate set. This is safe to run unsanboxed because the set of crates tested is limited to the 'demo' set. This requires the user have access to the docker daemon.
Today Crater expects to be run out of its source directory, and all
of its output is into the ./work
directory, where it maintains its
own rustup installation, crate mirrors, etc.
cargo run -- prepare-local --docker-env mini
cargo run -- define-ex --crate-select=demo stable beta
cargo run -- prepare-ex
cargo run -- run
cargo run -- gen-report work/ex/default/
This will output a report to ./work/ex/default/index.html
.
Delete things with
cargo run -- delete-all-target-dirs
cargo run -- delete-all-results
cargo run -- delete-ex
Each command except prepare-local
optionally takes an --ex
argument
to identify the experiment being referred to. If not supplied, this
defaults to default
. Here's what each of the steps does:
-
prepare-local
- sets up the stable toolchain for internal use, builds the docker container, builds lists of crates. This needs to be rerun periodically, but not between every experiment. -
define-ex
- defines a new experiment performing a build-test experiment on the 'demo' set of crates. -
prepare-ex
- fetches repos from github and captures their commit shas, downloads all crates, hacks up Cargo.toml files, captures lockfiles, fetches all dependencies, and prepares toolchains. -
run
- runs tests on crates in the experiment, against both toolchains -
gen-report
- summarize the experiment results to work/ex/default/index.html -
delete-all-target-dirs
/delete-all-results
/delete-ex
- clean up everything relating to this experiment
Toolchains for rust PRs that have been built by by asking bors to try a PR can
be specified using try#<SHA1 of try merge>
. You will probably want to specify
the comparison commit as master#<SHA1 of master before try merge>
.
There are three 'official' Crater machines:
- cargobomb-test (54.177.234.51) - 1 core, 4GB RAM, for experimenting
- cargobomb-try (54.241.86.211) - 8 core, 30GB RAM, for doing PR runs
- cargobomb-prod (54.177.126.219) - 8 core, 30GB RAM, for doing beta runs (but can do PR runs if free)
These can only be accessed via the bastion - you ssh
to the bastion,
then ssh
to the Crater machine. The bastion has restricted access
and you will need a static IP address (if you have a long-running server
in the cloud, that's usually fine) and a public SSH key (you should add
the key to github and then link to https://github.com/yourusername.keys,
once you have access to the bastion you can manage your own keys).
With these two pieces of information in hand, ask acrichto to
add you to the bastion and all three machines and they'll let you know
the bastion IP. You can now either edit your ~/.ssh/config
on your
static IP machine to contain
Host rust-bastion
# Bastion IP below
HostName 0.0.0.0
User bastionusername
Host cargobomb-test
HostName 54.177.234.51
ProxyCommand ssh -q rust-bastion nc -q0 %h 22
User ec2-user
# [...and so on for cargobomb-try and cargobomb-prod...]
which will let you do ssh cargobomb-test
etc from your static IP
machine. If you have a recent OpenSSH, you can use ProxyJump
instead.
The Crater servers use a terminal multiplexer (a way to keep multiple
terminals running on a server). Enter the multiplexer by logging onto a
server and running byobu
. You'll notice a bit of text along the
bottom saying something like "0:master 1:tc1 2:tc2" - these are
the 'windows' in the terminal multiplexer. The one highlighted and with a
*
next to it is the current window. Sending commands to the multiplexer
is achieved by pressing Ctrl+Z, and then another key.
Some useful operations:
- Ctrl+Z d - detach from the multiplexer (or you can just close your terminal)
- Ctrl+Z 0 - switch to window 0 (or any other number)
- Ctrl+Z PageUp - scroll upwards on the terminal. This will enter a sort of 'scrolling mode', so you can use PageUp and PageDown freely (to the limit of terminal scrollback). To return to normal terminal mode, hit Ctrl+C - be sure to only press it once, or you risk returning to normal mode and then killing the process running in the current terminal!
- Ctrl+Z c - create a new window, useful if you accidentally closed one
- Ctrl+Z , - rename a window, useful after recreating an accidentally closed window (hit enter to accept new name)
On your day for Crater triage, open the sheet. Click the top left cell and make sure every PR on that list has an entry on the sheet and make sure every row on the sheet without 'Complete' or 'Failed' is listed on the GitHub search. You may need to update PR tags or add rows to the sheet as appropriate.
Next, you should follow the steps below for eachrequested run on the sheet that does not have a status of 'Complete' or 'Failed'.
- Pending
- Is try or prod available? (prioritise beta runs to go on prod, no matter how far down the pending list they are) If not, go to next run.
- Log onto appropriate box and connect to multiplexer.
- Double check each multiplexer window to make sure nothing is running.
- Switch to the
master
multiplexer window. - Run
docker ps
to make sure no containers are running. - Run
df -h /home/ec2-user/cargobomb/work
, disk usage should be <250GB of the 1TB disk (a full run may consume 600GB)- If disk usage is greater, there are probably target directories
left over from a previous run. Run
du -sh work/local/target-dirs/*
, find the culprit (likely a directory with >100GB). - The directory name is the name of an experiment, e.g. MY_EX, so run
cargo run --release -- delete-all-target-dirs --ex MY_EX
.
- If disk usage is greater, there are probably target directories
left over from a previous run. Run
- Run
docker ps -aq | xargs --no-run-if-empty docker rm
to clean up all terminated Docker containers. - Run
git stash && git pull && git stash pop
to get the latest Crater changes. If this fails, it means there were local changes that conflict with upstream changes. Ping aidanhs and tomprince on IRC. - Run
cargo run --release -- prepare-local
. This may take between 5s and 5min, depending on what needs doing. - Log
EX_NAME
,EX_START
andEX_END
in the spreadsheet, where:- If doing a run for PR 12345,
EX_NAME
ispr-12345
,EX_END
istry#deadbeef2...
(deadbeef2
is in the bors comment "Trying commitabcdef
with mergedeadbeef2
" - click through and copy from the URL to get the full commitish) andEX_START
ismaster#deadbeef1...
(deadbeef1
is on the page you clicked through to getdeadbeef2...
, just below the commit message, the left hand commit of "2 parentsdeadbeef1
andbcdef1
" - click through and copy from the URL to get the full commitish, make sure the commit is an auto merge from bors). Just to emphasise, the second commitish you copied goes inEX_START
. - If doing a beta run,
EX_NAME
isstable-STABLE_VERSION-beta-BETA_VERSION
,EX_START
isLAST_STABLE
andEX_END
isBETA_DATE
.STABLE_VERSION
is the version number fromcurl -sSL static.rust-lang.org/dist/channel-rust-stable.toml | grep -A1 -F '[pkg.rust]'
,BETA_VERSION
is the version number fromcurl -sSL static.rust-lang.org/dist/channel-rust-beta.toml | grep -A1 -F '[pkg.rust]'
andBETA_DATE
is the date fromcurl -sSL static.rust-lang.org/dist/channel-rust-beta.toml | grep '^date ='
(it is not necessarily the same date as retrieved in theBETA_VERSION
command).
- If doing a run for PR 12345,
- Run
cargo run --release -- define-ex --crate-select=full --ex EX_NAME EX_START EX_END
. This will complete in a few seconds. - Run
cargo run --release -- prepare-ex --ex EX_NAME
. - Change status to 'Preparing'.
- Update either the PR or the person requesting the run to let them know the run has started.
- Go to next run.
- Preparing
- Log onto appropriate box and connect to multiplexer.
- Switch to the
master
multiplexer window. - If preparation is ongoing, go to next run.
- If preparation failed, fix it. Known errors:
- "missing sha for ..." - remove the referenced repository from
gh-apps.txt
andgh-candidates.txt
(may be present in one or both). Make the same change locally and make a PR against Crater. Usecargo run --release -- delete-all-target-dirs --ex EX_NAME
andcargo run --release -- delete-ex --ex EX_NAME
, then jump to start of 'Pending'.
- "missing sha for ..." - remove the referenced repository from
- Switch to the
tc1
multiplexer window. - Run
cargo run --release -- run-tc --ex EX_NAME EX_START
. - Switch to the
tc2
multiplexer window. - Run
cargo run --release -- run-tc --ex EX_NAME EX_END
. - Go to next run.
- Running
- Log onto appropriate box and connect to multiplexer.
- Switch to the
master
multiplexer window. - Run
docker ps
. If any container has been running for more than 30min (may need to follow these steps more than once):- Take solace in us someday fixing this for good with docker limits. TODO: actually fix. Seems to only be a problem on prod with pleingres, our existing limits should catch it.
- Run
docker top CONTAINER_ID
. - If there's no mention of pleingres, raise an issue with the output of
the previous
docker top
command. - The process at the bottom of the list is the lowest in the process tree,
and should have a value in the
TIME
column of >30min. Find the value in thePID
column and runkill PID
. - Wait a few seconds, then check the container has now exited.
- If the run is ongoing in either the
tc1
ortc2
multiplexer windows, go to next run. - Switch to the
master
multiplexer window. - Run
du -sh work/ex/EX_NAME
, output should be <2GB. If not:- Run
find work/ex/EX_NAME -type f -size +100M | xargs du -sh
, there will likely only be a couple of files listed and they should be in theres
directory (TODO: blacklist pleingres as the main culprit here once it's possible, and update these instructions to suggest adding things to the blacklist). - For each file found, run
truncate --size='<100M' FILE
. - Check
du -sh work/ex/EX_NAME
is now an appropriate size.
- Run
- Run
cargo run --release -- publish-report --ex EX_NAME s3://cargobomb-reports/EX_NAME
. - Change status to 'Uploading'.
- (optional but much appreciated: come back to this run in 30mins as the upload will be complete)
- Go to next run.
- Uploading
- Switch to the
master
multiplexer window. - If the upload is ongoing, go to the next run.
- If the upload failed, fix it. Known errors:
<Error><Code>InternalError</Code><Message>...
- probably an s3 failure, try running upload again.
- Run
cargo run --release -- delete-all-target-dirs --ex EX_NAME
. This will take ~2min. - Change status to 'Complete' and add the results link,
http://cargobomb-reports.s3.amazonaws.com/EX_NAME/index.html
. - Update either the PR or the person requesting the beta run. Template is:
Crater results: <url>. 'Blacklisted' crates (spurious failures etc) can be found [here](https://github.com/rust-lang-nursery/crater/blob/master/blacklist.md). If you see any spurious failures not on the list, please make a PR against that file.
- Give yourself a pat on the back! Good job!
- Go to next run.
- Switch to the
(The runs can be stopped and restarted at any time. - really? How? asks aidanhs)
If a beta run has completed, regressions need reporting (PR runs are left to the people involved in the PR). To report regressions you'll need to navigate to the results page, wait for a bit (<30s) for the results to load (the buttons will be populated with numbers) and then click 'regressed'. The triage process (e.g. checking the cause of a regression) is 'crowd-sourced', we just report the issues (for now).
You can follow whatever process you like for working through regressions, but a suggestion workflow is described below, per regression:
- Open the regression log (i.e. 'toolchain 2').
- If the regression is on the blacklist, skip it.
- If the breakage is 'obviously deliberate', e.g. a lint changing to deny by default, find the original PR and double check it went through a Crater run. Skip reporting if so.
- If the regression is in a dependency, it will have probably caused multiple regressions so make sure to deal with the dependency first and then ignore any duplicates.
- If this is not a .1 beta (i.e. it's a second beta run), search for the regression already being reported. If it was closed as "wanted regression" skip reporting, if it was closed as "fixed" then reopen with a link to the log.
- Report the regression per the template below:
This template varies depending on crate source (crates.io or a git repo):
[CRATENAME-1.0.1](https://crates.io/crates/cratename) regressed from stable to beta - http://cargobomb-reports.../log.txt, cc @AUTHOR
[AUTHOR/REPO#COMMITISH](https://github.com/author/repo/tree/COMMITISH) regressed from stable to beta - http://cargobomb-reports.../log.txt, cc @AUTHOR
where AUTHOR is the github username of the crate author (may not be available if the crate is from crates.io in rare cases). You should also paste a snippet of the error in the issue.
When in doubt file an issue. It's best to force the Rust devs to acknowledge the regression.
If you are interested in triaging once the issues are raised, you can follow the rough instructions below (to be made clearer):
To triage the reports I use another sandboxed Rust environment to verify the regressions before filing them. Make sure the current nightly/beta/stable toolchains are installed.
- Find the git repo. If I can't find it (rare) I just skip the crate.
- Check out the git repo
- If the repo has version tags, check out the corresponding version, otherwise use master (if master fails to reproduce I will poke around the commit history a bit to see if I can pull out a failing revision)
- Run
cargo +stable test
to verify that stable works.- If stable does not work I will run it some more to see if it's a flaky test, and add it to the blacklist.
- I will run
cargo +PREVIOUS_RELEASE test
and see if that fails too, and if so move on.
- Run
cargo +beta test
to verify that it fails. Note that this is checking 'beta' even if Crater was against 'nightly'. If that succeeds then I move on tocargo +nightly test
.
MIT / Apache 2.0