Comments (12)
To summarize the email chain, we were thinking that we could use checksums to help us save the user-visible state when Checkpoint is called. For simplicity, we could likely use a hashmap keyed on file paths whose values are the checksums. Each checkpoint could generate a hashmap by walking the directory structure of the file system and checksumming the files found.
from crashmonkey.
We may also want to checksum at least some of the data available from calls to stat()
(ex. file size and permissions but none of the modified/accessed/created times) so that we can catch user-visible metadata errors as well. @vijay03 may have meant that when he said "(directory tree + data)", but I would like to explicitly put that out there as well.
from crashmonkey.
To consolidate what we've said so far and what I've been thinking about this issue:
Overview:
In short, we want to have some mechanism to know what data/metadata to expect in each crash state. The idea is to allow users to call Checkpoint, which captures the user-visible state (directory tree + data) of the file system somewhere. On a crash, we go back to the latest Checkpoint and see if we have all the data in there.
The user-space CrashMonkey test harness needs to be able to receive Checkpoint requests from other processes. Since we are also expanding CrashMonkey to run in the background and have the user kick off their own workload (not one that implements CrashMonkey's BaseTestCase
), we cannot assume that the workload will be a child process of CrashMonkey itself. Therefore, the Checkpoint feature must be capable of communicating with processes it does not have a parent-child relationship with. The Checkpoint()
call should be available to users regardless of if they implement BaseTestCase
and let CrashMonkey run their workload or they run CrashMonkey in the background and then run their workload.
Checkpoint() Workload continues
| |
Workload ---A---------D-----------
\ /
CrashMonkey -----B-----C-------------
|
walk file system
Collecting Data:
In the CrashMonkey test harness, a call to Checkpoint()
should cause CrashMonkey to walk the directory structure on the snapshot for the current workload (this should be /dev/cow_ram_snapshot1_0
). During the file system walk, CrashMonkey should checksum the data of each file (ex. read the file and compute checksum) as well as checksum some of the file metadata obtainable by calling stat()
. The metadata that is checksummed should not include date/time fields as they are prone to change but generally don't affect program correctness, but should include things like file permissions and file size. These checksums can then be stored in something like a hashmap. A new hashmap containing checksums for the entire file system should be created on each call to Checkpoint()
.
Implementation Thoughts:
As the cow_brd.c
module currently only allows snapshots based off the base disk (/dev/cow_ram0
) and the workload runs on a snapshot not the base disk, this can be a synchronous call to start out. This should be achievable by having a stub the user can call which tells CrashMonkey to do a Checkpoint operation and waits for CrashMonkey to reply.
I was planning on using local sockets when implementing #1, thus giving us flexibility down the line if we want to allow RPC calls into CrashMonkey functionality. I believe the implementation for this could also use local sockets as they allow bidirectional communication across processes and can be treated much like files in C code. On a local machine, they may not be as flexible as shared memory regions, but they avoid some of the synchronization/locking issues of shm
in addition to allowing easy modification if we decide to allow RPC calls down the road.
from crashmonkey.
We also need to be able to associate checkpoints with points in our logged bio
sequence so we should timestamp when the checkpoint was done. Logged bio
s will also need timestamps as that information is not currently recorded.
We can assume that the user has just completed a sync
operation of some form when Checkpoint()
is called.
from crashmonkey.
Sockets sounds reasonable. To associate Checkpoints with the stream of data sent to the device, we should have either a file inside the device (lets say called Flag), that is written to everytime there is a checkpoint. Using writes to Flag, we can then associate each checkpoint in the dat stream.
Another approach would be to have a in-kernel counter that is incremented every time the user calls Checkpoint (via an ioctl for example). Using the counter we associate checkpoints with bios.
from crashmonkey.
@domingues @ashmrtn progress seem to have stalled on this? Are we blocked on something?
from crashmonkey.
I have two doubts by now:
- Should I ignore
lost+found
folder? - So on every crash tested (
test_check_random_permutations()
) if the user test fails (test_loader.get_instance()->check_test()
) we should check if the data is equal to the last checkpoint made?
from crashmonkey.
Lets ignore lost and found for now.
By "user test", do you mean a test that the user runs on top of the mounted file system? If so, this is the default version of that user test. Once the file system mounts, we are basically testing that the data/metadata we expect is in there.
If the file system does not mount at all, we just report an error and return.
from crashmonkey.
I think implementing checkpoints is a very large task, that is unlikely to be merged in with a single pull request. @domingues, could you merge in parts of it with pull requests as you code it up?
from crashmonkey.
@vijay03 I think it might be advantageous to split the functionality of the original checkpoint idea into 2 things:
- a
checkpoint
type operation that will be passed to user tests, denoting the most recent checkpoint reached in the generated crash state (only available after sync/fsync) - a
watch
type operation where the user passes a file path to CrashMonkey and CrashMonkey then monitors that file path to make sure no changes occur in it after that point in generated crash states. This will require support from (1) as well. (only available after sync/fsync)
from crashmonkey.
I'm going to split this issue up into several smaller ones since both checkpoints and watches somewhat complicated and require support across different parts of CrashMonkey.
from crashmonkey.
Should we close this issue now @ashmrtn ?
from crashmonkey.
Related Issues (20)
- CrashMonkey test faililng due to assertion error in RandomPermuter HOT 1
- insmod ERROR: “disk_wrapper.ko: Cannot allocate memory” HOT 3
- Make multiple instances of CrashMonkey run in a single machine HOT 1
- Have CrashMonkey behave more like a fuzzer HOT 1
- Write an adaptor for Crashmonkey for dm-flakey HOT 2
- ACE fails on fsync HOT 3
- ACE fails on "nested" mode HOT 1
- report two Ace bugs HOT 15
- Build error happening HOT 2
- Failed test cases - "Unable to remove wrapper device" HOT 1
- Port CrashMonkey to Linux 5.6.6
- Update documentation and scripts to reflect changes in xfsMonkey.py for btrfs
- Build error "No such file or directory" on CentOS7 HOT 1
- xfstest adapter produces incorrect output files because of erroring commands
- Memory access violation HOT 1
- False positive tests
- The xfstest does not support _supported_os
- ZFS/OpenZFS support ?
- future bugs to investigate when CrashMonkey is more complete HOT 1
- Clean up logs
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from crashmonkey.