I have a long email thread in my inbox with <a class="user-mention notranslate" data-h

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I have two doubts by now: Should I ignore <code class="notrans

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Implement Checkpoints about crashmonkey HOT 12 CLOSED

utsaslab commented on September 25, 2024

Implement Checkpoints

from crashmonkey.

Comments (12)

ashmrtn commented on September 25, 2024

To summarize the email chain, we were thinking that we could use checksums to help us save the user-visible state when Checkpoint is called. For simplicity, we could likely use a hashmap keyed on file paths whose values are the checksums. Each checkpoint could generate a hashmap by walking the directory structure of the file system and checksumming the files found.

from crashmonkey.

ashmrtn commented on September 25, 2024

We may also want to checksum at least some of the data available from calls to stat() (ex. file size and permissions but none of the modified/accessed/created times) so that we can catch user-visible metadata errors as well. @vijay03 may have meant that when he said "(directory tree + data)", but I would like to explicitly put that out there as well.

from crashmonkey.

ashmrtn commented on September 25, 2024

To consolidate what we've said so far and what I've been thinking about this issue:

Overview:

In short, we want to have some mechanism to know what data/metadata to expect in each crash state. The idea is to allow users to call Checkpoint, which captures the user-visible state (directory tree + data) of the file system somewhere. On a crash, we go back to the latest Checkpoint and see if we have all the data in there.

The user-space CrashMonkey test harness needs to be able to receive Checkpoint requests from other processes. Since we are also expanding CrashMonkey to run in the background and have the user kick off their own workload (not one that implements CrashMonkey's BaseTestCase), we cannot assume that the workload will be a child process of CrashMonkey itself. Therefore, the Checkpoint feature must be capable of communicating with processes it does not have a parent-child relationship with. The Checkpoint() call should be available to users regardless of if they implement BaseTestCase and let CrashMonkey run their workload or they run CrashMonkey in the background and then run their workload.

      Checkpoint()   Workload continues
               |         |
Workload    ---A---------D-----------
                \       /
CrashMonkey -----B-----C-------------
                 |
             walk file system

Collecting Data:
In the CrashMonkey test harness, a call to Checkpoint() should cause CrashMonkey to walk the directory structure on the snapshot for the current workload (this should be /dev/cow_ram_snapshot1_0). During the file system walk, CrashMonkey should checksum the data of each file (ex. read the file and compute checksum) as well as checksum some of the file metadata obtainable by calling stat(). The metadata that is checksummed should not include date/time fields as they are prone to change but generally don't affect program correctness, but should include things like file permissions and file size. These checksums can then be stored in something like a hashmap. A new hashmap containing checksums for the entire file system should be created on each call to Checkpoint().

Implementation Thoughts:
As the cow_brd.c module currently only allows snapshots based off the base disk (/dev/cow_ram0) and the workload runs on a snapshot not the base disk, this can be a synchronous call to start out. This should be achievable by having a stub the user can call which tells CrashMonkey to do a Checkpoint operation and waits for CrashMonkey to reply.

I was planning on using local sockets when implementing #1, thus giving us flexibility down the line if we want to allow RPC calls into CrashMonkey functionality. I believe the implementation for this could also use local sockets as they allow bidirectional communication across processes and can be treated much like files in C code. On a local machine, they may not be as flexible as shared memory regions, but they avoid some of the synchronization/locking issues of shm in addition to allowing easy modification if we decide to allow RPC calls down the road.

from crashmonkey.

ashmrtn commented on September 25, 2024

We also need to be able to associate checkpoints with points in our logged bio sequence so we should timestamp when the checkpoint was done. Logged bios will also need timestamps as that information is not currently recorded.

We can assume that the user has just completed a sync operation of some form when Checkpoint() is called.

from crashmonkey.

vijay03 commented on September 25, 2024

Sockets sounds reasonable. To associate Checkpoints with the stream of data sent to the device, we should have either a file inside the device (lets say called Flag), that is written to everytime there is a checkpoint. Using writes to Flag, we can then associate each checkpoint in the dat stream.

Another approach would be to have a in-kernel counter that is incremented every time the user calls Checkpoint (via an ioctl for example). Using the counter we associate checkpoints with bios.

from crashmonkey.

vijay03 commented on September 25, 2024

@domingues @ashmrtn progress seem to have stalled on this? Are we blocked on something?

from crashmonkey.

domingues commented on September 25, 2024

I have two doubts by now:

Should I ignore lost+found folder?
So on every crash tested (test_check_random_permutations()) if the user test fails (test_loader.get_instance()->check_test()) we should check if the data is equal to the last checkpoint made?

from crashmonkey.

vijay03 commented on September 25, 2024

Lets ignore lost and found for now.

By "user test", do you mean a test that the user runs on top of the mounted file system? If so, this is the default version of that user test. Once the file system mounts, we are basically testing that the data/metadata we expect is in there.

If the file system does not mount at all, we just report an error and return.

from crashmonkey.

vijay03 commented on September 25, 2024

I think implementing checkpoints is a very large task, that is unlikely to be merged in with a single pull request. @domingues, could you merge in parts of it with pull requests as you code it up?

from crashmonkey.

ashmrtn commented on September 25, 2024

@vijay03 I think it might be advantageous to split the functionality of the original checkpoint idea into 2 things:

a checkpoint type operation that will be passed to user tests, denoting the most recent checkpoint reached in the generated crash state (only available after sync/fsync)
a watch type operation where the user passes a file path to CrashMonkey and CrashMonkey then monitors that file path to make sure no changes occur in it after that point in generated crash states. This will require support from (1) as well. (only available after sync/fsync)

from crashmonkey.

ashmrtn commented on September 25, 2024

I'm going to split this issue up into several smaller ones since both checkpoints and watches somewhat complicated and require support across different parts of CrashMonkey.

from crashmonkey.

vijay03 commented on September 25, 2024

Should we close this issue now @ashmrtn ?

from crashmonkey.

Implement Checkpoints about crashmonkey HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent