Git Product home page Git Product logo

famfs's Introduction

famfs logo

Famfs Shared Memory Filesystem Framework - User Space Repo

This is the famfs user space repo. Also required is a famfs-enabled Linux kernel. A viable kernel can be accessed one of the following ways:

What is Famfs?

Famfs is a scale-out shared-memory file system. If two or more hosts have shared access to memory, a famfs file system can be created in that memory, such that data sets can be saved to files in the shared memory.

For apps that can memory map files, memory-mapping a famfs file provides direct access to the memory without any page cache involvement (and no faults involving data movement at all).

Consuming data from famfs files works the same as any other file system, meaning almost any app that can use data in files can use data in shared memory

Files default to read-only on client nodes, but famfs fully supports converting any file to writable on any client - in which case the app/user is responsible for managing cache coherency. Famfs maintains cache coherency for its structures and metadata, but does not attempt to do so for other apps.

Famfs figure 1

Famfs effectively does the following:

  • Enables disaggregated shared memory for any app that can use shared files
  • De-duplicates memory, because multiple nodes can share a single copy of data in memory
  • Reduces or avoids shuffling
  • Enables larger in-memory data sets than previously possible, because FAM could be larger than the memory limit of any one server

Documentation Contents

Background

In the coming years the emerging CXL standard will enable shared, disaggregated memory in multiple forms - included multi-port memory and fabric-attached memory (FAM). Famfs is intended to provide a viable usage pattern for FAM that many apps can use without modification. Shared memory implementations already exist in early 2024.

Famfs is an fs-dax file system that allows multiple hosts to mount the same file system from the same shared memory. The file system is administered by a Master, but can be concurrently mounted by one or more Clients (which default to read-only access to files, but writable access is permitted).

Why do we need a new fs-dax file system when others (e.g. xfs and ext4) exist? Because the existing fs-dax file systems use write-back metadata (as pretty much all conventional file systems do). Write-back metadata is not compatible with scale-out shared memory access, because two or more hosts have no way to agree on the definitive state of metadata (not to mention space allocation).

Famfs was introduced at the 2023 Linux Plumbers Conference. The linked page contains the abstract, plus links to the slides and a youtube video of the talk.

What is dax?

In Linux, special purpose memory is exposed as a dax device (e.g. /dev/dax0.0 or /dev/pmem0). Applications can memory map dax memory by opening a dax device calling the mmap() system call on the file descriptor.

Dax memory can be onlined as system-ram, but that is not appropriate if the memory is shared. The first of many reasons for this is that Linux zeroes memory that gets onlined, which would wipe any shared contents.

In CXL V3 and beyond, dynamic capacity devices (DCDs) support shared memory. A DCD is really just a memory device with an allocator and access control built-in. Sharable memory has a mandatory Tag (UUID) which is assigned when the memory is allocated; all hosts with shared access identify the memory by its Tag. "Tagged Capacity" will be exposed under Linux as tagged dax devices (e.g. /sys/devices/dax/<tag> - the specific recipe is TBD)

What is fs-dax?

Fs-dax is a means of creating a file system that resides in dax memory. A file in an fs-dax file system is just a convenient means of accessing the subset of dax memory that is allocated to that file. If an application opens an fs-dax file and calls mmap() on the file descriptor, the resulting pointer provides direct load/store access to the memory - without ever moving data in and out of the page cache (as is the case with mmap() on "normal" files).

Posix read/write are also supported, but those are not the optimal use case for fs-dax; read and write effectively amount to memcpy() in and out of the file's memory.

How can famfs do interesting work?

It is common in data science and AI workloads to share large datasets (e.g. data frames) among many compute jobs that share the data. Many components in these tool chains can memory map datasets from files. The "zero-copy formats" (e.g. Apache Arrow) are of particular interest because they are already oriented to formatting data sets in a way that can be efficiently memory-mapped.

Famfs enables a number of advantages when the compute jobs scale out:

  • Large datasets can exist as one shared copy in a famfs file, effectively de-duplicating memory
  • Shuffling and sharding can be avoided if enough FAM is available, potentially removing the quadratic order of data distribution
  • When an app memory-maps a famfs file, it is directly accessing the memory; unlike block-based file systems, data is not read (or faulted) into local memory in order to be accessed

Jobs like these can be adapted to using famfs without even recompiling any components, through a procedure like this.

  1. Wrangle data into a zero-copy format
  2. Copy the zero-copy files into a shared famfs file system
  3. Component jobs from any node in the cluster can access the data via mmap() from files in the shared famfs file system
  4. When a job is over, dismount the famfs file system and make the memory available for the next job...

Famfs Requirements

  1. Must support a file system abstraction backed by sharable dax memory
  2. Files must efficiently handle VMA faults
  3. Must support metadata distribution in a sharable way
  4. Must handle clients with a stale copy of metadata

A few observations about the requirements

Requirement Notes
1 Making shared memory accessible as files means that most apps that can consume data from files (especially the ones that use mmap()) can experiment with disaggregated shared memory.
2 Efficient VMA fault handling is absolutely mandatory for famfs to perform at "memory speeds". Famfs caches file extent lists in the kernel, and forces all allocations to be huge page aligned for efficient memory mapping and fault handling.
3 There are existing fs-dax file systems (e.g. xfs, ext4), but they use cached, write-back metadata. This cannot be reconciled with more than one host concurrently (read-write) mounting those file systems from the same shared memory. Famfs does not use write-back metadata; that, along with some annoying limitations, solves the shared metadata problems.
4 The same annoying restrictions mean that in its current form, famfs does not need to track whether clients have consumed all of the metadata.

Theory of Operation

Famfs is a Linux file system that is administered from user space by the code in this repo. The host device is a dax memory device (e.g. /dev/dax0.0).

The file infrastructure lives in the Linux kernel, which is necessary for Requirement #2 (must efficiently handle VMA faults). But the majority of the code lives in user space and executes via the famfs cli and library.

The "Master" node is the system that created a famfs file system (by running mkfs.famfs). The system UUID of the master node is stored in the superblock, and the famfs cli and library prevent client nodes from mutating famfs metadata for a file system that they did not create.

As files and directories are allocated and created, the Master adds those files to the famfs append-only metadata log. Clients gain visibility of files by periodically re-playing the log.

Famfs On-media Format

The mkfs.famfs command formats an empty famfs file system on a dax memory device. An empty famfs file system consists of a superblock at offset 0 and a metadata log at offset 2MiB. The format looks like this:

Image of empty famfs media

Note this is not to scale. Currently the superblock is 2MiB and the log defaults to 8MiB, and the minimum supported memory device size is 4GiB.

After a file has been created, the media format looks like this:

Image of famfs file system with one file

The following figure shows a slightly more complex famfs format:

Image of famfs filesystem with directory

During mkfs.famfs and famfs mount, the superblock and log are accessed via raw mmap of the dax device. Once the file system is mounted, the famfs user space only accesses the superblock and log via meta files:

# mount | grep famfs
/dev/dax1.0 on /mnt/famfs type famfs (rw,nosuid,nodev,noexec,noatime)
# ls -al /mnt/famfs/.meta
total 4096
drwx------ 2 root root       0 Feb 20 10:34 .
drwxr-xr-x 3 root root       0 Feb 20 10:34 ..
-rw-r--r-- 1 root root 8388608 Feb 20 10:34 .log
-r--r--r-- 1 root root 2097152 Feb 20 10:34 .superblock

The famfs meta files are special-case files that are not in the log. The superblock is a known location and size (offset 0, 2MiB - i.e. a single PMD page). The superblock contains the offset and size of the log. All additional files and directories are created via log entries.

The famfs kernel module never accesses the memory of the dax device - not even the superblock and log. This has RAS benefits. If a memory error occurs (non-correctable errors, poison, connectivity problems, etc.) the process that accessed the memory should receive a SIGBUS, but kernel code should not be affected.

Famfs currently does not include any stateful or continuously running processes. When the library creates files, allocation and log append are serialized via a flock() call on the metadata log file. This is sufficient since only the Master node can perform operations that write the log.

Famfs Operations

Famfs files can be accessed through normal means (read/write/mmap), but creating famfs files requires the famfs user space library and/or cli. File creation consists of the following (slightly oversimplified) steps:

  1. Allocate memory for the file
  2. Write a famfs metadata log entry that commits the existence of the file
  3. Instantiate the file in the mounted famfs instance

Directory creation is slightly simpler, consisting of:

  1. Commit the directory creation in the log
  2. Instantiate the directory in the mounted famfs instance
Operation Notes
mkfs.famfs The host that creates a famfs file system becomes the Master. Only the master can create files in a famfs file system, though Clients can read files (and optionally be given write permission)
famfs mount Master and Clients: Mount a famfs file system from a dax or pmem device. Files default to read-only on Clients
famfs logplay Master and Clients: Any files and directories created in the log will be instantiated in the mounted instance of famfs. Can be re-run to detect and create files and directories logged since the last logplay.
famfs fsck Master and Clients: The allocation map will be checked and errors will be reported
famfs check Master and Clients: Any invalid famfs files will be detected. This can happen if famfs files are modified (e.g. created or truncated) by non-famfs tools. Famfs prevents I/O to such files. The file system can be restored to a valid state by unmounting and remounting
famfs mkdir [-p] Master only: Create a directory (optionally including missing parent directories). This creates a directory in the mounted file system.
famfs creat Master only: Create and allocate a famfs file.
famfs cp [-r] Master only: Copy one or more files into a famfs file system
famfs flush Master and Clients: Flush or invalidate the processor cache for a file; this may be needed when mutating files
read()/write() Master and Clients: any file can be read or written provided the caller has appropriate permissions
mmap() Master and Clients: any file can be mmapped read/write or read-only provided the caller has sufficient permissions

For more detail, see the famfs cli reference.

Missing File System Operations

Famfs is currently lacks the ability to do some standard file system operations; rm and truncate append are not supported. By omitting these operations, we avoid the complex distributed computing problems of 1) figuring out when it is safe to re-use freed space, and 2) allocating space dynamically.

Operation Notes
rm The remove (rm) operation is currently not supported, which allows famfs to avoid the problem of freeing space and determining when it is safe to re-use the space for a different file. Many use cases can work round this by freeing whole famfs file systems rather than individual files - just free the memory (if it's a DCD, CXL supports freeing allocations, including forced remove; Then a new famfs file system can be created in a new sharable DCD allocation).
append Appending a file requires additional allocation sooner or later. Since famfs files are strictly pre-allocated, append is not allowed.
truncate The truncate operation can make a file either smaller or larger. When making the file smaller, it has the same issues as rm. When making a file larger, truncate has the same issues as append. Thus famfs does not support truncate.

Rogue File System Operations

Famfs requires that any operation that creates files files be done via the famfs api or cli - in order to properly allocate space and log the metadata. Operations that affect allocation are not allowed after a file has been created, which is why famfs has no commands/APIs for rm, append or truncate. However, famfs currently has no way to prevent a user with sufficient premissions from attempting these operations via the standard file system tools.

If a famfs file is removed or its size is changed by standard utilities, it will be treated as invalid by famfs, and famfs will prevent writing or reading until the file is repaired.

Most of the operations resulting in invalid files are recoverable.

Invalid Operation Notes Recovery
Linux rm If the caller has sufficient permission, the file will disappear from the mounted famfs file system. The file will reappear if you replay the log, as in famfs logplay /mnt/famfs.
Linux ftruncate If a file's size is changed from the allocated size, it is treated as invalid until it is repaired. Logged truncate seems like a less likely requirement than logged delete, but let's talk if you need it. umount/remount, possibly logplay
Linux cp Using standard 'cp' where the destination is famfs is invalid. The 'cp' will fail, but it may leave behind an empty and invalid file at the destination. umount/remount

Are These Limitations Permanent?

No! Well, some maybe, but most of the famfs limitations can be mitigated or eliminated if we decide to write enough code. Those Byzantine generals need to be managed, etc.

The current famfs limitations are chosen to keep the problem manageable and the code reasonably sized in the early stages.

What is Famfs NOT?

Famfs is not a general purpose file system, and unlike most file systems, it is not a data storage tool. Famfs is a data sharing tool.

famfs's People

Contributors

jagalactic avatar jjacob512 avatar zhijianli88 avatar

Stargazers

Mingkai Dong avatar  avatar Zhiting Zhu avatar Clark Zinzow avatar Johan Peltenburg avatar Jihyeon Gim avatar  avatar  avatar Han_fangyuan avatar  avatar Kinkaku avatar  avatar Aravind Ramesh avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

famfs's Issues

compile failed for the famfs as the extent_type name has been changed to famfs_extent_type.

compile failed as the error:
#make all
make[3]: Entering directory '/root/famfs/debug'
[ 3%] Building C object CMakeFiles/libfamfs.dir/src/famfs_lib.c.o
/root/famfs/src/famfs_lib.c:423:28: warning: ‘enum extent_type’ declared inside parameter list will not be visible outside of this definition or declaration
423 | enum extent_type *type)
| ^~~~~~~~~~~
/root/famfs/src/famfs_lib.c:421:1: error: conflicting types for ‘famfs_get_device_size’; have ‘int(const char *, size_t *, enum extent_type *)’ {aka ‘int(const char *, long unsigned int *, enum extent_type *)’}
421 | famfs_get_device_size(const char *fname,
| ^~~~~~~~~~~~~~~~~~~~~
In file included from /root/famfs/src/famfs_lib.c:35:
/root/famfs/src/famfs_lib.h:21:12: note: previous declaration of ‘famfs_get_device_size’ with type ‘int(const char *, size_t *, enum famfs_extent_type *)’ {aka ‘int(const char *, long unsigned int *, enum famfs_extent_type *)’}
21 | extern int famfs_get_device_size(const char *fname, size_t *size, enum famfs_extent_type *type);
| ^~~~~~~~~~~~~~~~~~~~~
/root/famfs/src/famfs_lib.c: In function ‘famfs_mkfs’:
/root/famfs/src/famfs_lib.c:3812:14: error: variable ‘type’ has initializer but incomplete type
3812 | enum extent_type type = SIMPLE_DAX_EXTENT;
| ^~~~~~~~~~~
/root/famfs/src/famfs_lib.c:3812:26: error: storage size of ‘type’ isn’t known
3812 | enum extent_type type = SIMPLE_DAX_EXTENT;
| ^~~~
/root/famfs/src/famfs_lib.c:3812:26: warning: unused variable ‘type’ [-Wunused-variable]
make[3]: *** [CMakeFiles/libfamfs.dir/build.make:76: CMakeFiles/libfamfs.dir/src/famfs_lib.c.o] Error 1
make[3]: Leaving directory '/root/famfs/debug'
make[2]: *** [CMakeFiles/Makefile2:184: CMakeFiles/libfamfs.dir/all] Error 2
make[2]: Leaving directory '/root/famfs/debug'
make[1]: *** [Makefile:146: all] Error 2
make[1]: Leaving directory '/root/famfs/debug'
make: *** [Makefile:13: debug] Error 2

pcq enhancements - enabling useful automated testing

  • When consuming an entry from a queue, if the crc is bad, should invalidate the cache (for the entry) and retry the consume
  • Add a -s|--status to print status during the run; need status_worker() thread
  • Keep stats on the number of times the producer sees a full queue
  • Keep stats on the number of times a consumer sees an empty queue
  • Keep stats on the number of times a consumer needs to retry reading a valid entry due to crc mismatch (which is either a bad entry or a cache incoherency occurrence)
  • pcq --setperm option to set permission for p|c|b|n (producer, consumer, both, neither) to help with tests

log play should rigorously avoid following symlinks

I added a test that manually puts a symlink where a directory should be, and then runs logplay. The link was to a directory (/tmp) and logplay just thinks the directory already existed (i.e. stat followed the link to its destination and reported on that).

We may need to use fstatat() in logplay (rather than stat()) to avoid this. Or always do an lstat() before the stat(). Also preventing symlinks might be a good idea, if there's a way.

Code Coverage: Umbrella issue

We use gcov to test code coverage. We are currently (and will probably continue to) use a combination of smoke and unit tests to measure "official" coverage (see https://github.com/cxl-micron-reskit/famfs/blob/master/markdown/getting-started.md).

But I would like to see our unit-test-only increase for various reasons. One is that smoke tests may not be runnable on Github, since they need an actual dax device. Moreover, there are some inherently hard-to-cover branches - mmap failures and file open failures are top of the list here. We probably need to enable mocking for open and mmap here in order to properly test failures in those functions.

So please work on unit (or smoke) test code coverage improvements, but especially unit test coverage improvements and send PRs.

Better unit test coverage will be good since enabling Github coverage reporting probably can't include coverage from smoke tests (but please correct me if this seems wrong).

Add clflush as needed to support concurrent cache-incoherent mounts

clflush needed:
after mkfs, on superblock and log
after log append (on log)
before logplay (on log)
after 'famfs cp'
after 'famfs creat' if the file was initialized.

User or app will be responsible for flushing the cash on any data written by apps other than the famfs cli

Still thinking about how the generalized cases should work. Probably a "famfs flush" command and api call that should flush data as appropriate...

File creation will hork if a duplicate relative path is created after a rogue delete of that relative path

When famfs creates a file (famfs_mkfile() in the api, normally from 'famfs cp' or 'famfs creat'), it will fail if the file already exists. But if a rogue delete had taken place, and then a cp or creat tried to create the same relative path, it would not see the file in the mounted filesystem - and would proceed to create and log the file instance.

A rogue delete is any 'rm' that did not occur through the famfs api/cli - and since the api/cli currently does not support delete, it's any delete.

This would effectively make a mess of things.

  • The log would contain two file creates of the same relative path, with different allocation
  • Logplay would instantiate the first, and ignore the second
  • The Master node would have the second file (hey, only the master can create files), but any client that has not experienced a rogue delete would have the first file (which would not map to the same memory as the "rogue create" on the master).

This could be solved by building a hash table of relative paths during logplay (master only), and 1) detecting relative path collisions in the log, and 2) detecting famfs_mkfile() or famfs_mkdir() calls that generate relative path collisions in the log (which are detectable via the mounted namespace if there have been no rogue namespace operations)

One downside to this is that it will make the O() order of file creation worse; it's already kinda expensive because space allocation plays the log to get the free/available bitmap - which is not persisted. There is not an "easy" way to persist the hash table either, so that might need to be re-generated on each [batch of] file create. (batches because 'cp *' and 'cp -r' and 'mkdir -p' lock the log and build the bitmap once for a batch of creates).

Hmm. we could persist the bitmap in a new meta file, and only expose that on the master. Even the hash table could be handled that way too. Extend the flock(log) to cover those files, and it may be a fully working approach worth considering...eventually.

This has not been observed in the wild; Will put this "on ice" initially, but may need to

We need a unit test that overflows the famfs metadata log

Log overflow has not been tested yet...

Mind you the default log size has space for >25000 entries, and the minimum allocation unit is 2MiB (but if you're creating files <=2MiB, you're probably missing the point of famfs).

Still: will add such a test

famfs mount: works even if there is no superblock & log

The kernel mount is performed, and then the famfs_mkmeta() discovers that there is no valid superblock and bails out on creating the meta files.

Need to think through the right way to handle this gracefully. Probably issue a umount on the way out.

mkfs.famfs: support configurable metadata log size?

Currently the log is always 8MiB, which currently holds something over 25000 log entries. Reasons we might support a configurable log size:

  1. 25K entries is not enough
  2. 8MiB is massive overkill (e.g. the device is going to be used for a single file, or a small set). In this case, the minimum log size will be 2MiB - one huge page.

Under no circumstances should we support a log size that is not a 2MiB multiple. Honestly I'm not sure this needs to be addressed, but it's a question that has come up several times.

The trickiest part would be test coverage.

Need a test that verifies that mapping faults are 2MiB or 1GiB, and not 4KiB

The famfs kernel RFC v1 has fault counters that can be enabled via /sys/fs/famfs/... These were implemented because we had a bug in the kernel module at one point that resulted in 4K mapping faults rather than 2M, and it killed performance for many concurrent computational processes hammering on the same data frame(s) in memory.

Famfs files are constrained to multiples of 2MiB, and mmap addresses are constrained to 2MiB alignment - both for this reason. But we need a good way to catch a regression.

However, the counters proved controversial (see thread [1]) so they will probably be dropped from the next version of the patch set. Dan Williams pointed out a user space test that ndctl uses to verify something similar [2], but I'm hoping we can do better. The first place to look is the rest of the thread at [1] to see of a prescription becomes clear.

[1] https://lore.kernel.org/linux-fsdevel/3jwluwrqj6rwsxdsksfvdeo5uccgmnkh7rgefaeyxf2gu75344@ybhwncywkftx/T/#m69d2b66e54f9657c38e6e0a0da94ab4b3eca7586
[2] https://github.com/pmem/ndctl/blob/main/test/dax.sh#L31

Does FAMFS supports write now?

When I was trying to do some simple cmds to see the behavior, I notice that if I use cmds like:
echo 1234 >> /mnt/famfs/test
then I am using famfs_dax_write_iter to change the contents.
However the log is not added. Therefore, famfs verify/fsck cannot be passed.
Is that a issue? Or there is another way to write something to the file?

Integrate fio tests into famfs test suite (probably in smoke...)

Something along the lines of what we did in this sub-thread: https://lore.kernel.org/linux-fsdevel/w5cqtmdgqtjvbnrg5okdgmxe45vjg5evaxh6gg3gs6kwfqmn5p@wgakpqcumrbt/.

Test artifacts should be captured in log files. Extra points for scriptage that sanity checks results (though this could get complicated since not all memory has the same performance...). I already have some script setup, but it will need to check core count and memdev size in order to properly setup on any system/memdev.

famfs_map_superblock_and_log_raw(): size of log is assumed

This function needs to either:

  • Check the log size after the mmap, and remap if necessary to get the right size
  • Or mmap the superblock first, get the log size, then mmap the log

The first option leaves the cleanup almost exactly the same, whereas the second leaves two things to munmap.

This is a bug, but it's only a latent bug until we support more than one log size.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.