I've been thinking more about <a href="https://github.com/ostreedev/ostree/pull/2640"

I've been discussing the backing-files GC issue with <a class="user-mention notranslat

I was thinking about this more today, because we just hit a <a href="https://github.co

Add shared library/tool for managing backing store files about composefs HOT 14 OPEN

containers commented on June 11, 2024

Add shared library/tool for managing backing store files

from composefs.

Comments (14)

cgwalters commented on June 11, 2024 2

I think being able to share between the host and container images is more than a nice-to-have; if we're doing a major architectural rework, I'd call it a requirement because it makes containerization a more zero-cost thing. (e.g. assuming that your glibc is shared between host and base images)

from composefs.

alexlarsson commented on June 11, 2024 1

I've been discussing the backing-files GC issue with @giuseppe quite a bit in the context of containers/storage. And the best approach I've come up with is this:

Suppose you have /composefs/ like above, you would in it have a layout something like:

├── files
│   ├── 00
│   │   └── 1234.file
│   ├── aa
│   │   └── 5678.file
│   └── bc
│       └── abcd.file
└── images
    ├── foo
    │   ├── image.cfs
    │   ├── 00
    │   │   └── 1234.file
    │   └── aa
    │       └── 5678.file
    └── bar
        ├── image.cfs
        ├── 00
        │   └── 1234.file
        └── bc
            └── abcd.file

So, a shared backing file dir with all files from all images, and then each image has a directory with only the files for that image. However, the backing files would be hardlinked. Then, to remove an image you delete the image file and the directory, and then you make a pass over the shared dir and delete any files with n_link==1.

Updating a structure like this can be atomic I believe. You start by hardlinking from the shared dir into the per-image dir, and on ENOFILE you create new files, sync and then try to hardlink the new files back to all-files. On EEXIST failure, start over with the smaller set of failed files.

from composefs.

alexlarsson commented on June 11, 2024 1

Updating a structure like this can be atomic I believe. You start by hardlinking from the shared dir into the per-image dir, and on ENOFILE you create new files, sync and then try to hardlink the new files back to all-files. On EEXIST failure, start over with the smaller set of failed files.

s/EEXIST/ENOENT/ right? i.e. on a failure to linkat we want to try writing the file and hardlinking again.

Although actually I think the logic may end up being simpler if for non-existent files we actually create the file in the per-image dir first, and then hardlink to the base files directory, ensuring it always has a link count of 2 to start and won't be concurrently pruned.

That is what i mean with EEXISTS. You do what you said, but it could race with someone else, then when you link() it you get EEXISTS so you start over trying to link from shared to per-image.

from composefs.

cgwalters commented on June 11, 2024 1

Then, to remove an image you delete the image file and the directory, and then you make a pass over the shared dir and delete any files with n_link==1.

I think we can optimize this by scanning the composefs image that we're removing instead, and then only finally unlinking anything in the final shared dir n_link==1

from composefs.

alexlarsson commented on June 11, 2024

But yeah, it would be cool with a global, namespaced version of this, because then we can easily get sharing between the ostree rootfs and container images.

from composefs.

cgwalters commented on June 11, 2024

The scheme you describe makes sense to me offhand. The simplicity is very appealing; there's no explicit locking (e.g. flock()) and no databases (sqlite, json, etc.). It's basically pushing refcounting down into the kernel inodes, the same as ostree does. However IME one downside of this is that adding/removing images incurs metadata traffic (i.e. dirties inodes) on the order of number of files. That's already a cost paid with ostree today though.

Updating a structure like this can be atomic I believe. You start by hardlinking from the shared dir into the per-image dir, and on ENOFILE you create new files, sync and then try to hardlink the new files back to all-files. On EEXIST failure, start over with the smaller set of failed files.

s/EEXIST/ENOENT/ right? i.e. on a failure to linkat we want to try writing the file and hardlinking again.

Although actually I think the logic may end up being simpler if for non-existent files we actually create the file in the per-image dir first, and then hardlink to the base files directory, ensuring it always has a link count of 2 to start and won't be concurrently pruned.

from composefs.

cgwalters commented on June 11, 2024

And basically once we have this shared scheme, I think we can seamlessly convert an ostree repository into this format (for composefs-only cases). And that then significantly reduces the logic in ostree core and I think simplifies the composefs integration.

from composefs.

alexlarsson commented on June 11, 2024

Parallel to the above, flatpak stores a .ref file for each deploy dir, and whenever we run an app we pass to bwrap --lock-file $apppath/.ref --lock-file $runtimepath/.ref which take a (shared) read-lock on the .ref file. Then we can try to get a write lock on the file to see if it is in use.

The general approach for remove in flatpak is:

atomically move $dir/deploy/foo to $dir/removed/foo
Loop over $dir/removed
- Try to lock $/dir/remove/$subdir/.ref
- If we can lock, remove directory

This way we can atomically remove things, yet still keep running instances.

We can maybe do the same, but just lock the image file.

from composefs.

cgwalters commented on June 11, 2024

I was thinking about this more today, because we just hit a really bad ostree bug that of course only affected ostree, not rpm and not containers/storage.

In a future where we share tooling between host updates and containers there's much less chances for bugs that affects just one of the two, and we get all the page cache sharing etc.

But...what I really came here to say is that while this all sounds good, so far in many scenarios in FCOS (and generally ostree systems) we've been encouraging people to provision a separate /var mount. There's multiple advantages to this - it more strongly decouples OS updates from "system state". But today that "system state" includes /var/lib/containers/images...

And if we eventually try to do something like upgrading users who are currently using separate ostree and container storage into a more unified model, we now have uncomfortable tradeoffs around disk sizing.

I guess ultimately we'd need to detect this situation when / and /var/lib/containers are separate filesystems and just keep the composefs storage separate going forward. (But, I do think it's likely that we start doing more "system container" type stuff in / again).

EDIT: Hmmm....I guess in theory, nothing stops us from at least doing something like cherry-picking "high value objects to share" (e.g. glibc) and deduping them between the "host object storage" and the "app object storage". Maybe something like just having a plain old symlink from /var/lib/containers/objects/aa/1234.object -> /composefs/objects/aa/1234.object...and then also adding a "placeholder" hardlink image reference to it in the host storage.

from composefs.

Add shared library/tool for managing backing store files about composefs HOT 14 OPEN

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent