Git Product home page Git Product logo

Comments (14)

cgwalters avatar cgwalters commented on June 11, 2024 2

I think being able to share between the host and container images is more than a nice-to-have; if we're doing a major architectural rework, I'd call it a requirement because it makes containerization a more zero-cost thing. (e.g. assuming that your glibc is shared between host and base images)

from composefs.

alexlarsson avatar alexlarsson commented on June 11, 2024 1

I've been discussing the backing-files GC issue with @giuseppe quite a bit in the context of containers/storage. And the best approach I've come up with is this:

Suppose you have /composefs/ like above, you would in it have a layout something like:

├── files
│   ├── 00
│   │   └── 1234.file
│   ├── aa
│   │   └── 5678.file
│   └── bc
│       └── abcd.file
└── images
    ├── foo
    │   ├── image.cfs
    │   ├── 00
    │   │   └── 1234.file
    │   └── aa
    │       └── 5678.file
    └── bar
        ├── image.cfs
        ├── 00
        │   └── 1234.file
        └── bc
            └── abcd.file

So, a shared backing file dir with all files from all images, and then each image has a directory with only the files for that image. However, the backing files would be hardlinked. Then, to remove an image you delete the image file and the directory, and then you make a pass over the shared dir and delete any files with n_link==1.

Updating a structure like this can be atomic I believe. You start by hardlinking from the shared dir into the per-image dir, and on ENOFILE you create new files, sync and then try to hardlink the new files back to all-files. On EEXIST failure, start over with the smaller set of failed files.

from composefs.

alexlarsson avatar alexlarsson commented on June 11, 2024 1

Updating a structure like this can be atomic I believe. You start by hardlinking from the shared dir into the per-image dir, and on ENOFILE you create new files, sync and then try to hardlink the new files back to all-files. On EEXIST failure, start over with the smaller set of failed files.

s/EEXIST/ENOENT/ right? i.e. on a failure to linkat we want to try writing the file and hardlinking again.

Although actually I think the logic may end up being simpler if for non-existent files we actually create the file in the per-image dir first, and then hardlink to the base files directory, ensuring it always has a link count of 2 to start and won't be concurrently pruned.

That is what i mean with EEXISTS. You do what you said, but it could race with someone else, then when you link() it you get EEXISTS so you start over trying to link from shared to per-image.

from composefs.

cgwalters avatar cgwalters commented on June 11, 2024 1

Then, to remove an image you delete the image file and the directory, and then you make a pass over the shared dir and delete any files with n_link==1.

I think we can optimize this by scanning the composefs image that we're removing instead, and then only finally unlinking anything in the final shared dir n_link==1

from composefs.

alexlarsson avatar alexlarsson commented on June 11, 2024

But yeah, it would be cool with a global, namespaced version of this, because then we can easily get sharing between the ostree rootfs and container images.

from composefs.

cgwalters avatar cgwalters commented on June 11, 2024

The scheme you describe makes sense to me offhand. The simplicity is very appealing; there's no explicit locking (e.g. flock()) and no databases (sqlite, json, etc.). It's basically pushing refcounting down into the kernel inodes, the same as ostree does. However IME one downside of this is that adding/removing images incurs metadata traffic (i.e. dirties inodes) on the order of number of files. That's already a cost paid with ostree today though.

Updating a structure like this can be atomic I believe. You start by hardlinking from the shared dir into the per-image dir, and on ENOFILE you create new files, sync and then try to hardlink the new files back to all-files. On EEXIST failure, start over with the smaller set of failed files.

s/EEXIST/ENOENT/ right? i.e. on a failure to linkat we want to try writing the file and hardlinking again.

Although actually I think the logic may end up being simpler if for non-existent files we actually create the file in the per-image dir first, and then hardlink to the base files directory, ensuring it always has a link count of 2 to start and won't be concurrently pruned.

from composefs.

cgwalters avatar cgwalters commented on June 11, 2024

And basically once we have this shared scheme, I think we can seamlessly convert an ostree repository into this format (for composefs-only cases). And that then significantly reduces the logic in ostree core and I think simplifies the composefs integration.

from composefs.

alexlarsson avatar alexlarsson commented on June 11, 2024

Parallel to the above, flatpak stores a .ref file for each deploy dir, and whenever we run an app we pass to bwrap --lock-file $apppath/.ref --lock-file $runtimepath/.ref which take a (shared) read-lock on the .ref file. Then we can try to get a write lock on the file to see if it is in use.

The general approach for remove in flatpak is:

  • atomically move $dir/deploy/foo to $dir/removed/foo
  • Loop over $dir/removed
    • Try to lock $/dir/remove/$subdir/.ref
    • If we can lock, remove directory

This way we can atomically remove things, yet still keep running instances.

We can maybe do the same, but just lock the image file.

from composefs.

cgwalters avatar cgwalters commented on June 11, 2024

I was thinking about this more today, because we just hit a really bad ostree bug that of course only affected ostree, not rpm and not containers/storage.

In a future where we share tooling between host updates and containers there's much less chances for bugs that affects just one of the two, and we get all the page cache sharing etc.

But...what I really came here to say is that while this all sounds good, so far in many scenarios in FCOS (and generally ostree systems) we've been encouraging people to provision a separate /var mount. There's multiple advantages to this - it more strongly decouples OS updates from "system state". But today that "system state" includes /var/lib/containers/images...

And if we eventually try to do something like upgrading users who are currently using separate ostree and container storage into a more unified model, we now have uncomfortable tradeoffs around disk sizing.

I guess ultimately we'd need to detect this situation when / and /var/lib/containers are separate filesystems and just keep the composefs storage separate going forward. (But, I do think it's likely that we start doing more "system container" type stuff in / again).

EDIT: Hmmm....I guess in theory, nothing stops us from at least doing something like cherry-picking "high value objects to share" (e.g. glibc) and deduping them between the "host object storage" and the "app object storage". Maybe something like just having a plain old symlink from /var/lib/containers/objects/aa/1234.object -> /composefs/objects/aa/1234.object...and then also adding a "placeholder" hardlink image reference to it in the host storage.

from composefs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.