Comments (14)
I think being able to share between the host and container images is more than a nice-to-have; if we're doing a major architectural rework, I'd call it a requirement because it makes containerization a more zero-cost thing. (e.g. assuming that your glibc is shared between host and base images)
from composefs.
I've been discussing the backing-files GC issue with @giuseppe quite a bit in the context of containers/storage. And the best approach I've come up with is this:
Suppose you have /composefs/
like above, you would in it have a layout something like:
├── files
│ ├── 00
│ │ └── 1234.file
│ ├── aa
│ │ └── 5678.file
│ └── bc
│ └── abcd.file
└── images
├── foo
│ ├── image.cfs
│ ├── 00
│ │ └── 1234.file
│ └── aa
│ └── 5678.file
└── bar
├── image.cfs
├── 00
│ └── 1234.file
└── bc
└── abcd.file
So, a shared backing file dir with all files from all images, and then each image has a directory with only the files for that image. However, the backing files would be hardlinked. Then, to remove an image you delete the image file and the directory, and then you make a pass over the shared dir and delete any files with n_link==1.
Updating a structure like this can be atomic I believe. You start by hardlinking from the shared dir into the per-image dir, and on ENOFILE you create new files, sync and then try to hardlink the new files back to all-files. On EEXIST failure, start over with the smaller set of failed files.
from composefs.
Updating a structure like this can be atomic I believe. You start by hardlinking from the shared dir into the per-image dir, and on ENOFILE you create new files, sync and then try to hardlink the new files back to all-files. On EEXIST failure, start over with the smaller set of failed files.
s/EEXIST/ENOENT/ right? i.e. on a failure to
linkat
we want to try writing the file and hardlinking again.Although actually I think the logic may end up being simpler if for non-existent files we actually create the file in the per-image dir first, and then hardlink to the base files directory, ensuring it always has a link count of 2 to start and won't be concurrently pruned.
That is what i mean with EEXISTS. You do what you said, but it could race with someone else, then when you link() it you get EEXISTS so you start over trying to link from shared to per-image.
from composefs.
Then, to remove an image you delete the image file and the directory, and then you make a pass over the shared dir and delete any files with n_link==1.
I think we can optimize this by scanning the composefs image that we're removing instead, and then only finally unlinking anything in the final shared dir n_link==1
from composefs.
But yeah, it would be cool with a global, namespaced version of this, because then we can easily get sharing between the ostree rootfs and container images.
from composefs.
The scheme you describe makes sense to me offhand. The simplicity is very appealing; there's no explicit locking (e.g. flock()
) and no databases (sqlite, json, etc.). It's basically pushing refcounting down into the kernel inodes, the same as ostree does. However IME one downside of this is that adding/removing images incurs metadata traffic (i.e. dirties inodes) on the order of number of files. That's already a cost paid with ostree today though.
Updating a structure like this can be atomic I believe. You start by hardlinking from the shared dir into the per-image dir, and on ENOFILE you create new files, sync and then try to hardlink the new files back to all-files. On EEXIST failure, start over with the smaller set of failed files.
s/EEXIST/ENOENT/ right? i.e. on a failure to linkat
we want to try writing the file and hardlinking again.
Although actually I think the logic may end up being simpler if for non-existent files we actually create the file in the per-image dir first, and then hardlink to the base files directory, ensuring it always has a link count of 2 to start and won't be concurrently pruned.
from composefs.
And basically once we have this shared scheme, I think we can seamlessly convert an ostree repository into this format (for composefs-only cases). And that then significantly reduces the logic in ostree core and I think simplifies the composefs integration.
from composefs.
Parallel to the above, flatpak stores a .ref
file for each deploy dir, and whenever we run an app we pass to bwrap --lock-file $apppath/.ref --lock-file $runtimepath/.ref
which take a (shared) read-lock on the .ref file. Then we can try to get a write lock on the file to see if it is in use.
The general approach for remove in flatpak is:
- atomically move $dir/deploy/foo to $dir/removed/foo
- Loop over $dir/removed
- Try to lock $/dir/remove/$subdir/.ref
- If we can lock, remove directory
This way we can atomically remove things, yet still keep running instances.
We can maybe do the same, but just lock the image file.
from composefs.
I was thinking about this more today, because we just hit a really bad ostree bug that of course only affected ostree, not rpm and not containers/storage.
In a future where we share tooling between host updates and containers there's much less chances for bugs that affects just one of the two, and we get all the page cache sharing etc.
But...what I really came here to say is that while this all sounds good, so far in many scenarios in FCOS (and generally ostree systems) we've been encouraging people to provision a separate /var mount. There's multiple advantages to this - it more strongly decouples OS updates from "system state". But today that "system state" includes /var/lib/containers/images
...
And if we eventually try to do something like upgrading users who are currently using separate ostree and container storage into a more unified model, we now have uncomfortable tradeoffs around disk sizing.
I guess ultimately we'd need to detect this situation when /
and /var/lib/containers
are separate filesystems and just keep the composefs storage separate going forward. (But, I do think it's likely that we start doing more "system container" type stuff in /
again).
EDIT: Hmmm....I guess in theory, nothing stops us from at least doing something like cherry-picking "high value objects to share" (e.g. glibc) and deduping them between the "host object storage" and the "app object storage". Maybe something like just having a plain old symlink from /var/lib/containers/objects/aa/1234.object -> /composefs/objects/aa/1234.object
...and then also adding a "placeholder" hardlink image reference to it in the host storage.
from composefs.
Related Issues (20)
- Authentication support HOT 6
- Sudo does not work HOT 3
- Consider using BitTorrent v2-like protocol for image content transfer? HOT 1
- better loopback handling (hiding it) HOT 25
- Upstream dependencies HOT 3
- consider verifying signatures in userspace HOT 18
- Add bloom filter data to erofs images HOT 10
- consider adding ima as alternative HOT 1
- handle nested whiteouts HOT 12
- support for reading .cfs images HOT 1
- Post 1.0 tasks HOT 2
- Game out a plan for a 1.1 format HOT 3
- portability issues found while packaging HOT 11
- composefs-from-json gets wrong mtime (because of tzone) on musl HOT 2
- composefs-from-json: seccomp breaks stdio on musl
- Specfile has wrong license
- mount.composefs fails with `Failed to mount composefs example.cfs: No such file or directory` when upperdir has a comma in the name HOT 1
- incorrect length parameter given to strncat in mkcomposefs.c HOT 3
- mkcomposefs has exit value 0 although --digest-store value is too long
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from composefs.