google / crfs Goto Github PK
View Code? Open in Web Editor NEWCRFS: Container Registry Filesystem
License: BSD 3-Clause "New" or "Revised" License
CRFS: Container Registry Filesystem
License: BSD 3-Clause "New" or "Revised" License
https://github.com/alibaba/accelerated-container-image , which is based on block level rather than filesystem.
This project hasn't been updated for a while, and we are still ACTIVE !
Currently crfs supports GCR and uses GCR-specific API, so we can't use it with docker private registry(especially with local and unsecure(http) one).
Isn't it great to support private registry to make it easy to try crfs?
And also, isn't it good start point to support other OCI(docker) compliant registries, as we can focus on API-related issues and separate the auth-related issues aside?
like:
$ ls /crfs/layers/127.0.0.1:5000/
my
$ ls /crfs/layers/127.0.0.1:5000/my
ubuntu
$ ls /crfs/layers/127.0.0.1:5000/my/ubuntu
18.04 sha256-2bca06c5f3ca2402e6fd5ab82fad0c3d8d6ee18e2def29bcadaae5360d0d43d9
$ ls /crfs/layers/127.0.0.1:5000/my/ubuntu/18.04/
0 sha256-0c0ed20421e1c2fbadc7fb185d4e37348de9b39a390c09957f2b9a6b68bd4785
1 sha256-24e2698eca10208eab4c4dad0dfad485a30c8307902404ffec2da284ae848fb8
2 sha256-2b01b35b83e6609c41f1aac861cd65914934fa503f645ca17c9ebff45907b9c5
3 sha256-646be464f13960b2cd0bf3a741a42f1bf658bee676ffbc49183222bdfb79e249
bottom top
config
$ ls /crfs/layers/127.0.0.1:5000/my/ubuntu/18.04/bottom
bin boot dev etc home lib lib64 media mnt opt proc root run sbin srv sys tmp usr var
And I have an idea of patch implementation on my branch, so if possible I'm willing to contribute.
Notice on image path format:
Currently crfs supports only <owner>/<image>
-styled image path format, and this patch also intend to follow the restriction. Supporting any format of image path may be future work.
Currently, crfs uses GCR specific API for following purposes:
Use OCI(docker)-compliant API instead of GCR specific one for private registry, say:
GET
on /v2/_catalog
(Filter the response to allow only <owner>/<image>
-styled image path format, and then parse it.)GET
on /v2/_catalog
(Filter the response to allow only <owner>/<image>
-styled image path format, and then parse it.)GET
on /v2/<owner>/<image>/tags/list
for getting the list of tag names, and HEAD
on /v2/<owner>/<image>/manifests/<tag name>
with Accept: application/vnd.docker.distribution.manifest.v2+json
header for getting digests of a V2 manifest(written in Docker-Content-Digest
response header).As used in the current code base of crfs, it is better to select appropriate API schema based on github.com/google/go-containerregistry/pkg/name
.
Say, when we request API of localhost's docker private registry, it is good to use http (not https).
Some container images use whiteouts to indicate "removed entries". But currently, when we use CRFS with overlayfs these whiteouts don't work and no entry is removed.
Assume we have the lower layer:
lower/etc
├── group
├── hostname
├── hosts
├── localtime
├── mtab -> /proc/mounts
├── network
│ ├── if-down.d
│ ├── if-post-down.d
│ ├── if-pre-up.d
│ └── if-up.d
├── passwd
├── resolv.conf
└── shadow
And the upper layer including whiteouts:
upper
└── etc
├── network
│ ├── newfile
│ └── .wh..wh..opq
└── .wh.localtime
According to "whiteout" definition in the OCI image specification, the merged directory should be the following(compatible with docker images).
merged/etc
├── group
├── hostname
├── hosts
├── mtab -> /proc/mounts
├── network
│ └── newfile
├── passwd
├── resolv.conf
└── shadow
1 directory, 8 files
But currently CRFS shows these ".wh."-prefixed whiteout files as-is. This behaviour doesn't make overlayfs happy because overlayfs has a different convention to express whiteouts. So it currently results in the following unexpected result:
merged/etc
├── group
├── hostname
├── hosts
├── localtime
├── mtab -> /proc/mounts
├── network
│ ├── if-down.d
│ ├── if-post-down.d
│ ├── if-pre-up.d
│ ├── if-up.d
│ ├── newfile
│ └── .wh..wh..opq
├── passwd
├── resolv.conf
├── shadow
└── .wh.localtime
There are container images which are owner-sensitive i.e. tomcat:8.5.45-jdk8-openjdk
.
But currently, crfs doesn't preserve owner information of directories.
For example, all following directories should be owned by staff
group and was so in the original stargz file. But CRFS overwrites the owner of directories to root(0) during populating the TOC JSON information.
$ ls -al /crfs/layers/local/rootfs.stargz/usr/local/openjdk-8
total 51013
-r--r--r-- 1 root staff 1522 7月 12 02:28 ASSEMBLY_EXCEPTION
drwxr-xr-x 1 root root 0 8月 31 1754 bin
drwxr-xr-x 1 root root 0 8月 31 1754 demo
drwxr-xr-x 1 root root 0 8月 31 1754 include
drwxr-xr-x 1 root root 0 8月 31 1754 jre
drwxr-xr-x 1 root root 0 8月 31 1754 lib
-r--r--r-- 1 root staff 19274 7月 12 02:28 LICENSE
drwxr-xr-x 1 root root 0 8月 31 1754 man
-rw-rw-r-- 1 root staff 238 7月 12 02:28 release
drwxr-xr-x 1 root root 0 8月 31 1754 sample
-rw-rw-r-- 1 root staff 52067487 7月 12 08:06 src.zip
-r--r--r-- 1 root staff 147535 7月 12 02:28 THIRD_PARTY_README
Howdy!
Just a thought, in case the option wasn't explored or considered -
have you considered using a 9p virtual filesystem implementation instead of FUSE?
I ask because it supports partial file reads, has a driver in the linux kernel, and and is network mountable - so machines who want to mount a container can just mount <networkip> /containers
or the like, assuming the server is posting the 9p FS on a port.
It may let you cut out the http requests, depending on how easy it is to access containers without going through the registry directly. You'd also only need to install software on the registry server - nothing would be needed client-side, since linux already has the driver.
Just a thought from taking a glance - my apologies if it's not applicable, and thanks for the project!
It is possible for an image to be stargzified twice during the distribution lifecycle, which currently results in a broken image.
Following is OK:
$ stargzify -insecure ubuntu:18.04 http://private:5000/ubuntu:once
$ docker pull private:5000/ubuntu:once
But when we stargzify twice, it results in a broken image:
$ stargzify -insecure http://private:5000/ubuntu:once http://private:5000/ubuntu:twice
$ docker pull private:5000/ubuntu:twice
failed to register layer: Error processing tar file(duplicates of file paths not supported):
The reason is the name of TOC JSON file("stargz.index.json") isn't reserved and duplicates in the stargzified image.
$ curl http://private:5000/v2/ubuntu/blobs/sha256:479846a1cefdb8af9ace78046a2e3a691ccbf8b018b710cb9fcea7fe0593dd97 | tar -z --list
run/
run/systemd/
run/systemd/container
stargz.index.json
stargz.index.json
When we read big files, sometimes these contents are broken.
Minimal example of the issue will be:
# tail -n 5 /crfs/layers/gcr.io/reasonablek8s/ubuntu/18.04/0/bin/zgrep
escape='
s/'\''/'\''\\'\'''\''/g
$s/$/'\''/
'
opera
The output should be:
# tail -n 5 /crfs/layers/gcr.io/reasonablek8s/ubuntu/18.04/0/bin/zgrep
test 128 -lt $r && exit $r
test "$gzip_status" -eq 0 || test "$gzip_status" -eq 2 || r=2
test $res -lt $r && res=$r
done
exit $res
When we read big file, the actual readings are separated to several blocks. In such situations, a node is requested to read at specific offset, but CRFS doesn't truncate unnecessary data before the offset.
As a result, big files are broken in CRFS.
After CRFS fetch first chunk of required range, truncate unnecessary data before the offset.
I'll submit the patch as PR.
Currently we can't deal with hard-link.
In the tar file of ubuntu:18.04 base layer(sha256:6958ba61ef42a87f438b918c94a12af7a64a415a23dbed8e364eb6af9bb0845a), there is a hardlink:
hrwxr-xr-x 0/0 0 bin/uncompress ==> bin/gunzip
We can get the original file on CRFS:
$ cat /crfs/layers/gcr.io/reasonablek8s/ubuntu/18.04/0/bin/gunzip
#!/bin/sh
# Uncompress files. This is the inverse of gzip.
# Copyright (C) 2007 Free Software Foundation
...
But the link counts aren't correct.
$ stat --print="%h\n" /crfs/layers/gcr.io/reasonablek8s/ubuntu/18.04/0/bin/gunzip
1
$ stat --print="%h\n" /crfs/layers/gcr.io/reasonablek8s/ubuntu/18.04/0/bin/uncompress
1
And we get an empty file from the hardlink.
$ file /crfs/layers/gcr.io/reasonablek8s/ubuntu/18.04/0/bin/uncompress
/crfs/layers/gcr.io/reasonablek8s/ubuntu/18.04/0/bin/uncompress: empty
Additionally, when we cat
it, we get EIO.
There are two causes:
TOCEntriy
s, link counts aren't recorded and the filesystem returns a static number 1
as link counts.stargz.Reader.Lookup()
and (*stargz.TOCEntry).LookupChild()
), the hardlink isn't resolved.Stargzifying images on the HTTP registries fail when the registry isn't "localhost"(or 127.0.0.1).
We can stargzify images on the localhost using HTTP.
$ ./stargzify -upgrade 127.0.0.1:5000/ubuntu:18.04
2019/10/11 20:11:45 No matching credentials were found, falling back on anonymous
2019/10/11 20:11:46 pushed blob: sha256:c2a664698f4653bb4ad606cfeb701ccf07441973588c7294b23aff5e7e57e431
2019/10/11 20:11:46 pushed blob: sha256:5dc26a6e1f6217fa8d2aca4fd0bc60f2b0b7fbf6aac3d1473e948003be0f41ae
2019/10/11 20:11:46 pushed blob: sha256:3650658a2723d53534249a8f7785dbfc3b6e845cbe3b812ac37f5f29f98f2b68
2019/10/11 20:12:04 pushed blob: sha256:7b085330c7bb6994f0e1e720772493240386abe14f02e9aebc1df7dbd7306aa7
2019/10/11 20:12:04 pushed blob: sha256:720c7c23e42df652d62a51e48c9d441e0603a0384290fbe07461e426578f1841
2019/10/11 20:12:05 127.0.0.1:5000/ubuntu:18.04: digest: sha256:28665f72af70a56b0dfccf914c56c0cd7f478761bd527eb75d7b473aacf37ca2 size: 912
But if the registry isn't on the localhost, we can't.
$ ./stargzify -upgrade private:5000/ubuntu:18.04
2019/10/11 20:11:37 No matching credentials were found, falling back on anonymous
2019/10/11 20:11:37 Get https://private:5000/v2/: http: server gave HTTP response to HTTPS client
Sometimes, stargzifying an image fails with DIGEST_INVALID.
$ ./stargzify -upgrade 127.0.0.1:5000/ubuntu:18.04
2019/10/13 14:37:24 No matching credentials were found, falling back on anonymous
2019/10/13 14:37:49 DIGEST_INVALID: provided digest did not match uploaded content; map[Digest:sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 Reason:map[]]
Especially when Write
ing to the digester is slow, we can see this failure more frequently.
func (d *digester) Write(b []byte) (int, error) {
time.Sleep(time.Duration(1) * time.Millisecond)
n, err := d.h.Write(b)
d.n += int64(n)
return n, err
}
digester
The method layer.Compressed()
calculates the digest(with Sum()
method) immediately after the PipeWriter
is unblocked.
func (l *layer) Compressed() (io.ReadCloser, error) {
pr, pw := io.Pipe()
// Convert input blob to stargz while computing diffid, digest, and size.
go func() {
w := stargz.NewWriter(pw)
if err := w.AppendTar(l.rc); err != nil {
...
if err := w.Close(); err != nil {
...
l.digest = &v1.Hash{
Algorithm: "sha256",
Hex: hex.EncodeToString(l.d.h.Sum(nil)),
}
...
}()
return ioutil.NopCloser(io.TeeReader(pr, l.d)), nil
But even if the PipeWriter
is unblocked, it isn't guaranteed that the TeeReader
completes to write the data to the digester
.
func (t *teeReader) Read(p []byte) (n int, err error) {
n, err = t.r.Read(p)
// ***** Region A *****
if n > 0 {
if n, err := t.w.Write(p[:n]); err != nil {
return n, err
}
}
return
}
Here, this TeeReader has the PipeReader
as a reader and the digester
as a writer.
In the Region A of the above, the write half(PipeWriter
) is unblocked because reading from this PipeReader
completes.
So the goroutine in the layer.Compressed()
can immediately calculate the digest during the region A(before following t.w.Write()
completes).
This race results in calculating an invalid(uncompleted) degest.
FYI maybe you have already heard about it before, but this seems similar to using CVMFS for container distribution.
General information on CVMFS:
https://cvmfs.readthedocs.io/en/stable/
https://cernvm.cern.ch/portal/filesystem
https://github.com/cvmfs/cvmfs
Information about loading docker images on demand from CVMFS:
https://cvmfs.readthedocs.io/en/stable/cpt-graphdriver.html
Information about automatically converting container images and publishing them to CVMFS (with DUCC)
https://cvmfs.readthedocs.io/en/stable/cpt-ducc.html
[root@LT crfs]# go build -mod=readonly
go: updates to go.mod needed, disabled by -mod=readonly
We need to update go.mod
, some libs were missing.
Enabling to merge CRFS layers using overlayfs is valuable (I know currently fuse-overlayfs trying to support CRFS).
But currently CRFS has a cache related issue to achive it.
On terminal A, run crfs:
# ./crfs -fuse_debug
On terminal B, mount crfs as overlayfs and access on it every second:
# mkdir merged upper work
# CRFS_PATH=/crfs/layers/gcr.io/reasonablek8s/ubuntu/18.04 ; mount -t overlay overlay -o lowerdir=${CRFS_PATH}/3:${CRFS_PATH}/2:${CRFS_PATH}/1:${CRFS_PATH}/0,upperdir=upper,workdir=work merged
# for i in $(seq 0 100) ; do echo -n "${i}: " && ls $(pwd)/merged/lib ; sleep 1 ; done
Then, on terminal B, we see ESTALE after 1 min:
0: init lsb systemd terminfo udev x86_64-linux-gnu
1: init lsb systemd terminfo udev x86_64-linux-gnu
2: init lsb systemd terminfo udev x86_64-linux-gnu
...
60: ls: cannot access '/home/kohei/local/experiment/lazypull/crfs/merged/lib': Stale file handle
61: ls: cannot access '/home/kohei/local/experiment/lazypull/crfs/merged/lib': Stale file handle
63: ls: cannot access '/home/kohei/local/experiment/lazypull/crfs/merged/lib': Stale file handle
On terminal A, kernel seems not to be satisfied with inode returned by "Lookup".
2019/09/12 12:17:05 fuse debug: <- Lookup [ID=0x31f Node=0x9 Uid=0 Gid=0 Pid=27867] "lib"
2019/09/12 12:17:05 fuse debug: -> [ID=0x31f] Lookup 0x11 gen=4 valid=1m0s attr={valid=720h0m0s ino=824640017696 size=0 mode=drwxr-xr-x}
2019/09/12 12:17:05 fuse debug: <- Forget [ID=0x320 Node=0x11 Uid=0 Gid=0 Pid=0] 1
2019/09/12 12:17:05 fuse debug: -> [ID=0x320] Forget
CRFS generates different "Node" instance everytime "Lookup" is called. This behavior makes bazil/fuse assign different "Node IDs"(used by FUSE) to inodes everytime even if these "Lookups" point to same file because bazil/fuse caches "Node IDs" keyed by a "Node" insatance (not an inode number etc.). Most time (when we don't use overlayfs etc.) it is fine.
However, when dentry cache revalidation is executed and the dentry is expired (by default, set to 1 min in bazil/fuse.), FUSE "lookups" the original inode and it doesn't allow different Node IDs to same inode and concludes the cache as invalid. Unfortunately, overlayfs doesn't allow dentry caches being invalid and returns ESTALE.
As a result, CRFS currently doesn't support to merge the layers with overlayfs.
Cache "node" instances in CRFS once it lookedup, and use it when the same name is looked up.
I'll submit the patch as PR.
On the unix filesystems, link count of each directory should be determined by reference relationships among directories.
But current CRFS doesn't support it and all link counts of directories are 1
.
For example, the link count to the same directory is different between ext4 and CRFS.
$ stat --print="%h\n" ./rootfs/etc/security
6
$ stat --print="%h\n" /crfs/layers/local/rootfs.stargz/etc/security
1
If the image layer includes non-unicode file path, stargzify can't process correctly, example:
mkdir ./test
touch ./test/$(printf '\xFF')
touch ./test/$(printf '\xAA')
tar -czvf ./test.tar.gz ./test
stargzify file:test.tar.gz file:test.stargz
tar -xzvf test.stargz stargz.index.json
The index file stargz.index.json
has two entries with the same name "�
":
{
"version": 1,
"entries": [
{
"name": "./test/",
"type": "dir",
"modtime": "2020-09-08T10:33:57Z",
"mode": 509,
"uid": 1000,
"gid": 1000,
"NumLink": 0
},
{
"name": "./test/\ufffd",
"type": "reg",
"modtime": "2020-09-08T10:33:47Z",
"mode": 436,
"uid": 1000,
"gid": 1000,
"NumLink": 0,
"digest": "sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
},
{
"name": "./test/\ufffd",
"type": "reg",
"modtime": "2020-09-08T10:33:57Z",
"mode": 436,
"uid": 1000,
"gid": 1000,
"NumLink": 0,
"digest": "sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
}
]
}
Treat file path as []byte
like xattr value, so that we can serialize the path as base64 encoded string to JSON index file, but we need to distinguish between plain path and base64 encoded path using extra entry field like "name_encoded": true
.
Hey there,
Congrats on the idea for this project. Sounds interesting and useful.
Question: What happens when I have mounted the image, and then a new digest of the same image tag is pushed to the registry? Does my mounted image gets automatically the new changes of the new digest? Does the already running container stops and restarts automatically with the new digest?
Thanks in advance
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.