google / crfs Goto Github PK

View Code? Open in Web Editor NEW

1.3K 40.0 69.0 96 KB

CRFS: Container Registry Filesystem

License: BSD 3-Clause "New" or "Revised" License

Go 100.00%

crfs's Issues

Just take a look at another NEW implementation from us !

https://github.com/alibaba/accelerated-container-image , which is based on block level rather than filesystem.

This project hasn't been updated for a while, and we are still ACTIVE !

Support docker private registry API

Currently crfs supports GCR and uses GCR-specific API, so we can't use it with docker private registry(especially with local and unsecure(http) one).
Isn't it great to support private registry to make it easy to try crfs?
And also, isn't it good start point to support other OCI(docker) compliant registries, as we can focus on API-related issues and separate the auth-related issues aside?
like:

$ ls /crfs/layers/127.0.0.1:5000/
my
$ ls /crfs/layers/127.0.0.1:5000/my
ubuntu
$ ls /crfs/layers/127.0.0.1:5000/my/ubuntu
18.04  sha256-2bca06c5f3ca2402e6fd5ab82fad0c3d8d6ee18e2def29bcadaae5360d0d43d9
$ ls /crfs/layers/127.0.0.1:5000/my/ubuntu/18.04/
0       sha256-0c0ed20421e1c2fbadc7fb185d4e37348de9b39a390c09957f2b9a6b68bd4785
1       sha256-24e2698eca10208eab4c4dad0dfad485a30c8307902404ffec2da284ae848fb8
2       sha256-2b01b35b83e6609c41f1aac861cd65914934fa503f645ca17c9ebff45907b9c5
3       sha256-646be464f13960b2cd0bf3a741a42f1bf658bee676ffbc49183222bdfb79e249
bottom  top
config
$ ls /crfs/layers/127.0.0.1:5000/my/ubuntu/18.04/bottom
bin  boot  dev  etc  home  lib  lib64  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var

And I have an idea of patch implementation on my branch, so if possible I'm willing to contribute.

Notice on image path format:
Currently crfs supports only <owner>/<image>-styled image path format, and this patch also intend to follow the restriction. Supporting any format of image path may be future work.

Overview of the ideas is following:

Use OCI-compliant API

Currently, crfs uses GCR specific API for following purposes:

Use OCI(docker)-compliant API instead of GCR specific one for private registry, say:

Listing owners on a host: GET on /v2/_catalog (Filter the response to allow only <owner>/<image>-styled image path format, and then parse it.)
Listing images stored in a owner: GET on /v2/_catalog (Filter the response to allow only <owner>/<image>-styled image path format, and then parse it.)
Listing maps from tag digests to names labeled on an image: GET on /v2/<owner>/<image>/tags/list for getting the list of tag names, and HEAD on /v2/<owner>/<image>/manifests/<tag name> with Accept: application/vnd.docker.distribution.manifest.v2+json header for getting digests of a V2 manifest(written in Docker-Content-Digest response header).

Use appropriate schema based on the host name

As used in the current code base of crfs, it is better to select appropriate API schema based on github.com/google/go-containerregistry/pkg/name.
Say, when we request API of localhost's docker private registry, it is good to use http (not https).

Whiteouts don't work with overlayfs

Some container images use whiteouts to indicate "removed entries". But currently, when we use CRFS with overlayfs these whiteouts don't work and no entry is removed.

Assume we have the lower layer:

lower/etc
├── group
├── hostname
├── hosts
├── localtime
├── mtab -> /proc/mounts
├── network
│   ├── if-down.d
│   ├── if-post-down.d
│   ├── if-pre-up.d
│   └── if-up.d
├── passwd
├── resolv.conf
└── shadow

And the upper layer including whiteouts:

upper
└── etc
    ├── network
    │   ├── newfile
    │   └── .wh..wh..opq
    └── .wh.localtime

According to "whiteout" definition in the OCI image specification, the merged directory should be the following(compatible with docker images).

merged/etc
├── group
├── hostname
├── hosts
├── mtab -> /proc/mounts
├── network
│   └── newfile
├── passwd
├── resolv.conf
└── shadow

1 directory, 8 files

But currently CRFS shows these ".wh."-prefixed whiteout files as-is. This behaviour doesn't make overlayfs happy because overlayfs has a different convention to express whiteouts. So it currently results in the following unexpected result:

merged/etc
├── group
├── hostname
├── hosts
├── localtime
├── mtab -> /proc/mounts
├── network
│   ├── if-down.d
│   ├── if-post-down.d
│   ├── if-pre-up.d
│   ├── if-up.d
│   ├── newfile
│   └── .wh..wh..opq
├── passwd
├── resolv.conf
├── shadow
└── .wh.localtime

Owner info of directories aren't preserved

There are container images which are owner-sensitive i.e. tomcat:8.5.45-jdk8-openjdk.
But currently, crfs doesn't preserve owner information of directories.

For example, all following directories should be owned by staff group and was so in the original stargz file. But CRFS overwrites the owner of directories to root(0) during populating the TOC JSON information.

$ ls -al /crfs/layers/local/rootfs.stargz/usr/local/openjdk-8
total 51013
-r--r--r-- 1 root staff     1522  7月 12 02:28 ASSEMBLY_EXCEPTION
drwxr-xr-x 1 root root         0  8月 31  1754 bin
drwxr-xr-x 1 root root         0  8月 31  1754 demo
drwxr-xr-x 1 root root         0  8月 31  1754 include
drwxr-xr-x 1 root root         0  8月 31  1754 jre
drwxr-xr-x 1 root root         0  8月 31  1754 lib
-r--r--r-- 1 root staff    19274  7月 12 02:28 LICENSE
drwxr-xr-x 1 root root         0  8月 31  1754 man
-rw-rw-r-- 1 root staff      238  7月 12 02:28 release
drwxr-xr-x 1 root root         0  8月 31  1754 sample
-rw-rw-r-- 1 root staff 52067487  7月 12 08:06 src.zip
-r--r--r-- 1 root staff   147535  7月 12 02:28 THIRD_PARTY_README

Alternate VFS question

Howdy!

Just a thought, in case the option wasn't explored or considered -
have you considered using a 9p virtual filesystem implementation instead of FUSE?

I ask because it supports partial file reads, has a driver in the linux kernel, and and is network mountable - so machines who want to mount a container can just mount <networkip> /containers or the like, assuming the server is posting the 9p FS on a port.

It may let you cut out the http requests, depending on how easy it is to access containers without going through the registry directly. You'd also only need to install software on the registry server - nothing would be needed client-side, since linux already has the driver.

Just a thought from taking a glance - my apologies if it's not applicable, and thanks for the project!

stargzify: stargzifying an image twice results in a broken image

It is possible for an image to be stargzified twice during the distribution lifecycle, which currently results in a broken image.

Following is OK:

$ stargzify -insecure ubuntu:18.04 http://private:5000/ubuntu:once
$ docker pull private:5000/ubuntu:once

But when we stargzify twice, it results in a broken image:

$ stargzify -insecure http://private:5000/ubuntu:once http://private:5000/ubuntu:twice
$ docker pull private:5000/ubuntu:twice
failed to register layer: Error processing tar file(duplicates of file paths not supported):

The reason is the name of TOC JSON file("stargz.index.json") isn't reserved and duplicates in the stargzified image.

$ curl http://private:5000/v2/ubuntu/blobs/sha256:479846a1cefdb8af9ace78046a2e3a691ccbf8b018b710cb9fcea7fe0593dd97 | tar -z --list
run/
run/systemd/
run/systemd/container
stargz.index.json
stargz.index.json

Contents of big files are broken

When we read big files, sometimes these contents are broken.

Issue Reproduction

Minimal example of the issue will be:

# tail -n 5 /crfs/layers/gcr.io/reasonablek8s/ubuntu/18.04/0/bin/zgrep
escape='
  s/'\''/'\''\\'\'''\''/g
  $s/$/'\''/
'
opera

The output should be:

# tail -n 5 /crfs/layers/gcr.io/reasonablek8s/ubuntu/18.04/0/bin/zgrep
  test 128 -lt $r && exit $r
  test "$gzip_status" -eq 0 || test "$gzip_status" -eq 2 || r=2
  test $res -lt $r && res=$r
done
exit $res

The cause of issue

When we read big file, the actual readings are separated to several blocks. In such situations, a node is requested to read at specific offset, but CRFS doesn't truncate unnecessary data before the offset.

As a result, big files are broken in CRFS.

Solution

After CRFS fetch first chunk of required range, truncate unnecessary data before the offset.
I'll submit the patch as PR.

Hard-linked files cannot be read and the link counts aren't correct.

Issue description

Currently we can't deal with hard-link.

Issue Reproduction

In the tar file of ubuntu:18.04 base layer(sha256:6958ba61ef42a87f438b918c94a12af7a64a415a23dbed8e364eb6af9bb0845a), there is a hardlink:

 hrwxr-xr-x       0/0             0 bin/uncompress ==> bin/gunzip

We can get the original file on CRFS:

$ cat /crfs/layers/gcr.io/reasonablek8s/ubuntu/18.04/0/bin/gunzip
#!/bin/sh
# Uncompress files.  This is the inverse of gzip.

# Copyright (C) 2007 Free Software Foundation
...

But the link counts aren't correct.

$ stat --print="%h\n" /crfs/layers/gcr.io/reasonablek8s/ubuntu/18.04/0/bin/gunzip
1
$ stat --print="%h\n" /crfs/layers/gcr.io/reasonablek8s/ubuntu/18.04/0/bin/uncompress
1

And we get an empty file from the hardlink.

$ file /crfs/layers/gcr.io/reasonablek8s/ubuntu/18.04/0/bin/uncompress
/crfs/layers/gcr.io/reasonablek8s/ubuntu/18.04/0/bin/uncompress: empty

Additionally, when we cat it, we get EIO.

The cause of issue

There are two causes:

In TOCEntriys, link counts aren't recorded and the filesystem returns a static number 1 as link counts.
When a filesystem looksup to a stargz file (with stargz.Reader.Lookup() and (*stargz.TOCEntry).LookupChild()), the hardlink isn't resolved.

stargzify: Stargzifying images using HTTP fails

Stargzifying images on the HTTP registries fail when the registry isn't "localhost"(or 127.0.0.1).

We can stargzify images on the localhost using HTTP.

$ ./stargzify -upgrade 127.0.0.1:5000/ubuntu:18.04
2019/10/11 20:11:45 No matching credentials were found, falling back on anonymous
2019/10/11 20:11:46 pushed blob: sha256:c2a664698f4653bb4ad606cfeb701ccf07441973588c7294b23aff5e7e57e431
2019/10/11 20:11:46 pushed blob: sha256:5dc26a6e1f6217fa8d2aca4fd0bc60f2b0b7fbf6aac3d1473e948003be0f41ae
2019/10/11 20:11:46 pushed blob: sha256:3650658a2723d53534249a8f7785dbfc3b6e845cbe3b812ac37f5f29f98f2b68
2019/10/11 20:12:04 pushed blob: sha256:7b085330c7bb6994f0e1e720772493240386abe14f02e9aebc1df7dbd7306aa7
2019/10/11 20:12:04 pushed blob: sha256:720c7c23e42df652d62a51e48c9d441e0603a0384290fbe07461e426578f1841
2019/10/11 20:12:05 127.0.0.1:5000/ubuntu:18.04: digest: sha256:28665f72af70a56b0dfccf914c56c0cd7f478761bd527eb75d7b473aacf37ca2 size: 912

But if the registry isn't on the localhost, we can't.

$ ./stargzify -upgrade private:5000/ubuntu:18.04
2019/10/11 20:11:37 No matching credentials were found, falling back on anonymous
2019/10/11 20:11:37 Get https://private:5000/v2/: http: server gave HTTP response to HTTPS client

stargzify: Pushing blobs fail with DIGEST_INVALID occasionally because of race

Sometimes, stargzifying an image fails with DIGEST_INVALID.

$ ./stargzify -upgrade 127.0.0.1:5000/ubuntu:18.04
2019/10/13 14:37:24 No matching credentials were found, falling back on anonymous
2019/10/13 14:37:49 DIGEST_INVALID: provided digest did not match uploaded content; map[Digest:sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 Reason:map[]]

Especially when Writeing to the digester is slow, we can see this failure more frequently.

func (d *digester) Write(b []byte) (int, error) {
	time.Sleep(time.Duration(1) * time.Millisecond)
	n, err := d.h.Write(b)
	d.n += int64(n)
	return n, err
}

There is a race condition on the `digester`

The method layer.Compressed() calculates the digest(with Sum() method) immediately after the PipeWriter is unblocked.

func (l *layer) Compressed() (io.ReadCloser, error) {
	pr, pw := io.Pipe()

	// Convert input blob to stargz while computing diffid, digest, and size.
	go func() {
		w := stargz.NewWriter(pw)
		if err := w.AppendTar(l.rc); err != nil {
...
		if err := w.Close(); err != nil {
...
		l.digest = &v1.Hash{
			Algorithm: "sha256",
			Hex:       hex.EncodeToString(l.d.h.Sum(nil)),
		}
...
	}()

	return ioutil.NopCloser(io.TeeReader(pr, l.d)), nil

But even if the PipeWriter is unblocked, it isn't guaranteed that the TeeReader completes to write the data to the digester.

Accroding to the implementation of the TeeReader:

func (t *teeReader) Read(p []byte) (n int, err error) {
	n, err = t.r.Read(p)

        // ***** Region A *****

	if n > 0 {
		if n, err := t.w.Write(p[:n]); err != nil {
			return n, err
		}
	}
	return
}

Here, this TeeReader has the PipeReader as a reader and the digester as a writer.
In the Region A of the above, the write half(PipeWriter) is unblocked because reading from this PipeReader completes.
So the goroutine in the layer.Compressed() can immediately calculate the digest during the region A(before following t.w.Write() completes).
This race results in calculating an invalid(uncompleted) degest.

comment on CVMFS

FYI maybe you have already heard about it before, but this seems similar to using CVMFS for container distribution.

General information on CVMFS:
https://cvmfs.readthedocs.io/en/stable/
https://cernvm.cern.ch/portal/filesystem
https://github.com/cvmfs/cvmfs

Information about loading docker images on demand from CVMFS:
https://cvmfs.readthedocs.io/en/stable/cpt-graphdriver.html

Information about automatically converting container images and publishing them to CVMFS (with DUCC)
https://cvmfs.readthedocs.io/en/stable/cpt-ducc.html

go build -mod=readonly failed

[root@LT crfs]# go build -mod=readonly
go: updates to go.mod needed, disabled by -mod=readonly

We need to update go.mod, some libs were missing.

Support merging layers using Overlayfs

Enabling to merge CRFS layers using overlayfs is valuable (I know currently fuse-overlayfs trying to support CRFS).

But currently CRFS has a cache related issue to achive it.

Issue Reproduction

On terminal A, run crfs:

# ./crfs -fuse_debug

On terminal B, mount crfs as overlayfs and access on it every second:

# mkdir merged upper work
# CRFS_PATH=/crfs/layers/gcr.io/reasonablek8s/ubuntu/18.04 ; mount -t overlay overlay -o lowerdir=${CRFS_PATH}/3:${CRFS_PATH}/2:${CRFS_PATH}/1:${CRFS_PATH}/0,upperdir=upper,workdir=work merged
# for i in $(seq 0 100) ; do echo -n "${i}: " && ls $(pwd)/merged/lib ; sleep 1 ; done

Then, on terminal B, we see ESTALE after 1 min:

0: init  lsb  systemd  terminfo  udev  x86_64-linux-gnu
1: init  lsb  systemd  terminfo  udev  x86_64-linux-gnu
2: init  lsb  systemd  terminfo  udev  x86_64-linux-gnu
...
60: ls: cannot access '/home/kohei/local/experiment/lazypull/crfs/merged/lib': Stale file handle
61: ls: cannot access '/home/kohei/local/experiment/lazypull/crfs/merged/lib': Stale file handle
63: ls: cannot access '/home/kohei/local/experiment/lazypull/crfs/merged/lib': Stale file handle

On terminal A, kernel seems not to be satisfied with inode returned by "Lookup".

2019/09/12 12:17:05 fuse debug: <- Lookup [ID=0x31f Node=0x9 Uid=0 Gid=0 Pid=27867] "lib"
2019/09/12 12:17:05 fuse debug: -> [ID=0x31f] Lookup 0x11 gen=4 valid=1m0s attr={valid=720h0m0s ino=824640017696 size=0 mode=drwxr-xr-x}
2019/09/12 12:17:05 fuse debug: <- Forget [ID=0x320 Node=0x11 Uid=0 Gid=0 Pid=0] 1
2019/09/12 12:17:05 fuse debug: -> [ID=0x320] Forget

The cause of issue

CRFS generates different "Node" instance everytime "Lookup" is called. This behavior makes bazil/fuse assign different "Node IDs"(used by FUSE) to inodes everytime even if these "Lookups" point to same file because bazil/fuse caches "Node IDs" keyed by a "Node" insatance (not an inode number etc.). Most time (when we don't use overlayfs etc.) it is fine.

However, when dentry cache revalidation is executed and the dentry is expired (by default, set to 1 min in bazil/fuse.), FUSE "lookups" the original inode and it doesn't allow different Node IDs to same inode and concludes the cache as invalid. Unfortunately, overlayfs doesn't allow dentry caches being invalid and returns ESTALE.

As a result, CRFS currently doesn't support to merge the layers with overlayfs.

Solution

Cache "node" instances in CRFS once it lookedup, and use it when the same name is looked up.
I'll submit the patch as PR.

Link count of direcoties are incorrect

On the unix filesystems, link count of each directory should be determined by reference relationships among directories.
But current CRFS doesn't support it and all link counts of directories are 1.

For example, the link count to the same directory is different between ext4 and CRFS.

$ stat --print="%h\n" ./rootfs/etc/security
6
$ stat --print="%h\n" /crfs/layers/local/rootfs.stargz/etc/security
1

Incorrect non-unicode file path processing

Description

If the image layer includes non-unicode file path, stargzify can't process correctly, example:

mkdir ./test
touch ./test/$(printf '\xFF')
touch ./test/$(printf '\xAA')
tar -czvf ./test.tar.gz ./test
stargzify file:test.tar.gz file:test.stargz
tar -xzvf test.stargz stargz.index.json

The index file stargz.index.json has two entries with the same name "�":

{
	"version": 1,
	"entries": [
		{
			"name": "./test/",
			"type": "dir",
			"modtime": "2020-09-08T10:33:57Z",
			"mode": 509,
			"uid": 1000,
			"gid": 1000,
			"NumLink": 0
		},
		{
			"name": "./test/\ufffd",
			"type": "reg",
			"modtime": "2020-09-08T10:33:47Z",
			"mode": 436,
			"uid": 1000,
			"gid": 1000,
			"NumLink": 0,
			"digest": "sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
		},
		{
			"name": "./test/\ufffd",
			"type": "reg",
			"modtime": "2020-09-08T10:33:57Z",
			"mode": 436,
			"uid": 1000,
			"gid": 1000,
			"NumLink": 0,
			"digest": "sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
		}
	]
}

Possible Solution

Treat file path as []byte like xattr value, so that we can serialize the path as base64 encoded string to JSON index file, but we need to distinguish between plain path and base64 encoded path using extra entry field like "name_encoded": true.

Support mode that refreshes contents as image tag is updated on registry?

Hey there,

Congrats on the idea for this project. Sounds interesting and useful.

Question: What happens when I have mounted the image, and then a new digest of the same image tag is pushed to the registry? Does my mounted image gets automatically the new changes of the new digest? Does the already running container stops and restarts automatically with the new digest?

Thanks in advance

google / crfs Goto Github PK

crfs's Issues

Overview of the ideas is following:

Use OCI-compliant API

Use appropriate schema based on the host name

Issue Reproduction

The cause of issue

Solution

Issue description

Issue Reproduction

The cause of issue

There is a race condition on the digester

Accroding to the implementation of the TeeReader:

Issue Reproduction

The cause of issue

Solution

Description

Possible Solution

Recommend Projects

Recommend Topics

Recommend Org

There is a race condition on the `digester`