Git Product home page Git Product logo

fluid-cloudnative / fluid Goto Github PK

View Code? Open in Web Editor NEW
1.6K 31.0 955.0 48.45 MB

Fluid, elastic data abstraction and acceleration for BigData/AI applications in cloud. (Project under CNCF)

Home Page: https://fluid-cloudnative.github.io/

License: Apache License 2.0

Dockerfile 0.07% Makefile 0.25% Go 95.01% Smarty 0.09% Shell 1.08% Mustache 0.81% Python 2.69%
data-abstraction kubernetes big-data ai-framework alluxio distributed-cache

fluid's People

Contributors

abowloflrf avatar allenhaozi avatar baowj-678 avatar billychen1 avatar chenxiaofei-cxf avatar cheyang avatar daomin885 avatar dashanji avatar dependabot[bot] avatar fengshunli avatar frankleaf avatar hahchenchen avatar iluoeli avatar ldawns avatar littletiger123 avatar myccccccc avatar ronggu avatar ssz1997 avatar trafalgarzzz avatar uniqueni avatar wang-mask avatar wangshli avatar xiao-hou avatar xieydd avatar xliuqq avatar yangjun289519474 avatar yangyuliufeng avatar zhang-x-z avatar zhongweichang001 avatar zwwhdls avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fluid's Issues

[BUG] the runtime master pod is not restricted to the node where hostpath is located

What is your environment(Kubernetes version, Fluid version, etc.)

Descirbe the bug
When I try to accelerate hostpath located on one of my kubernetes cluster nodes, which is a new feature introduced in fluid v0.3.0 ,I have set the correct nodeAffinity in dataset.yaml to let fluid know which node my hostpath is on, but it seems that fluid did not use that nodeAffinity for runtime master pod, So fluid is likely to create master pod on an unexpedted node, and consequently the pvc created by fluid is empty.

What you expect to happen:

How to reproduce it

Additional Information

[BUG] Assignment to entry in nil map if no properties set in AlluxioRuntime

What is your environment(Kubernetes version, Fluid version, etc.)

  • Fluid 0.2.0
  • Kubernetes 1.16.9
  • Go 1.13.9
  • Linux/amd64

Descirbe the bug

I've got a assignment to entry in nil map error if I set no properties in runtime.yaml like this:

apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
  name: hbase
spec:
  replicas: 2
  tieredstore:
    levels:
      - mediumtype: MEM
        path: /dev/shm
        quota: 2Gi
        high: "0.95"
        low: "0.7"
        storageType: Memory
#  properties:
#    alluxio.user.file.writetype.default: MUST_CACHE
#    alluxio.master.journal.folder: /journal
#    alluxio.master.journal.type: UFS
#    alluxio.user.block.size.bytes.default: 256MB
#    alluxio.user.streaming.reader.chunk.size.bytes: 256MB
#    alluxio.user.local.reader.chunk.size.bytes: 256MB
#    alluxio.worker.network.reader.buffer.size: 256MB
#    alluxio.user.streaming.data.timeout: 300sec
  master:
    jvmOptions:
      - "-Xmx4G"
  worker:
    jvmOptions:
      - "-Xmx4G"
  fuse:
    jvmOptions:
      - "-Xmx4G "
      - "-Xms4G "
    # For now, only support local
    shortCircuitPolicy: local
    args:
      - fuse
      - --fuse-opts=direct_io,ro,max_read=131072,attr_timeout=7200,entry_timeout=7200,nonempty

controller-manager will crash, and here's a snippet of controller-manager log:

E0829 14:21:24.519270       1 runtime.go:78] Observed a panic: "assignment to entry in nil map" (assignment to entry in nil map)
goroutine 273 [running]:
github.com/cloudnativefluid/fluid/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x14742a0, 0x17f7630)
	/go/src/github.com/cloudnativefluid/fluid/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3
github.com/cloudnativefluid/fluid/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/cloudnativefluid/fluid/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82
panic(0x14742a0, 0x17f7630)
	/usr/local/go/src/runtime/panic.go:969 +0x166
github.com/cloudnativefluid/fluid/pkg/ddc/alluxio.(*AlluxioEngine).transformCommonPart(0xc0005af8c0, 0xc000501400, 0xc000214900, 0xc0004800a0, 0xc000346280)
	/go/src/github.com/cloudnativefluid/fluid/pkg/ddc/alluxio/transform.go:106 +0x222
github.com/cloudnativefluid/fluid/pkg/ddc/alluxio.(*AlluxioEngine).transform(0xc0005af8c0, 0xc000501400, 0xc00038a3c0, 0x14, 0xc000607cf0)
	/go/src/github.com/cloudnativefluid/fluid/pkg/ddc/alluxio/transform.go:41 +0x82
github.com/cloudnativefluid/fluid/pkg/ddc/alluxio.(*AlluxioEngine).generateAlluxioValueFile(0xc0005af8c0, 0xc000501400, 0x0, 0x0, 0x8, 0xc0008b59e0)
	/go/src/github.com/cloudnativefluid/fluid/pkg/ddc/alluxio/master_internal.go:75 +0x107
github.com/cloudnativefluid/fluid/pkg/ddc/alluxio.(*AlluxioEngine).setupMasterInernal(0xc0005af8c0, 0x162c3cd, 0xb)
	/go/src/github.com/cloudnativefluid/fluid/pkg/ddc/alluxio/master_internal.go:43 +0xb2
github.com/cloudnativefluid/fluid/pkg/ddc/alluxio.(*AlluxioEngine).SetupMaster(0xc0005af8c0, 0x1635901, 0x0)
	/go/src/github.com/cloudnativefluid/fluid/pkg/ddc/alluxio/master.go:125 +0x584
github.com/cloudnativefluid/fluid/pkg/ddc/base.(*TemplateEngine).Setup(0xc000890000, 0x18479a0, 0xc000042150, 0xc000607cf0, 0x7, 0xc000607ce0, 0x5, 0x184e500, 0xc0001bec60, 0xc00034ca80, ...)
	/go/src/github.com/cloudnativefluid/fluid/pkg/ddc/base/setup.go:52 +0x750
github.com/cloudnativefluid/fluid/pkg/controllers.(*RuntimeReconciler).ReconcileRuntime(0xc0000455c0, 0x184e440, 0xc000890000, 0x18479a0, 0xc000042150, 0xc000607cf0, 0x7, 0xc000607ce0, 0x5, 0x184e500, ...)
	/go/src/github.com/cloudnativefluid/fluid/pkg/controllers/runtime_controller.go:195 +0x1ab

What you expect to happen:
Fluid runs properly, and Alluxio starts with default settings.

How to reproduce it
Create any AlluxioRuntime with no properties set is okay for me to reproduce this bug.

Additional Information
None

Failed to recreate Runtime on Dataset which has already unbounded from original Runtime

fluid version
fluid-0.1.0-SNAPSHOT on this commit

Describe the bug
I created a Dataset, and the corresponding AlluxioRuntime bounded to the Dataset.

cat << EOF > dataset.yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: cifar10
  #namespace: fluid-system
spec:
  mounts:
  - mountPoint: https://downloads.apache.org/hadoop/common/hadoop-3.2.1/
    name: hadoop
  - mountPoint: https://downloads.apache.org/spark/spark-2.4.6/
    name: spark
  - mountPoint: https://downloads.apache.org/hbase/2.2.5/
    name: hbase
EOF
kubectl create -f dataset.yaml
cat << EOF > runtime.yaml
---
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
  name: cifar10
  #namespace: fluid-system
spec:
  # Add fields here
  #dataCopies: 3
  replicas: 2
  alluxioVersion:
    image: registry.cn-huhehaote.aliyuncs.com/alluxio/alluxio
    imageTag: "2.3.0-SNAPSHOT-bbce37a"
    imagePullPolicy: Always
  tieredstore:
    levels:
    - mediumtype: MEM
      path: /dev/shm
      quota: 1Gi
      high: "0.95"
      low: "0.7"
      storageType: Memory
    - mediumtype: SSD
      path: /var/lib/docker/alluxio
      quota: 2Gi
      high: "0.95"
      low: "0.7"
      storageType: Disk
  properteies:
    alluxio.user.file.writetype.default: MUST_CACHE
    alluxio.master.journal.folder: /journal
    alluxio.master.journal.type: UFS
  master:
    replicas: 1
    jvmOptions:
      - "-Xmx4G"
  worker:
    jvmOptions:
      - "-Xmx4G"
  fuse:
    image: registry.cn-huhehaote.aliyuncs.com/alluxio/alluxio-fuse
    imageTag: "2.3.0-SNAPSHOT-bbce37a"
    imagePullPolicy: Always
    jvmOptions:
      - "-Xmx4G "
      - "-Xms4G "
    # For now, only support local
    shortCircuitPolicy: local
    args:
      - fuse
      - --fuse-opts=ro,max_read=131072,attr_timeout=7200,entry_timeout=7200
EOF
kubectl create -f runtime.yaml

The Dataset is bounded to AlluxioRuntime. Until now, evething goes fine. Then I deleted the Runtime and recreated it, but this time AlluxioRuntime can not be set up correctly.

To Reproduce

  1. create Dataset and bound it to a Runtime
  2. delete Runtime thus Dataset is unbounded
  3. recreate Runtime

Expected behavior
Runtime which is created to bind an unbounded Dataset should be set up correctly.

删除dataset,pv,pvc一直处于terminating

删除dataset,pv,pvc一直处于terminating

kubectl get pvc
NAME       STATUS        VOLUME     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
imagenet   Terminating   imagenet   0                                        47h

怀疑的原因是没有等待pv删除完全后,就开始删除alluxio。

@iluoeli 诊断一下,谢谢。

[FEATURES] Leverage init container to support non-root scenario

What feature you'd like to add:

Leverage init container to support non-root scenario

Why is this feature needed:

当前的问题:

1.底层存储需要指定的non root身份只读访问;
2.用户身份在ldap中配置,在主机的/etc/passwd不存在;容器很难无缝感知该用户id和用户名
3.而alluxio需要依赖用户名称,而不是用户id

Multiple runtime info periodically added to bounded dataset

What happened
Bounded dataset has multiple same runtime info in its status property:

$ kubectl get datasets cifar10 -oyaml
.......
status:
   .....
   runtimes:
   .......
   - name: cifar10
     namespace: default
     type: Accelerate
   - name: cifar10
     namespace: default
     type: Accelerate
   - name: cifar10
     namespace: default
     type: Accelerate
    ......

Also, the number of runtime info would periodically increase:

$ kubectl get datasets cifar10 -o=go-template="{{len .status.runtimes}}"
3
# and several seconds later
$ !!
4

What did you expect to see
Only one runtime info since I've got only one Alluxio runtime bounded to the dataset

How to reproduce it
Bound a dataset to a alluxio runtime, check its status and check it again some time later.

Environment

  • Fluid 0.1.0 (installed from helm 3)

支持缩容场景

当用户减少replica数量的时候,可以实现数据集的缩容。

RunAs no-root in alluxio change the permission of UFS[BUG]

What is your environment(Kubernetes version, Fluid version, etc.)

  • version
  1. kubernetes:v1.15
    2.helm:2.8
    3.fluid:v0.3.0

Descirbe the bug
I deploy fluid in my cluster,in dataset.yaml,I set mountPoint: local:///tmp,owner uid=844,gid=844,and in runtime.yaml I set runAs:uid=844,gid=844,after I create the dataset and runtime,the UFS /tmp permisson all change to 844

What you expect to happen:
alluxio should not change the permission of UFS
How to reproduce it

Additional Information

Add additionalPrinterColumns for CRDs

We should provide users with a easier way to check status of created CRD objects than currently used kubectl describe <crd> <name>.

AdditionalPrinterColumns might be a good way to do that.

Candidate status properties that should be set under AdditionalPrinterColumns for each CRD:

Dataset

  • Name
  • UfsTotal: Total size of the mouted UFS
  • Cached: Total size of all the file cached in the ddc engine
  • Cache Capacity: Total cache capacity the ddc engine can provide
  • Cached Percentage: Cached / UfsTotal * 100%
  • Phase: Phase of the Dataset object
  • Age: How long since the object was created

AlluxioRuntime
Further discussion needed

AlluxioDataload
Further discussion needed

Alluxio load优化

1.首先需要alluixo distributed load
2.从fuse读取数据 (可配置)

Accelerate hostPath in Kubernetes

What feature you'd like to add:

Accelerate hostPath and Persistent Volume in Kubernetes

Why is this feature needed:

Some UFS type is not natively support by the current runtime, such as NFS and some cloud storage. But we need to support them in Fluid.

删除alluxioRuntime时,需要清理alluxioworker目录

问题描述:

运行了一次Alluxio的JNRFuse后,之后切换成JNIFuse后,出现了dataloss的error

从应用侧观察到的错误是

INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.DataLossError'>, 2 root error(s) found.
(0) Data loss: truncated record at 142388333
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[input_processing/IteratorGetNext]]
[[cluster_5_1/merge_oidx_1/_2655]]
(1) Data loss: truncated record at 142388333
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[input_processing/IteratorGetNext]]
0 successful operations.
0 derived errors ignored.
I0812 06:10:41.991964 140419573827392 coordinator.py:224] Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.DataLossError'>, 2 root error(s) found.
(0) Data loss: truncated record at 142388333
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[input_processing/IteratorGetNext]]
[[cluster_5_1/merge_oidx_1/_2655]]
(1) Data loss: truncated record at 142388333
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[input_processing/IteratorGetNext]]
0 successful operations.
0 derived errors ignored.

但是在alluxio-fuse可以看到的日志如下,描述是block id为3489660931,期望的大小16MB,实际的大小是7MB

2020-08-12 06:07:19,191 ERROR AlluxioJniFuseFileSystem - Failed to read /imagenet/train/train-00123-of-01024,131072,57671680:
java.lang.IllegalStateException: Block 3489660931 is expected to be 16777216 bytes, but only 7341943 bytes are available. Please ensure its metadata is consistent between Alluxio and UFS.
at com.google.common.base.Preconditions.checkState(Preconditions.java:842)
at alluxio.client.block.stream.BlockInStream.readInternal(BlockInStream.java:275)
at alluxio.client.block.stream.BlockInStream.read(BlockInStream.java:264)
at alluxio.client.file.AlluxioFileInStream.read(AlluxioFileInStream.java:187)
at alluxio.fuse.AlluxioJniFuseFileSystem.readInternal(AlluxioJniFuseFileSystem.java:326)
at alluxio.fuse.AlluxioJniFuseFileSystem.lambda$read$4(AlluxioJniFuseFileSystem.java:298)
at alluxio.fuse.AlluxioFuseUtils.call(AlluxioFuseUtils.java:245)
at alluxio.fuse.AlluxioJniFuseFileSystem.read(AlluxioJniFuseFileSystem.java:298)
at alluxio.jnifuse.AbstractFuseFileSystem.readCallback(AbstractFuseFileSystem.java:150)
2020-08-12 06:07:19,196 ERROR AlluxioJniFuseFileSystem - Failed to read /imagenet/train/train-00123-of-01024,4096,57671680:
java.lang.IllegalStateException: Block 3489660931 is expected to be 16777216 bytes, but only 7341943 bytes are available. Please ensure its metadata is consistent between Alluxio and UFS.
at com.google.common.base.Preconditions.checkState(Preconditions.java:842)
at alluxio.client.block.stream.BlockInStream.readInternal(BlockInStream.java:275)
at alluxio.client.block.stream.BlockInStream.read(BlockInStream.java:264)
at alluxio.client.file.AlluxioFileInStream.read(AlluxioFileInStream.java:187)
at alluxio.fuse.AlluxioJniFuseFileSystem.readInternal(AlluxioJniFuseFileSystem.java:326)
at alluxio.fuse.AlluxioJniFuseFileSystem.lambda$read$4(AlluxioJniFuseFileSystem.java:298)
at alluxio.fuse.AlluxioFuseUtils.call(AlluxioFuseUtils.java:245)
at alluxio.fuse.AlluxioJniFuseFileSystem.read(AlluxioJniFuseFileSystem.java:298)
at alluxio.jnifuse.AbstractFuseFileSystem.readCallback(AbstractFuseFileSystem.java:150)

登录到对应节点发现

1.该block大小确实为7.1MB
2.怀疑该block的创建时间是在之前JNRFuse创建的且不完整

[root@iZuf68sywkiky95veylv1yZ alluxio]# cd alluxioworker/
[root@iZuf68sywkiky95veylv1yZ alluxioworker]# ls -ltr |grep 3489660931
-rwxrwxrwx 1 root root 7341943 8月 12 11:37 3489660931

问题:

1.什么情况下block会写不完整.目前不存在存储空间不足的情况
2.每次部署新的alluxio集群时,会删除掉之前的缓存block文件吗?

diagnose脚本开发

1.输入 alluxioruntime的名称和namespace

2.输出 所有相关alluxio的pod日志

the Capacity of PV and PVC is not consistent witht Ufs Total

What happened:
I kubectl apply on these samples, but the capacity of PV and PVC are always 100G, which is not consistent with Ufs Total.

The Dataset Status:

Name:         cifar10
Namespace:    default
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"data.fluid.io/v1alpha1","kind":"Dataset","metadata":{"annotations":{},"name":"cifar10","namespace":"default"},"spec":{"moun...
API Version:  data.fluid.io/v1alpha1
Kind:         Dataset
Metadata:
  Creation Timestamp:  2020-08-22T06:56:09Z
  Finalizers:
    fluid-dataset-controller-finalizer
  Generation:        1
  Resource Version:  1592896981
  Self Link:         /apis/data.fluid.io/v1alpha1/namespaces/default/datasets/cifar10
  UID:               8ee5c294-e444-11ea-b246-92d5f2bc5508
Spec:
  Mounts:
    Mount Point:  https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-3.2.1/
    Name:         hadoop
    Mount Point:  https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.4.6/
    Name:         spark
    Mount Point:  https://mirrors.tuna.tsinghua.edu.cn/apache/hbase/2.2.5/
    Name:         hbase
  Node Affinity:
    Required:
      Node Selector Terms:
        Match Expressions:
          Key:       aliyun.accelerator/nvidia_name
          Operator:  In
          Values:
            Tesla-P100-PCIE-16GB
Status:
  Cache States:
    Cache Capacity:     24GiB
    Cached:             0B
    Cached Percentage:  0%
  Conditions:
    Last Transition Time:  2020-08-22T06:57:23Z
    Last Update Time:      2020-08-22T10:27:54Z
    Message:               The ddc runtime is ready.
    Reason:                DatasetReady
    Status:                True
    Type:                  Ready
  Phase:                   Bound
  Runtimes:
    Category:   Accelerate
    Name:       cifar10
    Namespace:  default
    Type:       alluxio
  Ufs Total:    1.742GiB
Events:         <none>

PV and PVC:

NAME                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM             STORAGECLASS   REASON   AGE
persistentvolume/cifar10   100Gi      RWX            Retain           Bound    default/cifar10                           3h39m

NAME                            STATUS   VOLUME    CAPACITY   ACCESS MODES   STORAGECLASS   AGE
persistentvolumeclaim/cifar10   Bound    cifar10   100Gi      RWX                           3h39m

使用get alluxioruntime获得replicas等细节信息

What feature you'd like to add:

了解alluxio runtime的状态,需要kubectl get -oyaml。对于用户来说,这个操作,比较复杂。系统通过

kubectl get alluxioruntime 和kubectl get alluxioruntime -o wide获得细节信息,比如目前worker ready的replicas是多少

Why is this feature needed:

Alluxio数据热预加载功能

需求:

开发Alluxio prefetch功能,该任务的本质是一个数据预热的分布式任务,会在每个缓存节点运行。预热的方式通过执行一个replica数量=缓存节点数量的batchJob执行

(1)开发一个helm chart可以执行batch job,该batch job可以在有缓存能力的节点运行,可配置参数

  • 线程数
  • 预热目录
    (2)通过alluxio load这个crd可以定义行为,并且通过alluxio job controller来执行该helm chart
    (3)通过alluxio job controller定时查询和更新该alluxio load这个crd的condition和phase

参考:

docker run -v /alluxio-fuse/train:/data --env THREADS=32 -itd --name=read-coco-32-alluxio
registry.cn-huhehaote.aliyuncs.com/tensorflow-samples/coco-perf
/app/read_file.sh
docker logs -f read-coco-32-alluxio

进展要求:

7.29 -》 8.5:
1.helm chart开发完成
2.alluxio load controller的设计

Owner: @TrafalgarZZZ @iluoeli

[FEATURE] Make timestamp more readable

What is your environment(Kubernetes version, Fluid version, etc.)

{"level":"info","ts":1599569172.4293172,"logger":"alluxioctl.AlluxioRuntime","caller":"alluxio/shutdown.go:34","msg":"clean cache failed","alluxioruntime":"default/test","retry times":1}
{"level":"info","ts":1599653665.3547537,"logger":"alluxioctl.AlluxioRuntime","caller":"alluxio/shutdown.go:34","msg":"clean cache failed","alluxioruntime":"default/test","retry times":2}
{"level":"info","ts":1599653831.3333874,"logger":"alluxioctl.AlluxioRuntime","caller":"alluxio/shutdown.go:34","msg":"clean cache failed","alluxioruntime":"default/test","retry times":3}
{"level":"info","ts":1599654161.170925,"logger":"alluxioctl.AlluxioRuntime","caller":"alluxio/shutdown.go:34","msg":"clean cache failed","alluxioruntime":"default/test","retry times":4}
{"level":"info","ts":1599654818.675095,"logger":"alluxioctl.AlluxioRuntime","caller":"alluxio/shutdown.go:34","msg":"clean cache failed","alluxioruntime":"default/test","retry times":5}

默认时间戳对于排查问题并不友好,建议可以优化一下。谢谢。变得更加可读。

Descirbe the bug

What you expect to happen:

How to reproduce it

Additional Information

[BUG] Multi Dataset port conflict

What is your environment(Kubernetes version, Fluid version, etc.)
v0.3.0
Descirbe the bug
When two dataset scheduler to same node, one dataset will be pending.
What you expect to happen:
One kubernetes node wupport multi-dataset
How to reproduce it
Just create two dataset on same node vid nodeName
Additional Information
I notice log is port conflict, because alluxio-master use ports 19998、19999 and host ports 19998、19999;alluxio-job-master use ports 20001、20002 and host ports 20001、20002.

Can not delete datset and runtime

What is your environment(Kubernetes version, Fluid version, etc.)
kubernetes:1.15
Fluid:0.3.0

Describe the bug
I deploy dataset and runtime in my cluster,everythong goes well,however 2 day pass,i want to delete all of them,but it failed,i exec kubectl delete -f dataset.yaml and kubectl delete -f runtime.yaml.
What you expect to happen:

How to reproduce it

Additional Information

[FEATURES] Translate the user in UFS to Fuse

What is your environment(Kubernetes version, Fluid version, etc.)

Describe the bug

Translate the user in UFS to Fuse

What you expect to happen:

How to reproduce it

Additional Information

Support LocalPath Distribution Training Scenario

What feature you'd like to add:
Support LocalPath Distribution Training Scenario
Why is this feature needed:
Description of the architecture of our training:

origin:
                    Distribution Storage Lustre
                                          |
admin: every host machine have Lustre client; Use LDAP manage file access;Automatic injection uid/gid for user job(client)
user: allocate GPU resources and whether distribution training(will use mpijob crd), Specify the mount directory

fluid: 
                    Distribution Storage Lustre
                                          |
admin: every host machine have Lustre client; Create dataset and runtime injection uid/gid 
user: allocate GPU resources and whether distribution training(will use mpijob crd),Specify the mount directory
Gap: When use distribution training, user and admin will not know which node will be bound with pod, so admin can`t create dataset and runtime in right node.

If there is any misunderstanding, andbody can point out :)

Resource consumption evaluation

In Accelerate hostPath scenario, application(dataset)、master 、worker and fuse client will deployment in single node. It will increase the load of host. So It need a long term test for different scenario,I summarize it briefly below.

  1. For different orders of magnitude and types dataset, for example Lots of little filesMedium size Bin file(include imagenet tfrecord)Big size Bin file(>1T)
  2. For different datasets, tha capacity of one host machine for deployment dataset.
  3. Multi-user hot data and cold data scenario test

If there is any misunderstanding, andbody can point out :)

diagnose脚本没有收集alluxio日志

没有收集到alluxio worker日志

bash diagnose-fluid.sh --name imagenet --namespace default
No resources found in default namespace.
No resources found in default namespace.
No resources found in default namespace.
No resources found in default namespace.
No resources found in default namespace.
tar: 从成员名中删除开头的“/”
/tmp/diagnose_fluid_1596761896/
/tmp/diagnose_fluid_1596761896/pods-fluid-system/
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-8cfdh-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-g7wxt-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-h6j5z-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-nlpgx-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/controller-manager-6fb8db4f6b-kzp85-manager.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-twk7q-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-7s9zk-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-g7wxt-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-twk7q-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-g9c9s-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-nlpgx-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-ght9v-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-2qqgp-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-7866d-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-7s9zk-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-zdv9d-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-hjj9m-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-qcdl2-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-zdv9d-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-qcdl2-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-g9c9s-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-ght9v-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-mwzlq-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-2qqgp-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-8cfdh-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-hjj9m-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-h6j5z-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-gkhsd-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-gdc7t-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-7866d-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-mwzlq-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-gkhsd-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-gdc7t-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/fluid-system.log
/tmp/diagnose_fluid_1596761896/default.log
/tmp/diagnose_fluid_1596761896/pods-default/
/tmp/diagnose_fluid_1596761896/helm.log
please get diagnose_fluid_1596761896.tar.gz for diagnostics

但是没有产生alluxio-worker的日志

[root@iZuf68sywkiky95veylv1yZ fluid-cy]# cd /tmp/diagnose_fluid_1596761896/
[root@iZuf68sywkiky95veylv1yZ diagnose_fluid_1596761896]# ls
default.log  fluid-system.log  helm.log  pods-default  pods-fluid-system
[root@iZuf68sywkiky95veylv1yZ diagnose_fluid_1596761896]# cd pods-default/
[root@iZuf68sywkiky95veylv1yZ pods-default]# ls
[root@iZuf68sywkiky95veylv1yZ pods-default]#

Can't delete a dataset even when the associated runtime has been deleted

What happened
I have a pair of bounded dataset and its alluxio runtime, and I want to delete them to do some clean up.
After successfully deleting the alluxio runtime, the status of the associated dataset remains still Bounded.
Also, if I try to delete the dataset with kubectl delete, there would be an infinite stuck.

What did you expect to see
After the deletion of the alluxion runtime, shouldn't the status changed from Bounded to NotBound or something else? And I expected to be able to delete the dataset with kubectl delete

How to reproduce it

  1. Make a bounded dataset
  2. kubectl delete alluxio runtime
  3. check the status of the dataset
  4. kubectl delete dataset

Environment

  • Fluid 0.1.0 (installed from helm 3)

Demo for dawnbench

  1. Create docs for users to run it by themselves.
  2. Create video (No urgent, need discussion)

[BUG] alluxio runtime to run with non-root user has no permission to cache data

What is your environment(Kubernetes version, Fluid version, etc.)

Describe the bug

  1. Create alluxio runtime
spec:
  replicas: 2
#  alluxioVersion:
#    image: registry.cn-huhehaote.aliyuncs.com/alluxio/alluxio
#    imageTag: "2.3.0-SNAPSHOT-f83f51e"
#    imagePullPolicy: Always
  initUsers:
    image: registry.cn-hangzhou.aliyuncs.com/fluid/init-users
    imageTag: v0.3.0
    imagePullPolicy: Always
  tieredstore:
    levels:
      - mediumtype: SSD
        path: /var/lib/docker/alluxio
        quota: 2Gi
        high: "0.95"
        low: "0.7"
  runAs:
    uid: 1005
    gid: 1005
    user: myuser
    group: mygroup
  1. I found the alluxio worker doesn't have permission to create cache directory
2020-09-20 09:47:26,568 INFO  BlockWorkerFactory - Creating alluxio.worker.block.BlockWorker
2020-09-20 09:47:26,644 ERROR StorageTier - Unable to initialize storage directory at /var/lib/docker/alluxio/alluxioworker: Failed to create folder /var/lib/docker/alluxio/alluxioworker
  1. For now, I have to add permission before creating alluxio runtime in specified node
chmod -R 777 /var/lib/docker/alluxio

What you expect to happen:

I hope we can give the permission 777 to the cache directory /var/lib/docker/alluxio in the init containers.

How to reproduce it

Additional Information

Release v0.1.0

What is your environment(Kubernetes version, Fluid version, etc.)

Create Release v0.1.0

Descirbe the bug

What you expect to happen:

How to reproduce it

Additional Information

将dataset controller和alluxioruntime 进行拆分和解耦

What feature you'd like to add:

目前dataset controller和alluxioruntime controller在同一个binary里,需要进行拆分。dataset controller独立负责dataset 声明周期,与实现无关

1.dataset controller 和alluxioruntime controller拆分,分别由两个main函数作为入口
2.两个controller分别部署在不同的Pod

/cc @xieydd @TrafalgarZZZ @iluoeli

Why is this feature needed:

fluid的数据集初始化优化

目前数据集的初始化需要依赖:

  1. alluxio fs ls -R /
    2.alluxio fs count

如果用户的数据量特别巨大,需要把ufsTotal变成异步的操作。

TODO:

1.设计异步方案
2.代码实现

[DOC] 数据卷加速

Provide a link to that doc page:

中文和英文文档

  • 文档
  • 短视频

What is the defect and your suggestions on improvement:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.