fluid-cloudnative / fluid Goto Github PK

View Code? Open in Web Editor NEW

1.6K 31.0 955.0 48.45 MB

Fluid, elastic data abstraction and acceleration for BigData/AI applications in cloud. (Project under CNCF)

Home Page: https://fluid-cloudnative.github.io/

License: Apache License 2.0

Dockerfile 0.07% Makefile 0.25% Go 95.01% Smarty 0.09% Shell 1.08% Mustache 0.81% Python 2.69%

data-abstraction kubernetes big-data ai-framework alluxio distributed-cache

fluid's People

Contributors

Stargazers

Watchers

Forkers

iluoeli wsxiaozhang cheyang zhongweichang001 denverdino trafalgarzzz simon-cj caoe jysongwithzhangce run-lin hooooyeah allenhaozi newsky valiantljk luke202001 jangolee heluocs chaowangnk1 alexyinhan heu-kevin hu-chi allenwang08 bernchess violape tifo22 yizhouchen996 reflyer823 duoxingli willbsnalt densil-phd littletiger123 hhhercules cjy97 fly923 yaokaizhe piftch dingjiefeng 1250658183 onceicy ghd1111 bigelow2017 njuwkf crayonsea zyt123456789 fromddy tangjianback fania98 jakimwong liuyixuee isgasho haxine huangweiboy2 yangwenyangwen kinderyj zhouqbuuu githubch luqqiu xq2005 pzy441 huanghaoding xiu10086 booooooooooooom wangliansong frankleaf yangyuliufeng hxycode 962796441 zach-xue kniter1 nkcs1341530396 yz09191 sighingnow andydiwenzhu 15851856882 carlos307079 wangyanghack itsinthebag renke2020 xiao-hou appleofdiscord xiejiajun rinascere0 gaocegege xieydd fantastic2085 ivan-cai neujie moreesindo cleverlzc timmylicheng goodoid yilin0829 guohao-rosicky tianqiongenze wonderisland zhuchance yangliang9004 ronggu 15801024150 duanshuaimin

fluid's Issues

[BUG] the runtime master pod is not restricted to the node where hostpath is located

What is your environment(Kubernetes version, Fluid version, etc.)

Descirbe the bug
When I try to accelerate hostpath located on one of my kubernetes cluster nodes, which is a new feature introduced in fluid v0.3.0 ,I have set the correct nodeAffinity in dataset.yaml to let fluid know which node my hostpath is on, but it seems that fluid did not use that nodeAffinity for runtime master pod, So fluid is likely to create master pod on an unexpedted node, and consequently the pvc created by fluid is empty.

What you expect to happen:

How to reproduce it

Additional Information

[BUG] Assignment to entry in nil map if no properties set in AlluxioRuntime

What is your environment(Kubernetes version, Fluid version, etc.)

Fluid 0.2.0
Kubernetes 1.16.9
Go 1.13.9
Linux/amd64

Descirbe the bug

I've got a assignment to entry in nil map error if I set no properties in runtime.yaml like this:

apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
  name: hbase
spec:
  replicas: 2
  tieredstore:
    levels:
      - mediumtype: MEM
        path: /dev/shm
        quota: 2Gi
        high: "0.95"
        low: "0.7"
        storageType: Memory
#  properties:
#    alluxio.user.file.writetype.default: MUST_CACHE
#    alluxio.master.journal.folder: /journal
#    alluxio.master.journal.type: UFS
#    alluxio.user.block.size.bytes.default: 256MB
#    alluxio.user.streaming.reader.chunk.size.bytes: 256MB
#    alluxio.user.local.reader.chunk.size.bytes: 256MB
#    alluxio.worker.network.reader.buffer.size: 256MB
#    alluxio.user.streaming.data.timeout: 300sec
  master:
    jvmOptions:
      - "-Xmx4G"
  worker:
    jvmOptions:
      - "-Xmx4G"
  fuse:
    jvmOptions:
      - "-Xmx4G "
      - "-Xms4G "
    # For now, only support local
    shortCircuitPolicy: local
    args:
      - fuse
      - --fuse-opts=direct_io,ro,max_read=131072,attr_timeout=7200,entry_timeout=7200,nonempty

controller-manager will crash, and here's a snippet of controller-manager log:

E0829 14:21:24.519270       1 runtime.go:78] Observed a panic: "assignment to entry in nil map" (assignment to entry in nil map)
goroutine 273 [running]:
github.com/cloudnativefluid/fluid/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x14742a0, 0x17f7630)
	/go/src/github.com/cloudnativefluid/fluid/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3
github.com/cloudnativefluid/fluid/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/cloudnativefluid/fluid/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82
panic(0x14742a0, 0x17f7630)
	/usr/local/go/src/runtime/panic.go:969 +0x166
github.com/cloudnativefluid/fluid/pkg/ddc/alluxio.(*AlluxioEngine).transformCommonPart(0xc0005af8c0, 0xc000501400, 0xc000214900, 0xc0004800a0, 0xc000346280)
	/go/src/github.com/cloudnativefluid/fluid/pkg/ddc/alluxio/transform.go:106 +0x222
github.com/cloudnativefluid/fluid/pkg/ddc/alluxio.(*AlluxioEngine).transform(0xc0005af8c0, 0xc000501400, 0xc00038a3c0, 0x14, 0xc000607cf0)
	/go/src/github.com/cloudnativefluid/fluid/pkg/ddc/alluxio/transform.go:41 +0x82
github.com/cloudnativefluid/fluid/pkg/ddc/alluxio.(*AlluxioEngine).generateAlluxioValueFile(0xc0005af8c0, 0xc000501400, 0x0, 0x0, 0x8, 0xc0008b59e0)
	/go/src/github.com/cloudnativefluid/fluid/pkg/ddc/alluxio/master_internal.go:75 +0x107
github.com/cloudnativefluid/fluid/pkg/ddc/alluxio.(*AlluxioEngine).setupMasterInernal(0xc0005af8c0, 0x162c3cd, 0xb)
	/go/src/github.com/cloudnativefluid/fluid/pkg/ddc/alluxio/master_internal.go:43 +0xb2
github.com/cloudnativefluid/fluid/pkg/ddc/alluxio.(*AlluxioEngine).SetupMaster(0xc0005af8c0, 0x1635901, 0x0)
	/go/src/github.com/cloudnativefluid/fluid/pkg/ddc/alluxio/master.go:125 +0x584
github.com/cloudnativefluid/fluid/pkg/ddc/base.(*TemplateEngine).Setup(0xc000890000, 0x18479a0, 0xc000042150, 0xc000607cf0, 0x7, 0xc000607ce0, 0x5, 0x184e500, 0xc0001bec60, 0xc00034ca80, ...)
	/go/src/github.com/cloudnativefluid/fluid/pkg/ddc/base/setup.go:52 +0x750
github.com/cloudnativefluid/fluid/pkg/controllers.(*RuntimeReconciler).ReconcileRuntime(0xc0000455c0, 0x184e440, 0xc000890000, 0x18479a0, 0xc000042150, 0xc000607cf0, 0x7, 0xc000607ce0, 0x5, 0x184e500, ...)
	/go/src/github.com/cloudnativefluid/fluid/pkg/controllers/runtime_controller.go:195 +0x1ab

What you expect to happen:
Fluid runs properly, and Alluxio starts with default settings.

How to reproduce it
Create any AlluxioRuntime with no properties set is okay for me to reproduce this bug.

Additional Information
None

[DOC] Add doc for Multi-User use same dataset for LocalPath Scenario

Provide a link to that doc page:
Have not found. But LocalPath doc here
What is the defect and your suggestions on improvement:
Add a demo for different user(different uid but same gid) use same dataset(only have gid)， i think ot support yet, but not test.

Failed to recreate Runtime on Dataset which has already unbounded from original Runtime

fluid version
fluid-0.1.0-SNAPSHOT on this commit

Describe the bug
I created a Dataset, and the corresponding AlluxioRuntime bounded to the Dataset.

cat << EOF > dataset.yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: cifar10
  #namespace: fluid-system
spec:
  mounts:
  - mountPoint: https://downloads.apache.org/hadoop/common/hadoop-3.2.1/
    name: hadoop
  - mountPoint: https://downloads.apache.org/spark/spark-2.4.6/
    name: spark
  - mountPoint: https://downloads.apache.org/hbase/2.2.5/
    name: hbase
EOF
kubectl create -f dataset.yaml

cat << EOF > runtime.yaml
---
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
  name: cifar10
  #namespace: fluid-system
spec:
  # Add fields here
  #dataCopies: 3
  replicas: 2
  alluxioVersion:
    image: registry.cn-huhehaote.aliyuncs.com/alluxio/alluxio
    imageTag: "2.3.0-SNAPSHOT-bbce37a"
    imagePullPolicy: Always
  tieredstore:
    levels:
    - mediumtype: MEM
      path: /dev/shm
      quota: 1Gi
      high: "0.95"
      low: "0.7"
      storageType: Memory
    - mediumtype: SSD
      path: /var/lib/docker/alluxio
      quota: 2Gi
      high: "0.95"
      low: "0.7"
      storageType: Disk
  properteies:
    alluxio.user.file.writetype.default: MUST_CACHE
    alluxio.master.journal.folder: /journal
    alluxio.master.journal.type: UFS
  master:
    replicas: 1
    jvmOptions:
      - "-Xmx4G"
  worker:
    jvmOptions:
      - "-Xmx4G"
  fuse:
    image: registry.cn-huhehaote.aliyuncs.com/alluxio/alluxio-fuse
    imageTag: "2.3.0-SNAPSHOT-bbce37a"
    imagePullPolicy: Always
    jvmOptions:
      - "-Xmx4G "
      - "-Xms4G "
    # For now, only support local
    shortCircuitPolicy: local
    args:
      - fuse
      - --fuse-opts=ro,max_read=131072,attr_timeout=7200,entry_timeout=7200
EOF
kubectl create -f runtime.yaml

The Dataset is bounded to AlluxioRuntime. Until now, evething goes fine. Then I deleted the Runtime and recreated it, but this time AlluxioRuntime can not be set up correctly.

To Reproduce

create Dataset and bound it to a Runtime
delete Runtime thus Dataset is unbounded
recreate Runtime

Expected behavior
Runtime which is created to bind an unbounded Dataset should be set up correctly.

统一访问不同数据源样例，HDFS+Web

What feature you'd like to add:

类似https://github.com/fluid-cloudnative/fluid/blob/master/docs/zh/samples/accelerate_data_accessing.md ，但是突出可以从两个数据源读取： HDFS+Web

Why is this feature needed:

HDFS参考：https://developer.aliyun.com/article/757111?spm=a2c6h.13262185.0.0.59fd60feJdduvn

删除dataset，pv，pvc一直处于terminating

kubectl get pvc
NAME       STATUS        VOLUME     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
imagenet   Terminating   imagenet   0                                        47h

怀疑的原因是没有等待pv删除完全后，就开始删除alluxio。

请@iluoeli 诊断一下，谢谢。

更新CRD后需要创建新的helm chart

[FEATURES] Leverage init container to support non-root scenario

What feature you'd like to add:

Leverage init container to support non-root scenario

Why is this feature needed:

当前的问题：

1.底层存储需要指定的non root身份只读访问；
2.用户身份在ldap中配置，在主机的/etc/passwd不存在；容器很难无缝感知该用户id和用户名
3.而alluxio需要依赖用户名称，而不是用户id

data replicas发生变化时，controller要重新设置alluxio runtime

在alluxioRuntime中包含dataReplicas参数，当该数值发生变化时。可以做以下操作

alluxio fs setReplication --max 1 /
Changed the replication level of /
replicationMax was set to 1

Multiple runtime info periodically added to bounded dataset

What happened
Bounded dataset has multiple same runtime info in its status property:

$ kubectl get datasets cifar10 -oyaml
.......
status:
   .....
   runtimes:
   .......
   - name: cifar10
     namespace: default
     type: Accelerate
   - name: cifar10
     namespace: default
     type: Accelerate
   - name: cifar10
     namespace: default
     type: Accelerate
    ......

Also, the number of runtime info would periodically increase:

$ kubectl get datasets cifar10 -o=go-template="{{len .status.runtimes}}"
3
# and several seconds later
$ !!
4

What did you expect to see
Only one runtime info since I've got only one Alluxio runtime bounded to the dataset

How to reproduce it
Bound a dataset to a alluxio runtime, check its status and check it again some time later.

Environment

Fluid 0.1.0 (installed from helm 3)

支持缩容场景

当用户减少replica数量的时候，可以实现数据集的缩容。

[FEATURES] Add 'helm lint' to travis for acceptance testing

What feature you'd like to add:

Add 'helm lint' to travis for acceptance testing

Why is this feature needed:

Because if the charts don't work, the build shouldn't pass.

RunAs no-root in alluxio change the permission of UFS[BUG]

What is your environment(Kubernetes version, Fluid version, etc.)

version

kubernetes:v1.15
2.helm:2.8
3.fluid:v0.3.0

Descirbe the bug
I deploy fluid in my cluster,in dataset.yaml,I set mountPoint: local:///tmp,owner uid=844,gid=844,and in runtime.yaml I set runAs:uid=844,gid=844,after I create the dataset and runtime,the UFS /tmp permisson all change to 844

What you expect to happen:
alluxio should not change the permission of UFS
How to reproduce it

Additional Information

Add additionalPrinterColumns for CRDs

We should provide users with a easier way to check status of created CRD objects than currently used kubectl describe <crd> <name>.

AdditionalPrinterColumns might be a good way to do that.

Candidate status properties that should be set under AdditionalPrinterColumns for each CRD:

Dataset

Name
UfsTotal: Total size of the mouted UFS
Cached: Total size of all the file cached in the ddc engine
Cache Capacity: Total cache capacity the ddc engine can provide
Cached Percentage: Cached / UfsTotal * 100%
Phase: Phase of the Dataset object
Age: How long since the object was created

AlluxioRuntime
Further discussion needed

AlluxioDataload
Further discussion needed

[FEATURES] readonly设置为数据集的默认访问模式

What feature you'd like to add:

目前默认的配置是readonly为false，如果不设置需要设置为true

Why is this feature needed:

Alluxio load优化

1.首先需要alluixo distributed load
2.从fuse读取数据（可配置）

Unit Test Enhancement

What feature you'd like to add:

Why is this feature needed:

[DOC]访问非root用户数据的使用文档

Provide a link to that doc page:

可以从现有hostpath文档拆分

What is the defect and your suggestions on improvement:

Accelerate hostPath in Kubernetes

What feature you'd like to add:

Accelerate hostPath and Persistent Volume in Kubernetes

Why is this feature needed:

Some UFS type is not natively support by the current runtime, such as NFS and some cloud storage. But we need to support them in Fluid.

删除alluxioRuntime时，需要清理alluxioworker目录

问题描述：

运行了一次Alluxio的JNRFuse后，之后切换成JNIFuse后，出现了dataloss的error

从应用侧观察到的错误是

INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.DataLossError'>, 2 root error(s) found.
(0) Data loss: truncated record at 142388333
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[input_processing/IteratorGetNext]]
[[cluster_5_1/merge_oidx_1/_2655]]
(1) Data loss: truncated record at 142388333
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[input_processing/IteratorGetNext]]
0 successful operations.
0 derived errors ignored.
I0812 06:10:41.991964 140419573827392 coordinator.py:224] Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.DataLossError'>, 2 root error(s) found.
(0) Data loss: truncated record at 142388333
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[input_processing/IteratorGetNext]]
[[cluster_5_1/merge_oidx_1/_2655]]
(1) Data loss: truncated record at 142388333
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[input_processing/IteratorGetNext]]
0 successful operations.
0 derived errors ignored.

但是在alluxio-fuse可以看到的日志如下，描述是block id为3489660931，期望的大小16MB，实际的大小是7MB

2020-08-12 06:07:19,191 ERROR AlluxioJniFuseFileSystem - Failed to read /imagenet/train/train-00123-of-01024,131072,57671680:
java.lang.IllegalStateException: Block 3489660931 is expected to be 16777216 bytes, but only 7341943 bytes are available. Please ensure its metadata is consistent between Alluxio and UFS.
at com.google.common.base.Preconditions.checkState(Preconditions.java:842)
at alluxio.client.block.stream.BlockInStream.readInternal(BlockInStream.java:275)
at alluxio.client.block.stream.BlockInStream.read(BlockInStream.java:264)
at alluxio.client.file.AlluxioFileInStream.read(AlluxioFileInStream.java:187)
at alluxio.fuse.AlluxioJniFuseFileSystem.readInternal(AlluxioJniFuseFileSystem.java:326)
at alluxio.fuse.AlluxioJniFuseFileSystem.lambda$read$4(AlluxioJniFuseFileSystem.java:298)
at alluxio.fuse.AlluxioFuseUtils.call(AlluxioFuseUtils.java:245)
at alluxio.fuse.AlluxioJniFuseFileSystem.read(AlluxioJniFuseFileSystem.java:298)
at alluxio.jnifuse.AbstractFuseFileSystem.readCallback(AbstractFuseFileSystem.java:150)
2020-08-12 06:07:19,196 ERROR AlluxioJniFuseFileSystem - Failed to read /imagenet/train/train-00123-of-01024,4096,57671680:
java.lang.IllegalStateException: Block 3489660931 is expected to be 16777216 bytes, but only 7341943 bytes are available. Please ensure its metadata is consistent between Alluxio and UFS.
at com.google.common.base.Preconditions.checkState(Preconditions.java:842)
at alluxio.client.block.stream.BlockInStream.readInternal(BlockInStream.java:275)
at alluxio.client.block.stream.BlockInStream.read(BlockInStream.java:264)
at alluxio.client.file.AlluxioFileInStream.read(AlluxioFileInStream.java:187)
at alluxio.fuse.AlluxioJniFuseFileSystem.readInternal(AlluxioJniFuseFileSystem.java:326)
at alluxio.fuse.AlluxioJniFuseFileSystem.lambda$read$4(AlluxioJniFuseFileSystem.java:298)
at alluxio.fuse.AlluxioFuseUtils.call(AlluxioFuseUtils.java:245)
at alluxio.fuse.AlluxioJniFuseFileSystem.read(AlluxioJniFuseFileSystem.java:298)
at alluxio.jnifuse.AbstractFuseFileSystem.readCallback(AbstractFuseFileSystem.java:150)

登录到对应节点发现

1.该block大小确实为7.1MB
2.怀疑该block的创建时间是在之前JNRFuse创建的且不完整

[root@iZuf68sywkiky95veylv1yZ alluxio]# cd alluxioworker/
[root@iZuf68sywkiky95veylv1yZ alluxioworker]# ls -ltr |grep 3489660931
-rwxrwxrwx 1 root root 7341943 8月 12 11:37 3489660931

问题：

1.什么情况下block会写不完整.目前不存在存储空间不足的情况
2.每次部署新的alluxio集群时，会删除掉之前的缓存block文件吗？

diagnose脚本开发

1.输入 alluxioruntime的名称和namespace

2.输出所有相关alluxio的pod日志

[FEATURES] Accelerate Persistent Volume in Kubernetes

What feature you'd like to add:

Accelerate Persistent Volume in Kubernetes

Why is this feature needed:

the Capacity of PV and PVC is not consistent witht Ufs Total

What happened:
I kubectl apply on these samples, but the capacity of PV and PVC are always 100G, which is not consistent with Ufs Total.

The Dataset Status:

Name:         cifar10
Namespace:    default
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"data.fluid.io/v1alpha1","kind":"Dataset","metadata":{"annotations":{},"name":"cifar10","namespace":"default"},"spec":{"moun...
API Version:  data.fluid.io/v1alpha1
Kind:         Dataset
Metadata:
  Creation Timestamp:  2020-08-22T06:56:09Z
  Finalizers:
    fluid-dataset-controller-finalizer
  Generation:        1
  Resource Version:  1592896981
  Self Link:         /apis/data.fluid.io/v1alpha1/namespaces/default/datasets/cifar10
  UID:               8ee5c294-e444-11ea-b246-92d5f2bc5508
Spec:
  Mounts:
    Mount Point:  https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-3.2.1/
    Name:         hadoop
    Mount Point:  https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.4.6/
    Name:         spark
    Mount Point:  https://mirrors.tuna.tsinghua.edu.cn/apache/hbase/2.2.5/
    Name:         hbase
  Node Affinity:
    Required:
      Node Selector Terms:
        Match Expressions:
          Key:       aliyun.accelerator/nvidia_name
          Operator:  In
          Values:
            Tesla-P100-PCIE-16GB
Status:
  Cache States:
    Cache Capacity:     24GiB
    Cached:             0B
    Cached Percentage:  0%
  Conditions:
    Last Transition Time:  2020-08-22T06:57:23Z
    Last Update Time:      2020-08-22T10:27:54Z
    Message:               The ddc runtime is ready.
    Reason:                DatasetReady
    Status:                True
    Type:                  Ready
  Phase:                   Bound
  Runtimes:
    Category:   Accelerate
    Name:       cifar10
    Namespace:  default
    Type:       alluxio
  Ufs Total:    1.742GiB
Events:         <none>

PV and PVC:

NAME                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM             STORAGECLASS   REASON   AGE
persistentvolume/cifar10   100Gi      RWX            Retain           Bound    default/cifar10                           3h39m

NAME                            STATUS   VOLUME    CAPACITY   ACCESS MODES   STORAGECLASS   AGE
persistentvolumeclaim/cifar10   Bound    cifar10   100Gi      RWX                           3h39m

文档迁移后链接需要校验

请检查 https://github.com/fluid-cloudnative/fluid/blob/master/docs/zh/TOC.md

使用get alluxioruntime获得replicas等细节信息

What feature you'd like to add:

了解alluxio runtime的状态，需要kubectl get -oyaml。对于用户来说，这个操作，比较复杂。系统通过

kubectl get alluxioruntime 和kubectl get alluxioruntime -o wide获得细节信息，比如目前worker ready的replicas是多少

Why is this feature needed:

Alluxio数据热预加载功能

需求：

开发Alluxio prefetch功能，该任务的本质是一个数据预热的分布式任务，会在每个缓存节点运行。预热的方式通过执行一个replica数量=缓存节点数量的batchJob执行

（1）开发一个helm chart可以执行batch job，该batch job可以在有缓存能力的节点运行，可配置参数

线程数
预热目录
（2）通过alluxio load这个crd可以定义行为，并且通过alluxio job controller来执行该helm chart
（3）通过alluxio job controller定时查询和更新该alluxio load这个crd的condition和phase

参考：

docker run -v /alluxio-fuse/train:/data --env THREADS=32 -itd --name=read-coco-32-alluxio
registry.cn-huhehaote.aliyuncs.com/tensorflow-samples/coco-perf
/app/read_file.sh
docker logs -f read-coco-32-alluxio

进展要求：

7.29 -》 8.5：
1.helm chart开发完成
2.alluxio load controller的设计

Owner： @TrafalgarZZZ @iluoeli

API docs

We can refer https://github.com/pingcap/tidb-operator/blob/master/docs/api-references/docs.md

And provide the link to alluxio storage integration link: https://docs.alluxio.io/os/user/stable/en/ufs/S3.html

[FEATURE] Make timestamp more readable

What is your environment(Kubernetes version, Fluid version, etc.)

{"level":"info","ts":1599569172.4293172,"logger":"alluxioctl.AlluxioRuntime","caller":"alluxio/shutdown.go:34","msg":"clean cache failed","alluxioruntime":"default/test","retry times":1}
{"level":"info","ts":1599653665.3547537,"logger":"alluxioctl.AlluxioRuntime","caller":"alluxio/shutdown.go:34","msg":"clean cache failed","alluxioruntime":"default/test","retry times":2}
{"level":"info","ts":1599653831.3333874,"logger":"alluxioctl.AlluxioRuntime","caller":"alluxio/shutdown.go:34","msg":"clean cache failed","alluxioruntime":"default/test","retry times":3}
{"level":"info","ts":1599654161.170925,"logger":"alluxioctl.AlluxioRuntime","caller":"alluxio/shutdown.go:34","msg":"clean cache failed","alluxioruntime":"default/test","retry times":4}
{"level":"info","ts":1599654818.675095,"logger":"alluxioctl.AlluxioRuntime","caller":"alluxio/shutdown.go:34","msg":"clean cache failed","alluxioruntime":"default/test","retry times":5}

默认时间戳对于排查问题并不友好，建议可以优化一下。谢谢。变得更加可读。

Descirbe the bug

What you expect to happen:

How to reproduce it

Additional Information

Upgrade alluxio to the latest commit of branch: branch-2.3-fuse

What feature you'd like to add:

Upgrade docker image of alluxio fuse 2.3

Why is this feature needed:

[BUG] Multi Dataset port conflict

What is your environment(Kubernetes version, Fluid version, etc.)
v0.3.0
Descirbe the bug
When two dataset scheduler to same node, one dataset will be pending.
What you expect to happen:
One kubernetes node wupport multi-dataset
How to reproduce it
Just create two dataset on same node vid nodeName
Additional Information
I notice log is port conflict, because alluxio-master use ports 19998、19999 and host ports 19998、19999；alluxio-job-master use ports 20001、20002 and host ports 20001、20002.

Create docs for dev

能够完成编译，构建，调试，安装（产出Dev docs，基于docker的）参考 https://github.com/openkruise/kruise/blob/master/CONTRIBUTING.md
https://github.com/kubeflow/arena

可以先写中文版

The badges of project

我们需要调研一下，开源后，我们需要配置哪些badge

Can not delete datset and runtime

What is your environment(Kubernetes version, Fluid version, etc.)
kubernetes:1.15
Fluid:0.3.0

Describe the bug
I deploy dataset and runtime in my cluster,everythong goes well,however 2 day pass,i want to delete all of them,but it failed,i exec kubectl delete -f dataset.yaml and kubectl delete -f runtime.yaml.
What you expect to happen:

How to reproduce it

Additional Information

dataset Spec 支持attribute pin

原因：为了避免数据驱逐，需要将指定数据pin住

参考：https://docs.alluxio.io/os/user/stable/en/operation/User-CLI.html

pin 命令对Alluxio中的文件或文件夹进行标记。该命令只针对元数据进行操作，不会引发任何数据被加载到Alluxio中

[FEATURES] Translate the user in UFS to Fuse

What is your environment(Kubernetes version, Fluid version, etc.)

Describe the bug

Translate the user in UFS to Fuse

What you expect to happen:

How to reproduce it

Additional Information

Support LocalPath Distribution Training Scenario

What feature you'd like to add:
Support LocalPath Distribution Training Scenario
Why is this feature needed:
Description of the architecture of our training:

origin:
                    Distribution Storage Lustre
                                          |
admin: every host machine have Lustre client； Use LDAP manage file access；Automatic injection uid/gid for user job(client)
user: allocate GPU resources and whether distribution training(will use mpijob crd), Specify the mount directory

fluid: 
                    Distribution Storage Lustre
                                          |
admin: every host machine have Lustre client； Create dataset and runtime injection uid/gid 
user: allocate GPU resources and whether distribution training(will use mpijob crd)，Specify the mount directory
Gap: When use distribution training, user and admin will not know which node will be bound with pod, so admin can`t create dataset and runtime in right node.

If there is any misunderstanding, andbody can point out ：）

Build Test infrastructure for fluid

提供基于Kubernetes或者kubebuilder单元测试的方案和规则（产出测试方案的使用文档和真实例子）

工具util测试
Engine测试
controller的测试

https://github.com/cheyang/fluid-docs/blob/master/status/overview.md

Resource consumption evaluation

In Accelerate hostPath scenario， application(dataset)、master 、worker and fuse client will deployment in single node. It will increase the load of host. So It need a long term test for different scenario，I summarize it briefly below.

For different orders of magnitude and types dataset, for example Lots of little files、Medium size Bin file（include imagenet tfrecord）、Big size Bin file(>1T)
For different datasets, tha capacity of one host machine for deployment dataset.
Multi-user hot data and cold data scenario test

If there is any misunderstanding, andbody can point out ：）

diagnose脚本没有收集alluxio日志

没有收集到alluxio worker日志

bash diagnose-fluid.sh --name imagenet --namespace default
No resources found in default namespace.
No resources found in default namespace.
No resources found in default namespace.
No resources found in default namespace.
No resources found in default namespace.
tar: 从成员名中删除开头的“/”
/tmp/diagnose_fluid_1596761896/
/tmp/diagnose_fluid_1596761896/pods-fluid-system/
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-8cfdh-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-g7wxt-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-h6j5z-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-nlpgx-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/controller-manager-6fb8db4f6b-kzp85-manager.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-twk7q-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-7s9zk-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-g7wxt-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-twk7q-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-g9c9s-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-nlpgx-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-ght9v-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-2qqgp-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-7866d-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-7s9zk-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-zdv9d-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-hjj9m-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-qcdl2-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-zdv9d-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-qcdl2-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-g9c9s-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-ght9v-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-mwzlq-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-2qqgp-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-8cfdh-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-hjj9m-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-h6j5z-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-gkhsd-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-gdc7t-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-7866d-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-mwzlq-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-gkhsd-plugins.log
/tmp/diagnose_fluid_1596761896/pods-fluid-system/csi-nodeplugin-fluid-gdc7t-node-driver-registrar.log
/tmp/diagnose_fluid_1596761896/fluid-system.log
/tmp/diagnose_fluid_1596761896/default.log
/tmp/diagnose_fluid_1596761896/pods-default/
/tmp/diagnose_fluid_1596761896/helm.log
please get diagnose_fluid_1596761896.tar.gz for diagnostics

但是没有产生alluxio-worker的日志

[root@iZuf68sywkiky95veylv1yZ fluid-cy]# cd /tmp/diagnose_fluid_1596761896/
[root@iZuf68sywkiky95veylv1yZ diagnose_fluid_1596761896]# ls
default.log  fluid-system.log  helm.log  pods-default  pods-fluid-system
[root@iZuf68sywkiky95veylv1yZ diagnose_fluid_1596761896]# cd pods-default/
[root@iZuf68sywkiky95veylv1yZ pods-default]# ls
[root@iZuf68sywkiky95veylv1yZ pods-default]#

Can't delete a dataset even when the associated runtime has been deleted

What happened
I have a pair of bounded dataset and its alluxio runtime, and I want to delete them to do some clean up.
After successfully deleting the alluxio runtime, the status of the associated dataset remains still Bounded.
Also, if I try to delete the dataset with kubectl delete, there would be an infinite stuck.

What did you expect to see
After the deletion of the alluxion runtime, shouldn't the status changed from Bounded to NotBound or something else? And I expected to be able to delete the dataset with kubectl delete

How to reproduce it

Make a bounded dataset
kubectl delete alluxio runtime
check the status of the dataset
kubectl delete dataset

Environment

Fluid 0.1.0 (installed from helm 3)

Demo for dawnbench

Create docs for users to run it by themselves.
Create video (No urgent, need discussion)

Demo for Cache co-locality

Create docs for users to run it by themselves.
2.Create video (No urgent, need discussion)

Reference:
https://github.com/fast-ml/nezha

dataReplicas 修改为data: replicas

由于我们要增加一些data管理的能力：

data:
replicas: 2
pined: true

需要修改API

[BUG] alluxio runtime to run with non-root user has no permission to cache data

What is your environment(Kubernetes version, Fluid version, etc.)

Describe the bug

Create alluxio runtime

spec:
  replicas: 2
#  alluxioVersion:
#    image: registry.cn-huhehaote.aliyuncs.com/alluxio/alluxio
#    imageTag: "2.3.0-SNAPSHOT-f83f51e"
#    imagePullPolicy: Always
  initUsers:
    image: registry.cn-hangzhou.aliyuncs.com/fluid/init-users
    imageTag: v0.3.0
    imagePullPolicy: Always
  tieredstore:
    levels:
      - mediumtype: SSD
        path: /var/lib/docker/alluxio
        quota: 2Gi
        high: "0.95"
        low: "0.7"
  runAs:
    uid: 1005
    gid: 1005
    user: myuser
    group: mygroup

I found the alluxio worker doesn't have permission to create cache directory

2020-09-20 09:47:26,568 INFO  BlockWorkerFactory - Creating alluxio.worker.block.BlockWorker
2020-09-20 09:47:26,644 ERROR StorageTier - Unable to initialize storage directory at /var/lib/docker/alluxio/alluxioworker: Failed to create folder /var/lib/docker/alluxio/alluxioworker