Git Product home page Git Product logo

opencurve / curve Goto Github PK

View Code? Open in Web Editor NEW
2.2K 62.0 484.0 100.74 MB

Curve is a sandbox project hosted by the CNCF Foundation. It's cloud-native, high-performance, and easy to operate. Curve is an open-source distributed storage system for block and shared file storage.

Home Page: https://opencurve.io

License: Apache License 2.0

Starlark 1.32% Shell 1.14% Python 2.14% Dockerfile 0.09% C++ 89.10% C 0.42% SWIG 0.01% Roff 0.06% Makefile 0.06% Go 4.95% Jinja 0.35% Java 0.34%
storage distributed-systems raft sds cloud-native-storage high-performance block-storage filestorage storage-engine posix-compatible

curve's Issues

read chunk 的stale read 问题

阅读代码发现ReadChunkRequest::Process 和 CopysetNode::on_apply 都使用了ConcurrentApplyModule。但是ConcurrentApplyModule对于属于同一个chunk读请求和写请求是分别用不同的任务队列(不同线程)处理的,感觉这样无法保证ReadChunkRequest::Process 描述的可以解决stale read问题。

fio 测试导致chunkserver offline

版本

https://github.com/opencurve/curve/releases/tag/v1.0.0

步骤

fio测试之前curve_ops_tool status查看chunk server,md,etcd没有offline
fio -direct=1 -iodepth=64 -thread -rw=randwrite -bs=4k -numjobs=4 -runtime=30 -group_reporting -name=test-curve -filename=/dev/nbd0 -ioengine=libaio -io_limit=400000G
数据盘上有少量io。
之后,curve_ops_tool status查看chunk server offline

cluster is not healthy
total copysets: 300, unhealthy copysets: 110, unhealthy_ratio: 36.6667%
...
chunkserver: total num = 36, online = 32, offline = 4(recoveringout = 0, chunkserverlist: [])
left size: min = 687GB, max = 688GB, average = 687.29GB, range = 1GB, variance = 0.21

查看offline的chunkserver的log,类似

I 2020-11-18T02:09:45-0500 49594 chunkfile_pool.cpp:306] get chunk success! now pool size = 44017
W 2020-11-18T02:09:45-0500 49589 ext4_filesystem_impl.cpp:142] open failed: Too many open files, file path = /data/chunkserver0/chunkfilepool/30235
E 2020-11-18T02:09:45-0500 49589 chunkfile_pool.cpp:368] file open failed, /data/chunkserver0/chunkfilepool/30235
I 2020-11-18T02:09:45-0500 49589 chunkfile_pool.cpp:289] src path = /data/chunkserver0/chunkfilepool/30235, dist path = /data/chunkserver0/copysets/4294967448/data/chunk_57670
E 2020-11-18T02:09:45-0500 49589 chunkfile_pool.cpp:311] write metapage failed, /data/chunkserver0/chunkfilepool/30235
W 2020-11-18T02:09:45-0500 49589 ext4_filesystem_impl.cpp:142] open failed: Too many open files, file path = /data/chunkserver0/chunkfilepool/35831
E 2020-11-18T02:09:45-0500 49589 chunkfile_pool.cpp:368] file open failed, /data/chunkserver0/chunkfilepool/35831
I 2020-11-18T02:09:45-0500 49589 chunkfile_pool.cpp:289] src path = /data/chunkserver0/chunkfilepool/35831, dist path = /data/chunkserver0/copysets/4294967448/data/chunk_57670
E 2020-11-18T02:09:45-0500 49589 chunkfile_pool.cpp:311] write metapage failed, /data/chunkserver0/chunkfilepool/35831
W 2020-11-18T02:09:45-0500 49589 ext4_filesystem_impl.cpp:142] open failed: Too many open files, file path = /data/chunkserver0/chunkfilepool/42129
E 2020-11-18T02:09:45-0500 49589 chunkfile_pool.cpp:368] file open failed, /data/chunkserver0/chunkfilepool/42129
I 2020-11-18T02:09:45-0500 49589 chunkfile_pool.cpp:289] src path = /data/chunkserver0/chunkfilepool/42129, dist path = /data/chunkserver0/copysets/4294967448/data/chunk_57670
E 2020-11-18T02:09:45-0500 49589 chunkfile_pool.cpp:311] write metapage failed, /data/chunkserver0/chunkfilepool/42129
W 2020-11-18T02:09:45-0500 49589 ext4_filesystem_impl.cpp:142] open failed: Too many open files, file path = /data/chunkserver0/chunkfilepool/21533
E 2020-11-18T02:09:45-0500 49589 chunkfile_pool.cpp:368] file open failed, /data/chunkserver0/chunkfilepool/21533
I 2020-11-18T02:09:45-0500 49589 chunkfile_pool.cpp:289] src path = /data/chunkserver0/chunkfilepool/21533, dist path = /data/chunkserver0/copysets/4294967448/data/chunk_57670
E 2020-11-18T02:09:45-0500 49589 chunkfile_pool.cpp:311] write metapage failed, /data/chunkserver0/chunkfilepool/21533
W 2020-11-18T02:09:45-0500 49589 ext4_filesystem_impl.cpp:142] open failed: Too many open files, file path = /data/chunkserver0/chunkfilepool/4215
E 2020-11-18T02:09:45-0500 49589 chunkfile_pool.cpp:368] file open failed, /data/chunkserver0/chunkfilepool/4215
I 2020-11-18T02:09:45-0500 49589 chunkfile_pool.cpp:289] src path = /data/chunkserver0/chunkfilepool/4215, dist path = /data/chunkserver0/copysets/4294967448/data/chunk_57670
E 2020-11-18T02:09:45-0500 49589 chunkfile_pool.cpp:311] write metapage failed, /data/chunkserver0/chunkfilepool/4215
E 2020-11-18T02:09:45-0500 49589 chunkserver_chunkfile.cpp:195] Error occured when create file. filepath = /data/chunkserver0/copysets/4294967448/data/chunk_57670
W 2020-11-18T02:09:45-0500 49589 chunkserver_datastore.cpp:197] Create chunk file failed.ChunkID = 57670, ErrorCode = 1
F 2020-11-18T02:09:45-0500 49589 op_request.cpp:479] write failed:  logic pool id: 1 copyset id: 152 chunkid: 57670 data size: 4096 data store return: 1

对应chunkserver手动无法拉起。尝试重启集群

ansible-playbook -i server.ini stop_curve.yml 
ansible-playbook -i server.ini start_curve.yml

之后有的chunkserver启动了,而另一些chunkserver offline。log与上面类似。

I 2020-11-18T02:36:46-0500 103320 chunkfile_pool.cpp:306] get chunk success! now pool size = 44013
W 2020-11-18T02:36:46-0500 103315 ext4_filesystem_impl.cpp:142] open failed: Too many open files, file path = /data/chunkserver4/chunkfilepool/13316
E 2020-11-18T02:36:46-0500 103315 chunkfile_pool.cpp:368] file open failed, /data/chunkserver4/chunkfilepool/13316
I 2020-11-18T02:36:46-0500 103315 chunkfile_pool.cpp:289] src path = /data/chunkserver4/chunkfilepool/13316, dist path = /data/chunkserver4/copysets/4294967520/data/chunk_91042
E 2020-11-18T02:36:46-0500 103315 chunkfile_pool.cpp:311] write metapage failed, /data/chunkserver4/chunkfilepool/13316
W 2020-11-18T02:36:46-0500 103315 ext4_filesystem_impl.cpp:142] open failed: Too many open files, file path = /data/chunkserver4/chunkfilepool/20387
E 2020-11-18T02:36:46-0500 103315 chunkfile_pool.cpp:368] file open failed, /data/chunkserver4/chunkfilepool/20387
I 2020-11-18T02:36:46-0500 103315 chunkfile_pool.cpp:289] src path = /data/chunkserver4/chunkfilepool/20387, dist path = /data/chunkserver4/copysets/4294967520/data/chunk_91042
E 2020-11-18T02:36:46-0500 103315 chunkfile_pool.cpp:311] write metapage failed, /data/chunkserver4/chunkfilepool/20387
W 2020-11-18T02:36:46-0500 103315 ext4_filesystem_impl.cpp:142] open failed: Too many open files, file path = /data/chunkserver4/chunkfilepool/25475
E 2020-11-18T02:36:46-0500 103315 chunkfile_pool.cpp:368] file open failed, /data/chunkserver4/chunkfilepool/25475
I 2020-11-18T02:36:46-0500 103315 chunkfile_pool.cpp:289] src path = /data/chunkserver4/chunkfilepool/25475, dist path = /data/chunkserver4/copysets/4294967520/data/chunk_91042
E 2020-11-18T02:36:46-0500 103315 chunkfile_pool.cpp:311] write metapage failed, /data/chunkserver4/chunkfilepool/25475
W 2020-11-18T02:36:46-0500 103315 ext4_filesystem_impl.cpp:142] open failed: Too many open files, file path = /data/chunkserver4/chunkfilepool/34096
E 2020-11-18T02:36:46-0500 103315 chunkfile_pool.cpp:368] file open failed, /data/chunkserver4/chunkfilepool/34096
I 2020-11-18T02:36:46-0500 103315 chunkfile_pool.cpp:289] src path = /data/chunkserver4/chunkfilepool/34096, dist path = /data/chunkserver4/copysets/4294967520/data/chunk_91042
E 2020-11-18T02:36:46-0500 103315 chunkfile_pool.cpp:311] write metapage failed, /data/chunkserver4/chunkfilepool/34096
W 2020-11-18T02:36:46-0500 103315 ext4_filesystem_impl.cpp:142] open failed: Too many open files, file path = /data/chunkserver4/chunkfilepool/734
E 2020-11-18T02:36:46-0500 103315 chunkfile_pool.cpp:368] file open failed, /data/chunkserver4/chunkfilepool/734
I 2020-11-18T02:36:46-0500 103315 chunkfile_pool.cpp:289] src path = /data/chunkserver4/chunkfilepool/734, dist path = /data/chunkserver4/copysets/4294967520/data/chunk_91042
E 2020-11-18T02:36:46-0500 103315 chunkfile_pool.cpp:311] write metapage failed, /data/chunkserver4/chunkfilepool/734
E 2020-11-18T02:36:46-0500 103315 chunkserver_chunkfile.cpp:195] Error occured when create file. filepath = /data/chunkserver4/copysets/4294967520/data/chunk_91042
W 2020-11-18T02:36:46-0500 103315 chunkserver_datastore.cpp:197] Create chunk file failed.ChunkID = 91042, ErrorCode = 1
F 2020-11-18T02:36:46-0500 103315 op_request.cpp:532] write failed:  logic pool id: 1 copyset id: 224 chunkid: 91042 data size: 4096 data store return: 1

两次做fio测试都有类似问题,无法测试性能。如果有其它测试方法希望分享一下。
另外请问清理集群是 ansible-playbook -i server.ini clean_curve.yml 么?实际有时候运行完再部署还是不行,不知道是什么文件没删掉。
ansible配置文件见附件。config.zip

基本是抄的 https://github.com/opencurve/curve/blob/master/docs/cn/deploy.md

使用命令 ansible 出现FAILED 无法部署

Describe the bug (描述bug)
使用命令 ansible-playbook -i server.ini deploy_curve.yml 进行部署的时候,提示

TASK [generate_config : generate configuration file directly] *************************************************************************************************
fatal: [localhost]: FAILED! => {"changed": false, "checksum": "27c7b68395f392cdc4d364ba6afa06b577c925ff", "msg": "Destination /etc/curve not writable"}

。。。。。。。。。。。
fatal: [localhost]: FAILED! => {"changed": true, "cmd": "sudo cp /etc/curve/etcd.conf.yml /etc/curve/etcd.conf.yml.bak", "delta": "0:00:00.064517", "end": "2021-02-21 09:41:20.241484", "msg": "non-zero return code", "rc": 1, "start": "2021-02-21 09:41:20.176967", "stderr": "cp: cannot stat '/etc/curve/etcd.conf.yml': No such file or directory", "stderr_lines": ["cp: cannot stat '/etc/curve/etcd.conf.yml': No such file or directory"], "stdout": "", "stdout_lines": []}

然后失败无法部署。

To Reproduce (复现方法)

  1. 删除 etc/curve 可复现这种情况。 ansiable 与 rm 操作都在root用户下进行

  2. 根据提示把 etc/curve 目录拥有者改成curve后没有这个报错。

    chown -R curve:curve /etc/curve

Expected behavior (期望行为)

修复此bug

Versions (各种版本)

编译和部署均使用 docker opencurve/curveintegration:centos8 镜像

版本为 commit 1c81911 (HEAD -> master, origin/master, origin/HEAD)

Additional context/screenshots (更多上下文/截图)
image

ubuntu上无法编译成功

General Question

test/failpoint/failpoint_test.cpp:24:25: fatal error: fiu-control.h: No such file or directory
在源代码里面,找不到该文件
需要自己编译下libfiu 嘛?在构建文件里面没发现

Docker中执行 ./build.sh 编译错误

General Question

按照官方文档拉取 opencurve/curvebuild:centos8 镜像,随后拉取代码以后执行build.sh进行编译,出现如下错误:

[280 / 833] 8 actions, 7 running
Compiling external/com_google_protobuf/src/google/protobuf/descriptor.cc; 2s processwrapper-sandbox
@com_google_protobuf//:protobuf; 1s processwrapper-sandbox
@com_google_protobuf//:protoc_lib; 1s processwrapper-sandbox
@com_google_protobuf//:protoc_lib; 1s processwrapper-sandbox
@com_google_protobuf//:protobuf; 0s processwrapper-sandbox
Compiling external/com_google_protobuf/src/google/protobuf/descriptor.pb.cc; 0s processwrapper-sandbox
Compiling external/com_google_protobuf/src/google/protobuf/descriptor_database.cc; 0s processwrapper-sandbox
[-----] Compiling external/com_google_protobuf/src/google/protobuf/util/internal/object_writer.cc

Server terminated abruptly (error code: 14, error message: '', log file: '/root/.cache/bazel/_bazel_root/6f5f4033910d16741199bab868564698/server/jvm.out')

build phase1 failed
[root@0e79895da77d curve]# cat /root/.cache/bazel/_bazel_root/6f5f4033910d16741199bab868564698/server/jvm.out
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.google.protobuf.UnsafeUtil (file:/root/.cache/bazel/_bazel_root/install/792a28b07894763eaa2bd870f8776b23/_embedded_binaries/A-server.jar) to field java.lang.String.value
WARNING: Please consider reporting this to the maintainers of com.google.protobuf.UnsafeUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release

单机版本部署,格式化 /dev/nbd0 失败

Describe the bug (描述bug)

  1. mkfs.ext4 /dev/nbd0 失败,以下查询信息:
  • root@ubuntu-xenial:/home/vagrant# curve_ops_tool status

Cluster status:
Get status metric from 127.0.0.1:8081 fail
No snapshot-clone-server is active
snapshot-clone-server 127.0.0.1:5556 is offline
cluster is not healthy
total copysets: 100, unhealthy copysets: 0, unhealthy_ratio: 0%
physical pool number: 1, logical pool number: 1
total space = 122021132GB, logical used = 24GB(0.00%, can be recycled = 0GB(0.00%)), physical used = 1GB(0.00%)

Client status:
nebd-server: version-0.1.0: 1

MDS status:
version: 0.1.0
current MDS: 127.0.0.1:6666
online mds list: 127.0.0.1:6666
offline mds list:

Etcd status:
version: 3.4.0
current etcd: 127.0.0.1:2379
online etcd list: 127.0.0.1:2379
offline etcd list:

SnapshotCloneServer status:
no version found!
GetAndCheckSnapshotCloneVersion fail
Get status metric from 127.0.0.1:8081 fail
current snapshot-clone-server:
online snapshot-clone-server list:
offline snapshot-clone-server list: 127.0.0.1:5556

ChunkServer status:
version: 0.1.0
chunkserver: total num = 3, online = 3, offline = 0(recoveringout = 0, chunkserverlist: [])
left size: min = 20278169GB, max = 56282713GB, average = 40673710.33GB, range = 36004544GB, variance = 227510007645070.22

  • root@ubuntu-xenial:/home/vagrant# curve-nbd list-mapped

id image device
18042 cbd:pool//test_curve_ /dev/nbd0

  • root@ubuntu-xenial:/home/vagrant# fdisk /dev/nbd0

Welcome to fdisk (util-linux 2.27.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.

fdisk: cannot open /dev/nbd0: Input/output error

  • root@ubuntu-xenial:/home/vagrant# lsblk

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 10G 0 disk
└─sda1 8:1 0 10G 0 part /
sdb 8:16 0 10M 0 disk
nbd0 43:0 0 10G 0 disk
root@ubuntu-xenial:/home/vagrant#

  1. ubuntu16.04 ,按照单机版本部署
    logicalpools/name 改为了 2
    curve 分支: master

ChunkServer 日志:server.tar.gz
/data/ 日志:data.tar.gz

挂载卷失败

问题描述:
创建完卷后进行卷挂载时提示失败,返回如下:
curve-nbd: kernel reported invalid size (85899345920, expected 10737418240)
curve-ndb: failed to map, status: Invalid argument

To Reproduce (复现方法)
单物理机进行单机部署. 步骤参照: https://github.com/opencurve/curve/blob/master/docs/cn/deploy.md#单机部署

Versions (各种版本)
OS: CentOS Linux release 7.3.1611 (Core)
gcc: 8.4.0
opsnssl: OpenSSL 1.1.1g 21 Apr 2020
git: 2.28.0
curve: v1.0.0-beta
nbd info
image
集群状态:
image

创建卷、挂载卷:
image

nebd.zip

无法在centos7上面编译

Describe the bug (描述bug)
无法在centos7上面编译

To Reproduce (复现方法)
bazel 1.2.1
bash ./mk-tar.sh

Expected behavior (期望行为)

Versions (各种版本)
OS: centos7
Compiler: gcc (GCC) 7.3.1
curve-mds:
curve-chunkserver:
curve-snapshotcloneserver:
curve-sdk:
nebd:
curve-nbd:

Additional context/screenshots (更多上下文/截图)
name 'http_archive' is not defined
name 'new_git_repository' is not defined (did you mean 'git_repository'?)

【C-Plan选题二:捉虫计划】fix a initialization error.

**Describe alternatives you've considered (optional) **

in /curve/test/chunkserver/multiple_copysets_io_test.cpp function update_leader()

The leader field in the info structure is not explicitly initialized, which means that its initial value is 0, but in function update_leader, if the first if is judged to be flase because of the failure of the election of the leader, it will directly return to copyset->leader, but zero is a valid one Value, this will cause an error.

So I think that the initial value of copyset->leader should be set to -1, and a meaningless judgment in update_leader should be deleted, which will cause the boundary to be crossed after the initialization is -1

单机部署-部署集群报错 No daemon installed

General Question

我在单机部署中,进行集群部署时报错,报错信息如下

fatal: [localhost]: FAILED! => {"changed": true, "cmd": "sudo ./mds-daemon.sh start", "delta": "0:00:00.024325", "end": "2020-08-04 20:40:04.707755", "msg": "non-zero return code", "rc": 1, "start": "2020-08-04 20:40:04.683430", "stderr": "", "stderr_lines": [], "stdout": "No daemon installed", "stdout_lines": ["No daemon installed"]}

随后我往前查看部署日志,发现已经通过daemon install的验证

TASK [install_package : install daemon] ****************************************
included: /home/curve/curve/curve-ansible/roles/install_package/tasks/include/install_daemon.yml for localhost

TASK [install_package : determine if daemon installed] *************************
changed: [localhost]

TASK [install_package : set daemon_installed] **********************************
ok: [localhost]

我继续查看/home/curve/curve/curve-ansible/roles/install_package/tasks/include/install_daemon.yml的内容,发现会执行一条shell命令shell: daemon --version,我随后手动执行daemon --version并得到结果

[curve@localhost ~]$ daemon --version
daemon-0.6.4

表明已经安装了daemon,那么为什么又会报以上错误呢?
期待你们的答复,非常感谢

单机部署成功,创建文件失败

[root@057e65f0a884 curve-ansible]# curve_ops_tool status
Cluster status:
Copysets are not healthy!
Get status metric from 127.0.0.1:8081 fail
No snapshot-clone-server is active
snapshot-clone-server 127.0.0.1:5556 is offline
cluster is not healthy
total copysets: 100, unhealthy copysets: 13, unhealthy_ratio: 13%
physical pool number: 1, logical pool number: 1
total space = 0GB, logical used = 0GB(0.00%, can be recycled = 0GB(0.00%)), physical used = 0GB(0.00%)

Client status:

MDS status:
version: 9.9.9
current MDS: 127.0.0.1:6666
online mds list: 127.0.0.1:6666
offline mds list:

Etcd status:
version: 3.4.0
current etcd: 127.0.0.1:2379
online etcd list: 127.0.0.1:2379
offline etcd list:

SnapshotCloneServer status:
no version found!
GetAndCheckSnapshotCloneVersion fail
Get status metric from 127.0.0.1:8081 fail
current snapshot-clone-server:
online snapshot-clone-server list:
offline snapshot-clone-server list: 127.0.0.1:5556

ChunkServer status:
version: 9.9.9
chunkserver: total num = 3, online = 3, offline = 0(recoveringout = 0, chunkserverlist: [])
left size: min = 0GB, max = 0GB, average = 0.00GB, range = 0GB, variance = 0.00
[root@057e65f0a884 curve-ansible]# curve create --filename /test --length 10 --user root
E 2020-08-03T23:09:59.519749+0800 12349 server.cpp:994] Fail to listen 0.0.0.0:9000
E 2020-08-03T23:09:59.519836+0800 12349 server.cpp:1832] Fail to start dummy_server at port=9000
E 2020-08-03T23:09:59.529330+0800 12349 mds_client.cpp:314] CreateFile: filename = /test, owner = root, is nomalfile: 1, errocde = 4, error msg = kOwnerAuthFail, log id = 1
create fail, ret = -4
[root@057e65f0a884 curve-ansible]#

[root@057e65f0a884 curve-ansible]# ps -aux|grep chunkserver
root 6893 55.9 4.3 745128 88044 ? Sl 22:56 11:06 curve-chunkserver -bthread_concurrency=18 -raft_max_segment_size=8388608 -raft_max_install_snapshot_tasks_num=1 -raft_sync=true -conf=/etc/curve/chunkserver.conf -enableChunkfilepool=false -chunkFilePoolDir=./data/chunkserver0 -chunkFilePoolMetaPath=./data/chunkserver0/chunkfilepool.meta -chunkServerIp=127.0.0.1 -chunkServerPort=8200 -chunkServerMetaUri=local://./data/chunkserver0/chunkserver.dat -chunkServerStoreUri=local://./data/chunkserver0/ -copySetUri=local://./data/chunkserver0/copysets -raftSnapshotUri=curve://./data/chunkserver0/copysets -recycleUri=local://./data/chunkserver0/recycler -graceful_quit_on_sigterm=true -raft_sync_meta=true -raft_sync_segments=true -graceful_quit_on_sigterm=true -log_dir=./data/log/chunkserver0
root 6899 35.2 4.1 728712 84560 ? Sl 22:56 6:59 curve-chunkserver -bthread_concurrency=18 -raft_max_segment_size=8388608 -raft_max_install_snapshot_tasks_num=1 -raft_sync=true -conf=/etc/curve/chunkserver.conf -enableChunkfilepool=false -chunkFilePoolDir=./data/chunkserver1 -chunkFilePoolMetaPath=./data/chunkserver1/chunkfilepool.meta -chunkServerIp=127.0.0.1 -chunkServerPort=8201 -chunkServerMetaUri=local://./data/chunkserver1/chunkserver.dat -chunkServerStoreUri=local://./data/chunkserver1/ -copySetUri=local://./data/chunkserver1/copysets -raftSnapshotUri=curve://./data/chunkserver1/copysets -recycleUri=local://./data/chunkserver1/recycler -graceful_quit_on_sigterm=true -raft_sync_meta=true -raft_sync_segments=true -graceful_quit_on_sigterm=true -log_dir=./data/log/chunkserver1
root 6987 27.4 4.1 728696 85064 ? Sl 22:56 5:26 curve-chunkserver -bthread_concurrency=18 -raft_max_segment_size=8388608 -raft_max_install_snapshot_tasks_num=1 -raft_sync=true -conf=/etc/curve/chunkserver.conf -enableChunkfilepool=false -chunkFilePoolDir=./data/chunkserver2 -chunkFilePoolMetaPath=./data/chunkserver2/chunkfilepool.meta -chunkServerIp=127.0.0.1 -chunkServerPort=8202 -chunkServerMetaUri=local://./data/chunkserver2/chunkserver.dat -chunkServerStoreUri=local://./data/chunkserver2/ -copySetUri=local://./data/chunkserver2/copysets -raftSnapshotUri=curve://./data/chunkserver2/copysets -recycleUri=local://./data/chunkserver2/recycler -graceful_quit_on_sigterm=true -raft_sync_meta=true -raft_sync_segments=true -graceful_quit_on_sigterm=true -log_dir=./data/log/chunkserver2
root 12363 0.0 0.0 9180 1036 pts/0 R+ 23:16 0:00 grep --color=auto chunkserver

UUID is missing in fstab when using disk partition as chunkserver

Describe the bug (描述bug)
If we use disk partition as chunkserver, such as:

xxx@curve-chunk-node2:~$ lsblk | grep chunk
├─sdu2 65:66 0 1.1T 0 part /data/chunkserver11

then after deploy the UUID is missing in /etc/fstab:
#curvefs
UUID= /data/chunkserver11 ext4 rw,errors=remount-ro 0 0

微信图片_20210514161117
微信图片_20210514161135
微信图片_20210514161143

./curve-chunkserver/home/nbs/chunkserver_deploy.sh:
微信图片_20210514161351
this is because we don't handle the prifix "├─sdu2" of lsblk CLI.

To Reproduce (复现方法)
see above.

Expected behavior (期望行为)
support using disk partition as chunkserver, and UUID is ok in fstab.

Versions (各种版本)
OS:
Compiler:
curve-mds:
curve-chunkserver:
curve-snapshotcloneserver:
curve-sdk:
nebd:
curve-nbd:

Additional context/screenshots (更多上下文/截图)

【C-Plan】编译与部署

一、编译

基于curvebuild镜像,拉取docker镜像并启动

docker pull opencurve/curvebuild:centos8
docker run -it opencurve/curvebuild:centos8 /bin/bash

遇到无法联网问题,加上--net=host

docker run --net=host -it opencurve/curvebuild:centos8 /bin/bash

编译和打包顺利,没有遇到问题:

0 curve build
0 curve build1
使用的是个人PC的虚拟机,编译打包耗时约一个小时

可以看到,tar文件打包成功
0 打包成功

二、单机部署

基于curveintegration镜像

拉取并启动镜像

docker run --cap-add=ALL -v /dev:/dev -v /lib/modules:/lib/modules --privileged -it opencurve/curveintegration:centos8 /bin/bash

容器内无法联网,加上--net=host,重新 docker run

docker run --net=host --cap-add=ALL -v /dev:/dev -v /lib/modules:/lib/modules --privileged -it opencurve/curveintegration:centos8 /bin/bash

获取tar并解压

按步骤进行

执行单机部署

部署集群并启动服务

ansible-playbook -i server.ini deploy_curve.yml

执行命令查看当前集群状态

curve_ops_tool status

1 Chunkserver status

ansible-playbook -i client.ini deploy_nebd.yml
ansible-playbook -i client.ini deploy_nbd.yml
ansible-playbook -i client.ini deploy_curve_sdk.yml

前几次部署尝试安装NBD包一直遇到报错,原因是第一次添加nbd模块时权限不足,感谢前面的@taohansi同学提供的解决办法。

单机部署验证

2 创建curve卷

可以看到nbd0卷

client如何支持热升级?

请问“Client还支持热升级,可以在用户无感知的情况下进行底层版本变更”是如何实现的?是client分为两个进程(一个轻量的只有接口的light client,一个有实际逻辑的core client)吗?

多机部署失败

Describe the bug (描述bug)
使用的是Debian 9 操作系统部署。
安装过程中,会检查所依赖的 lib。其中会检查软件podlators-perl,但是该软件被perl-modules-5.24替代,因此检查和安装都会失败,需要手动修改配置文件 roles/prepare_software_env/tasks/main.yml。

To Reproduce (复现方法)

Expected behavior (期望行为)

Versions (各种版本)
OS: Debian
Compiler: gcc-6.3.0 g++
curve-mds:
curve-chunkserver:
curve-snapshotcloneserver:
curve-sdk:
nebd:
curve-nbd:

Additional context/screenshots (更多上下文/截图)
image

【C-Plan选题一:清理代码中的TODO】wrap initializing code of raft node options into a function

Describe the task you choose (描述你选择的任务)
wrap initializing code of raft node options into a function
// TODO(wudemiao): 放到nodeOptions的init中 at src/chunkserver/copyset_node.cpp

Describe alternatives you've considered (optional) (描述你想到的方案(可选))
I simply wrap the code, do not know if it's true......

Additional context/screenshots (更多上下文/截图)

执行mk-tar.sh失败

Describe the bug (描述bug)

执行bash mk-tar.sh进行编译打包,进行到第七步打包python whell的打包py2时,报错。我的环境上同时安装了python2和python3。

To Reproduce (复现方法)

在同时有py2、py3的机器上执行bash mk-tar.sh

Expected behavior (期望行为)

期望py2和py3的包都打出来,但实际上只打出了py3的,py2的报错。

Versions (各种版本)
OS: Ubuntu 18.04
Compiler: gcc 7.4.0 bazel 0.17.2
curve-mds: master
curve-chunkserver: master
curve-snapshotcloneserver: master
curve-sdk: master
nebd: master
curve-nbd: master

Additional context/screenshots (更多上下文/截图)

在python包build过程中,会重建tmplib目录:https://github.com/opencurve/curve/blob/master/curvefs_python/configure.sh#L62-L73,
会从bazel-bin目录中拷贝so文件到tmplib。

然后在第一遍打包py3时没问题:
https://github.com/opencurve/curve/blob/master/mk-tar.sh#L90

但当打完py3的包后,bazel-bin目录已经发生了改变,导致再打包py2时,bazel-bin目录已经没有了需要的so文件,从而py2打包失败。

建议

  1. 需要在打包前,备份一下bazel-bin中生成的各个so文件。Patch: #216

  2. 或者,由于py2已经官方废弃,不如curve这里只支持py3?

单机部署-nbd读写挂起

Describe the bug (描述bug)
单机部署-nbd读写挂起

To Reproduce (复现方法)
CentOS 8.2下单机部署,map nbd后读写/dev/nbd0

Expected behavior (期望行为)
正常读写

Versions (各种版本)
OS: CentOS 8.2 x86_64
Compiler:
curve-mds: 1.1.0-beta+5d648c9ec
curve-chunkserver: 1.1.0-beta+5d648c9ec
curve-snapshotcloneserver:
curve-sdk: 1.1.0-beta+5d648c9ec
nebd: 1.1.0-beta+5d648c9ec
curve-nbd: 1.1.0-beta+5d648c9ec

Additional context/screenshots (更多上下文/截图)
W 2020-10-26T16:46:02.317500+0800 2468 replicator.cpp:299] Group 4294967317 fail to issue RPC to 10.202.91.10:8202:0 _consecutive_error_times=1, [E1008]Reached timeout=500ms @10.202.91.10:8202
W 2020-10-26T16:46:02.317519+0800 2468 replicator.cpp:299] Group 4294967354 fail to issue RPC to 10.202.91.10:8202:0 _consecutive_error_times=1, [E1008]Reached timeout=500ms @10.202.91.10:8202
W 2020-10-26T16:46:02.317534+0800 2468 replicator.cpp:299] Group 4294967319 fail to issue RPC to 10.202.91.10:8202:0 _consecutive_error_times=1, [E1008]Reached timeout=500ms @10.202.91.10:8202
W 2020-10-26T16:46:02.317579+0800 2479 replicator.cpp:299] Group 4294967394 fail to issue RPC to 10.202.91.10:8202:0 _consecutive_error_times=1, [E1008]Reached timeout=500ms @10.202.91.10:8202
W 2020-10-26T16:46:02.317595+0800 2479 replicator.cpp:299] Group 4294967303 fail to issue RPC to 10.202.91.10:8202:0 _consecutive_error_times=1, [E1008]Reached timeout=500ms @10.202.91.10:8202
W 2020-10-26T16:46:02.317729+0800 2476 replicator.cpp:299] Group 4294967352 fail to issue RPC to 10.202.91.10:8202:0 _consecutive_error_times=1, [E1008]Reached timeout=500ms @10.202.91.10:8202
W 2020-10-26T16:46:02.317749+0800 2476 replicator.cpp:299] Group 4294967332 fail to issue RPC to 10.202.91.10:8202:0 _consecutive_error_times=1, [E1008]Reached timeout=500ms @10.202.91.10:8202
W 2020-10-26T16:46:02.317764+0800 2476 replicator.cpp:299] Group 4294967305 fail to issue RPC to 10.202.91.10:8202:0 _consecutive_error_times=1, [E1008]Reached timeout=500ms @10.202.91.10:8202
W 2020-10-26T16:46:02.317780+0800 2476 replicator.cpp:299] Group 4294967302 fail to issue RPC to 10.202.91.10:8202:0 _consecutive_error_times=1, [E1008]Reached timeout=500ms @10.202.91.10:8202
W 2020-10-26T16:46:02.348995+0800 2475 node.cpp:1244] node 4294967358:10.202.91.10:8200:0 received invalid RequestVoteResponse from 10.202.91.10:8201:0 state not in CANDIDATE but LEADER
W 2020-10-26T16:46:02.349308+0800 2464 node.cpp:1316] node 4294967344:10.202.91.10:8200:0 received invalid PreVoteResponse from 10.202.91.10:8202:0 state not in STATE_FOLLOWER but CANDIDATE
W 2020-10-26T16:46:02.563308+0800 2468 node.cpp:1292] node 4294967381:10.202.91.10:8200:0 received RequestVoteResponse from 10.202.91.10:8202:0 error: [E1008]Reached timeout=1000ms @10.202.91.10:8202
W 2020-10-26T16:46:02.563849+0800 2479 node.cpp:1292] node 4294967373:10.202.91.10:8200:0 received RequestVoteResponse from 10.202.91.10:8202:0 error: [E1008]Reached timeout=1000ms @10.202.91.10:8202
W 2020-10-26T16:46:02.564920+0800 2466 node.cpp:1292] node 4294967348:10.202.91.10:8200:0 received RequestVoteResponse from 10.202.91.10:8202:0 error: [E1008]Reached timeout=1000ms @10.202.91.10:8202
W 2020-10-26T16:46:02.628770+0800 2469 node.cpp:1222] node 4294967316:10.202.91.10:8200:0 term 362 steps down when reaching vote timeout: fail to get quorum vote-granted
W 2020-10-26T16:46:02.670714+0800 2469 node.cpp:1316] node 4294967306:10.202.91.10:8200:0 received invalid PreVoteResponse from 10.202.91.10:8202:0 state not in STATE_FOLLOWER but CANDIDATE
W 2020-10-26T16:46:02.702837+0800 2465 node.cpp:2065] node 4294967351:10.202.91.10:8200:0 ignore stale AppendEntries from 10.202.91.10:8201:0 in term 353 current_term 354
W 2020-10-26T16:46:02.703104+0800 2471 node.cpp:1316] node 4294967321:10.202.91.10:8200:0 received invalid PreVoteResponse from 10.202.91.10:8202:0 state not in STATE_FOLLOWER but CANDIDATE
W 2020-10-26T16:46:02.714792+0800 2479 node.cpp:1292] node 4294967338:10.202.91.10:8200:0 received RequestVoteResponse from 10.202.91.10:8202:0 error: [E1008]Reached timeout=1000ms @10.202.91.10:8202
W 2020-10-26T16:46:02.714850+0800 2479 node.cpp:1292] node 4294967324:10.202.91.10:8200:0 received RequestVoteResponse from 10.202.91.10:8202:0 error: [E1008]Reached timeout=1000ms @10.202.91.10:8202
W 2020-10-26T16:46:02.810375+0800 2473 node.cpp:1316] node 4294967316:10.202.91.10:8200:0 received invalid PreVoteResponse from 10.202.91.10:8202:0 state not in STATE_FOLLOWER but CANDIDATE
W 2020-10-26T16:46:02.811153+0800 2463 node.cpp:1244] node 4294967306:10.202.91.10:8200:0 received invalid RequestVoteResponse from 10.202.91.10:8202:0 state not in CANDIDATE but LEADER
W 2020-10-26T16:46:02.816112+0800 2479 node.cpp:1292] node 4294967344:10.202.91.10:8200:0 received RequestVoteResponse from 10.202.91.10:8202:0 error: [E1008]Reached timeout=1000ms @10.202.91.10:8202
W 2020-10-26T16:46:02.816541+0800 2468 node.cpp:1292] node 4294967353:10.202.91.10:8200:0 received RequestVoteResponse from 10.202.91.10:8202:0 error: [E1008]Reached timeout=1000ms @10.202.91.10:8202
W 2020-10-26T16:46:02.844225+0800 2468 node.cpp:1222] node 4294967326:10.202.91.10:8200:0 term 360 steps down when reaching vote timeout: fail to get quorum vote-granted
W 2020-10-26T16:46:02.863354+0800 2468 node.cpp:1244] node 4294967321:10.202.91.10:8200:0 received invalid RequestVoteResponse from 10.202.91.10:8202:0 state not in CANDIDATE but LEADER
W 2020-10-26T16:46:03.010736+0800 2478 node.cpp:1244] node 4294967316:10.202.91.10:8200:0 received invalid RequestVoteResponse from 10.202.91.10:8202:0 state not in CANDIDATE but LEADER
W 2020-10-26T16:46:03.028254+0800 2473 node.cpp:1316] node 4294967326:10.202.91.10:8200:0 received invalid PreVoteResponse from 10.202.91.10:8202:0 state not in STATE_FOLLOWER but CANDIDATE
W 2020-10-26T16:46:03.127198+0800 2476 node.cpp:1244] node 4294967326:10.202.91.10:8200:0 received invalid RequestVoteResponse from 10.202.91.10:8201:0 state not in CANDIDATE but LEADER
W 2020-10-26T16:49:58.732772+0800 2469 baidu_rpc_protocol.cpp:255] Fail to write into Socket{id=85899347749 fd=191 addr=10.202.91.10:49188:8200} (0x7f02bfd83f40): Unknown error 1014 [1014]
W 2020-10-26T16:51:19.010679+0800 2462 baidu_rpc_protocol.cpp:255] Fail to write into Socket{id=51539609032 fd=191 addr=10.202.91.10:49506:8200} (0x7f02c07a58c0): Unknown error 1014 [1014]
W 2020-10-26T16:51:39.987674+0800 2465 baidu_rpc_protocol.cpp:255] Fail to write into Socket{id=60129543972 fd=191 addr=10.202.91.10:49610:8200} (0x7f02bfd83d00): Unknown error 1014 [1014]

*** Aborted at 1603702159 (unix time) try "date -d @1603702159" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGILL (@0x559f3295db4e) received by PID 3261 (TID 0x7ff2d1c95700) from PID 848681806; stack trace: ***
@ 0x7ff2df048dd0 (unknown)
@ 0x559f3295db4e curve::client::RequestSender::ReadChunk()
@ 0x559f329573dd ZNSt17_Function_handlerIFvPN6google8protobuf7ClosureESt10shared_ptrIN5curve6client13RequestSenderEEEZNS6_13CopysetClient9ReadChunkERKNS6_11ChunkIDInfoEmlmmRKNS6_17RequestSourceIn
foES3_EUlS3_S8_E_E9_M_invokeERKSt9_Any_dataOS3_OS8

@ 0x559f32957faa curve::client::CopysetClient::DoRPCTask()
@ 0x559f329583b0 curve::client::CopysetClient::ReadChunk()
@ 0x559f32955a34 curve::client::RequestScheduler::ProcessOne()
@ 0x559f32955db2 curve::client::RequestScheduler::Process()
@ 0x7ff2ddc46b73 (unknown)
@ 0x7ff2df03e2de start_thread
@ 0x7ff2dd6a5e83 __GI___clone
@ 0x0 (unknown)
*** Aborted at 1603702162 (unix time) try "date -d @1603702162" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGILL (@0x558257b1cb4e) received by PID 3369 (TID 0x7f353cd42700) from PID 1471269710; stack trace: ***
@ 0x7f3549d63dd0 (unknown)
@ 0x558257b1cb4e curve::client::RequestSender::ReadChunk()
@ 0x558257b163dd ZNSt17_Function_handlerIFvPN6google8protobuf7ClosureESt10shared_ptrIN5curve6client13RequestSenderEEEZNS6_13CopysetClient9ReadChunkERKNS6_11ChunkIDInfoEmlmmRKNS6_17RequestSourceIn
foES3_EUlS3_S8_E_E9_M_invokeERKSt9_Any_dataOS3_OS8

@ 0x558257b16faa curve::client::CopysetClient::DoRPCTask()
@ 0x558257b173b0 curve::client::CopysetClient::ReadChunk()
@ 0x558257b14a34 curve::client::RequestScheduler::ProcessOne()
@ 0x558257b14db2 curve::client::RequestScheduler::Process()
@ 0x7f3548961b73 (unknown)
@ 0x7f3549d592de start_thread
@ 0x7f35483c0e83 __GI___clone
@ 0x0 (unknown)
*** Aborted at 1603702163 (unix time) try "date -d @1603702163" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGILL (@0x5615cb349b4e) received by PID 3395 (TID 0x7fa7e6df5700) from PID 18446744072823806798; stack trace: ***
@ 0x7fa7f35a0dd0 (unknown)
@ 0x5615cb349b4e curve::client::RequestSender::ReadChunk()
@ 0x5615cb3433dd ZNSt17_Function_handlerIFvPN6google8protobuf7ClosureESt10shared_ptrIN5curve6client13RequestSenderEEEZNS6_13CopysetClient9ReadChunkERKNS6_11ChunkIDInfoEmlmmRKNS6_17RequestSourceIn
foES3_EUlS3_S8_E_E9_M_invokeERKSt9_Any_dataOS3_OS8

@ 0x5615cb343faa curve::client::CopysetClient::DoRPCTask()
@ 0x5615cb3443b0 curve::client::CopysetClient::ReadChunk()
@ 0x5615cb341a34 curve::client::RequestScheduler::ProcessOne()
@ 0x5615cb341db2 curve::client::RequestScheduler::Process()
@ 0x7fa7f219eb73 (unknown)
@ 0x7fa7f35962de start_thread
@ 0x7fa7f1bfde83 __GI___clone
@ 0x0 (unknown)

完整日志请见附件:
nbd-hung-logs.tar.gz

part2 重试逻辑

General Question

void ClientClosure::OnRetry() {
    MetricHelper::IncremFailRPCCount(fileMetric_, reqCtx_->optype_);
   // -------条件1 chunkserverOPMaxRetry = 3
    if (reqDone_->GetRetriedTimes() >= failReqOpt_.chunkserverOPMaxRetry) {
        reqDone_->SetFailed(status_);
        LOG(ERROR) << OpTypeToString(reqCtx_->optype_)
                   << " retried times exceeds"
                   << ", IO id = " << reqDone_->GetIOTracker()->GetID()
                   << ", request id = " << reqCtx_->id_;
        done_->Run();
        return;
    }
    // -------条件2 chunkserverMaxRetryTimesBeforeConsiderSuspend = 20
    // ------- 我的理解是如果满足 reqDone_->GetRetriedTimes() >=        failReqOpt_.chunkserverMaxRetryTimesBeforeConsiderSuspend
    // ------- 肯定满足前面的条件1
    if (!reqDone_->IsSuspendRPC() && reqDone_->GetRetriedTimes() >=
        failReqOpt_.chunkserverMaxRetryTimesBeforeConsiderSuspend) {
        reqDone_->SetSuspendRPCFlag();
        MetricHelper::IncremIOSuspendNum(fileMetric_);
        LOG(WARNING) << "IO Retried "
                    << failReqOpt_.chunkserverMaxRetryTimesBeforeConsiderSuspend
                    << " times, set suspend flag! " << *reqCtx_
                    << ", IO id = " << reqDone_->GetIOTracker()->GetID()
                    << ", request id = " << reqCtx_->id_;
    }

    PreProcessBeforeRetry(status_, cntlstatus_);
    SendRetryRequest();
}

满足条件2,肯定会满足条件1,也就不会走到条件2,不知道我的理解对不对

raft snapshot 是不是存在不一致?

General Question

chunkserver raftsnapshot 模块,copyset_node.cpp 中 on_snapshot_save 方法只是列举了所有CHUNKFILE 类型的文件,并将文件名保存在 BRAFT_SNAPSHOT_META_FILE 文件中。后续 install_snapshot 的时候,follower会从leader下载这个meta文件、data目录下的chunkfile文件、data目录下的snapshotfile文件。snapshotfile可以理解为只读的,而chunkfile非只读的,那么安装快照的过程中,chunkfile有可能发生变化,会造成follower和leader数据的不一致。

不知道我理解的是不是有问题,还是有其他机制保证此处的一致性,敬请解答~

单机部署第三步失败

General Question

按照单机部署:https://github.com/opencurve/curve/blob/master/docs/cn/deploy.md

执行第三步:ansible-playbook -i server.ini deploy_curve.yml
报如下错误:
TASK [install_package : determine if etcd exists] **********************************************************************************************************************************************************
fatal: [localhost]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host 127.0.0.1 port 1046: Connection refused", "unreachable": true}

Didn't start curve-mds by daemon

fatal: [mds3]: FAILED! => {"changed": true, "cmd": "sudo ./mds-daemon.sh start", "delta": "0:00:03.162596", "end": "2021-01-14 15:07:15.075607", "msg": "non-zero return code", "rc": 1, "start": "2021-01-14 15:07:11.913011", "stderr": "", "stderr_lines": [], "stdout": "subnet: 172.22.12.0/24\nport: 6666\nDidn't start curve-mds by daemon", "stdout_lines": ["subnet: 172.22.12.0/24", "port: 6666", "Didn't start curve-mds by daemon"]}
通过ansible进行集群部署,在1.0.1-rc0、1.1.0-beta版本都遇到了这个问题
问题发生在TASK [start_service : start by daemon]时,在第一遍运行启动运行集群脚本时会遇到这个问题脚本终止,第二遍可以顺利执行完脚本

安装失败

General Question

安装过程中总是提示创建文件被拒绝

【C-Plan】Ubuntu 18.04 下编译及部署

Describe the task you choose (描述你选择的任务)

完成 Ubuntu 18.04 环境下 curve 编译及部署

curve 编译

编译环境:4核8G Ubuntu 18.04

首先需要安装各种所需库依赖。 Dockerfile 中从 centos 基础镜像开始构建,依次 install 了许多安装包,对应 ubuntu 环境下需要安装如下预编译包(apt-get install):

libssl-dev uuid-dev libfiu-dev libcurl4-openssl-dev zlib zlib1g.dev libnl-3-dev libboost-dev libunwind8-dev libnl-genl-3-dev python-pip

接着安装 bazel,执行

wget https://curve-build.nos-eastchina1.126.net/bazel-0.17.2-installer-linux-x86_64.sh
bash bazel-0.17.2-installer-linux-x86_64.sh

安装上述依赖和 bazel 均需 sudo 或 root 用户下执行。接下来编译及打包即可。

image

image

建议1:请勿打开 go mod,最好环境中不安装 golang。编译过程中下载了 1.12.8 版本的 golang 且对所有 etcd v3.4.0 的 go 依赖统一源代码形式下载至本地,编译时默认从源代码编译。由于 etcd 出现过替换 go.etcd.io/etcd 替换为 github.com/coreos/etcd ,以及 grpc 的版本问题,使用 go mod 下载的依赖会造成编译无法通过。

建议2:2次编译时可以将 https://github.com/opencurve/curve/blob/master/build.sh#L94 的 make clean 和 make all 注释掉,这部分 libetcdclient.so 已经编译完成后无需重新编译。

执行 bash build.sh 开始编译。由于未指定版本信息,打包完成生成如下4个文件

curve_9.9.9+5f9b5f68.tar.gz
curve-monitor_9.9.9+5f9b5f68.tar.gz
nbd_9.9.9+5f9b5f68.tar.gz
nebd_9.9.9+5f9b5f68.tar.gz

curve 部署

单机环境下按照部署文档中描述进行部署。

image

建议1:curve 使用了 perf,而 ubuntu 源中没有约定的 linux-perf 包,ubuntu 18.04 既定内核版本下安装:

sudo apt-get install linux-tools-common

安装完成后执行 perf 有对应提示,表明生效。
之后,将 https://github.com/opencurve/curve/blob/master/curve-ansible/roles/prepare_software_env/tasks/main.yml#L48 注释 install perf 相关。

建议2:部署权限问题。目前暂使用 root 用户直接部署,创建 curve 用户后包括 /etc/curve 等配置文件夹、nbd 执行等都存在权限问题,猜测 ansible 执行既定 playbooks 时仍需要 root 权限。

建议3:python 包网络下载慢建议添加 pip 国内源,etcd 也可手动下载至 /tmp/etcd-v3.4.0.tar.gz,并修改对应 ansible 配置文件。

【C-Plan】编译与部署

编译及部署初体验

本issue为C-Plan的必选任务,包含部署及编译。
本文先介绍遇到的问题,然后为成功的截图。

使用 opencurve/curveintegration:centos8 挂载与编译

部署成功

image

image

遇到情况:

1)
map的时候提示打不开镜像 nebd-server进程一直占用CPU高达25% (刚好4核之1核),根据日志知道/data下面没有nebd目录,为什么会没有暂不清楚,用build目录下stop再start 没有办法解决,后来删除nebd日志并且执行 deploy_nebd.yml 可以启动。

I 2021-02-02T12:22:54.056986+0000 2000722 source_reader.cpp:59] SourceReader fdCloseThread run successfully
I 2021-02-02T12:22:54.056990+0000 2000722 nebd_server.cpp:60] NebdServer init curveRequestExecutor ok
W 2021-02-02T12:22:54.056999+0000 2000722 metafile_manager.cpp:142] File not exist: /data/nebd/nebdserver.meta
I 2021-02-02T12:22:54.057001+0000 2000722 metafile_manager.cpp:48] Init metafilemanager success.
I 2021-02-02T12:22:54.057020+0000 2000722 file_manager.cpp:84] Load file record finished.
I 2021-02-02T12:22:54.057021+0000 2000722 nebd_server.cpp:67] NebdServer init fileManager ok
I 2021-02-02T12:22:54.057047+0000 2000722 heartbeat_manager.cpp:42] Run heartbeat manager success.
I 2021-02-02T12:22:54.057049+0000 2000722 nebd_server.cpp:74] NebdServer init heartbeatManager ok
I 2021-02-02T12:22:54.057051+0000 2000722 nebd_server.cpp:76] NebdServer init ok
I 2021-02-02T12:22:54.057052+0000 2000722 nebd_server.cpp:78] nebd version: 9.9.9+984a60e7
E 2021-02-02T12:22:54.057240+0000 2000722 file_lock.cpp:38] open file failed, error = No such file or directory, filename = /data/nebd/nebd.sock.lock
E 2021-02-02T12:22:54.057245+0000 2000722 nebd_server.cpp:235] Address already in use
I 2021-02-02T12:22:54.057246+0000 2000722 file_manager.cpp:57] Stop file manager success.
I 2021-02-02T12:22:54.057404+0000 2000722 heartbeat_manager.cpp:48] Stopping heartbeat manager...
I 2021-02-02T12:22:54.057443+0000 2000722 heartbeat_manager.cpp:52] Stop heartbeat manager success.

2)curve_ops_tool delete -fileName=/test1 -userName=curve -forcedelete=true 无法删除卷
解决办法: -forcedelete=true 去掉,再进入recyclebin删除。为什么删除recyclebin下的文件userName为curve的时候失败,而改成root成功?

【C-Plan】编译及部署初体验

编译及部署初体验

本issue为C-Plan的必选任务,包含部署及编译。
本文先介绍遇到的问题,然后为成功的截图。

问题

1
执行ansible-playbook -i client.ini deploy_nbd.yml遇到权限问题
{760D6D75-14BA-488C-936C-F59744BBB6A6}
后加上sudo成功

sudo ansible-playbook -i client.ini deploy_nbd.yml

成功的流程

注意curve提供了两个docker镜像,分别用于开发和部署,需要分别拉取

docker pull opencurve/curvebuild:centos8
docker pull opencurve/curveintegration:centos8

根据文档,创建对应的container,执行对应步骤即可

编译及打包成功
{AAB4C34B-06EB-4A70-BF85-423728BF69B0}

{3F27D22F-2A90-4481-A415-4C7F0D287461}

部署成功
集群状态满足文档要求
{B517BC78-277F-41D9-B2C7-52567A5E324A}

查看/test的创建信息
截图

观察到一个类型为nbd0类型的卷被添加
{32AFEC1F-62CF-4DD0-9F81-4519AB8DB86A}

【C-Plan选题三】代码翻译&fix typo

Describe the task you choose (描述你选择的任务)

  • translate comments in these files:
    • src/client/libcurve_file.h
    • src/client/libcurve_file.cpp
    • include/client/libcurve.h

Describe alternatives you've considered (optional) (描述你想到的方案(可选))

Additional context/screenshots (更多上下文/截图)

  • fix a typo (SNAPSTHO_FROZEN => SNAPSHOT_FROZEN) in these files:
    • curvefs_python/curvefs_tool.py
    • src/client/mds_client.cpp
    • include/client/libcurve.h

单机部署提示could not find or access ../curve-mds/bin/导致部署失败

General Question

按照官方文档的指导,使用ansible-playbook命令部署单节点curve环境,在install mds bin阶段提示could not find or access ../curve-mds/bin/,导致部署失败。
image

查看脚本发现,在install mds bin阶段会将../curve-mds/bin/中的文件复制到/usr/bin中,但在../curve-mds目录下没有/bin目录,仅有DEBIAN和home两个文件夹。因此可以认为是curve-mds文件夹中没有mds组件,导致脚本无法找到源文件而报错。但整体阅读代码后并未找到下载或构建mds的命令,因此想请教一下如何解决这个问题。

OS:CentOS-8.2.2004
内核:4.18.0-193.el8.x86_64
openssl:1.1.1g FIPS 21 Apr 2020
gcc:8.4.1 20200928 (Red Hat 8.4.1-1)
curve:curve-1.0.4(1.2.1-rc0和1.3.0-beta2也有同样的问题)

跑测试用例snapshot_server_concurrent_itest失败

Describe the bug (描述bug)
编译curve通过后,跑测试命令(./bazel-bin/test/integration/snapshotcloneserver/snapshot_server_concurrent_itest)的时候先是会卡住如图1,最终失败如图2。

To Reproduce (复现方法)
1)下载:git clone https://github.com/opencurve/curve.git(master分支,最后commit如图3)
2)编译curve源码
3)启动fakes3(fakes3 -r /S3_DATA_DIR -p 9999 --license YOUR_LICENSE_KEY)
4)执行命令(./bazel-bin/test/integration/snapshotcloneserver/snapshot_server_concurrent_itest)
注:fakes3最终交互如图4

Expected behavior (期望行为)
修复此bug

Versions (各种版本)
OS: centos8.1
Compiler: gcc version 8.3.1
fakes3:FakeS3 2.0.0

Additional context/screenshots (更多上下文/截图)
图1:
image
图2:
image
图3:
image
图4:
image

【C-Plan】基于Docker的编译与部署

一、编译

  • 基于VM虚拟机的ubuntu系统在docker中进编译

  • 1、从仓库中下载curvebuild、curveintegration两个镜像,分别对应与编译与部署

    docker pull opencurve/curvebuild:centos8
    docker pull opencurve/curveintegration:centos8
    

image-20210128153536505

  • 2、运行Curvebuild镜像,进入curve目录

    docker run -it opencurve/curvebuild:centos8 /bin/bash
    
  • 3、执行编译命令

  • 4、大概等待一个小时,结果如下

image-20210128154226944

二、部署

  • 运行curveintegration镜像,加上--net=host命令使其能够访问互联网

    docker run --net=host --cap-add=ALL -v /dev:/dev -v /lib/modules:/lib/modules --privileged -it opencurve/curveintegration:centos8 /bin/bash
    
  • 获取tar包并解压

    wget https://github.com/opencurve/curve/releases/download/v{version}/curve_{version}.tar.gz
    wget https://github.com/opencurve/curve/releases/download/v{version}/nbd_{version}.tar.gz
    wget https://github.com/opencurve/curve/releases/download/v{version}/nebd_{version}.tar.gz
    tar zxvf curve_{version}.tar.gz
    tar zxvf nbd_{version}.tar.gz
    tar zxvf nebd_{version}.tar.gz
    

    (部署过程很顺利,只是上面的v{version}开始使用的是最新v1.0.2-beta会出现unhealthy_copysets不为0, 后参考其它同学的部署经验改成v1.0.1-rc0之后问题解决

image-20210128185217449

  • 集群启动结果

image-20210128180819147

  • 安装Nebd服务和NBD包

    ansible-playbook -i client.ini deploy_nebd.yml
    ansible-playbook -i client.ini deploy_nbd.yml
    ansible-playbook -i client.ini deploy_curve_sdk.yml
    
  • 创建CRUVE卷,可以查看到NBD0

image-20210128175659566

docker下部署失败

General Question

使用 https://github.com/opencurve/curve/blob/master/docs/cn/deploy.md#%E5%8D%95%E6%9C%BA%E9%83%A8%E7%BD%B2
推荐的docker image
docker run --cap-add=ALL -v /dev:/dev -v /lib/modules:/lib/modules --privileged -it opencurve/curveintegration:centos8 /bin/bash

最后运行ansible显示失败。这个意思是上面这个centos8版本太低一定要到4.18.0-193.el8.x86_64?

$ uname -a
Linux ec0a9b5f0083 3.10.0-957.el7.x86_64 #1 SMP Thu Nov 8 23:39:32 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux  
$ ansible-playbook -i server.ini deploy_curve.yml  
...
TASK [check kernel version] **************************************************************************************************************************
[DEPRECATION WARNING]: Using tests as filters is deprecated. Instead of using `result|version_compare` instead use `result is version_compare`. This 
feature will be removed in version 2.9. Deprecation warnings can be disabled by setting deprecation_warnings=False in ansible.cfg.
fatal: [localhost]: FAILED! => {
    "assertion": "ansible_kernel|version_compare('3.15', '>=')", 
    "changed": false, 
    "evaluated_to": false
}

NO MORE HOSTS LEFT ***********************************************************************************************************************************
	to retry, use: --limit @/home/curve/curve/curve-ansible/deploy_curve.retry

PLAY RECAP *******************************************************************************************************************************************
localhost                  : ok=2    changed=0    unreachable=0    failed=1   

三节点部署失败

General Question

使用https://github.com/opencurve/curve/releases/download/v1.0.0/curve_1.0.0+8b04e0ec.tar.gz,

CentOS8下部署单节点成功,部署三节点在deploy etcd这步失败。
具体是哪里生成的etcd.conf.yml并拷贝到其它主机的?在中控机以外的两台主机没有看到生成的etcd.conf.yml。

############################## deploy etcd ##############################
- name: prepare etcd
  hosts: etcd
  any_errors_fatal: true
  gather_facts: no
  become: yes
  become_user: "{{ sudo_user }}"
  become_flags: -iu {{ sudo_user }}
  tags:
    - etcd
  roles:
    - { role: install_package, package_name: etcd, install_with_deb: false, tags: install_etcd } // 我直接把etcd的下载包解压拷贝到/usr/bin下了
    - { role: generate_config, template_name: etcd.conf.yml, conf_path: "{{ etcd_config_path }}", tags: generage_config } // 三节点,etcd1(也是中控机)成功,etcd2和etcd3失败

ansible-playbook -i server.ini deploy_curve.yml 失败如下

TASK [generate_config : generate configuration file directly] ****************************************************************************************
fatal: [etcd3]: FAILED! => {"changed": false, "checksum": "d0ddc59580bbb243f260acea51c4a870449ecaf8", "msg": "Destination /etc/curve not writable"}
...ignoring
fatal: [etcd1]: FAILED! => {"changed": false, "checksum": "ed3f202012e267b199f35bc0c2d106bfb436a6a7", "msg": "Destination /etc/curve not writable"}
...ignoring
fatal: [etcd2]: FAILED! => {"changed": false, "checksum": "e02c13a758f55ac58ccbf3be5443e10833a72d91", "msg": "Destination /etc/curve not writable"}
...ignoring

TASK [generate_config : generate configuration file at /tmp] *****************************************************************************************
changed: [etcd1]
changed: [etcd2]
changed: [etcd3]

TASK [generate_config : mv config file] **************************************************************************************************************
changed: [etcd1]
fatal: [etcd2]: FAILED! => {"changed": true, "cmd": "sudo mv /tmp/etcd.conf.yml /etc/curve/etcd.conf.yml", "delta": "0:00:00.016609", "end": "2020-11-13 04:07:44.967946", "msg": "non-zero return code", "rc": 1, "start": "2020-11-13 04:07:44.951337", "stderr": "mv: cannot stat '/tmp/etcd.conf.yml': No such file or directory", "stderr_lines": ["mv: cannot stat '/tmp/etcd.conf.yml': No such file or directory"], "stdout": "", "stdout_lines": []}
fatal: [etcd3]: FAILED! => {"changed": true, "cmd": "sudo mv /tmp/etcd.conf.yml /etc/curve/etcd.conf.yml", "delta": "0:00:01.017238", "end": "2020-11-13 04:07:45.976883", "msg": "non-zero return code", "rc": 1, "start": "2020-11-13 04:07:44.959645", "stderr": "mv: cannot stat '/tmp/etcd.conf.yml': No such file or directory", "stderr_lines": ["mv: cannot stat '/tmp/etcd.conf.yml': No such file or directory"], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT ***********************************************************************************************************************************
	to retry, use: --limit @/home/curve/curve/curve/curve-ansible/deploy_curve.retry

PLAY RECAP *******************************************************************************************************************************************
etcd1                      : ok=52   changed=14   unreachable=0    failed=0   
etcd2                      : ok=51   changed=13   unreachable=0    failed=1   
etcd3                      : ok=51   changed=13   unreachable=0    failed=1   
localhost                  : ok=35   changed=8    unreachable=0    failed=0   
mds1                       : ok=34   changed=8    unreachable=0    failed=0   
mds2                       : ok=34   changed=8    unreachable=0    failed=0   
mds3                       : ok=34   changed=8    unreachable=0    failed=0   
nginx1                     : ok=34   changed=8    unreachable=0    failed=0   
nginx2                     : ok=34   changed=8    unreachable=0    failed=0   
server1                    : ok=36   changed=8    unreachable=0    failed=0   
server2                    : ok=36   changed=8    unreachable=0    failed=0   
server3                    : ok=36   changed=8    unreachable=0    failed=0   
snap1                      : ok=34   changed=8    unreachable=0    failed=0   
snap2                      : ok=34   changed=8    unreachable=0    failed=0   
snap3                      : ok=34   changed=8    unreachable=0    failed=0   

而在显示成功的etcd1(中控机)上,生成的etcd.conf.xml显示是etcd3???

$ sudo head /etc/curve/etcd.conf.yml 
# This is the configuration file for the etcd server.

# Human-readable name for this member.
name: etcd3      <------------???

# Path to the data directory.
data-dir: /etcd/data

# Path to the dedicated wal directory.
wal-dir: /etcd/wal

【C-Plan】编译与部署

测试环境:

  • 硬件配置:云服务器ECS 2C4G
  • 操作系统:Centos8(非Docker环境编译/部署)
  • 内核版本:4.18.0

部署
按官网教程逐步进行,遇以下情况

  • 提示未安装GCC 解决方案:yum install gcc。
  • 下载etcd-client时多次连接超时 解决方案:其他机器下载etcd-client,上传至本机/tmp下,注释脚本/home/curve/curve/curve-ansible/roles/install_package/tasks/include/install_etcd.yml的32-41行下载etcd-client步骤。
  • 启动./chunkserver_ctl.sh start all失败 解决方案:修改chunkserver_ctl.sh脚本 在164行打印curve-chunkserver的执行结果即$LD_PRELOAD,发现缺少libcurl-gnutls.so.4,则在/usr/lib64下创建软连接,ln -s libcurl.so.4 libcurl-gnutls.so.4。

结果验证
image
写入测试
image

编译

  1. wget https://github.com/bazelbuild/bazel/releases/download/0.17.2/bazel-0.17.2-installer-linux-x86_64.sh 下载bazel并安装
  2. yum install git gcc-c++ make zlib zlib-devel openssl openssl-devel 安装相应依赖 git clone https://github.com/albertito/libfiu.git && make && make install (该依赖不可使用yum install)
  3. 执行 ./replace-curve-repo.sh && ./build.sh

Client Python Api测试

  1. 在root下创建目录 mkdir curve_test && cd curve_test,拷贝 /usr/curvefs 到该目录 cp -r /usr/curvefs/ .
  2. 编写测试用例 vim main.py
    image
    3.执行
    image

Client Cpp Api测试

  1. 在curve目录下创建目录 mkdir curvefs_cpp && cd curvefs_cpp,拷贝 curve/nebd/src/part2/BUILD 到该目录
  2. 修改BUILD文件
    image
  3. 编写测试用例(模仿curvefs_python)
    image
  4. 编译并执行
    bazel build //curvefs_cpp:curvefs --copt -DHAVE_ZLIB=1 --compilation_mode=dbg -s --define=with_glog=true --define=libunwind=true --copt -DGFLAGS_NS=google --copt -Wno-error=format-security --copt -DUSE_BTHREAD_MUTEX
    bazel-bin/curvefs_cpp/curvefs
    image

能详细介绍下chunkfilepool吗?

在简介里看到这么一句,好奇具体是怎么做的

在状态机的实现上采用 chunkfilepool 的方式 ( 初始化集群的时候格式化出指定比例的空间用作 chunk ) 使得底层的写入放大为 0

【C-Plan】编译与部署

一、编译

1、基本步骤

编译的过程还是比较顺利的,按部就班操作即可,总共用时一个多小时

体验非常的丝滑顺畅

可以看到编译成功的提示:

屏幕快照 2021-01-18 上午4 53 17

打包成功:

屏幕快照 2021-01-18 上午4 57 07

二、部署

1、基本步骤

一路部署下来还是遇到不少坑的,先po出结果吧:

集群部署成功:

屏幕快照 2021-01-20 下午10 32 32

集群状态检查:

屏幕快照 2021-01-20 下午10 33 50

Nbd0卷检查:

屏幕快照 2021-01-20 下午10 44 11

2、踩坑&填坑

  • 镜像要选择正确

最开始,我直接在curvebuild镜像下,对着编译打好的tar包解压并部署了。配环境真的需要很多时间~

还是使用curveintegration这个镜像最省事了

  • 切换到curve用户时,会报出下列警告:
-bash: /dev/null: Permission denied

这是出了权限问题

设置一下,就可以了:

chmod 777 /dev/null
  • wget速度过慢,经常中断的问题

换用axel多线程下载工具,可以大大加快速度

带上-c参数,启用断点续传,就不用担心经常速度变成0了

  • 部署集群时报错:Curl error (28): Timeout was reached

连接超时了,经常还和6号错误一同出现,这是源的问题

换一个能用的源可以解决问题

  • 部署集群时报错:/usr/bin/python: No such file or directory

在bin下确实不存在python目录,但是存在具体版本的目录

使用软连接,把python连接到python2,7上:

ln -s /usr/bin/python2.7 /usr/bin/python
  • 部署集群时报错:Errors during downloading metadata for repository 'base'

这也是源出问题了

换一个能用的源可以解决问题

  • 部署集群时报错:dev/urandom not found

好像是权限出了问题,dev下是存在urandom的

在执行命令的时候加上sudo就可以了

chunkserver will exit with core dump when receive SIGINT

curve user feedback

Describe the bug (描述bug)

  1. kill -2 $chunkserverid in normal enviroment
  2. generate chunkserver core file

To Reproduce (复现方法)
kill -2 $chunkserverid

Expected behavior (期望行为)
chunkserver exits normally

Versions (各种版本)
OS: debain9
Compiler: use curve release version
curve-mds: v1.3.0-beta
curve-chunkserver: v1.3.0-beta
curve-snapshotcloneserver: v1.3.0-beta
curve-sdk: v1.3.0-beta
nebd: v1.3.0-beta
curve-nbd: v1.3.0-beta

Additional context/screenshots (更多上下文/截图)
image

image

ARM64 platform support

Is your feature request related to a problem? (你需要的功能是否与某个问题有关?)

No

Describe the solution you'd like (描述你期望的解决方法)

Curve的maintainers,你们好,我是来自华为的开源开发者,在这里想了解一下Curve社区对ARM64架构支持的意愿。我想在Curve社区推动ARM64的支持。这里有如下计划,想和社区的各位专家讨论一下:

  1. 接入ARM64 CI

    目前,curve的CI平台是自建的jenkins,我们这边可以捐献ARM64的虚拟机到社区CI平台,以满足ARM64 CI的支持。

  2. 提交ARM64 support的相关代码,保证curve在arm64平台的build和test通过

    目前,我已经在本地的arm64机器上完成了build相关的patch。wangxiyuan@19b2a66 在此patch基础上,可以顺利build curve。 测试用例的相关测试与修复还在进行中。

  3. ARM64 release

    等build、test的问题在arm64全修复,并且ARM64 CI稳定运行一段时间后,可以官方release arm64的二进制文件。

ARM平台现在越来越受欢迎,并且在个人PC、服务器、云计算等领域也开始发力,不知curve社区对ARM64支持的工作是否感兴趣?如果社区同意,我可以负责相关代码的编写,以及ARM CI的后期维护工作,期待您的回复,欢迎讨论。谢谢

Describe alternatives you've considered (描述你想到的折衷方案)

ARM CI有很多方案,比如使用Travis CI等等。但由于curve是自建的CI平台,捐献ARM64机器应该是最好的方案,这样可以统一CI,便于控制、方便维护。

Additional context/screenshots (更多上下文/截图)

None

【C-Plan】编译和部署

环境

  • 云服务器:CPU: 1核 内存: 2GB
  • OS:CentOS 8

使用非 docker 方式进行部署,curve 版本是 1.0.2-rc0

问题及解决方案

问题

  1. 下载 etcd-client 时多次连接超时
  2. 启动 ./chunkserver_ctl.sh start all 失败
  3. 权限问题
  4. 执行 ansible 任务报错
    fatal: [localhost]: FAILED! => {
    "assertion": "ansible_kernel|version_compare('3.15', '>=')", 
    "changed": false, 
    "evaluated_to": false
    }
    

解决方案

  • 其中问题 1, 2 参照 #245 的解决方案;
  • 问题 3 加上 sudo
  • 问题 4 升级centos系统内核,然后重启系统,最后安装的系统内核为 4.18.0-240.10.1.el8_3.x86_64

部署结果

集群状态
截屏2021-02-04 下午12 36 29

创建 CURVE 卷,并通过 NBD 挂载到本地
截屏2021-02-04 下午12 45 05

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.