Git Product home page Git Product logo

regionless-storage-service's People

Contributors

jshaofuturewei avatar kxu1985 avatar pdgetrf avatar zmn223 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

regionless-storage-service's Issues

0728 test run

part of 730 tests
gaol: test 1M record load (each record value is 10KB)

test env

export NUM_OF_SI=5
export SI_INSTANCE_TYPE=t2.large
export RKV_INSTANCE_TYPE=t2.xlarge
export JAEGER_INSTANCE_TYPE=t2.xlarge
export JAEGER_ROOT_DISK_VOLUME=200
export YCSB_INSTANCE_TYPE=t2.large

test procedure (load only)

after the test lab has been set up, login to ycsb vm, cd work/go-ycsb, make change to workloads/workloada file, with following changes:

threadcount=2
fieldlength=1000
recordcount=1000000
operationcount=1000000
workload=core

then run ./bin/go-ycsb load rkv -P workloads/workloada

test result

Run finished, takes 1h9m37.403247648s
INSERT - Takes(s): 4177.4, Count: 999990, OPS: 239.4, Avg(us): 8113, Min(us): 4344, Max(us): 217343, 99th(us): 25071, 99.9th(us): 43391, 99.99th(us): 62943

other observations

  • rkv service host took memory used 547MB;
  • redis si each took memory ~4.3GB

minimal service framework

This is to set up the most basic e2e framework that

  • Accept API request
  • CRUD using storage instance

No certain features such as partition, replication are required for this task. The goal is to have a running service that works with simple storage instance. Later on features will be built upon this service.

rkv service crashes due to temporarily connection issue to backend redis

during perf run, observed that rkv may exit in the middle of load test; below is /tmp/rkv.log file
this does not always happen; 2 times in about 20 runs.

$ cat /tmp/rkv.log 
The url is 35.166.131.63:6666 and the pool is &{0x6c7ba0 <nil> <nil> 80 12000 0s false 0s {0 0} false 0 {0 {0 0}} <nil> {0 <nil> <nil>} 0 0}
The url is 34.217.123.197:6666 and the pool is &{0x6c7ba0 <nil> <nil> 80 12000 0s false 0s {0 0} false 0 {0 {0 0}} <nil> {0 <nil> <nil>} 0 0}
The url is 18.237.48.185:6666 and the pool is &{0x6c7ba0 <nil> <nil> 80 12000 0s false 0s {0 0} false 0 {0 {0 0}} <nil> {0 <nil> <nil>} 0 0}
The url is 34.215.240.3:6666 and the pool is &{0x6c7ba0 <nil> <nil> 80 12000 0s false 0s {0 0} false 0 {0 {0 0}} <nil> {0 <nil> <nil>} 0 0}
The url is 52.89.46.186:6666 and the pool is &{0x6c7ba0 <nil> <nil> 80 12000 0s false 0s {0 0} false 0 {0 {0 0}} <nil> {0 <nil> <nil>} 0 0}
2022/07/29 02:14:45 ERROR: fail init redis: dial tcp 52.89.46.186:6666: connect: connection refused

ycsb log indicated that exit happened after 4750 seconds

INSERT - Takes(s): 4710.0, Count: 357472, OPS: 75.9, Avg(us): 25817, Min(us): 7364, Max(us): 19316735, 99th(us): 33439, 99.9th(us): 5459967, 99.99th(us): 12992511
INSERT - Takes(s): 4720.0, Count: 357475, OPS: 75.7, Avg(us): 25891, Min(us): 7364, Max(us): 19316735, 99th(us): 33439, 99.9th(us): 5484543, 99.99th(us): 12992511
INSERT - Takes(s): 4730.0, Count: 357475, OPS: 75.6, Avg(us): 25891, Min(us): 7364, Max(us): 19316735, 99th(us): 33439, 99.9th(us): 5484543, 99.99th(us): 12992511
INSERT - Takes(s): 4740.0, Count: 357476, OPS: 75.4, Avg(us): 25941, Min(us): 7364, Max(us): 19316735, 99th(us): 33439, 99.9th(us): 5496831, 99.99th(us): 13049855
INSERT - Takes(s): 4750.0, Count: 357477, OPS: 75.3, Avg(us): 25997, Min(us): 7364, Max(us): 20168703, 99th(us): 33439, 99.9th(us): 5541887, 99.99th(us): 13213695
INSERT_ERROR - Takes(s): 9.5, Count: 8294, OPS: 876.4, Avg(us): 2085, Min(us): 1776, Max(us): 9503, 99th(us): 4483, 99.9th(us): 5543, 99.99th(us): 6971
INSERT - Takes(s): 4760.0, Count: 357477, OPS: 75.1, Avg(us): 25997, Min(us): 7364, Max(us): 20168703, 99th(us): 33439, 99.9th(us): 5541887, 99.99th(us): 13213695
INSERT_ERROR - Takes(s): 19.5, Count: 16847, OPS: 865.6, Avg(us): 2113, Min(us): 1776, Max(us): 14327, 99th(us): 4527, 99.9th(us): 5759, 99.99th(us): 11015
INSERT - Takes(s): 4770.0, Count: 357477, OPS: 74.9, Avg(us): 25997, Min(us): 7364, Max(us): 20168703, 99th(us): 33439, 99.9th(us): 5541887, 99.99th(us): 13213695
INSERT_ERROR - Takes(s): 29.5, Count: 25455, OPS: 863.9, Avg(us): 2117, Min(us): 1776, Max(us): 14327, 99th(us): 4519, 99.9th(us): 5675, 99.99th(us): 9503
INSERT - Takes(s): 4780.0, Count: 357477, OPS: 74.8, Avg(us): 25997, Min(us): 7364, Max(us): 20168703, 99th(us): 33439, 99.9th(us): 5541887, 99.99th(us): 13213695

Ignite evaluation, LW reference

Goals:

  • Understand partition/hashing mechanism
  • Understand strong (sequential) consistency
  • Understand List-Watch implementation with partitioned instances
  • Understand single instance performance

Ignite design doc

10M test run

YCSB Setup

***************** properties *****************
"threadcount"="4"
"dotransactions"="false"
"fieldlength"="160   #intended 500; due to the encoding, 160 length would yield about 500 payload"
"operationcount"="10000000"
"requestdistribution"="uniform"
"recordcount"="10000000"
"insertproportion"="0"
"readproportion"="0.5"
"workload"="core"
"readallfields"="true"
"updateproportion"="0.5"
"scanproportion"="0"
**********************************************

AWS Instances

image

Yaeger crashed at the end.

YCSB Result

Run finished, takes 9h20m19.700252511s
INSERT - Takes(s): 33619.7, Count: 9999999, OPS: 297.4, Avg(us): 13412, Min(us): 4300, Max(us): 1058815, 99th(us): 21951, 99.9th(us): 36703, 99.99th(us): 56831

RKV Memory

ubuntu@ip-172-31-13-231:~$ free -g
               total        used        free      shared  buff/cache   available
Mem:              31           4          25           0           1          26
Swap:              0           0           0

SI Memory

ubuntu@ip-172-31-9-38:~$  free -g
               total        used        free      shared  buff/cache   available
Mem:              31          13          16           0           1          17
Swap:              0           0           0

CPU

image

Jaeger

image

image

ycsb load with mem databases fails, yielding INSERT_ERROR

when rkv uses mem database as storage backend, go-ycsb load phase fails with INSERT_ERROR (go-ycsb workload specifies threadcount 4):

$ ./bin/go-ycsb load rkv -P workloads/workloada
***************** properties *****************
"dotransactions"="false"
"operationcount"="1000"
"scanproportion"="0"
"workload"="core"
"readallfields"="true"
"threadcount"="4"
"requestdistribution"="uniform"
"updateproportion"="0.5"
"recordcount"="1000"
"readproportion"="0.5"
"insertproportion"="0"
**********************************************
Run finished, takes 179.69837ms
INSERT - Takes(s): 0.2, Count: 55, OPS: 310.2, Avg(us): 1403, Min(us): 492, Max(us): 3987, 99th(us): 3213, 99.9th(us): 3987, 99.99th(us): 3987
INSERT_ERROR - Takes(s): 0.1, Count: 941, OPS: 6705.2, Avg(us): 523, Min(us): 187, Max(us): 9167, 99th(us): 3247, 99.9th(us): 7919, 99.99th(us): 9167

rkv log indicates concurrent map write causing the crash

The url is 172.31.9.140:6379 and the pool is &{0x6c8780 <nil> <nil> 80 12000 0s false 0s {0 0} false 0 {0 {0 0}} <nil> {0 <nil> <nil>} 0 0}
The url is 172.31.12.96:6380 and the pool is &{0x6c8780 <nil> <nil> 80 12000 0s false 0s {0 0} false 0 {0 {0 0}} <nil> {0 <nil> <nil>} 0 0}
fatal error: concurrent map writes

goroutine 315 [running]:
runtime.throw(0x7cdf55, 0x15)
        /home/howell/go/go1.16.9/src/runtime/panic.go:1117 +0x72 fp=0xc0003f3e30 sp=0xc0003f3e00 pc=0x437ab2
runtime.mapassign_faststr(0x7636c0, 0xc0003821e0, 0xc00002264c, 0x3, 0x0)
        /home/howell/go/go1.16.9/src/runtime/map_faststr.go:211 +0x3f1 fp=0xc0003f3e98 sp=0xc0003f3e30 pc=0x415eb1
github.com/regionless-storage-service/pkg/database.MemDatabase.Put(...)
        /home/howell/work/regionless-storage-service/pkg/database/mem.go:31
github.com/regionless-storage-service/pkg/database.(*MemDatabase).Put(0xc0003c6000, 0xc00002264c, 0x3, 0xc000468000, 0xdd7, 0x0, 0x0, 0x0, 0x83be20)
        <autogenerated>:1 +0x65 fp=0xc0003f3ed0 sp=0xc0003f3e98 pc=0x6c8ba5
github.com/regionless-storage-service/pkg/piping.(*ChainPiping).Write.func1(0xc000022640, 0x83be20, 0xc00051dd10, 0xc0000ca2a0, 0xc00002264c, 0x3, 0xc000468000, 0xdd7)
        /home/howell/work/regionless-storage-service/pkg/piping/chain_piping_manager.go:61 +0x18d fp=0xc0003f3fa0 sp=0xc0003f3ed0 pc=0x6d244d
runtime.goexit()
        /home/howell/go/go1.16.9/src/runtime/asm_amd64.s:1371 +0x1 fp=0xc0003f3fa8 sp=0xc0003f3fa0 pc=0x46d3e1
created by github.com/regionless-storage-service/pkg/piping.(*ChainPiping).Write
        /home/howell/work/regionless-storage-service/pkg/piping/chain_piping_manager.go:57 +0x32d

...

Other things worthy of noting:

  • the delay code of mem database Put method was temporily commented out before the test;
  • go-ycsb workload threadcount 4

0729 test run

test config

redis persistence is disabled (save "")

export NUM_OF_SI=6
export RKV_ROOT_DISK_VOLUME=100
export SI_ROOT_DISK_VOLUME=100
export SI_INSTANCE_TYPE=t2.xlarge
export RKV_INSTANCE_TYPE=t2.2xlarge
export JAEGER_INSTANCE_TYPE=t2.2xlarge
export JAEGER_ROOT_DISK_VOLUME=200
export YCSB_INSTANCE_TYPE=t2.2xlarge
export YCSB_ROOT_DISK_VOLUME=40

records: 5M
k-v payload: 5KB value

test procedure

load test only
workloada setting

threadcount=4
fieldlength=160   #intended 500; due to the encoding, 160 length would yield about 500 payload
recordcount=5000000
operationcount=5000000

test result

ycsb log:

INSERT - Takes(s): 10840.0, Count: 3188140, OPS: 294.1, Avg(us): 13548, Min(us): 4082, Max(us): 1046015, 99th(us): 22239, 99.9th(us): 29471, 99.99th(us): 223231
INSERT - Takes(s): 10850.0, Count: 3191224, OPS: 294.1, Avg(us): 13548, Min(us): 4082, Max(us): 1046015, 99th(us): 22239, 99.9th(us): 29471, 99.99th(us): 223231
INSERT - Takes(s): 10860.0, Count: 3194339, OPS: 294.1, Avg(us): 13547, Min(us): 4082, Max(us): 1046015, 99th(us): 22239, 99.9th(us): 29471, 99.99th(us): 223231
INSERT - Takes(s): 10870.0, Count: 3197327, OPS: 294.1, Avg(us): 13547, Min(us): 4082, Max(us): 1046015, 99th(us): 22223, 99.9th(us): 29471, 99.99th(us): 223231
INSERT - Takes(s): 10880.0, Count: 3200144, OPS: 294.1, Avg(us): 13547, Min(us): 4082, Max(us): 1046015, 99th(us): 22223, 99.9th(us): 29455, 99.99th(us): 223231
... // approaching to 4M records (where redis is almost exhausting its memory), noticing OPS is actually very low (<1 ops at this moment)
INSERT - Takes(s): 13820.0, Count: 3931533, OPS: 284.5, Avg(us): 13969, Min(us): 4082, Max(us): 19005439, 99th(us): 22079, 99.9th(us): 29439, 99.99th(us): 1022463
INSERT - Takes(s): 13830.0, Count: 3931539, OPS: 284.3, Avg(us): 13986, Min(us): 4082, Max(us): 22298623, 99th(us): 22079, 99.9th(us): 29455, 99.99th(us): 1022463
INSERT - Takes(s): 13840.0, Count: 3931541, OPS: 284.1, Avg(us): 13993, Min(us): 4082, Max(us): 22298623, 99th(us): 22079, 99.9th(us): 29455, 99.99th(us): 1022975
INSERT - Takes(s): 13850.0, Count: 3931542, OPS: 283.9, Avg(us): 13999, Min(us): 4082, Max(us): 22790143, 99th(us): 22079, 99.9th(us): 29455, 99.99th(us): 1022975
... // even no ops sometimes
INSERT - Takes(s): 14550.0, Count: 3932125, OPS: 270.2, Avg(us): 14673, Min(us): 4082, Max(us): 33439743, 99th(us): 22111, 99.9th(us): 29855, 99.99th(us): 5308415
INSERT - Takes(s): 14560.0, Count: 3932125, OPS: 270.1, Avg(us): 14673, Min(us): 4082, Max(us): 33439743, 99th(us): 22111, 99.9th(us): 29855, 99.99th(us): 5308415
INSERT - Takes(s): 14570.0, Count: 3932125, OPS: 269.9, Avg(us): 14673, Min(us): 4082, Max(us): 33439743, 99th(us): 22111, 99.9th(us): 29855, 99.99th(us): 5308415
INSERT - Takes(s): 14580.0, Count: 3932125, OPS: 269.7, Avg(us): 14673, Min(us): 4082, Max(us): 33439743, 99th(us): 22111, 99.9th(us): 29855, 99.99th(us): 5308415
...
INSERT - Takes(s): 14640.0, Count: 3932125, OPS: 268.6, Avg(us): 14673, Min(us): 4082, Max(us): 33439743, 99th(us): 22111, 99.9th(us): 29855, 99.99th(us): 5308415
INSERT - Takes(s): 14650.0, Count: 3932126, OPS: 268.4, Avg(us): 14682, Min(us): 4082, Max(us): 33980415, 99th(us): 22111, 99.9th(us): 29855, 99.99th(us): 5320703
INSERT - Takes(s): 14660.0, Count: 3932129, OPS: 268.2, Avg(us): 14708, Min(us): 4082, Max(us): 34570239, 99th(us): 22111, 99.9th(us): 29871, 99.99th(us): 5345279
INSERT - Takes(s): 14670.0, Count: 3932133, OPS: 268.0, Avg(us): 14722, Min(us): 4082, Max(us): 34570239, 99th(us): 22111, 99.9th(us): 29871, 99.99th(us): 5402623
INSERT - Takes(s): 14680.0, Count: 3932136, OPS: 267.9, Avg(us): 14729, Min(us): 4082, Max(us): 34570239, 99th(us): 22111, 99.9th(us): 29871, 99.99th(us): 5423103
...
INSERT - Takes(s): 17050.0, Count: 3933443, OPS: 230.7, Avg(us): 17128, Min(us): 4082, Max(us): 34570239, 99th(us): 22191, 99.9th(us): 31215, 99.99th(us): 12156927
INSERT - Takes(s): 17060.0, Count: 3933451, OPS: 230.6, Avg(us): 17144, Min(us): 4082, Max(us): 34570239, 99th(us): 22191, 99.9th(us): 31231, 99.99th(us): 12230655
INSERT - Takes(s): 17070.0, Count: 3933454, OPS: 230.4, Avg(us): 17155, Min(us): 4082, Max(us): 34570239, 99th(us): 22191, 99.9th(us): 31231, 99.99th(us): 12230655

rkv memory usage when system is about 4M records

              total        used        free      shared  buff/cache   available
Mem:            31G        1.7G         28G        844K        1.5G         29G
Swap:            0B          0B          0B

jaeger cpu and disk usage when system is about 4M recods

               total        used        free      shared  buff/cache   available
Mem:            31Gi       7.8Gi        22Gi       1.0Mi       926Mi        23Gi
Swap:             0B          0B          0B

Filesystem      Size  Used Avail Use% Mounted on
/dev/root       194G  2.4G  192G   2% /

ycsb client cpu and disk usage when system is about 4M recods

               total        used        free      shared  buff/cache   available
Mem:            31Gi       334Mi        30Gi       0.0Ki       788Mi        30Gi
Swap:             0B          0B          0B

Filesystem      Size  Used Avail Use% Mounted on
/dev/root        39G  4.3G   35G  11% /

observations

  • redis memory usage need to estimate with sufficient buffer beyond 5KB per record. Suggest to use x2 or a bit more.
  • rkv not exit when redis memory has almost exhausted (though ops is very low should that happens) - disabling redis save should be the help to avoid crash of rkv in that case

configuration experiment with redis count

5 x t2.medium redis

ubuntu@ip-172-31-3-247:~/work/go-ycsb$ ./bin/go-ycsb load rkv -P workloads/workloada
***************** properties *****************
"readproportion"="0.5"
"updateproportion"="0.5"
"requestdistribution"="uniform"
"workload"="core"
"readallfields"="true"
"dotransactions"="false"
"recordcount"="10000"
"fieldlength"="1000"
"threadcount"="4"
"scanproportion"="0"
"insertproportion"="0"
"operationcount"="10000"
**********************************************
INSERT - Takes(s): 10.0, Count: 1564, OPS: 156.6, Avg(us): 25321, Min(us): 18416, Max(us): 57023, 99th(us): 43679, 99.9th(us): 56319, 99.99th(us): 57023
INSERT - Takes(s): 20.0, Count: 3116, OPS: 155.9, Avg(us): 25435, Min(us): 8336, Max(us): 57023, 99th(us): 43519, 99.9th(us): 55999, 99.99th(us): 57023
INSERT - Takes(s): 30.0, Count: 4496, OPS: 149.9, Avg(us): 26455, Min(us): 8336, Max(us): 87103, 99th(us): 49503, 99.9th(us): 66495, 99.99th(us): 87103

INSERT - Takes(s): 40.0, Count: 5843, OPS: 146.1, Avg(us): 27147, Min(us): 8336, Max(us): 87103, 99th(us): 51871, 99.9th(us): 67903, 99.99th(us): 83455
INSERT - Takes(s): 50.0, Count: 7243, OPS: 144.9, Avg(us): 27380, Min(us): 8336, Max(us): 87103, 99th(us): 51423, 99.9th(us): 66495, 99.99th(us): 83455

10 x t2.medium redis

Getting Non-existing Value with Concurrent Operations

This is a finding when doing the consistency validation.

  • Deployment configuration: 4 storage instance (1 us-east-1, 1 us-east-2, 1 us-west-1, 1 us-west-2)
  • Sync/Async nodes: 3 async nodes (localreplicanum=0, remotereplicanum=3, remotestorelatencythresholdinmillisec=0)
  • Consistency validation setting: num_client=5, duration=20

Result: a client can get a value which is not existing in the redis backend. For example, in the following screenshot client 4 got a value 582. From the original log the corresponding revision number is 132. The returned value from the curl command for revision 132 is 311.

Screen Shot 2022-09-19 at 9 38 49 AM
Screen Shot 2022-09-19 at 9 39 56 AM

new feature request: rkv update with current latest revision number

Currently rkv update op updates the value of the specified key; each key keeps the history of values in list of revisions.

Taking an imaginary scenario that increase the request-counter (key named as "count") on receiving requests:

  1. client reads the latest revision of count, say n;
  2. client update count with a new value n+1;

There exist multiple components (e.g. server handler go-routines) updating the count. We need a mechanism to ensure the correctness of count update. One simple approach is updating with comparison of latest revision number.

[tracking] 0730 perf test result

This is NOT an issue, but a tracking report of 0730 perf testings.
It should have been a wiki page; unfortunately for now wiki page is still not available for this repo in private shape.

Perf tests are conducted with go-ycab with rkv driver.
IMPORTANT: a full test has 2 steps: load + run, like below

./bin/go-ycsb load rkv -P workloads/workloada
./bin/go-ycsb run rkv -P workloads/workloada

workload-a ( 50% read, 50% update)

record count config total time(s) insert count ops latency(us): avg(us) min max 99% 99.9% 99.99% rkv used mem jaeger notes
1K redisx2, all t2.micro(1cpu, 1GB mem), client threads=1 4.45 1000 226 4405 3386 26479 10191 21135 26479 150M createKV: 3060us, set kv:3038us, put index: 8us
1M redisx2, rkv+jaeger t2.medium (1cpu, 4GB), client thread=100 52'4" 1,000,000 320 312349 4732 1531903 452607 1305599 1353727 838M
10M redisx5, rkv t2.large(2cpu, 8GB), others t2.medium, threads 100 7:14'11" 10,000,000 384 260467 1412 7389183 479999 1273855 1341439 4.0G jaeger crashed, likely due to insufficient mem
FYI: redis 1K ycsb result is
Run finished, takes 605.65671ms
INSERT - Takes(s): 0.6, Count: 1000, OPS: 1666.0, Avg(us): 572, Min(us): 461, Max(us): 3987, 99th(us): 1698, 99.9th(us): 3965, 99.99th(us): 3987

930 Performance Test - 100M staging

YCSB Parameters

threadcount=32
fieldlength=80
recordcount=100000000
operationcount=100000000

ONE thread

WRITE

1 thread
image

INSERT - Takes(s): 150.0, Count: 8962, OPS: 59.8, Avg(us): 16698, Min(us): 1429, Max(us): 29279, 99th(us): 25103, 99.9th(us): 25951, 99.99th(us): 28735
INSERT - Takes(s): 160.0, Count: 9550, OPS: 59.7, Avg(us): 16713, Min(us): 1429, Max(us): 29279, 99th(us): 25087, 99.9th(us): 25951, 99.99th(us): 28735
Run finished, takes 2m46.888740023s
INSERT - Takes(s): 166.9, Count: 10000, OPS: 59.9, Avg(us): 16650, Min(us): 1429, Max(us): 29279, 99th(us): 25087, 99.9th(us): 26015, 99.99th(us): 28735

READ

image

READ   - Takes(s): 10.0, Count: 7002, OPS: 700.3, Avg(us): 1420, Min(us): 915, Max(us): 8407, 99th(us): 2225, 99.9th(us): 4899, 99.99th(us): 7427
Run finished, takes 14.256774727s
READ   - Takes(s): 14.3, Count: 10000, OPS: 701.5, Avg(us): 1418, Min(us): 915, Max(us): 8839, 99th(us): 2229, 99.9th(us): 4555, 99.99th(us): 8407

Full Test (32 Thread)

WRITE

image

INSERT - Takes(s): 110.0, Count: 145534, OPS: 1323.1, Avg(us): 17606, Min(us): 1333, Max(us): 1020927, 99th(us): 26975, 99.9th(us): 40895, 99.99th(us): 1018367
INSERT - Takes(s): 120.0, Count: 162599, OPS: 1355.1, Avg(us): 17438, Min(us): 1333, Max(us): 1015807, 99th(us): 27023, 99.9th(us): 39423, 99.99th(us): 1012735
INSERT - Takes(s): 130.0, Count: 179170, OPS: 1378.3, Avg(us): 17356, Min(us): 1333, Max(us): 1013247, 99th(us): 27023, 99.9th(us): 38687, 99.99th(us): 1010175

READ

Hardcoded RemoteStoreLatencyThresholdInMilliSec might cause confusion and zero remote stores be select

Tried ./setup_test_lab.sh ./si_def_4_region_micro.json with RemoteStoreLatencyThresholdInMilliSec set up to 100ms in the configuration.

Got zero remote stores 2 times since it get the latencies few than 100ms from the stores across the states.

Updated to 50ms to make it work

Since we have different si_def settings and users might not have knowledge to set RemoteStoreLatencyThresholdInMilliSec to a reasonable number to distinguish local & remote stores.

Might considering a new strategy to create latency histograms to pick remote instances

cross-(2)region

YCSB Config

threadcount=4
fieldlength=160   #intended 500; due to the encoding, 160 length would yield about 500 payload
recordcount=50000
operationcount=50000
workload=core

YCSB Setup

***************** properties *****************
"insertproportion"="0"
"fieldlength"="160   #intended 500; due to the encoding, 160 length would yield about 500 payload"
"scanproportion"="0"
"threadcount"="4"
"recordcount"="50000"
"readallfields"="true"
"dotransactions"="true"
"updateproportion"="0.5"
"requestdistribution"="uniform"
"readproportion"="0.5"
"operationcount"="50000"
"workload"="core"
**********************************************

image

image

add replication to the service

add a component called replication manager that is in charge of replicating writes to replicas while maintaining consistency

30M test run

YCSB Config

threadcount=4
fieldlength=160   #intended 500; due to the encoding, 160 length would yield about 500 payload
recordcount=30000000
operationcount=30000000
workload=core

YCSB Setup

***************** properties *****************
"requestdistribution"="uniform"
"recordcount"="30000000"
"readproportion"="0.5"
"scanproportion"="0"
"workload"="core"
"insertproportion"="0"
"updateproportion"="0.5"
"fieldlength"="160   #intended 500; due to the encoding, 160 length would yield about 500 payload"
"operationcount"="30000000"
"dotransactions"="false"
"threadcount"="4"
"readallfields"="true"
**********************************************

AWS Instances

image

image

Yaeger crashed due to out of memory.

YCSB "Load"

Data

Run finished, takes 28h4m22.484828118s
INSERT - Takes(s): 101062.5, Count: 29999999, OPS: 296.8, Avg(us): 13438, Min(us): 4116, Max(us): 3053567, 99th(us): 22319, 99.9th(us): 29327, 99.99th(us): 37087

RKV Memory

ubuntu@ip-172-31-2-182:~$ free -g
               total        used        free      shared  buff/cache   available
Mem:              31          11          18           0           1          19
Swap:              0           0           0

SI Memory

Spot-checked a few SI, all have the following:

ubuntu@ip-172-31-11-22:~$  free -g
               total        used        free      shared  buff/cache   available
Mem:              31          15          14           0           1          15
Swap:              0           0           0

CPU

image

Jaeger

image

YCAB Run

PUT 50%/GET 50%, 3M

***************** properties *****************
"recordcount"="300000"
"threadcount"="4"
"operationcount"="300000"
"readallfields"="true"
"insertproportion"="0"
"scanproportion"="0"
"requestdistribution"="uniform"
"dotransactions"="true"
"fieldlength"="160   #intended 500; due to the encoding, 160 length would yield about 500 payload"
"workload"="core"
"readproportion"="0.5"
"updateproportion"="0.5"
**********************************************

Run finished, takes 11m21.721168971s
READ   - Takes(s): 681.7, Count: 149873, OPS: 219.8, Avg(us): 8072, Min(us): 1878, Max(us): 58815, 99th(us): 17807, 99.9th(us): 25311, 99.99th(us): 33727
UPDATE - Takes(s): 681.7, Count: 150127, OPS: 220.2, Avg(us): 10081, Min(us): 3522, Max(us): 81407, 99th(us): 20447, 99.9th(us): 28575, 99.99th(us): 36383

PUT 0%/GET 100%

***************** properties *****************
"threadcount"="4"
"requestdistribution"="uniform"
"fieldlength"="160   #intended 500; due to the encoding, 160 length would yield about 500 payload"
"updateproportion"="0"
"insertproportion"="0"
"recordcount"="300000"
"dotransactions"="true"
"scanproportion"="0"
"operationcount"="300000"
"readallfields"="true"
"workload"="core"
"readproportion"="1"
**********************************************
READ   - Takes(s): 10.0, Count: 8002, OPS: 800.4, Avg(us): 4989, Min(us): 1865, Max(us): 31775, 99th(us): 14679, 99.9th(us): 19391, 99.99th(us): 31727
READ   - Takes(s): 20.0, Count: 15952, OPS: 797.7, Avg(us): 5006, Min(us): 1865, Max(us): 31775, 99th(us): 14687, 99.9th(us): 21887, 99.99th(us): 26927
READ   - Takes(s): 30.0, Count: 24022, OPS: 800.8, Avg(us): 4987, Min(us): 1865, Max(us): 31775, 99th(us): 14607, 99.9th(us): 22383, 99.99th(us): 26927
key is user8077940190266422784 and error is Get "http://rkv:8090/kv?key=user8077940190266422784": dial tcp 52.42.125.43:8090: connect: cannot assign requested address
key is user6962341607726016868 and error is Get "http://rkv:8090/kv?key=user6962341607726016868": dial tcp 52.42.125.43:8090: connect: cannot assign requested address
key is user6337301133090096462 and error is Get "http://rkv:8090/kv?key=user6337301133090096462": dial tcp 52.42.125.43:8090: connect: cannot assign requested address
key is user6971927150098868116 and error is Get "http://rkv:8090/kv?key=user6971927150098868116": dial tcp 52.42.125.43:8090: connect: cannot assign requested address
key is user6266626724665952454 and error is Get "http://rkv:8090/kv?key=user6266626724665952454": dial tcp 52.42.125.43:8090: connect: cannot assign requested address
key is user6298913883621934819 and error is Get "http://rkv:8090/kv?key=user6298913883621934819": dial tcp 52.42.125.43:8090: connect: cannot assign requested address
key is user7551286215126588905 and error is Get "http://rkv:8090/kv?key=user7551286215126588905": dial tcp 52.42.125.43:8090: connect: cannot assign requested address

52.42.125.43:8090 is the RKV server

PUT 100%/GET 0%

***************** properties *****************
"insertproportion"="0"
"readproportion"="0"
"requestdistribution"="uniform"
"operationcount"="300000"
"recordcount"="300000"
"dotransactions"="true"
"updateproportion"="1"
"scanproportion"="0"
"readallfields"="true"
"workload"="core"
"threadcount"="4"
"fieldlength"="160   #intended 500; due to the encoding, 160 length would yield about 500 payload"
**********************************************
UPDATE - Takes(s): 10.0, Count: 3030, OPS: 303.1, Avg(us): 13181, Min(us): 5500, Max(us): 32831, 99th(us): 22431, 99.9th(us): 27423, 99.99th(us): 32831
UPDATE - Takes(s): 20.0, Count: 6147, OPS: 307.4, Avg(us): 12997, Min(us): 5500, Max(us): 32831, 99th(us): 22399, 99.9th(us): 27055, 99.99th(us): 29855
UPDATE - Takes(s): 30.0, Count: 9277, OPS: 309.3, Avg(us): 12918, Min(us): 5500, Max(us): 32831, 99th(us): 22207, 99.9th(us): 27727, 99.99th(us): 32063
...
...
Run finished, takes 16m11.70180699s
UPDATE - Takes(s): 971.7, Count: 300000, OPS: 308.7, Avg(us): 12936, Min(us): 3868, Max(us): 1020415, 99th(us): 22831, 99.9th(us): 29535, 99.99th(us): 35679

curl client gets "storage not found for 127.0.0.1:6379" when rkv uses dummy+latency storage backend

seems a regression. Prior commit does not have this issue.

How to reproduce

  1. get the latest rkv code (commit de9428e), change storage type setting to "dummy+latency" in config.json, then start rkv server.
  2. run curl command curl -X POST http://127.0.0.1:8090/kv -d '{"key":"a","value":"3"}', it gets normal response The key value pair (a,3) has been saved as revision 1 at 127.0.0.1:6379,172.31.9.140:6379,172.31.12.96:6380
  3. run curl http://127.0.0.1:8090/kv?key=a, it gets back the unexpected response storage not found for 127.0.0.1:6379

KPI plan

come up with a set of methods to evaluate the goals for 630 release

refactor the multi-region test script

  • naming convention across all scripts (lots of global variables and mix of upper & lower cases, ugh)
  • all redis VM in all regions to have the same tag while still maintaining the order of VM consistent between si_config.json and generated_config.json

930 Performance Test - 5M staging

Goal

This test is to establish confidence and trim bugs before the 100M test.

YCSB

Workload Config

threadcount=25
fieldlength=160   #intended 500; due to the encoding, 160 length would yield about 500 payload
recordcount=5000000
operationcount=5000000

Command

nohup ./bin/go-ycsb load rkv -P workloads/workloada > load.log &

Time Estimates

  • 16t, 3+3+2+3

    • 1M = 1hr (60min)
    • 100k = 6min
  • 22t, 3+3+2+3

    • 100k = 4 min 55 sec
  • 25t, 3+3+2+3

    • 100k = 4 min 22 sec
  • 32t, 3+3+2+3

    • 100k = 3 min 31 sec

Run time (load only)

US-West-2

image

3h41m34.857277847s, also estimated as 40min/60*5=3.33 hr

image

image

US-East-1

image

2h59m57.94355228s

image

image

Write:
image

image

Read:
image

Thoughts

  • 32 thread not very stable
  • more VMs -> less redis congestion -> more threads -> shorter run time
  • need detailed profiling network/redis
  • need to move default region to us-east (for FW cost monitoring) (fixed in PR 77)
  • With larger test, the script could be throttled by AWS (fixed in PR 77)

Seen Errors

image

image

curl client gets "Connection refused" error when rkv uses the default mem database backend

get the latest version of rkv source code (commit de9428e), build and start rkv server; seemingly fine; however trying to curl curl http://127.0.0.1:8090/kv?key=a gets back unexpected response curl: (7) Failed to connect to 127.0.0.1 port 8090: Connection refused

The expected response is rev not found, since there is no such key yet

This seems a regression. Reverting back to prior commit, the curl gets response mvcc: Revision not found

rkv get return should be of json format

getting result from rkv; the header indicates json, but the the body is text string

Reproduction

starting rkv service, run following comamnd to create one {ket, rev} pair:

curl -XPOST http://127.0.0.1:8090/kv -d '{"key":"k", "value":"234"}'

assuming the revision created is 1,

curl -v http://127.0.0.1:8090/kv?key=k\&rev=1

gets

*   Trying 127.0.0.1:8090...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 8090 (#0)
> GET /kv?key=k&rev=1 HTTP/1.1
> Host: 127.0.0.1:8090
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 202 Accepted
< Content-Type: application/json
< Date: Mon, 24 Oct 2022 19:05:14 GMT
< Content-Length: 37
< 
The value is 234 with the revision 1

What is expected

body should be of json format

rkv get revision of key should respect the revision even if it does not exist

get non-existent revision for a valid key returns the value of the last-known revision.

Ho to reproduce

given the rkv service is running at http://127.0.0.1:8090/kv, run following comand to create a k-v pair

curl -XPOST http://127.0.0.1:8090/kv -d '{"key":"k", "value":"234"}'

assuming the revision created is 1, run

curl http://127.0.0.1:8090/kv?key=k\&rev=999

gets

The value is 234 with the revision 1

What is expected

an error message indicating the key-revision combination not exist

when only 1 redis backend is set, rkv get query returns unexpected error "the number of nodes is 1, which means there is no replica"

after posting values for a specific key, query of its value returns unexpected response: the number of nodes is 1, which means there is no replica

What is expected: the latest value of the key
What is specific: only one (the local) redis is set up

How to reproduce

  1. set up the local redis server, e.g. listening at 127.0.0.1:16378
  2. modify config.json file accordingly like below
{
    "ConsistentHash": "rendezvous",
    "BucketSize": 10,
    "ReplicaNum": 1,
    "StoreType": "redis",
    "Concurrent": true,
    "Stores": [
        {
            "RegionType": "local",
            "Name": "store1",
            "Host": "127.0.0.1",
            "Port": 16378
        }
    ]
}
  1. start rkv server
  2. run command to post value of key curl -X POST 127.0.0.1:8090/kv -d '{"key":"testk", "value":"testv"}'; notice the response is The key value pair (testk,testv) has been saved as revision 6 at 127.0.0.1:16378, assuming success
  3. run command to get back its value curl 127.0.0.1:8090/kv?key=testk
    the response is
the number of nodes is 1, which means there is no replica

The last commit #68 does not check the storeType before changing hostip:port to name and causes exceptions to save key value pairs

As it mentioned in 79353ac#diff-a3d824da3c42420cd5cbb0a4a2c0e7b5bfddd819652788a0596d195dc6e31fa5R70
// returned items identifing backend stores by name, NOT by hostname:port - backend may be other than redis type

Redis backend use hostip:port while others might use name.

The commit does not add any check of backend store type before changing hostip:port(not hostname:port as it describe) to name, which causes the following exceptions when using redis backends

image

I also checked the config_test.go and found all the test cases have been changed to test DummyLatency datastores.

[perf test tool] go-ycsb rkv driver writes value not respecting fieldlength setting

when running go-ycsb, setting fieldlength=500, so that the total value length would be 5KB. However, in backend redis, the value saved in char about 3 times more than 5KB (noticing the saved value in word in 5000).

below is part of value retrieved from redis:
char number of such value is 17340

map[field0:[87 78 107 72 121 106 114 113 120 67 82 85 107 74 89 106 105 76 70 102 81 121 100 69 104 70 115 117 116 89 67 76 79 70 117 97 80 80 118 71 71 112 113 113 109 72 78 79 73 70 78 121 112 77 100 90 112 69 112 66 88 67 75 76 79 73 82 98 75 100 110 74 65 121 100 76 77 109 72 85 122 83 122 75 73 83 107 112 111 72 119 119 107 119 98 101 81 73 116 85 90 118 87 78 77 113 66 107 66 103 111 113 80 85 102 79 114 111 109 68 118 67 110 72 86 69 74 104 98 99 119 112 75 103 75 103 111 106 104 104 116 120 85 88 109 117 71 120 112 98 86 117 70 78 85 84 110 84 89 116] field1:[87 103 118 70 76 99 107 89 108 105 107 85 120 117 71 97 106 86 71 75 99 98 80 118 108 107 119 103 116 101 68 79 78 81 105 69 122 85 65 71 72 78 67 87 86 107 81 90 108 111 65 84 117 88 82 116 111 117 73 102 66 97 122 116 67 107 103 121 86 98 105 112 67 73 73 69 99 87 90 86 122 113 99 77 112 77 86 105 79 114 115 118 117 101 104 118 115 109 120 69 74 117 73 121 70 65 78 87 120 80 117 100 109 118 100 70 120 85 87 69 114 74 114 117 89 86 116 106 117 107 74 100 76 109 89 75 106 89 114 107 114 100 106 112 72 109 97 108 66 105 102 85 120 90 69 111 77 70 87 84] field2:[83 109 99 105 113 85 100 79 102 122 122 104 111 88 70 74 120 119 113 72 76 75 122 78 78 114 80 69 88 73
...
67 78 114 105 68 122 115 69 122 88 86 98 81 88 90 99 98 118 87 76 113 97 120 84 86 103 76 113 90 67 103 105 87 97 112 78 101 77 98 115 84 88 90 81 84 119 101 106 115 65 98 104 99 107 70 114 102 104 112 73 105 113 81 83 65 102 67 73 77 69 65 103 88 120 121 117 77 119 72 99 98 69 71 103 74 112 106 75 108 86 120 67 98 99 98 79 106 79 101 84 105 78 78 97 73 122 65 85 89 122 112 100 101 121 69 120 116 66 65 119 114 98 87 112 120 78 69 121 76 100 104 115 80 65 101 97 80 122 82 105 84 105 77 99 69 111 88 102 89 111 69 97 84 83 79 78 78 80 116 75 97 73 122 72 67 76 87 115 120 104 76 80 102 106 106 118 109 85 75 70 89 85 90 119 80 90 74 78 111 117 89 122 79 104 110 86 102 103 69 68 121 80 81 122 122 99 101 99 108 103 67 70 81 105 69 109 66 84 89 114 85 81 88 71 108 102 114 106 119 120 113 113 104 110 106 90 83 104 114 84 97 70 101 82 101 105 84 87 99 86 121 81 82 67 110 87 86 76 115 106 84 65 72 107]]"

2 micro rkv per test fails, gets "runtime error: invalid memory address or nil pointer dereference"

this bug was found when running perf test with 2 micro configuration.

The config.json file used in this case is

{
  "ConsistentHash": "rendezvous",
  "BucketSize": 10,
  "ReplicaNum": 2,
  "StoreType": "redis",
  "Concurrent": true,
  "Stores": [
    {
      "Name": "hwperf-0824-1-rkv-lab-si-0",
      "Host": "54.219.184.67",
      "Port": 6666
    },
    {
      "Name": "hwperf-0824-1-rkv-lab-si-1",
      "Host": "54.183.189.182",
      "Port": 6666
    },
    {
      "Name": "hwperf-0824-1-rkv-lab-si-2",
      "Host": "35.89.67.43",
      "Port": 6666
    },
    {
      "Name": "hwperf-0824-1-rkv-lab-si-3",
      "Host": "35.90.217.98",
      "Port": 6666
    }
  ]
}

rkv server log has error message:

...
2022/08/24 17:20:56 http: panic serving 35.90.155.56:35186: runtime error: invalid memory address or nil pointer dereference
goroutine 2522 [running]:
net/http.(*conn).serve.func1(0xc000089720)
	/usr/local/go/src/net/http/server.go:1804 +0x153
panic(0x7655c0, 0xa011f0)
	/usr/local/go/src/runtime/panic.go:971 +0x499
go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End(0xc000232180, 0x0, 0x0, 0x0)
	/home/ubuntu/regionless-storage-service/vendor/go.opentelemetry.io/otel/sdk/trace/span.go:402 +0x345
panic(0x7655c0, 0xa011f0)
	/usr/local/go/src/runtime/panic.go:965 +0x1b9
main.(*KeyValueHandler).createKV(0xc0000c6800, 0x83b590, 0xc00025b880, 0xc0001ec100, 0x0, 0x0, 0x0, 0x0)
	/home/ubuntu/regionless-storage-service/cmd/http/main.go:221 +0x25d
main.(*KeyValueHandler).ServeHTTP(0xc0000c6800, 0x83b590, 0xc00025b880, 0xc0001ec100)
	/home/ubuntu/regionless-storage-service/cmd/http/main.go:105 +0x3c5
net/http.(*ServeMux).ServeHTTP(0xa12100, 0x83b590, 0xc00025b880, 0xc0001ec100)
	/usr/local/go/src/net/http/server.go:2428 +0x1ad
net/http.serverHandler.ServeHTTP(0xc00025a0e0, 0x83b590, 0xc00025b880, 0xc0001ec100)
	/usr/local/go/src/net/http/server.go:2867 +0xa3
net/http.(*conn).serve(0xc000089720, 0x83bc00, 0xc000394b00)
	/usr/local/go/src/net/http/server.go:1932 +0x8cd
created by net/http.(*Server).Serve
	/usr/local/go/src/net/http/server.go:2993 +0x39b
...

jaeger trace shows 1 span with exception log:
Screenshot from 2022-08-24 11-14-06

How to reproduce

cd scritpts
./select_config.sh 2 micro
./setup_test_lab.sh si_def.json

redis server dbsize decreased during perf test that has no delete

started perf test using scripts/setup_test_lab.sh, with following settings

rkv vm type: t2.large
redis SI vm type: t2.medium

at ycsb host, tageting recordcount 10M, thread count 100, ran

./bin/go-ycsb load rkv -P workloads/workloada     #workloada is 50%update+50%read, none deletes

observing the changes of redis key size of all SI backends, noticed that they are not monotonic increasing, but sometimes decreased significantly, like below

54.185.154.69:6379> dbsize
(integer) 253084
54.185.154.69:6379> dbsize
(integer) 515344
54.185.154.69:6379> dbsize
(integer) 516363
54.185.154.69:6379> dbsize
(integer) 518244
54.185.154.69:6379> dbsize
(integer) 518749
54.185.154.69:6379> dbsize
(integer) 525494
54.185.154.69:6379> dbsize
(integer) 606304
54.185.154.69:6379> dbsize
(integer) 3266
54.185.154.69:6379> dbsize
(integer) 90874
54.185.154.69:6379> dbsize
(integer) 91144
54.185.154.69:6379> dbsize
(integer) 91346

rkv misses revision 2

when inserting key-value to rkv, revison 2 is missed in the sequence

How to reproduce

start rkv server,
run following client commands to insert 3 k-v pairs:

$ curl -X POST 127.0.0.1:8090/kv -d '{"key":"testk", "value":"testv1"}'
The key value pair (testk,testv1) has been saved as revision 1 at 127.0.0.1:16378,127.0.0.1:16378,127.0.0.1:16378
$ curl -X POST 127.0.0.1:8090/kv -d '{"key":"testk", "value":"testv2"}'
The key value pair (testk,testv2) has been saved as revision 3 at 127.0.0.1:16378,127.0.0.1:16378,127.0.0.1:16378
$ curl -X POST 127.0.0.1:8090/kv -d '{"key":"testk", "value":"testv3"}'
The key value pair (testk,testv3) has been saved as revision 4 at 127.0.0.1:16378,127.0.0.1:16378,127.0.0.1:16378

notice that revisions return are 1, 3, 4.

[perf test] go-ycsb run rkv -P workloads/workloada reports READ_ERROR

start rkv test lab, run workloada (50% update, 50% read) in 2 steps

  1. go-ycsb load rkv -P workloads/workloada, which runs fine and inserts 1000 records;
  2. go-ycsb run rkv -P workloads/workloada, which reports
Run finished, takes 2.390314559s
READ_ERROR - Takes(s): 2.4, Count: 495, OPS: 207.7, Avg(us): 881, Min(us): 443, Max(us): 8463, 99th(us): 2251, 99.9th(us): 8463, 99.99th(us): 8463
UPDATE - Takes(s): 2.4, Count: 505, OPS: 211.7, Avg(us): 3846, Min(us): 3056, Max(us): 9935, 99th(us): 5827, 99.9th(us): 8551, 99.99th(us): 9935

look at jaeger tracing, find error mvcc: Revision not found, see the screenshot below:

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.