Short
When using a netcat client and server setup in Oakestra flakiness occurs, i.e. if you send 20 messages several of them get dropped with no clear pattern or explanation. - This was my assumption a week ago, when I now replicated this it no longer occurred ...
Deeper description of the bug
Concrete Example
Let's take service C (client) and S (server).
(Note: There are multiple different netcat implementations. I used "netcat-traditional".)
We run this script on S
#!/bin/bash
while true
do
# Make sure to use -p otherwise no message can get propagated.
# -w teminates the server session after 5 s. This is used to avoid getting stuck due to a broken/stale connection.
nc -l -p 99 -w 5
done
And this cmd on C:
# -q is used to terminate the client after sending the message to avoid getting the client stuck.
for i in {1..20}; do echo $i | nc 10.30.27.3 99 -q 1; done
This screenshot shows outputs of flaky behavior. On my local setup (non Oakestra) there is a smooth flow from 1-20.
![image](https://private-user-images.githubusercontent.com/65814168/319130082-e935e103-4a3d-4fe6-9087-c6c7d8ab8094.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjEzMTU2NzQsIm5iZiI6MTcyMTMxNTM3NCwicGF0aCI6Ii82NTgxNDE2OC8zMTkxMzAwODItZTkzNWUxMDMtNGEzZC00ZmU2LTkwODctYzZjN2Q4YWI4MDk0LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MTglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzE4VDE1MDkzNFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWEwNWE4ZjFkNDZlODE1ZmQwMTQyM2Q5MWI3NmQ4MGMwMTZjNjJmMmE3ZGE5YzZlMDZjNWUzOWJkNDEyZTc0Y2MmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.Tpokeva9AkOdKnyUIBYsU2mYEl9A7gJWRT6Arsb9ymo)
Update
I have pushed a custom image that has that exact netcat version installed as well as provided both scripts there for easier testing.
Here is the SLA:
{
"sla_version": "v2.0",
"customerID": "Admin",
"applications": [
{
"applicationID": "",
"application_name": "app",
"application_namespace": "test",
"application_desc": "",
"microservices": [
{
"microserviceID": "",
"microservice_name": "server",
"microservice_namespace": "test",
"virtualization": "container",
"cmd": ["bash","server.sh"],
"memory": 100,
"vcpus": 1,
"storage": 0,
"code": "ghcr.io/malyuk-a/netcat:testing",
"addresses": {
"rr_ip": "10.30.27.3"
}
},
{
"microserviceID": "",
"microservice_name": "client",
"microservice_namespace": "test",
"virtualization": "container",
"cmd": ["bash","client.sh"],
"memory": 100,
"vcpus": 1,
"storage": 0,
"code": "ghcr.io/malyuk-a/netcat:testing",
"one_shot": true
}
]
}
]
}
Interestingly enough, now I no longer can replicate that flakiness ...
![image](https://private-user-images.githubusercontent.com/65814168/319152074-d7316af3-9d5e-4d35-9386-5bf286398ca9.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjEzMTU2NzQsIm5iZiI6MTcyMTMxNTM3NCwicGF0aCI6Ii82NTgxNDE2OC8zMTkxNTIwNzQtZDczMTZhZjMtOWQ1ZS00ZDM1LTkzODYtNWJmMjg2Mzk4Y2E5LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MTglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzE4VDE1MDkzNFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTU0NjEzMjU2ZTdiZjk5MTVkMmQ5YjY4YWVjYmRkNDcyOWFlZGU3ZmU0OTkzOThkY2Y5NjkxMDI0MTUyMmU2OWUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.HFYtbo5OQanGNovrlhN82kEae8opfhpdusihuXn-Ufs)
Example output:
...
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
...
I have no idea why or what is going on differently now all of a sudden. When experiencing the bug mentioned above I was able to replicate that behavior multiple times in the span of ~2-3 days.
I am able to see rather strange behavior when it comes to redeploying these services, I am not sure if this is related or due to the CLI tool I use that handles these deployments/creations in a very quick way.
Solution
We (@giobart, @smnzlnsk, @Malyuk-A) had a look and we couldn't spot errors in the NetManager logs, thus this needs deep analysis.
Update
Right now I simply want to know what others can observe. When you run the same SLA do you see a nice flow from 1-20 or do you see gaps? If multiple people do not see any gaps we can close this Issue I guess.
Status
Replicated and discussed. Needs deep analysis to figure out what is going wrong.
Why might this be critical: If a classic tool like netcat is not properly working who knows how other tools behave? This can very much interfere with proper practical/scientific work.
Update
Let's see, maybe this was a very strange anomaly on my local system's side.
Checklist