Hello,
To test a network device, I want to create 100 routers with the test device as peer and each send 100k routes to the device under test (DUT).
My code is roughly as followed:
- in a loop, I create 100 instances of the corebgp.Server
- I add the DUT peer to each server, passing a new instance of the plugin to each
I keep getting the following error messages after a little while:
2023/02/22 15:22:27 peer closed
write tcp 10.17.1.115:179->10.17.0.181:24379: write: broken pipe
On the router side, I can see:
RP/0/RP0/CPU0:2023 Feb 22 15:30:17.631 UTC: bgp[1069]: %ROUTING-BGP-5-ADJCHANGE_DETAIL : neighbor 10.17.1.115 Down - BGP Notification sent, hold time expired (VRF: default; AFI/SAFI: 1/1) (AS: xx)
Looking further on the router, it actually doesn't receive any packet for 90 seconds, which is why it closes the connection.
Doing some debugging, it seems that the writer.WriteUpdate
call blocks sometimes for a long time (90s or more).
More specifically the call to u.conn.Write
in the following function:
func (u *updateMessageWriter) WriteUpdate(b []byte) error {
/*
https://tools.ietf.org/html/rfc4271#page-72
Each time the local system sends a KEEPALIVE or UPDATE message, it
restarts its KeepaliveTimer, unless the negotiated HoldTime value
is zero.
*/
select {
case <-u.closeCh:
return io.ErrClosedPipe
default:
_, err := u.conn.Write(prependHeader(b, updateMessageType))
if strings.HasPrefix(u.conn.LocalAddr().String(), "10.17.1.115") {
fmt.Println("send done")
}
if err == nil {
select {
case <-u.closeCh:
case u.resetKATimerCh <- struct{}{}:
}
}
return err
}
}
Note that I modified the function to add debug output, which leads to the following logs:
2023-02-22 17:41:34.928165013 +0000 UTC m=+277.043173280 send done
2023/02/22 17:41:34 send update done
2023/02/22 17:41:34 loop start route
2023/02/22 17:41:34 prepare route
2023/02/22 17:41:34 send update
2023-02-22 17:41:34.928208551 +0000 UTC m=+277.043216824 send done
2023/02/22 17:41:34 send update done
2023/02/22 17:41:34 loop start route
2023/02/22 17:41:34 prepare route
2023/02/22 17:41:34 send update
2023-02-22 17:41:34.928274148 +0000 UTC m=+277.043282415 send done
2023/02/22 17:41:34 send update done
2023/02/22 17:41:34 loop start route
2023/02/22 17:41:34 prepare route
2023/02/22 17:41:34 send update
There is no more logs until after the device closes the connection.
I don't get the issue when I use only one (or two) servers.
Could there be a concurrency issue that is limiting the number of simultaneous servers that can be run ?
Sorry for the vague report, I don't fully understand it myself... At this point I am not sure where the issue is exactly (could even be on the DUT...).