The corebgp from jwhited

mp ipv4/ipv6 and add path to start

integration test w/bird

support multiple listeners

add router ID to OnOpenMessage()

support loading bird config per integration test

right now we include a static bird config file - this should evolve to support a custom bird config per test case

IP TTL peer option

add listen-range example

implement server.Get/ListPeer()

When OnEstablished callback takes a long time, peer may reset the connection with hold time expired

Hello,

To test a network device, I want to create 100 routers with the test device as peer and each send 100k routes to the device under test (DUT).

My code is roughly as followed:

in a loop, I create 100 instances of the corebgp.Server
I add the DUT peer to each server, passing a new instance of the plugin to each

I keep getting the following error messages after a little while:

2023/02/22 15:22:27 peer closed
write tcp 10.17.1.115:179->10.17.0.181:24379: write: broken pipe

On the router side, I can see:

RP/0/RP0/CPU0:2023 Feb 22 15:30:17.631 UTC: bgp[1069]: %ROUTING-BGP-5-ADJCHANGE_DETAIL : neighbor 10.17.1.115 Down - BGP Notification sent, hold time expired (VRF: default; AFI/SAFI: 1/1) (AS: xx)

Looking further on the router, it actually doesn't receive any packet for 90 seconds, which is why it closes the connection.

Doing some debugging, it seems that the writer.WriteUpdate call blocks sometimes for a long time (90s or more).
More specifically the call to u.conn.Write in the following function:

func (u *updateMessageWriter) WriteUpdate(b []byte) error {
	/*
		https://tools.ietf.org/html/rfc4271#page-72
		Each time the local system sends a KEEPALIVE or UPDATE message, it
		restarts its KeepaliveTimer, unless the negotiated HoldTime value
		is zero.
	*/
	select {
	case <-u.closeCh:
		return io.ErrClosedPipe
	default:
		_, err := u.conn.Write(prependHeader(b, updateMessageType))
		if strings.HasPrefix(u.conn.LocalAddr().String(), "10.17.1.115") {
			fmt.Println("send done")
		}
		if err == nil {
			select {
			case <-u.closeCh:
			case u.resetKATimerCh <- struct{}{}:
			}
		}
		return err
	}
}

Note that I modified the function to add debug output, which leads to the following logs:

2023-02-22 17:41:34.928165013 +0000 UTC m=+277.043173280 send done
2023/02/22 17:41:34 send update done
2023/02/22 17:41:34 loop start route
2023/02/22 17:41:34 prepare route
2023/02/22 17:41:34 send update
2023-02-22 17:41:34.928208551 +0000 UTC m=+277.043216824 send done
2023/02/22 17:41:34 send update done
2023/02/22 17:41:34 loop start route
2023/02/22 17:41:34 prepare route
2023/02/22 17:41:34 send update
2023-02-22 17:41:34.928274148 +0000 UTC m=+277.043282415 send done
2023/02/22 17:41:34 send update done
2023/02/22 17:41:34 loop start route
2023/02/22 17:41:34 prepare route
2023/02/22 17:41:34 send update

There is no more logs until after the device closes the connection.

I don't get the issue when I use only one (or two) servers.
Could there be a concurrency issue that is limiting the number of simultaneous servers that can be run ?

Sorry for the vague report, I don't fully understand it myself... At this point I am not sure where the issue is exactly (could even be on the DUT...).

jwhited / corebgp Goto Github PK

corebgp's People

Contributors

Stargazers

Watchers

Forkers

corebgp's Issues

Recommend Projects

Recommend Topics

Recommend Org