open-telemetry / opamp-go Goto Github PK

View Code? Open in Web Editor NEW

122.0 122.0 60.0 671 KB

OpAMP protocol implementation in Go

License: Apache License 2.0

Makefile 0.59% Go 97.61% HTML 1.24% Shell 0.55%

opamp-go's People

Contributors

Stargazers

Watchers

opamp-go's Issues

Convert NextMessage.nextMessage to a pointer

ProtoBuf messages are non-copy (see https://golang.org/issues/8005), so instead of storing by value we need to store a pointer.

Delay ScheduleSend if OnMessage is in progress

OpAMPClient functions which update the state normally call ScheduleSend to deliver the changed state to the Server. When this happens while OnMessage callback is active we want to delay calling ScheduleSend and only call it once when OnMessage handler returns. This will avoid unnecessary messages going to the server.

OpenTelemetry Collector Supervisor Proposal

This is a proposal for building an OpAMP Supervisor that can manage OpenTelemetry Collector. An earlier discussion was around a generic Supervisor, however I believe building a Collector-specific Supervisor is the right initial step and will result in a better user experience.

Here is an initial draft of what the Collector-specific Supervisor can look like: https://docs.google.com/document/d/1qIFm8zoXqZox00NCmDT5oE0BqJVCzGqFT21wAOWceHM/edit#

Please review/comment.

wsclient: Unexpected error while receiving:read ... use of closed network connection

When ws client is stopped, it closes the connection here

opamp-go/client/wsclient.go

Lines 84 to 90 in 06011e4

 c.connMutex.RLock() 

 conn := c.conn 

 c.connMutex.RUnlock() 

 if conn != nil { 

 _ = conn.Close() 

 }

and cancels the context

opamp-go/client/internal/clientcommon.go

Lines 151 to 156 in 06011e4

 c.isStoppingMutex.Lock() 

 cancelFunc := c.runCancel 

 c.isStoppingFlag = true 

 c.isStoppingMutex.Unlock() 

 cancelFunc()

And the ReceiverLoop tries to read a message from the closed connection resulting in an error.

Unexpected error while receiving: read tcp 127.0.0.1:53775->127.0.0.1:53774: use of closed network connection

Add Case - Server not found and printing connection error logs

As currently, When client tries to connect to Server using HTTP, if server url we enter some dummy like - https://www.google.com/v1/opamp, it does not print any error logs which will make debugging hard, as we don't get to know what's wrong while performing connection.

I have raised a PR with some example solution to help us out resolve the problem.
Adding Case Server not found and printing connection error logs - https://github.com/open-telemetry/opamp-go/pull/160

Some code addition to help out --
case http.StatusNotFound: return nil, fmt.Errorf("failed to connect to the server, server not found: %d", resp.StatusCode)

Adding the Err log in function - makeOneRequestRoundtrip(ctx context.Context) --
h.logger.Errorf("%v", err)

Client feature: Agent’s Connection Settings

Management of connection credentials (access tokens, client certificates) used by the Agent.

Originally posted by @tigrannajaryan in #5 (comment)

See spec: https://github.com/open-telemetry/opamp-spec#connection-settings-management

Client should send status report on reconnect

Currently, the status report message is prepared here in client.Start:

opamp-go/client/clientimpl.go

Lines 109 to 113 in 5255de5

 // Prepare the first status report. 

 w.sender.UpdateNextStatus( 

 func(statusReport *protobufs.StatusReport) { 

 statusReport.AgentDescription = w.settings.AgentDescription 

 },

It then begins runUntilStopped and in runOneCycle it ensures it is connected before entering the ReceiverLoop to receive massages from the server:

opamp-go/client/clientimpl.go

Lines 313 to 318 in 5255de5

 func (w *client) runOneCycle(ctx context.Context) { 

 if err := w.ensureConnected(ctx); err != nil { 

 // Can't connect, so can't move forward. This currently happens when we 

 // are being stopped. 

 return 

 }

If the connection is closed, it will repeat runOneCycle and reconnect, but after reconnection no status message is sent because no message has been prepared.

In our own agent, the workaround was to call SetAgentDescription in the OnConnect callback, but I think the implementation should be doing this automatically.

retry policy: parse datetime value for Retry-After header?

I started working on writing tests as part of #99. I noticed that the retry header only works with delay seconds. The header value can either be seconds or datatime. Should the datetime case be accounted for or assume the server always sends the seconds?

Move the protos to another repository

As per our discussion in slack, the protos should be moved to another repository so that the java implementation can use them.

Client may not be able to fulfill OnAgentPackageAvailable

Currently, the OpAMPClient calls OnAgentPackageAvailable on receiving protobufs.AgentPackageAvailable message. The signature of this callback is:

// OnAgentPackageAvailable is called when the server has an agent package available
// for the agent.
// syncer can be used to initiate syncing the package from the server.
OnAgentPackageAvailable(addons *protobufs.AgentPackageAvailable, syncer AgentPackageSyncer) error

Unless the client is told ahead of time who the AgentPackageSyncer is, the client cannot successfully fulfill this callback as it doesn't have enough context to construct a valid syncer. More importantly, I think the client doesn't need to. Instead, I suggest we ignore the syncer altogether on the callback and just provide the details of the available agent package along with a context and let the callback handler deal with the actual syncing part which is most likely to be the supervisor which has enough knowledge about the agent and how to sync the packages.

The same is true for OnAddonsAvailable too.

Implement Client Status Reporting

See Status Reporting in the spec: https://github.com/open-telemetry/opamp-spec#status-reporting

Add ability to specify Client capabilities

The Client interface currently does not have a way to define the capabilities. This needs to be definable before Start(). It may also affect how the rest of the interface works (i.e. if the capability is not supported the corresponding function that tries to use it may need to return an error).

Websocket Race Condition when closing

There appears to be a race condition when calling Stop() on a websocket client.

The runUntilStopped function checks IsStopping() before runOneCycle. Inside runOneCycle we attempt to connect via ensureConnected then check IsStopping() again. The race condition lies in if Stop() is called on the client after the first IsStopping() check in runUntilStopped but before the second one in runOneCycle.

If this happens ensureConnected could possible fail due to a closed connection. Which is looks like we account for in this comment. The problem is on a failed connection the OnConnectFailedHandler is called and could execute any code associated with that on the implementors side. This is not ideal in a graceful shutdown.

I'm wondering if it's a better idea to swap the IsStopping() check and ensureConnected inside runOnceCycle. This way we don't try to connect if we're currently stopping.

	if c.common.IsStopping() {
		_ = c.conn.Close()
		return
	}

	if err := c.ensureConnected(ctx); err != nil {
                // Maybe log error here
		return
	}

Design Supervisor library API

Design the API of the Go library that can be used to implement supervisors.

Supervisor example crushes agent

Hi, it seems that the localOverrideAgentConfigs defined at

opamp-go/internal/examples/supervisor/supervisor/supervisor.go

Line 33 in 89eeb30

const localOverrideAgentConfig = `

causing agent to flicker:

2022/09/05 11:30:19 Agent process started, PID=17633 2022/09/05 11:30:20 Agent process PID=17633 exited unexpectedly, exit code=1. Will restart in a bit... 2022/09/05 11:30:25 Starting agent /Users/nemos/Dev/SDK/opentelemetry-collector-contrib/bin/otelcontribcol_darwin_amd64 2022/09/05 11:30:25 Agent process started, PID=17636 2022/09/05 11:30:25 Agent process PID=17636 exited unexpectedly, exit code=1. Will restart in a bit...

Attaching a PR for a suggested fix to initial agent configurations and print effective configuration to debug log in case of agent shutdown on #127

Decide on initial priorities for prototyping

The OpAMP spec defines a number of features which do not have to be implemented all at once. We can implement them gradually. It is important to define priorities so that we can deliver the most value in shortest amount of time.

I will post one comment per capability in this thread. Please vote with a +1 (thumb up) for the capability you believe is more important than others. Feel free to also add a comment if you would like expand on why you voted that way.

feature: Own Telemetry Reporting

Add the ability for the Server to tell the Agent where and how to report its own telemetry.

Originally posted by @tigrannajaryan in #5 (comment)

Implement compression in WebSocket client

We also have an open issue in the spec that asks to verify it: open-telemetry/opamp-spec#92

Ideally we want an automated test. If automated test is impossible then we need to verify manually and close the issue.

Implement gzip compression for HTTP transport

The spec says:

The Agent MAY compress the request body using gzip method and MUST specify "Content-Encoding: gzip" in that case. Server implementations MUST honour the "Content-Encoding" header and MUST support gzipped or uncompressed request bodies.
The Server MAY compress the response if the Agent indicated it can accept compressed response via the "Accept-Encoding" header.

Send message fails for a Attach function

We were trying to attach HttpHandler obtained from calling to Attach to an existing httpserver. The connection is accepted however sending/posting message to OpAmp server fails with panic message 'http: panic serving 127.0.0.1:56310: interface conversion: interface is nil, not net.Conn'.

We added TestServerAttachSendMessagePlainHTTP unit test to verify the error.

The panic seems to be coming from serverimpl.go#L292 because there is no conn object in req context with the connContextKey.

We think changing signature of Attach function to return ConnContext may help. The existing http server then can use the returned ConnContext to set just like serverImpl does for non-attach use case.

Other option could be to make httpconnectioncontext.contextWithConn or key public which existing server can use to update request context but that might be exposing too much internal details.

Please take a look.

Prototype a Supervisor

We want to create a prototype of a Supervisor. See initial thoughts https://docs.google.com/document/d/1KtH5atZQUs9Achbce6LiOaJxLbksNJenvgvyKLsJrkc/edit#

Likely some of the capabilities will require changes in the Collector. Doing the prototyping will help us understand what should go in the Supervisor and what goes into the Collector.

Task List:

Setup CI and makefile

We need to add lint,test,coverage,build to the makefile and setup the CI using Github actions.

Can get inspiration from Collector or Otel Go SDK.

PackageStatuses' ServerProvidedAllPackagesHash Won't Exist On First Connect (and other hash redundancy)

When an Agent connects to a Server for the first time, it may never have received an PackagesAvailable message when the server requests the PackagesStatuses through the send statuses flag.

The spec isn't really clear on what should happen in this case (maybe this issue belongs there?). Is sending a PackageStatuses message with a PackageStatus with nothing but a Name and AgentHasVersion sufficient for the Agent in this case (because it also won't have the Package hash for the currently installed package(s)?

This also makes me wonder if the ServerOfferedHash and AgentHasHash are actually redundant alongside the ServerOfferedVersion and AgentOfferedVersion. It seems like having the version only properties here would be sufficient if the package doesn't have a versioning scheme, then it could just provide a hash in the version field instead.

I'm interested in hearing what everything thinks about this.

Rethink OpAMP Client callbacks

We currently deliver the data received via one ServerToAgent message via multiple callbacks. This encourages piecemeal processing and can result in multiple state changes on the when only one would be sufficient.

For example, let's say we use the Supervisor model for OpenTelemetry Collector and we receive a message with a remote config and own telemetry connection settings. To process the remote config we may need to stop the Agent, update the config file and start the Agent. Then we process the connection settings and again we stop the Agent, update the config file (to include the own metric collection pipeline) and start the Agent again. We restart the Agent twice and it is unnecessary.

Cleanup client agent description setting

It is currently possible to set agent description in StartSettings and via SetAgentDescription.

Do we need both? Can we eliminate one?

Some sub-messages are always updated for next message

👋 , I have been trying to get familiar with the spec and the current implementation. There is a section about agent status compression which says sub-messages can be omitted in case of no change and the client implementation does it only for RemoteConfigStatus and PackageStatuses not for the EffectiveConfig, AgentHealth and AgentDescription. I was wondering if there was reason behind this?

Improve test coverage

The test coverage is inadequate. We need to increase it, particularly add coverage for edge cases and erroneous situations.

Make sure only one package syncing operation happens at a time

The initial implementation does not guard against multiple sync operations running concurrently: #76

This needs to be fixed.

Design generic Supervisor config file format

The config file format should allow the Supervisor to connect to the the OpAMP Server and communicate with it and find the Agent and supervise it.

Design and implement agent management server-side library

Once we know how the agent management should work and we have the protocol specification we want to implement a Go library which makes it easy for the backends to adopt the agent management solution. The library can be used by any vendor who has a Go-based backend and wants to manage the Collector remotely.

This task requires an API design for the library first, and then an implementation can be done.

Implement generic Supervisor

Use the Supervisor library to implement a generic Supervisor that supervise an Agent (e.g. Otel Collector).

Implement compression in WebSocket server

This is the Server counterpart of #101

Allow the agent User-Agent header to be configurable

I would like to be able to control the User-Agent header of the agent when communicating with the management server. Currently this defaults to Go-http-client/1.1. It can be useful to identify traffic for logging.

I would propose adding a UserAgent string field to StartSettings and setting the User-Agent header when we set the Authorization header.

Design and implement agent management client-side library

Create an implementation of the Client interface: https://github.com/open-telemetry/opamp-go/blob/main/client/client.go

We want to implement this as a library which makes it easy for the agents to adopt the agent management solution. The library will be used by the Collector but we want to make it independent so that it can be used by other agents as well.

Tasks todo:

Cleanup client effective config setting

It is currently possible to set effective config in StartSettings and via SetEffectiveConfig.

Do we need both? Can we eliminate one?

Design Server API

This is now done via #21
This issue is for the record.

Decide on directory structure

Here is a possible directory structure:

client - for client-side OpAMP implementation, package importable by agents that want to adopt OpAMP.
server - for server-side OpAMP implementation, package important by vendors who want to implement OpAMP servers.
internal - shared internal code for client and server (e.g. protobufs).
examples - examples that use client and server packages.

Questions:

Do we put client and server under pkg or top-level is fine?
Does everything use one top-level go.mod or client and server use their own go.mod (typically you only use one in a particular executable, both are only necessary if you building a proxy of some sort).
I think we also want a prototype implementation of the Supervisor in this repo. Where do we put it? Does it go into the examples or somewhere in cmd/supervisor?

OPAMP Client with Otel Agent support w/o supervisor

The documentation https://opentelemetry.io/docs/collector/management/ states that opAMP control flow between the control plane implementing the server-side OpAMP part and the collector (or a supervisor controlling the collector) implementing OpAMP client-side, if there is more documentation around using OPAMP server without the supervisor by including the OPAMP client in the otel collector( agent) will also help.

Thank you.

CalcHashEffectiveConfig not deterministic

It looks like in CalcHashEffectiveConfig that the hashing algorithm is iterating over the msg.ConfigMap.ConfigMap which is of type map[string]*AgentConfigFile. Iteration of a map in go is not deterministic so the hash computed may be different using the same set of data depending on the key order during iteration.

Should probably first sort the keyspace then iterate over the map.

Example agent implementation of OnRemoteConfig leads to infinite reconfigure loop

It appears that the expectation of OnRemoteConfig is that it return nil for *protobufs.EffectiveConfig if there is no change applied. However, the example agent always returns the effective config unless there was an error applying it:

opamp-go/internal/examples/agent/agent/agent.go

Lines 185 to 194 in e32b616

 func (agent *Agent) onRemoteConfig( 

 ctx context.Context, 

 config *protobufs.AgentRemoteConfig, 

 ) (*protobufs.EffectiveConfig, error) { 

 err := agent.applyRemoteConfig(config) 

 if err != nil { 

 return nil, err 

 } 

 return agent.composeEffectiveConfig(), nil 

 }

In the implementation of the client Receiver, it sets reportStatus to true if an effective config is returned:

opamp-go/client/internal/receiver.go

Lines 80 to 91 in e32b616

 func (r *Receiver) rcvRemoteConfig(ctx context.Context, config *protobufs.AgentRemoteConfig) (reportStatus bool) { 

 effective, err := r.callbacks.OnRemoteConfig(ctx, config) 

 if err == nil { 

 r.sender.UpdateNextStatus(func(statusReport *protobufs.StatusReport) { 

 statusReport.EffectiveConfig = effective 

 }) 

 if effective != nil { 

 return true 

 } 

 } 

 return false 

 }

This results in a new status report being sent to the server. This contradicts the spec here:

https://github.com/open-telemetry/opamp-spec/blob/279e064d5461726ec0f70dfed314d5c354384584/specification.md?plain=1#L499-L502

I resolved this in my own implementation by comparing the config hash and returning nil for the effective config.

	if bytes.Equal(config.GetConfigHash(), agent.effectiveConfigHash) {
		return nil, nil
	}

If returning nil for the *protobufs.EffectiveConfig is the desired approach of stopping the reconfigure messages, the callback documentation for OnRemoteConfig should be updated here:

opamp-go/client/types/callbacks.go

Lines 31 to 49 in e32b616

 // OnRemoteConfig is called when the agent receives a remote config from the server. 

 // Note that the config parameter may be nil, which indicates that the remote 

 // is the same as what the Agent reported it has via the last_remote_config_hash 

 // field of the RemoteConfigStatus message last time. 

 // 

 // The Agent should process the config and return the effective config if processing 

 // succeeded or an error if processing failed. 

 // The returned effective config or the error will be reported back to the server 

 // via StatusReport message (using EffectiveConfig and RemoteConfigStatus fields). 

 // 

 // Only one OnRemoteConfig call can be active at any time. Until OnRemoteConfig 

 // returns it will not be called again. Any other remote configs received from 

 // the server while OnRemoteConfig call has not returned will be remembered. 

 // Once OnRemoteConfig call returns it will be called again with the most recent 

 // remote config received. 

 OnRemoteConfig( 

 ctx context.Context, 

 config *protobufs.AgentRemoteConfig, 

 ) (*protobufs.EffectiveConfig, error)

Also, I was curious if the Sensu team ran into this in their implementation and noticed that they always return nil for the effective config in the OnRemoteConfig handler:

https://github.com/SumoLogic/sumologic-otel-collector/blob/fd9bd85c3d7a7bd6807dee4d5a5738b5c271b936/pkg/extension/opampextension/extension.go#L165

Improved logging interface

Right now the types.Logger interface only has two methods DebugF and Errorf with the same signature format string, v ...interface{}. This makes it difficult to incorporate custom loggers which may expect err as the first parameter.

Add fuzzing tests to verify decoding/receiving logic in client and server

This is especially applicable to WS transport which has an extra header.

Client should expose the internal sender or provide a method to send async message to the server

Currently, neither the transport sender (internal.Sender) or a method that allows for async send to the server is exposed by the client.

As the supervisor is shaping up, I think the client should expose a way to send async messages to the sender. I can think of a few reasons:

Report failing health of the agent or worse, termination status of the agent.
Report agent restart metrics.
Report agent package or addon download progress. This is allowed in the spec, but the client doesn't have enough context to fetch this information and for good reasons, I don' think it should either. Instead, the actual downloader should have the ability to report progress to the server through the client.

Make sure all exporter funcs and types have docs

Many are currently missing doc comments, e.g. NewHTTP.

Client feature: Remote Configuration

Add the ability for the Agent to receive remote configuration.

Originally posted by @tigrannajaryan in #5 (comment)

See spec: https://github.com/open-telemetry/opamp-spec#configuration

opamp.Connection needs a Mutex to avoid concurrent write panic

In a busy server, we encountered a concurrent write panic. I added protection in the server, but ideally the library would add a Mutex to prevent this from happening.

Here is the relevant stack fragment:

panic: concurrent write to websocket connection
goroutine 115 [running]:
github.com/gorilla/websocket.(*messageWriter).flushFrame(0xc001dc95a8, 0x1, {0xc001d88160?, 0x2?, 0xc001dc95d8?})
        .../go/pkg/mod/github.com/gorilla/[email protected]/conn.go:617 +0x52b
github.com/gorilla/websocket.(*Conn).WriteMessage(0xc001da8f20, 0xc0003d2930?, {0xc001d88160, 0x142, 0x142})
        .../go/pkg/mod/github.com/gorilla/[email protected]/conn.go:770 +0x139
github.com/open-telemetry/opamp-go/server.wsConnection.Send({0xc001dc9750?}, {0xc001d950e0?, 0xc001dba050?}, 0x270c?)
        .../go/pkg/mod/github.com/open-telemetry/[email protected]/server/wsconnection.go:30 +0x55
...

Add Close() to types.Connection to allow the server to close a connection

It can be useful to close an agent connection to the server and I propose that a Close() method be added to the types.Connection interface.

The agent may quickly reconnect, but in a server cluster with multiple nodes, periodically closing connections can help with rebalancing the cluster.

Consider the case of 100 agents connected to a 2-node cluster and assume a load balancer splits them 50/50 between the 2 nodes. If one node is terminated, 50 agents will reconnect and now all 100 agents will be connected to a single node. If a node is added, there will still be 100 agents connected to a single node and the new node will have 0 agents.

Implement Client OpAMP connection

See Connection Management section of the spec: https://github.com/open-telemetry/opamp-spec#connection-management

Client feature: Basic Status Reporting

Reporting of its identity and properties from Agent to Server.

Originally posted by @tigrannajaryan in #5 (comment)

See specification: https://github.com/open-telemetry/opamp-spec#status-reporting

Implement server

Implement the OpAMPServer interface.

	c.connMutex.RLock()
	conn := c.conn
	c.connMutex.RUnlock()

	if conn != nil {
	_ = conn.Close()
	}

	c.isStoppingMutex.Lock()
	cancelFunc := c.runCancel
	c.isStoppingFlag = true
	c.isStoppingMutex.Unlock()

	cancelFunc()

	// Prepare the first status report.
	w.sender.UpdateNextStatus(
	func(statusReport *protobufs.StatusReport) {
	statusReport.AgentDescription = w.settings.AgentDescription
	},

	func (w *client) runOneCycle(ctx context.Context) {
	if err := w.ensureConnected(ctx); err != nil {
	// Can't connect, so can't move forward. This currently happens when we
	// are being stopped.
	return
	}

	func (agent *Agent) onRemoteConfig(
	ctx context.Context,
	config *protobufs.AgentRemoteConfig,
	) (*protobufs.EffectiveConfig, error) {
	err := agent.applyRemoteConfig(config)
	if err != nil {
	return nil, err
	}
	return agent.composeEffectiveConfig(), nil
	}

	func (r Receiver) rcvRemoteConfig(ctx context.Context, config protobufs.AgentRemoteConfig) (reportStatus bool) {
	effective, err := r.callbacks.OnRemoteConfig(ctx, config)
	if err == nil {
	r.sender.UpdateNextStatus(func(statusReport *protobufs.StatusReport) {
	statusReport.EffectiveConfig = effective
	})
	if effective != nil {
	return true
	}
	}
	return false
	}

	// OnRemoteConfig is called when the agent receives a remote config from the server.
	// Note that the config parameter may be nil, which indicates that the remote
	// is the same as what the Agent reported it has via the last_remote_config_hash
	// field of the RemoteConfigStatus message last time.
	//
	// The Agent should process the config and return the effective config if processing
	// succeeded or an error if processing failed.
	// The returned effective config or the error will be reported back to the server
	// via StatusReport message (using EffectiveConfig and RemoteConfigStatus fields).
	//
	// Only one OnRemoteConfig call can be active at any time. Until OnRemoteConfig
	// returns it will not be called again. Any other remote configs received from
	// the server while OnRemoteConfig call has not returned will be remembered.
	// Once OnRemoteConfig call returns it will be called again with the most recent
	// remote config received.
	OnRemoteConfig(
	ctx context.Context,
	config *protobufs.AgentRemoteConfig,
	) (*protobufs.EffectiveConfig, error)

open-telemetry / opamp-go Goto Github PK

opamp-go's People

Contributors

Stargazers

Watchers

Forkers

opamp-go's Issues

Recommend Projects

Recommend Topics

Recommend Org