open-telemetry / opamp-go Goto Github PK
View Code? Open in Web Editor NEWOpAMP protocol implementation in Go
License: Apache License 2.0
OpAMP protocol implementation in Go
License: Apache License 2.0
ProtoBuf messages are non-copy (see https://golang.org/issues/8005), so instead of storing by value we need to store a pointer.
OpAMPClient functions which update the state normally call ScheduleSend to deliver the changed state to the Server. When this happens while OnMessage callback is active we want to delay calling ScheduleSend and only call it once when OnMessage handler returns. This will avoid unnecessary messages going to the server.
This is a proposal for building an OpAMP Supervisor that can manage OpenTelemetry Collector. An earlier discussion was around a generic Supervisor, however I believe building a Collector-specific Supervisor is the right initial step and will result in a better user experience.
Here is an initial draft of what the Collector-specific Supervisor can look like: https://docs.google.com/document/d/1qIFm8zoXqZox00NCmDT5oE0BqJVCzGqFT21wAOWceHM/edit#
Please review/comment.
When ws client is stopped, it closes the connection here
Lines 84 to 90 in 06011e4
opamp-go/client/internal/clientcommon.go
Lines 151 to 156 in 06011e4
Unexpected error while receiving: read tcp 127.0.0.1:53775->127.0.0.1:53774: use of closed network connection
As currently, When client tries to connect to Server using HTTP, if server url we enter some dummy like - https://www.google.com/v1/opamp, it does not print any error logs which will make debugging hard, as we don't get to know what's wrong while performing connection.
I have raised a PR with some example solution to help us out resolve the problem.
Adding Case Server not found and printing connection error logs - https://github.com/open-telemetry/opamp-go/pull/160
Some code addition to help out --
case http.StatusNotFound: return nil, fmt.Errorf("failed to connect to the server, server not found: %d", resp.StatusCode)
Adding the Err log in function - makeOneRequestRoundtrip(ctx context.Context) --
h.logger.Errorf("%v", err)
Management of connection credentials (access tokens, client certificates) used by the Agent.
Originally posted by @tigrannajaryan in #5 (comment)
See spec: https://github.com/open-telemetry/opamp-spec#connection-settings-management
Currently, the status report message is prepared here in client.Start:
Lines 109 to 113 in 5255de5
It then begins runUntilStopped and in runOneCycle it ensures it is connected before entering the ReceiverLoop to receive massages from the server:
Lines 313 to 318 in 5255de5
If the connection is closed, it will repeat runOneCycle and reconnect, but after reconnection no status message is sent because no message has been prepared.
In our own agent, the workaround was to call SetAgentDescription in the OnConnect callback, but I think the implementation should be doing this automatically.
As per our discussion in slack, the protos should be moved to another repository so that the java implementation can use them.
Currently, the OpAMPClient
calls OnAgentPackageAvailable
on receiving protobufs.AgentPackageAvailable
message. The signature of this callback is:
// OnAgentPackageAvailable is called when the server has an agent package available
// for the agent.
// syncer can be used to initiate syncing the package from the server.
OnAgentPackageAvailable(addons *protobufs.AgentPackageAvailable, syncer AgentPackageSyncer) error
Unless the client is told ahead of time who the AgentPackageSyncer
is, the client cannot successfully fulfill this callback as it doesn't have enough context to construct a valid syncer
. More importantly, I think the client doesn't need to. Instead, I suggest we ignore the syncer altogether on the callback and just provide the details of the available agent package along with a context and let the callback handler deal with the actual syncing part which is most likely to be the supervisor which has enough knowledge about the agent and how to sync the packages.
The same is true for OnAddonsAvailable
too.
See Status Reporting in the spec: https://github.com/open-telemetry/opamp-spec#status-reporting
The Client interface currently does not have a way to define the capabilities. This needs to be definable before Start(). It may also affect how the rest of the interface works (i.e. if the capability is not supported the corresponding function that tries to use it may need to return an error).
There appears to be a race condition when calling Stop()
on a websocket client.
The runUntilStopped function checks IsStopping()
before runOneCycle
. Inside runOneCycle
we attempt to connect via ensureConnected then check IsStopping()
again. The race condition lies in if Stop()
is called on the client after the first IsStopping()
check in runUntilStopped
but before the second one in runOneCycle
.
If this happens ensureConnected
could possible fail due to a closed connection. Which is looks like we account for in this comment. The problem is on a failed connection the OnConnectFailedHandler
is called and could execute any code associated with that on the implementors side. This is not ideal in a graceful shutdown.
I'm wondering if it's a better idea to swap the IsStopping()
check and ensureConnected
inside runOnceCycle
. This way we don't try to connect if we're currently stopping.
if c.common.IsStopping() {
_ = c.conn.Close()
return
}
if err := c.ensureConnected(ctx); err != nil {
// Maybe log error here
return
}
Design the API of the Go library that can be used to implement supervisors.
Hi, it seems that the localOverrideAgentConfigs defined at
causing agent to flicker:2022/09/05 11:30:19 Agent process started, PID=17633 2022/09/05 11:30:20 Agent process PID=17633 exited unexpectedly, exit code=1. Will restart in a bit... 2022/09/05 11:30:25 Starting agent /Users/nemos/Dev/SDK/opentelemetry-collector-contrib/bin/otelcontribcol_darwin_amd64 2022/09/05 11:30:25 Agent process started, PID=17636 2022/09/05 11:30:25 Agent process PID=17636 exited unexpectedly, exit code=1. Will restart in a bit...
The OpAMP spec defines a number of features which do not have to be implemented all at once. We can implement them gradually. It is important to define priorities so that we can deliver the most value in shortest amount of time.
I will post one comment per capability in this thread. Please vote with a +1 (thumb up) for the capability you believe is more important than others. Feel free to also add a comment if you would like expand on why you voted that way.
Add the ability for the Server to tell the Agent where and how to report its own telemetry.
Originally posted by @tigrannajaryan in #5 (comment)
We also have an open issue in the spec that asks to verify it: open-telemetry/opamp-spec#92
Ideally we want an automated test. If automated test is impossible then we need to verify manually and close the issue.
The spec says:
The Agent MAY compress the request body using gzip method and MUST specify "Content-Encoding: gzip" in that case. Server implementations MUST honour the "Content-Encoding" header and MUST support gzipped or uncompressed request bodies.
The Server MAY compress the response if the Agent indicated it can accept compressed response via the "Accept-Encoding" header.
We were trying to attach HttpHandler obtained from calling to Attach to an existing httpserver. The connection is accepted however sending/posting message to OpAmp server fails with panic message 'http: panic serving 127.0.0.1:56310: interface conversion: interface is nil, not net.Conn'.
We added TestServerAttachSendMessagePlainHTTP unit test to verify the error.
The panic seems to be coming from serverimpl.go#L292 because there is no conn object in req context with the connContextKey.
We think changing signature of Attach function to return ConnContext may help. The existing http server then can use the returned ConnContext to set just like serverImpl does for non-attach use case.
Other option could be to make httpconnectioncontext.contextWithConn or key public which existing server can use to update request context but that might be exposing too much internal details.
Please take a look.
We want to create a prototype of a Supervisor. See initial thoughts https://docs.google.com/document/d/1KtH5atZQUs9Achbce6LiOaJxLbksNJenvgvyKLsJrkc/edit#
Likely some of the capabilities will require changes in the Collector. Doing the prototyping will help us understand what should go in the Supervisor and what goes into the Collector.
Task List:
We need to add lint,test,coverage,build to the makefile and setup the CI using Github actions.
Can get inspiration from Collector or Otel Go SDK.
When an Agent connects to a Server for the first time, it may never have received an PackagesAvailable message when the server requests the PackagesStatuses through the send statuses flag.
The spec isn't really clear on what should happen in this case (maybe this issue belongs there?). Is sending a PackageStatuses message with a PackageStatus with nothing but a Name and AgentHasVersion sufficient for the Agent in this case (because it also won't have the Package hash for the currently installed package(s)?
This also makes me wonder if the ServerOfferedHash and AgentHasHash are actually redundant alongside the ServerOfferedVersion and AgentOfferedVersion. It seems like having the version only properties here would be sufficient if the package doesn't have a versioning scheme, then it could just provide a hash in the version field instead.
I'm interested in hearing what everything thinks about this.
We currently deliver the data received via one ServerToAgent message via multiple callbacks. This encourages piecemeal processing and can result in multiple state changes on the when only one would be sufficient.
For example, let's say we use the Supervisor model for OpenTelemetry Collector and we receive a message with a remote config and own telemetry connection settings. To process the remote config we may need to stop the Agent, update the config file and start the Agent. Then we process the connection settings and again we stop the Agent, update the config file (to include the own metric collection pipeline) and start the Agent again. We restart the Agent twice and it is unnecessary.
It is currently possible to set agent description in StartSettings and via SetAgentDescription.
Do we need both? Can we eliminate one?
👋 , I have been trying to get familiar with the spec and the current implementation. There is a section about agent status compression which says sub-messages can be omitted in case of no change and the client implementation does it only for RemoteConfigStatus
and PackageStatuses
not for the EffectiveConfig
, AgentHealth
and AgentDescription
. I was wondering if there was reason behind this?
The test coverage is inadequate. We need to increase it, particularly add coverage for edge cases and erroneous situations.
The initial implementation does not guard against multiple sync operations running concurrently: #76
This needs to be fixed.
The config file format should allow the Supervisor to connect to the the OpAMP Server and communicate with it and find the Agent and supervise it.
Once we know how the agent management should work and we have the protocol specification we want to implement a Go library which makes it easy for the backends to adopt the agent management solution. The library can be used by any vendor who has a Go-based backend and wants to manage the Collector remotely.
This task requires an API design for the library first, and then an implementation can be done.
Use the Supervisor library to implement a generic Supervisor that supervise an Agent (e.g. Otel Collector).
This is the Server counterpart of #101
I would like to be able to control the User-Agent header of the agent when communicating with the management server. Currently this defaults to Go-http-client/1.1
. It can be useful to identify traffic for logging.
I would propose adding a UserAgent string
field to StartSettings
and setting the User-Agent header when we set the Authorization header.
Create an implementation of the Client interface: https://github.com/open-telemetry/opamp-go/blob/main/client/client.go
We want to implement this as a library which makes it easy for the agents to adopt the agent management solution. The library will be used by the Collector but we want to make it independent so that it can be used by other agents as well.
Tasks todo:
It is currently possible to set effective config in StartSettings and via SetEffectiveConfig.
Do we need both? Can we eliminate one?
This is now done via #21
This issue is for the record.
Here is a possible directory structure:
Questions:
Thank you.
It looks like in CalcHashEffectiveConfig that the hashing algorithm is iterating over the msg.ConfigMap.ConfigMap
which is of type map[string]*AgentConfigFile
. Iteration of a map in go is not deterministic so the hash computed may be different using the same set of data depending on the key order during iteration.
Should probably first sort the keyspace then iterate over the map.
It appears that the expectation of OnRemoteConfig
is that it return nil
for *protobufs.EffectiveConfig
if there is no change applied. However, the example agent always returns the effective config unless there was an error applying it:
opamp-go/internal/examples/agent/agent/agent.go
Lines 185 to 194 in e32b616
In the implementation of the client Receiver, it sets reportStatus
to true if an effective config is returned:
opamp-go/client/internal/receiver.go
Lines 80 to 91 in e32b616
This results in a new status report being sent to the server. This contradicts the spec here:
I resolved this in my own implementation by comparing the config hash and returning nil for the effective config.
if bytes.Equal(config.GetConfigHash(), agent.effectiveConfigHash) {
return nil, nil
}
If returning nil
for the *protobufs.EffectiveConfig
is the desired approach of stopping the reconfigure messages, the callback documentation for OnRemoteConfig
should be updated here:
opamp-go/client/types/callbacks.go
Lines 31 to 49 in e32b616
Also, I was curious if the Sensu team ran into this in their implementation and noticed that they always return nil
for the effective config in the OnRemoteConfig handler:
Right now the types.Logger interface only has two methods DebugF
and Errorf
with the same signature format string, v ...interface{}
. This makes it difficult to incorporate custom loggers which may expect err
as the first parameter.
This is especially applicable to WS transport which has an extra header.
Currently, neither the transport sender (internal.Sender
) or a method that allows for async send to the server is exposed by the client.
As the supervisor is shaping up, I think the client should expose a way to send async messages to the sender. I can think of a few reasons:
Many are currently missing doc comments, e.g. NewHTTP.
Add the ability for the Agent to receive remote configuration.
Originally posted by @tigrannajaryan in #5 (comment)
See spec: https://github.com/open-telemetry/opamp-spec#configuration
In a busy server, we encountered a concurrent write panic. I added protection in the server, but ideally the library would add a Mutex to prevent this from happening.
Here is the relevant stack fragment:
panic: concurrent write to websocket connection
goroutine 115 [running]:
github.com/gorilla/websocket.(*messageWriter).flushFrame(0xc001dc95a8, 0x1, {0xc001d88160?, 0x2?, 0xc001dc95d8?})
.../go/pkg/mod/github.com/gorilla/[email protected]/conn.go:617 +0x52b
github.com/gorilla/websocket.(*Conn).WriteMessage(0xc001da8f20, 0xc0003d2930?, {0xc001d88160, 0x142, 0x142})
.../go/pkg/mod/github.com/gorilla/[email protected]/conn.go:770 +0x139
github.com/open-telemetry/opamp-go/server.wsConnection.Send({0xc001dc9750?}, {0xc001d950e0?, 0xc001dba050?}, 0x270c?)
.../go/pkg/mod/github.com/open-telemetry/[email protected]/server/wsconnection.go:30 +0x55
...
It can be useful to close an agent connection to the server and I propose that a Close() method be added to the types.Connection interface.
The agent may quickly reconnect, but in a server cluster with multiple nodes, periodically closing connections can help with rebalancing the cluster.
Consider the case of 100 agents connected to a 2-node cluster and assume a load balancer splits them 50/50 between the 2 nodes. If one node is terminated, 50 agents will reconnect and now all 100 agents will be connected to a single node. If a node is added, there will still be 100 agents connected to a single node and the new node will have 0 agents.
See Connection Management section of the spec: https://github.com/open-telemetry/opamp-spec#connection-management
Reporting of its identity and properties from Agent to Server.
Originally posted by @tigrannajaryan in #5 (comment)
See specification: https://github.com/open-telemetry/opamp-spec#status-reporting
Implement the OpAMPServer interface.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.