Git Product home page Git Product logo

opamp-go's People

Contributors

andykellr avatar aneurysm9 avatar cpheps avatar djaglowski avatar dsvanlani avatar echlebek avatar evan-bradley avatar frapschen avatar haoqixu avatar jaronoff97 avatar jlegoff avatar kailash-bhanushali avatar lucas-baldo avatar nemoshlag avatar phanidevavarapu avatar pmm-sumo avatar renovate[bot] avatar sairam866 avatar srikanthccv avatar tigrannajaryan avatar tshinde-splunk avatar yurishkuro avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

opamp-go's Issues

Delay ScheduleSend if OnMessage is in progress

OpAMPClient functions which update the state normally call ScheduleSend to deliver the changed state to the Server. When this happens while OnMessage callback is active we want to delay calling ScheduleSend and only call it once when OnMessage handler returns. This will avoid unnecessary messages going to the server.

OpenTelemetry Collector Supervisor Proposal

This is a proposal for building an OpAMP Supervisor that can manage OpenTelemetry Collector. An earlier discussion was around a generic Supervisor, however I believe building a Collector-specific Supervisor is the right initial step and will result in a better user experience.

Here is an initial draft of what the Collector-specific Supervisor can look like: https://docs.google.com/document/d/1qIFm8zoXqZox00NCmDT5oE0BqJVCzGqFT21wAOWceHM/edit#

Please review/comment.

wsclient: Unexpected error while receiving:read ... use of closed network connection

When ws client is stopped, it closes the connection here

c.connMutex.RLock()
conn := c.conn
c.connMutex.RUnlock()
if conn != nil {
_ = conn.Close()
}
and cancels the context
c.isStoppingMutex.Lock()
cancelFunc := c.runCancel
c.isStoppingFlag = true
c.isStoppingMutex.Unlock()
cancelFunc()
And the ReceiverLoop tries to read a message from the closed connection resulting in an error.

Unexpected error while receiving: read tcp 127.0.0.1:53775->127.0.0.1:53774: use of closed network connection

Add Case - Server not found and printing connection error logs

As currently, When client tries to connect to Server using HTTP, if server url we enter some dummy like - https://www.google.com/v1/opamp, it does not print any error logs which will make debugging hard, as we don't get to know what's wrong while performing connection.

I have raised a PR with some example solution to help us out resolve the problem.
Adding Case Server not found and printing connection error logs - https://github.com/open-telemetry/opamp-go/pull/160

Some code addition to help out --
case http.StatusNotFound: return nil, fmt.Errorf("failed to connect to the server, server not found: %d", resp.StatusCode)

Adding the Err log in function - makeOneRequestRoundtrip(ctx context.Context) --
h.logger.Errorf("%v", err)

Client should send status report on reconnect

Currently, the status report message is prepared here in client.Start:

// Prepare the first status report.
w.sender.UpdateNextStatus(
func(statusReport *protobufs.StatusReport) {
statusReport.AgentDescription = w.settings.AgentDescription
},

It then begins runUntilStopped and in runOneCycle it ensures it is connected before entering the ReceiverLoop to receive massages from the server:

func (w *client) runOneCycle(ctx context.Context) {
if err := w.ensureConnected(ctx); err != nil {
// Can't connect, so can't move forward. This currently happens when we
// are being stopped.
return
}

If the connection is closed, it will repeat runOneCycle and reconnect, but after reconnection no status message is sent because no message has been prepared.

In our own agent, the workaround was to call SetAgentDescription in the OnConnect callback, but I think the implementation should be doing this automatically.

Client may not be able to fulfill OnAgentPackageAvailable

Currently, the OpAMPClient calls OnAgentPackageAvailable on receiving protobufs.AgentPackageAvailable message. The signature of this callback is:

// OnAgentPackageAvailable is called when the server has an agent package available
// for the agent.
// syncer can be used to initiate syncing the package from the server.
OnAgentPackageAvailable(addons *protobufs.AgentPackageAvailable, syncer AgentPackageSyncer) error

Unless the client is told ahead of time who the AgentPackageSyncer is, the client cannot successfully fulfill this callback as it doesn't have enough context to construct a valid syncer. More importantly, I think the client doesn't need to. Instead, I suggest we ignore the syncer altogether on the callback and just provide the details of the available agent package along with a context and let the callback handler deal with the actual syncing part which is most likely to be the supervisor which has enough knowledge about the agent and how to sync the packages.

The same is true for OnAddonsAvailable too.

Add ability to specify Client capabilities

The Client interface currently does not have a way to define the capabilities. This needs to be definable before Start(). It may also affect how the rest of the interface works (i.e. if the capability is not supported the corresponding function that tries to use it may need to return an error).

Websocket Race Condition when closing

There appears to be a race condition when calling Stop() on a websocket client.

The runUntilStopped function checks IsStopping() before runOneCycle. Inside runOneCycle we attempt to connect via ensureConnected then check IsStopping() again. The race condition lies in if Stop() is called on the client after the first IsStopping() check in runUntilStopped but before the second one in runOneCycle.

If this happens ensureConnected could possible fail due to a closed connection. Which is looks like we account for in this comment. The problem is on a failed connection the OnConnectFailedHandler is called and could execute any code associated with that on the implementors side. This is not ideal in a graceful shutdown.

I'm wondering if it's a better idea to swap the IsStopping() check and ensureConnected inside runOnceCycle. This way we don't try to connect if we're currently stopping.

	if c.common.IsStopping() {
		_ = c.conn.Close()
		return
	}

	if err := c.ensureConnected(ctx); err != nil {
                // Maybe log error here
		return
	}

Supervisor example crushes agent

Hi, it seems that the localOverrideAgentConfigs defined at

const localOverrideAgentConfig = `
causing agent to flicker:
2022/09/05 11:30:19 Agent process started, PID=17633 2022/09/05 11:30:20 Agent process PID=17633 exited unexpectedly, exit code=1. Will restart in a bit... 2022/09/05 11:30:25 Starting agent /Users/nemos/Dev/SDK/opentelemetry-collector-contrib/bin/otelcontribcol_darwin_amd64 2022/09/05 11:30:25 Agent process started, PID=17636 2022/09/05 11:30:25 Agent process PID=17636 exited unexpectedly, exit code=1. Will restart in a bit...
Attaching a PR for a suggested fix to initial agent configurations and print effective configuration to debug log in case of agent shutdown on #127

Decide on initial priorities for prototyping

The OpAMP spec defines a number of features which do not have to be implemented all at once. We can implement them gradually. It is important to define priorities so that we can deliver the most value in shortest amount of time.

I will post one comment per capability in this thread. Please vote with a +1 (thumb up) for the capability you believe is more important than others. Feel free to also add a comment if you would like expand on why you voted that way.

Implement gzip compression for HTTP transport

The spec says:

The Agent MAY compress the request body using gzip method and MUST specify "Content-Encoding: gzip" in that case. Server implementations MUST honour the "Content-Encoding" header and MUST support gzipped or uncompressed request bodies.
The Server MAY compress the response if the Agent indicated it can accept compressed response via the "Accept-Encoding" header.

Send message fails for a Attach function

We were trying to attach HttpHandler obtained from calling to Attach to an existing httpserver. The connection is accepted however sending/posting message to OpAmp server fails with panic message 'http: panic serving 127.0.0.1:56310: interface conversion: interface is nil, not net.Conn'.

We added TestServerAttachSendMessagePlainHTTP unit test to verify the error.

The panic seems to be coming from serverimpl.go#L292 because there is no conn object in req context with the connContextKey.

We think changing signature of Attach function to return ConnContext may help. The existing http server then can use the returned ConnContext to set just like serverImpl does for non-attach use case.

Other option could be to make httpconnectioncontext.contextWithConn or key public which existing server can use to update request context but that might be exposing too much internal details.

Please take a look.

Setup CI and makefile

We need to add lint,test,coverage,build to the makefile and setup the CI using Github actions.

Can get inspiration from Collector or Otel Go SDK.

PackageStatuses' ServerProvidedAllPackagesHash Won't Exist On First Connect (and other hash redundancy)

When an Agent connects to a Server for the first time, it may never have received an PackagesAvailable message when the server requests the PackagesStatuses through the send statuses flag.

The spec isn't really clear on what should happen in this case (maybe this issue belongs there?). Is sending a PackageStatuses message with a PackageStatus with nothing but a Name and AgentHasVersion sufficient for the Agent in this case (because it also won't have the Package hash for the currently installed package(s)?

This also makes me wonder if the ServerOfferedHash and AgentHasHash are actually redundant alongside the ServerOfferedVersion and AgentOfferedVersion. It seems like having the version only properties here would be sufficient if the package doesn't have a versioning scheme, then it could just provide a hash in the version field instead.

I'm interested in hearing what everything thinks about this.

Rethink OpAMP Client callbacks

We currently deliver the data received via one ServerToAgent message via multiple callbacks. This encourages piecemeal processing and can result in multiple state changes on the when only one would be sufficient.

For example, let's say we use the Supervisor model for OpenTelemetry Collector and we receive a message with a remote config and own telemetry connection settings. To process the remote config we may need to stop the Agent, update the config file and start the Agent. Then we process the connection settings and again we stop the Agent, update the config file (to include the own metric collection pipeline) and start the Agent again. We restart the Agent twice and it is unnecessary.

Some sub-messages are always updated for next message

👋 , I have been trying to get familiar with the spec and the current implementation. There is a section about agent status compression which says sub-messages can be omitted in case of no change and the client implementation does it only for RemoteConfigStatus and PackageStatuses not for the EffectiveConfig, AgentHealth and AgentDescription. I was wondering if there was reason behind this?

Improve test coverage

The test coverage is inadequate. We need to increase it, particularly add coverage for edge cases and erroneous situations.

Design and implement agent management server-side library

Once we know how the agent management should work and we have the protocol specification we want to implement a Go library which makes it easy for the backends to adopt the agent management solution. The library can be used by any vendor who has a Go-based backend and wants to manage the Collector remotely.

This task requires an API design for the library first, and then an implementation can be done.

Allow the agent User-Agent header to be configurable

I would like to be able to control the User-Agent header of the agent when communicating with the management server. Currently this defaults to Go-http-client/1.1. It can be useful to identify traffic for logging.

I would propose adding a UserAgent string field to StartSettings and setting the User-Agent header when we set the Authorization header.

Decide on directory structure

Here is a possible directory structure:

  • client - for client-side OpAMP implementation, package importable by agents that want to adopt OpAMP.
  • server - for server-side OpAMP implementation, package important by vendors who want to implement OpAMP servers.
  • internal - shared internal code for client and server (e.g. protobufs).
  • examples - examples that use client and server packages.

Questions:

  1. Do we put client and server under pkg or top-level is fine?
  2. Does everything use one top-level go.mod or client and server use their own go.mod (typically you only use one in a particular executable, both are only necessary if you building a proxy of some sort).
  3. I think we also want a prototype implementation of the Supervisor in this repo. Where do we put it? Does it go into the examples or somewhere in cmd/supervisor?

CalcHashEffectiveConfig not deterministic

It looks like in CalcHashEffectiveConfig that the hashing algorithm is iterating over the msg.ConfigMap.ConfigMap which is of type map[string]*AgentConfigFile. Iteration of a map in go is not deterministic so the hash computed may be different using the same set of data depending on the key order during iteration.

Should probably first sort the keyspace then iterate over the map.

Example agent implementation of OnRemoteConfig leads to infinite reconfigure loop

It appears that the expectation of OnRemoteConfig is that it return nil for *protobufs.EffectiveConfig if there is no change applied. However, the example agent always returns the effective config unless there was an error applying it:

func (agent *Agent) onRemoteConfig(
ctx context.Context,
config *protobufs.AgentRemoteConfig,
) (*protobufs.EffectiveConfig, error) {
err := agent.applyRemoteConfig(config)
if err != nil {
return nil, err
}
return agent.composeEffectiveConfig(), nil
}

In the implementation of the client Receiver, it sets reportStatus to true if an effective config is returned:

func (r *Receiver) rcvRemoteConfig(ctx context.Context, config *protobufs.AgentRemoteConfig) (reportStatus bool) {
effective, err := r.callbacks.OnRemoteConfig(ctx, config)
if err == nil {
r.sender.UpdateNextStatus(func(statusReport *protobufs.StatusReport) {
statusReport.EffectiveConfig = effective
})
if effective != nil {
return true
}
}
return false
}

This results in a new status report being sent to the server. This contradicts the spec here:

https://github.com/open-telemetry/opamp-spec/blob/279e064d5461726ec0f70dfed314d5c354384584/specification.md?plain=1#L499-L502

I resolved this in my own implementation by comparing the config hash and returning nil for the effective config.

	if bytes.Equal(config.GetConfigHash(), agent.effectiveConfigHash) {
		return nil, nil
	}

If returning nil for the *protobufs.EffectiveConfig is the desired approach of stopping the reconfigure messages, the callback documentation for OnRemoteConfig should be updated here:

// OnRemoteConfig is called when the agent receives a remote config from the server.
// Note that the config parameter may be nil, which indicates that the remote
// is the same as what the Agent reported it has via the last_remote_config_hash
// field of the RemoteConfigStatus message last time.
//
// The Agent should process the config and return the effective config if processing
// succeeded or an error if processing failed.
// The returned effective config or the error will be reported back to the server
// via StatusReport message (using EffectiveConfig and RemoteConfigStatus fields).
//
// Only one OnRemoteConfig call can be active at any time. Until OnRemoteConfig
// returns it will not be called again. Any other remote configs received from
// the server while OnRemoteConfig call has not returned will be remembered.
// Once OnRemoteConfig call returns it will be called again with the most recent
// remote config received.
OnRemoteConfig(
ctx context.Context,
config *protobufs.AgentRemoteConfig,
) (*protobufs.EffectiveConfig, error)

Also, I was curious if the Sensu team ran into this in their implementation and noticed that they always return nil for the effective config in the OnRemoteConfig handler:

https://github.com/SumoLogic/sumologic-otel-collector/blob/fd9bd85c3d7a7bd6807dee4d5a5738b5c271b936/pkg/extension/opampextension/extension.go#L165

Improved logging interface

Right now the types.Logger interface only has two methods DebugF and Errorf with the same signature format string, v ...interface{}. This makes it difficult to incorporate custom loggers which may expect err as the first parameter.

Client should expose the internal sender or provide a method to send async message to the server

Currently, neither the transport sender (internal.Sender) or a method that allows for async send to the server is exposed by the client.

As the supervisor is shaping up, I think the client should expose a way to send async messages to the sender. I can think of a few reasons:

  1. Report failing health of the agent or worse, termination status of the agent.
  2. Report agent restart metrics.
  3. Report agent package or addon download progress. This is allowed in the spec, but the client doesn't have enough context to fetch this information and for good reasons, I don' think it should either. Instead, the actual downloader should have the ability to report progress to the server through the client.

opamp.Connection needs a Mutex to avoid concurrent write panic

In a busy server, we encountered a concurrent write panic. I added protection in the server, but ideally the library would add a Mutex to prevent this from happening.

Here is the relevant stack fragment:

panic: concurrent write to websocket connection
goroutine 115 [running]:
github.com/gorilla/websocket.(*messageWriter).flushFrame(0xc001dc95a8, 0x1, {0xc001d88160?, 0x2?, 0xc001dc95d8?})
        .../go/pkg/mod/github.com/gorilla/[email protected]/conn.go:617 +0x52b
github.com/gorilla/websocket.(*Conn).WriteMessage(0xc001da8f20, 0xc0003d2930?, {0xc001d88160, 0x142, 0x142})
        .../go/pkg/mod/github.com/gorilla/[email protected]/conn.go:770 +0x139
github.com/open-telemetry/opamp-go/server.wsConnection.Send({0xc001dc9750?}, {0xc001d950e0?, 0xc001dba050?}, 0x270c?)
        .../go/pkg/mod/github.com/open-telemetry/[email protected]/server/wsconnection.go:30 +0x55
...

Add Close() to types.Connection to allow the server to close a connection

It can be useful to close an agent connection to the server and I propose that a Close() method be added to the types.Connection interface.

The agent may quickly reconnect, but in a server cluster with multiple nodes, periodically closing connections can help with rebalancing the cluster.

Consider the case of 100 agents connected to a 2-node cluster and assume a load balancer splits them 50/50 between the 2 nodes. If one node is terminated, 50 agents will reconnect and now all 100 agents will be connected to a single node. If a node is added, there will still be 100 agents connected to a single node and the new node will have 0 agents.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.