Original comment by <a class="user-mention notranslate" data-hovercard-type="user"

[Agent] Reporting failures about elastic-agent HOT 13 OPEN

elastic commented on July 22, 2024

[Agent] Reporting failures

from elastic-agent.

Comments (13)

elasticmachine commented on July 22, 2024

Original comment by @urso:

The needs sounds more like an input/config status, and less about logging. For the purpose of logging it was discussed that we will use filebeat, right?

As a configuration is split into a number of blocks, I assume we will have 2 IDs. One per block to be split, and the overall configuration ID. These IDs should be added via structured logging to all inputs configured.

Despite inputs, we should also consider to add some kind of 'Context' to beat.Event, allowing us to correlate events and logs on events to inputs/configurations. If for example the ES output drops a JSON event due to mapping conflicts, then we want to be able to correlate this log message/fail with the original config an event did originate from.

Back to status. Taking filebeat as example here. Currently the Run method has no return value. That is, we have nothing we could report. In my ongoing refactoring I actually change the signature of Run to also return an error. This one we will be able to report. But, inputs should not just fail. A many 'failures' can be recovered from, by updating the remote system we collect data from. For example the kafka input might fail to read a topic, because it doesn't exist yet. But the moment some other process decides to publish events, the topic will be there and the input recovers. Also network issue can turn an input into 'failure' mode. In this case Run will not return, but retry.
We could augment inputs/modules by reporting some status (similar to systemd/windows services report status) via:

type InputService interface {
  Starting()
  Running()
  Failing(err error)
  Stopping()
  Stopped()
  Fatal(err error)
}

The CM component integrating with agent would create events based on these callbacks, adding the config IDs. The logs themselves still would be shipped asynchronously via filebeat, and might therefore arrive fleet/ES much later then the status update.

For fleet UI we might consider data frames to compute a current status.

from elastic-agent.

elasticmachine commented on July 22, 2024

Original comment by @mattapperson:

Fleet will be sending what amounts to 3 IDs used to identify each configuration:

ID, a unique ID of this exact configuration version. It will change every time a configuration changes. This is what all errors relating to configurations need to be tied to.
version: An auto-incrementing ID, each change to a configuration bumps this. this number will never go down unless the shared_id changes. If a lower number of version is returned, but the shared_id is the same, the “new” configuration is a bad cache and should be ignored.
shared_id: This ID persists across configuration changes, but changes if the agent gets moved to a new configuration (not just to a new version of a configuration)

from elastic-agent.

elasticmachine commented on July 22, 2024

Original comment by @michalpristas:

@ph i think this what matt said is important for stateresolver

from elastic-agent.

elasticmachine commented on July 22, 2024

Original comment by @ph:

@urso I love your proposal here and nice inputs from your filbeat refactoring.

from elastic-agent.

elasticmachine commented on July 22, 2024

Original comment by @ph:

Looking at the IDS:

ID: This will need to be propagated down to the stateresolver.
version: This is only needed by the fetcher of the configuration
shared_id: This is only needed by the fetcher of the configuration

@michalpristas Now if we move to a sync flow as defined in LINK REDACTED we should be fine if we do this. (pseudo code incoming)

Receive a configRequest
newState, steps := Converge(currrentState, configRequest)
Send steps to operator.
Check for errors and Call report on the configRequest

from elastic-agent.

elasticmachine commented on July 22, 2024

Original comment by @ph:

Note the above remove the need for the event bus and we do not have aggregation or discard of events in that flow.

from elastic-agent.

elasticmachine commented on July 22, 2024

Original comment by @michalpristas:

we will need to remove all queues (pubsubs) and replace them with direct calls, or keep the capability of queues and introduce ACK(succ/err) for commands.
the benefits of ACKs is that it can work with sync flow as well as async flow.
the sync without a pubsub is easier to read.

both of these will require some work for sure

from elastic-agent.

elasticmachine commented on July 22, 2024

Original comment by @ph:

@michalpristas I've created the proposal as a google docs here and added a tasks list that with a tentative split.

LINK REDACTED

from elastic-agent.

ph commented on July 22, 2024

We need to keep this open.

from elastic-agent.

elasticmachine commented on July 22, 2024

Pinging @elastic/ingest-management (Team:ingest-management)

from elastic-agent.

botelastic commented on July 22, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

from elastic-agent.

elasticmachine commented on July 22, 2024

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

from elastic-agent.

ruflin commented on July 22, 2024

@jlind23 This also partially ties into reporting of input status.

from elastic-agent.

[Agent] Reporting failures about elastic-agent HOT 13 OPEN

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent