elastic / elastic-agent Goto Github PK
View Code? Open in Web Editor NEWElastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
License: Other
Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
License: Other
Today when the Elastic Agent is enrolled into Fleet the elastic-agent.yml file is backed up and state is written into the elastic-agent.yml file that the agent is managed. In addition, the Agent writes fleet.yml with more data. On startup, the elastic-agent.yml is checked to see if the agent is enrolled. This caused some issues in Cloud initially because it was overwritten.
This issue is to have a more general discussions around the purposes of the files. Do we need to write a state to elastic-agent.yml when enrolled? Is fleet.yml
enough? How do we exactly use each file. The goal is to come up with a guideline to make sure future development is aligned with this guideline.
Original comment by @ph:
The Agent starts the beats process with the same user as the agent process which means root. This is less than ideal if we want to lock down the process and reduce the risk.
TODO:
Define stories
The docker provider that was implemented as the first dynamic provider includes all labels in the add_fields processor. It might be a good idea to reduce this information or make it a provider variable only allowing it to be conditionally added to the input, if the user wants it.
Follow up for elastic/beats#20842 (comment)
When checking docs for windows installation of the elastic agent, for both start
and stop
cmds for Windows check for elastic-agent
service.
https://www.elastic.co/guide/en/fleet/current/start-elastic-agent.html
https://www.elastic.co/guide/en/fleet/current/stop-elastic-agent.html
If all the steps from the doc have been followed during installation the name of the service will be Elastic Agent
instead.
So Stop-Service elastic-agent
and Start-Service elastic-agent
will fail.
Source code https://github.com/elastic/beats/blob/master/x-pack/elastic-agent/pkg/agent/install/paths_windows.go#L20 confirms the name of the agent as well.
Documentation should be updated.
For confirmed bugs, please report:
Today we can use https://github.com/elastic/beats/blob/64f70785c0911eeb6f3f6ce5264f61544844ca0f/x-pack/elastic-agent/pkg/agent/application/upgrade/upgrade.go#L78 to define if a release is upgradable or not. We have added the concept of capabilties in elastic/beats#23848
We should if to the capabitilies files, it could look like this:
capabilities:
- upgrade: false
@ruflin @mostlyjason WDYT if make it generic on the action that we support https://github.com/elastic/beats/tree/1f1fae56057dce0604f72f2cf0099f9a6f2b75aa/x-pack/elastic-agent/pkg/agent/application/pipeline/actions/handlers?
capabilities:
- rule: deny
action: Upgrade
Describe the enhancement:
Bringing the Elastic Agent more in line with outputs supported by Beats.
Describe a specific use case for the enhancement or feature:
Enable customers who are using beats to send events/logs to a Kafka broker to be able to create the same environment and functionality using the Elastic Agent. Lack of this capability may be an inhibitor for the adoption of Elastic Agent.
Description
elastic-agent install
fails to detect a problem in service start and report misleading message: Installation was successful and Elastic Agent is running
, even though service hasn't been able to start (ie: due to a process already binded in port 6789)
Script should at least notify that agent was installed but there was a problem starting the service.
How to reproduce the bug
# netstat -natop | grep 6789
tcp6 0 0 :::6789 :::* LISTEN 1891/docker-proxy off (0.00/0/0)
ubuntu@server:~$ sudo ./elastic-agent install -f --kibana-url=https://<URL> --enrollment-token=<token>
The Elastic Agent is currently in BETA and should not be used in production
2020-12-03T16:43:31.069+0100 DEBUG kibana/client.go:170 Request method: POST, path: /api/fleet/agents/enroll
Successfully enrolled the Elastic Agent.
Installation was successful and Elastic Agent is running.
Installation script reports Installation was successful and Elastic Agent is running.
but agent is never enrolled in Kibana Fleet UI
journalctl -u elastic-agent.service
we can see the process wasn't able to start due to the address already in use# journalctl -u elastic-agent.service
-- Logs begin at Sat 2020-08-29 18:15:02 CEST, end at Thu 2020-12-03 16:43:51 CET. --
nov 17 16:25:53 server systemd[1]: Stopped Elastic Agent is a unified agent to observe, monitor and protect your system..
nov 17 16:25:53 server systemd[1]: Started Elastic Agent is a unified agent to observe, monitor and protect your system..
nov 17 16:25:53 server elastic-agent[1514327]: starting GRPC listener: listen tcp 127.0.0.1:6789: bind: address already in use
nov 17 16:25:53 server systemd[1]: elastic-agent.service: Main process exited, code=exited, status=1/FAILURE
nov 17 16:25:53 server systemd[1]: elastic-agent.service: Failed with result 'exit-code'.
nov 17 16:27:53 server systemd[1]: elastic-agent.service: Scheduled restart job, restart counter is at 2.
nov 17 16:27:53 server systemd[1]: Stopped Elastic Agent is a unified agent to observe, monitor and protect your system..
nov 17 16:27:53 server systemd[1]: Started Elastic Agent is a unified agent to observe, monitor and protect your system..
nov 17 16:27:54 server elastic-agent[1514463]: starting GRPC listener: listen tcp 127.0.0.1:6789: bind: address already in use
nov 17 16:27:54 server systemd[1]: elastic-agent.service: Main process exited, code=exited, status=1/FAILURE
nov 17 16:27:54 server systemd[1]: elastic-agent.service: Failed with result 'exit-code'.
...
Workaround
We can change the default port in /opt/Elastic/Agent/elastic-agent.yml
from 6789 to ie 16789:
fleet:
enabled: true
agent.grpc:
address: localhost
port: 16789
And then restart the service and check that service is up:
# sudo systemctl start elastic-agent.service
#
# sudo journalctl -u elastic-agent.service -f
-- Logs begin at Sat 2020-08-29 18:15:02 CEST. --
dic 03 16:53:20 server systemd[1]: Started Elastic Agent is a unified agent to observe, monitor and protect your system..
Today when the Elastic Agent is enrolled into Fleet the elastic-agent.yml file is backed up and state is written into the elastic-agent.yml file that the agent is managed. In addition, the Agent writes fleet.yml with more data. On startup, the elastic-agent.yml is checked to see if the agent is enrolled. This caused some issues in Cloud initially because it was overwritten.
This issue is to have a more general discussions around the purposes of the files. Do we need to write a state to elastic-agent.yml when enrolled? Is fleet.yml
enough? How do we exactly use each file. The goal is to come up with a guideline to make sure future development is aligned with this guideline.
A user brought this to us and I am logging a quick ticket to capture minimal details.
Apparently the Agent failed to install metricbeat, due to a lack of disk space.
It wasn't clear immediately, but some subsequent log diving shows the reason:
/var/lib/elastic-agent/logs/elastic-agent-json.log.2:{"log.level":"error","@timestamp":"2020-11-11T11:13:12.997-0500","log.origin":{"file.name":"log/reporter.go","file.line":36},"message":"2020-11-11T11:13:12-05:00: type: 'ERROR': sub_type: 'FAILED' message: Application: filebeat--7.9.3--36643631373035623733363936343635[2ff0699f-4ef0-4d57-84b3-053a760c711e]: State changed to FAILED: TarInstaller: error writing to /var/lib/elastic-agent/install/filebeat-7.9.3-linux-x86_64/NOTICE.txt: write /var/lib/elastic-agent/install/filebeat-7.9.3-linux-x86_64/NOTICE.txt: no space left on device","ecs.version":"1.5.0"}
What should we expect of Elastic Agent here? Not sure what it can do... except to purge old log files? what else can we think? and what should be shown in the Activity log, etc?
Thanks @P1llus for bringing it to us in slack
Today when the Elastic Agent is enrolled into Fleet the elastic-agent.yml file is backed up and state is written into the elastic-agent.yml file that the agent is managed. In addition, the Agent writes fleet.yml with more data. On startup, the elastic-agent.yml is checked to see if the agent is enrolled. This caused some issues in Cloud initially because it was overwritten.
This issue is to have a more general discussions around the purposes of the files. Do we need to write a state to elastic-agent.yml when enrolled? Is fleet.yml
enough? How do we exactly use each file. The goal is to come up with a guideline to make sure future development is aligned with this guideline.
We allow people to see and edit the configuration in Fleet, it might be a good idea to share validations or at least high-level validations with the expected fields like ids, type and metricset. This can be also useful for communicating with other teams.
We have created an example configuration but this is not enough to express what we are expecting. We need a formal way to define it, one way would be to use a json-schema definition that could be used by both the agent and fleet.
Summary
When the elastic agent installs a new input, it starts a new process or restarts an existing process with additional input configuration. The agent does not apply any resource limits to the created subprocesses, potentially leading to the processes competing for available resources. This can become an issue when multiple processes run with high load, reaching the limit of available resources. We need a solution for limiting resource usage per subprocess.
It becomes especially important when the resources for the elastic agent are already restricted, which will be the case for the hosted elastic agent.
There is currently no concept available for how the memory/cpu shares available to the elastic agent should be distributed between processes. Most probably we would not want to limit the subprocesses by default, but only if configured. For hosted agents the orchestrator should pass a configuration to the container where the agent is running.
TODO
cgroups
?Describe the enhancement:
Describe a specific use case for the enhancement or feature:
Relates: #151
I'd like to ask for your recommendation for users that prefer to run Elastic Agent in the container (let's say due to security reasons).
Let's discuss the scenario:
The integrated product is nginx running in a container. It produces logs stored locally in the image and which are rotated. As the agent is running in a different container, it can't simply access produced logs.
What is your recommendation in this particular case? Should the user expose somehow log files? Mirror them?
Background -
I had an interesting talk with @ycombinator about possibilites and testing scenarios and it looks that we will both have to nail this problem (force agent to watch logs produced in a different container).
The agent.type
used by Elastic Agent currently leaks details about the underlying Beats. With my 7.13 agent, it's set to "filebeat" for logs and "metricbeat" for metrics. We don't want to create a user dependency on this Beats information because it may be refactored out in the future.
The solution for now is to not populate it and later add it. The reason for this is that not in all scenarios elastic-agent is the actual agent. This is true when it runs as a server (http server) and the agent.id and type could come from the sender. So leaving it out will reduce this mess for now. So we have an opportunity to clean this up. Adding the field is easy and adding it later is an addition. Remove it later is a breaking change. I think there are many other meta fields that we can likely already use for debugging.
As noted in elastic/beats#20994 EQL just do some.field != null
we should investigate if we could converge on the same syntax.
When using the /stats
API, the event is returned as is. When collecting stats from Beats the beats
namespace does contain process metrics like cgroup, CPU, and memory usage. But the process name is not included. When metrics are queried via Agent, the namespace beats
should become process
, plus the field process.name
should be added, which includes the process name known to agent (e.g. filebeat-default-monitoring
).
Although we have no event routing availale yet, data source should be encouraged to provide a data stream meta-data as hints (which can still be ignored). When quering process stats via agent, the JSON document published by Agent should include the data stream fields.
The change could be added to libbeat, or (maybe easier) as a processing step in Agent. When done in Agent we already have a place where we can massage endpoint stats in the future.
Hi this is a spawn off of testing done in support of
elastic/beats#26665
I'm transferring this issue from the Endpoint team, to Beats / Agent.
From @dikshachauhan-qasource : we have attempted to validate the endpoint behavior on French VM machines and found it working fine with a small glitch.
Observations:
Scenario1:
Installed agent under a policy having endpoint.
Agent remained in updating state till we manually restart the elastic-agent service.
Host then updated to healthy status and was available under Endpoint tab with status 'success'.
Data streams were working fine.
All binaries were in running state.
Recording:
https://user-images.githubusercontent.com/12970373/127567223-9c1fd3ee-4216-4837-b0a6-2d6cb45d0300.mp4
Scenario2:
Unenrolled then Re-Installed agent under same policy having endpoint.
Observations same as mentioned above.
Scenario3:
Unenrolled then Re-Installed agent under Default policy. Later after installation of agent, we added Endpoint security.
Observations same as mentioned above.
screenshot:
Logs.zip:
logs-french-win-10-agent.zip
Hi,
I was researching different ways of enabling verbose logging in Elastic Agent in terms of elastic/elastic-package#86 . I'd like to collect Elastic-Agent and subprocesses logs at some folder (which can be mounted and exposed externally).
Then, I came up to a different conclusion: the standard log output of Elastic Agent is silent even though the application is running in background and logging data to .log files:
{"isInitialized":true}{"isInitialized":true}{"list":[{"id":"935ec4c0-5415-11eb-b36e-d53bf68e2a18","active":true,"api_key_id":"4lTA8XYBqSxScuxU6GUe","name":"Default (de7d6165-b378-4b41-a770-2b419e856d98)","policy_id":"8afbdb60-5415-11eb-b36e-d53bf68e2a18","created_at":"2021-01-11T14:02:00.844Z"}],"total":1,"page":1,"perPage":20}
935ec4c0-5415-11eb-b36e-d53bf68e2a18
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 324 100 324 0 0 11899 0 --:--:-- --:--:-- --:--:-- 12000
NGxUQThYWUJxU3hTY3V4VTZHVWU6dVlRSWN5WERSUXl2RDRDX1JoYnRfZw==
The Elastic Agent is currently in BETA and should not be used in production
Successfully enrolled the Elastic Agent.
Elastic Agent might not be running; unable to trigger restart
(no more log messages)
This behavior doesn't seem to be consistent with other dockerized apps like Kibana or Elasticsearch which print levelled log messages by default. What do you think about changing this behavior to a similar pattern? I would appreciate if there is a special combined logging mode, which can merge log outputs from multiple sources like Elastic Agent, Metricbeat, Filebeat, etc.
This is related to elastic/kibana#75236 and elastic/kibana#99068, both of which are longer-term efforts around enabling more granular status reporting of "integrations" that are running on Elastic Agent. But Agent has no concept of integrations, only which inputs/processes are running.
Still, reporting that information is useful and would get us closer to our longer-term goals. In the short term, this would enable Endpoint to filter agents by which ones are running Endpoint without doing additional JOIN-like queries.
I'd like to propose that agents:
local_metadata
fieldOne thing to consider in deciding the data structure of of how this information should be stored, is that in the future we will want to allow subprocesses to report their own additional meta information, such as Endpoint process reporting an "isolated" status.
Original comment by @michalpristas:
At the moment the process is as follows using grpc
manager.Register
call to register settings it know how to handle.Config(string)
endpoint and tries to parse itmap[string]iface{}
) to a fleet/manager
using a channelfleet/manager
, breaks configuration into configuration blocks it understands (based on CM)Central Management
mechanismWhen failure occurs in step 1
, 2
and 3
it is returned to a caller as an error
But when error occurs in step 4
or 5
agent is not aware of failure (unless beat crashes, then it tries to restart it and apply config again)
We need to think of a way how to propagate failures from
We also need to think about pairing experienced failure with a concrete configuration version (this is more or less a question for @mattapperson).
At this moment beat
does not have a concept of a configuration version at all nor agent propagates version down the stack.
Describe the enhancement:
Bringing the Elastic Agent more in line with outputs supported by Beats.
Describe a specific use case for the enhancement or feature:
Enable customers who are using beats to send events/logs to a Kafka broker to be able to create the same environment and functionality using the Elastic Agent. Lack of this capability may be an inhibitor for the adoption of Elastic Agent.
In situations where the Elastic Endpoint Security integration installation fails to successfully install, Agent appears to continuously retry the installation. It's not clear whether there is a limitation or cap on the retries, but there does not appear to be. This results in unnecessary resource utilization, including filling up the elastic-agent.log
file.
Details:
Problem:
In certain execution contexts PowerShell will convert any line of text sent to STDERR into an Error object. This will no doubt go unhandled thus the commend is failed by powershell:
PS C:\Users\Administrator\Documents\EC_Spout> C:\Users\Administrator\Documents\EC_Spout\agent_install+enroll.ps1
Uninstalling existing
Elastic Agent has been uninstalled.
The Elastic Agent is currently in BETA and should not be used in production
elastic-agent.exe : 2020-11-24T07:32:24.902-0800 DEBUG kibana/client.go:170 Request method: POST, path:
/api/fleet/agents/enroll
At C:\Users\Administrator\Documents\EC_Spout\agent_install+enroll.ps1:91 char:1
+ & "$download_dir\elastic-agent-$stack_ver-windows-x86_64\elastic-agen ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (2020-11-24T07:3...t/agents/enroll:String) [], RemoteException
+ FullyQualifiedErrorId : NativeCommandError
Google "PS NativeCommandError" to discover all the happy people that trip over this error.
Solution:
On Windows systems (especially under powershell) do not send anything to STDERR unless its really is an error, and the command should be terminated/failed.
Ways to reproduce:
In an Interactive PS, this error handling will most likely not be enabled. Under ISE is most often is.
Open PowerShell ISE and write a ps1 script:
& "$download_dir\elastic-agent-7.10.0-windows-x86_64\elastic-agent.exe" install -f -k "$kn_url" -t "$agent_token"
Run the script with the 'play' button in the toolbar (after saving it).
Doing it via ISE like this was the easiest way, I think, to have PS in such an error handling mode. I have experienced the same problem with PS scripts start by the task scheduler.
Extra info:
I maintain scripts to automate starting a demo env.: https://github.com/ElasticSA/ec_spout (more info for Elastic employees here: https://wiki.elastic.co/display/PRES/EC+Spout )
Describe the feature:
When you run the enrolment command on Elastic Agent on a host where it has already been installed it terminates and you get the following error (on Mac at least)
"Error: already installed at: /Library/Elastic/Agent"
To continue, you then need to work out how to uninstall agent and then re-run the command.
It is likely with people doing initial testing will try and enrol a test host in more than one cluster as they iterate dev/poc clusters so it would be useful if Agent handled the situation better.
The ideal scenario is Agent would ask if you want to change the configuration of the installed agent to enrol in the new cluster. Alternatively, it could ask for confirmation and then uninstall the existing agent for the user.
As a fallback, it could at least provide the full uninstall command to the sure to be able to continue.
Describe a specific use case for the feature:
Setup of Elastic Agent
Revisit all the defaults, we should be able to run Fleet without having a physical configuration on disk and we should be able to override the path.* using -E like any other beats.
docker run --name centos centos:7 tail -f /dev/null
docker exec -ti centos bash
curl https://snapshots.elastic.co/8.0.0-3ce083a1/downloads/beats/elastic-agent/elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm -o /elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm
curl https://raw.githubusercontent.com/gdraheim/docker-systemctl-replacement/master/files/docker/systemctl.py -o /usr/bin/systemctl
yum localinstall /elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm -y
systemctl enable elastic-agent
systemctl start elastic-agent
top
. There should be only one process for the elastic-agent systemctl restart elastic-agent
top
After the initial restart, the elastic-agent appears once, not in the Z state.
After the initial restart, the elastic-agent appears twice, one in the Z state, and the other in the S state (as shown in the attachment)
This behavior persists across multiple restarts: the elastic-agent process gets into the zombie state each time is restarted (note that I restarted it three times, so there are 3 zombie processes):
docker run -d --name centos centos:7 tail -f /dev/null
docker exec -ti centos bash
Inside the container
curl https://snapshots.elastic.co/8.0.0-3ce083a1/downloads/beats/elastic-agent/elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm -o /elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm
curl https://raw.githubusercontent.com/gdraheim/docker-systemctl-replacement/master/files/docker/systemctl.py -o /usr/bin/systemctl
yum localinstall /elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm -y
systemctl enable elastic-agent
systemctl start elastic-agent
systemctl restart elastic-agent
top
Today when the Elastic Agent is enrolled into Fleet the elastic-agent.yml file is backed up and state is written into the elastic-agent.yml file that the agent is managed. In addition, the Agent writes fleet.yml with more data. On startup, the elastic-agent.yml is checked to see if the agent is enrolled. This caused some issues in Cloud initially because it was overwritten.
This issue is to have a more general discussions around the purposes of the files. Do we need to write a state to elastic-agent.yml when enrolled? Is fleet.yml
enough? How do we exactly use each file. The goal is to come up with a guideline to make sure future development is aligned with this guideline.
Describe the enhancement:
Currently there's no way to unenroll an elastic-agent from the client side
Describe a specific use case for the enhancement or feature:
When running ephemeral instances (containers, for example) each can enroll, but when the container is stopped we end up with stranded offline instances in fleet, which then takes two commands per host on the Fleet screen (unenroll and force unenroll, because they never unenroll), for a total of 6 clicks, plus delays, for each host.
If there were an unenroll
subcommand for ./elastic-agent
it could be called in the container teardown
Today when the Elastic Agent is enrolled into Fleet the elastic-agent.yml file is backed up and state is written into the elastic-agent.yml file that the agent is managed. In addition, the Agent writes fleet.yml with more data. On startup, the elastic-agent.yml is checked to see if the agent is enrolled. This caused some issues in Cloud initially because it was overwritten.
This issue is to have a more general discussions around the purposes of the files. Do we need to write a state to elastic-agent.yml when enrolled? Is fleet.yml
enough? How do we exactly use each file. The goal is to come up with a guideline to make sure future development is aligned with this guideline.
Describe the enhancement:
With beats, the configuration was available to metricbeat and filebeat locally on host but with agent and packages we moved the configuration definition in package registry. So when Kubernetest discovers that pod runs a software that we monitor, either through dynamic inputs conditionals in agent config or via hints based discovery, agent needs to download the integration config for the auto-discovered software. When we deliver this enhancement, agent will automagically ship data to right datastreams in elasticsearch similar to how beats do today. The user still needs to install the right package in kibana. We will tackle the auto-installation of package in separate issue.
Example scenario:
Worker node is running nginx on pod and through dynamic inputs or hints based auto-discovery, agent detects the existence of nginx running on that worker node. Agent is able to retrieve the nginx configuration for metrics and logs, fill in the values provided in auto-discovery configuration -( e.g see configuration examples in metricbeat here and is able to ship data to elasticsearch successfully.
The Elastic Agent was running for a few minutes and I changed the logging level in the Fleet UI from Info
to Debug
. This all seems to have worked but the first we lines that were logged, looked as following:
2021-05-03T19:26:23.569Z INFO process/app.go:176 Signaling application to stop because of shutdown: metricbeat--7.13.0-SNAPSHOT
2021-05-03T19:26:24.066Z INFO log/reporter.go:40 2021-05-03T19:26:24Z - message: Application: filebeat--7.13.0-SNAPSHOT[4f12dd1d-f096-40b1-8bf4-8a0e66722775]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'
2021-05-03T19:26:24.066Z INFO log/reporter.go:40 2021-05-03T19:26:24Z - message: Application: filebeat--7.13.0-SNAPSHOT--36643631373035623733363936343635[4f12dd1d-f096-40b1-8bf4-8a0e66722775]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'
2021-05-03T19:26:24.066Z INFO log/reporter.go:40 2021-05-03T19:26:24Z - message: Application: metricbeat--7.13.0-SNAPSHOT--36643631373035623733363936343635[4f12dd1d-f096-40b1-8bf4-8a0e66722775]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'
2021-05-03T19:26:24.066Z INFO log/reporter.go:40 2021-05-03T19:26:24Z - message: Application: metricbeat--7.13.0-SNAPSHOT[4f12dd1d-f096-40b1-8bf4-8a0e66722775]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'
2021-05-03T19:26:24.999Z ERROR fleet/fleet_gateway.go:167 context canceled
2021-05-03T19:26:26.378Z ERROR fleet/fleet_gateway.go:167 context canceled
2021-05-03T19:26:27.852Z INFO fleet/fleet_gateway.go:298 Fleet gateway is stopping
2021-05-03T19:26:27.852Z INFO status/reporter.go:236 Elastic Agent status changed to: 'online'
I stumbled over the two ERROR
log entries related to context which also contain very little "context" what it is about.
To help support dynamic inputs elastic/beats#19225 Elastic Agent needs to add the ability to debug the providers using for variable substitution. This issue is to track the debugging effort, for information about variable substitution review elastic/beats#20781
This obviously adds a lot of confusion to what the resulting configuration that Elastic Agent will be running with. To ensure that the feature is deployed correctly and that providers are working as expected debugging needs to be a top priority in the implementation.
With the new ability to communicate with the running daemon the inspect
command should be changed to talk to the running daemon and return the current configuration that is being used in memory. This will ensure that with running providers like Docker and Kubernetes it is easy to inspect what the resulting configuration is.
The current inspect
and output
commands can be combined and moved under the debug
subcommand. (Note: This is not connecting to the currently running Elastic Agent)
$ ./elastic-agent debug config
Possible to watch the configuration as changes come in with --watch
.
$ ./elastic-agent debug config --watch
A new debug
command should be implemented that runs a single provider and output what it's currently providing back to the Elastic Agent. (Note: This is not connecting to the currently running Elastic Agent)
Example outputting docker provider inventory key/value mappings:
$ ./elastic-agent debug provider docker
{"id": "1", "mapping": {"id": "1", "paths": {"log": "/var/log/containers/1.log"}}, "processors": {"add_fields": {"container.name": "my-container"}},}
{"id": "2", "mapping": {"id": "2", "paths": {"log": "/var/log/containers/2.log"}}, "processors": {"add_fields": {"container.name": "other-container"}},}
{"id": "2", "mapping": nil}
Example rendering configurations with changes:
$ ./elastic-agent debug provider docker -c testing-conf.yml
# {"id": "1", "mapping": {"id": "1", "paths": {"log": "/var/log/containers/1.log"}}, "processors": {"add_fields": {"container.name": "my-container"}}}
inputs:
- type: logfile
path: /var/log/containers/1.log
processors:
- add_fields:
container.name: my-container
# {"id": "2", "mapping": {"id": "2", "paths": {"log": "/var/log/containers/2.log"}}, "processors": {"add_fields": {"container.name": "other-container"}}}
inputs:
- type: logfile
path: /var/log/containers/1.log
processors:
- add_fields:
container.name: my-container
- type: logfile
path: /var/log/containers/2.log
processors:
- add_fields:
container.name: other-container
# {"id": "2", "mapping": nil}
inputs:
- type: logfile
path: /var/log/containers/1.log
Elastic Stack products will start to ship deprecation logs to a specific index based on the new indexing strategy. Elastic Agent should do the same and ship deprecation logs to logs-deprecation.elastic.agent-*
. It should be also discussed how and where the deprecation logs of the processes are shipped.
The Elastic Agent must connect to the fleet-server for enrollement. There are several issues that can happen around the connectivity to fleet-server. If the enrollment doesn't work, it would be nice to have a command line tool to investigate on what the actual issue is. Things like: certificate issue, port not open, host not reachable, wrong token etc.
This idea was triggered by issues like this one: elastic/fleet-server#235 (comment)
Debugging Elastic Agent is currently not as easy as it should be. In case of issue, the right paths for the logs have to be found and read one by one. It would be very convenient if elastic agent would offer a command to get the logs and metrics.
To tail all the logs, something like the following would be useful:
elastic-agent logs -f
Maybe later support for filtering logs from only a specific process could be added. One step further would be, that on the fly the logging level could be changed.
The same is true for metrics. Would be nice of a snapshot of the metrics could be gathered with something like the following:
elastic-agent metrics
When we created the "when" clause we were under the impression that all the beats were actually equal and supported all the same outputs. This was not completely true, APM-Server supports a subset of the output that beats supports, they do not support redis.
Maybe we should just move to the conditions and rely on capabilities
We should improve the reporting if an output is used and not supported by a running process, currently it will fail silently.
In the Agent code we are using errwrap
and go-multierror
, theses dependencies are not necessary and we should remove them.
Currently we report a very basic status during checkin.
To allow us to give users more details on the status of their agents we want to send more complete policy status (Format is defined here elastic/kibana#82298)
The status will be send during agent checkin:
How we persist status in ES?
Open question should the agent also send that data to ES directly?
Is this already the case if status change are in the agent logs? if yes are this log data will be searchable
Pro:
@blakerouse @ruflin I am curious to have your thoughts here on how this can work with the future Fleet Server too
The Fleet settings
UI allows for setting all kind of values in the Elasticsearch output configuration
. There is no validation, allowing for any kind of input.
enabled: false
: the Elastic Agent kills the sub processes and only logs that it stoped the sub processes. No indicator why they are stopped, no additional entries in the sub process logs.bulk_max_size: "4s"
: Elastic Agent keeps restarting the sub processes, which can't start because they all have an invalid configuration. Subprocess logs include detailed information - helpful for debuggingIdeally there would be some validation preventing to store invalid configurations
I wasn't certain whether this belongs here and/or to Fleet.
The Elastic Agent was running for a few minutes and I changed the logging level in the Fleet UI from Info
to Debug
. This all seems to have worked but the first we lines that were logged, looked as following:
2021-05-03T19:26:23.569Z INFO process/app.go:176 Signaling application to stop because of shutdown: metricbeat--7.13.0-SNAPSHOT
2021-05-03T19:26:24.066Z INFO log/reporter.go:40 2021-05-03T19:26:24Z - message: Application: filebeat--7.13.0-SNAPSHOT[4f12dd1d-f096-40b1-8bf4-8a0e66722775]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'
2021-05-03T19:26:24.066Z INFO log/reporter.go:40 2021-05-03T19:26:24Z - message: Application: filebeat--7.13.0-SNAPSHOT--36643631373035623733363936343635[4f12dd1d-f096-40b1-8bf4-8a0e66722775]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'
2021-05-03T19:26:24.066Z INFO log/reporter.go:40 2021-05-03T19:26:24Z - message: Application: metricbeat--7.13.0-SNAPSHOT--36643631373035623733363936343635[4f12dd1d-f096-40b1-8bf4-8a0e66722775]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'
2021-05-03T19:26:24.066Z INFO log/reporter.go:40 2021-05-03T19:26:24Z - message: Application: metricbeat--7.13.0-SNAPSHOT[4f12dd1d-f096-40b1-8bf4-8a0e66722775]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'
2021-05-03T19:26:24.999Z ERROR fleet/fleet_gateway.go:167 context canceled
2021-05-03T19:26:26.378Z ERROR fleet/fleet_gateway.go:167 context canceled
2021-05-03T19:26:27.852Z INFO fleet/fleet_gateway.go:298 Fleet gateway is stopping
2021-05-03T19:26:27.852Z INFO status/reporter.go:236 Elastic Agent status changed to: 'online'
I stumbled over the two ERROR
log entries related to context which also contain very little "context" what it is about.
Describe the enhancement:
Users may want to protect and observe their network shared drives, so we could support it.
It is currently not recommended to use. We have no automated tests to verify it works and have anecdotal (but old) data that indicates (at least) Filebeat would have a problem running there. no specifics further available at the time (testing would be required to generate example errors seen, etc).
Will leave this logged as an enh for now, and will add a brief note tied to this in the obs-docs for Agent.
At the moment the logs of Filebeat started by the Agent is polluted with the debug logs of the add_docker_metadata
processor. Example log line:
{"log.level":"error","@timestamp":"2020-07-23T16:00:00.372+0200","log.logger":"add_docker_metadata.docker","log.origin":{"file.name":"docker/watcher.go","file.line":320},"message":"Error watching for docker events: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?","ecs.version":"1.5.0"}
As I am not trying to test anything Docker related, these lines are distracting. Normally I just remove all processors when debugging Filebeat issues. However, as the Agent manages the configuration of these Filebeat instances, I do not have access to these global processors. Also, the inspect
subcommand does not show if these processors are included or not. (Or at least I am not aware of any flags which could show it.) But based on the logs, these processors are enabled.
$ ./elastic-agent inspect output -o default -p filebeat
Action ID: 856251a6-a6f8-40e9-b71b-af54738c3280
[default] filebeat:
filebeat:
inputs:
- exclude_files:
- .gz$
id: logfile-postgresql.log
index: logs-postgresql.log-default
meta:
package:
name: postgresql
version: 0.1.0
multiline:
match: after
negate: true
pattern: '^\d{4}-\d{2}-\d{2} '
name: postgresql-1
paths:
- /home/n/go/src/github.com/elastic/beats/filebeat/module/postgresql/log/test/*.log
processors:
- add_fields:
fields:
ecs:
version: 1.5.0
target: ""
- add_fields:
fields:
name: postgresql.log
namespace: default
type: logs
target: dataset
- add_fields:
fields:
dataset: postgresql.log
target: event
type: log
output:
elasticsearch:
api_key: {{key}}
hosts:
- {{outputhost}}
This not only impacts developers but users as well, as the logs of their agent managed Beat instances will be full of these docker/aws/etc. processor logs even if those are irrelevant.
For confirmed bugs, please report:
x-pack/elastic-agent
run mage package
./elastic-agent enroll
using the key you get from the Kibana UI./elastic-agent run
After that, I'm seeing this error:
Application: metricbeat--8.0.0[a3d097d2-59a5-4517-a5cc-86c906ac71c2]: State changed to FAILED: 2 errors occurred: * package '/home/alexk/go/src/github.com/elastic/beats/x-pack/elastic-agent/build/distributions/elastic-agent-8.0.0-linux-x86_64/data/elastic-agent-66d393/downloads/metricbeat-8.0.0-linux-x86_64.tar.gz' not found: open /home/alexk/go/src/github.com/elastic/beats/x-pack/elastic-agent/build/distributions/elastic-agent-8.0.0-linux-x86_64/data/elastic-agent-66d393/downloads/metricbeat-8.0.0-linux-x86_64.tar.gz: no such file or directory * call to 'https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-8.0.0-linux-x86_64.tar.gz' returned unsuccessful status code: 404: /go/src/github.com/elastic/beats/x-pack/elastic-agent/pkg/artifact/download/http/downloader.go[142]: unknown error
The tarballs the come packaged in data/elastic-agent-HASH/downloads/
are being deleted. The files are there when I unpack the elastic-agent
tarball, and after, at least one has been removed. I also discovered that if you try and copy over a new tarball to the downloads/
directory while elastic-agent
is running, it'll instantly delete it. Sometimes it's metricbeat, sometimes it's filebeat. I'm seeing it with install
as well as enroll
. Manually unpacking the tarballs and putting them in install/
doesn't help.
The default port for fleet-server is 8220. When enrolling an Elastic Agent with --url=http://localhost
the port 8220 is picked by default. The same is the case if https is used. On Cloud, fleet-server is exposed on 443/9243. If in the UI or during enrollment, if not port is specified it does not work by default.
I want to discuss if this is the expected behaviour or if we should default to 80/443 if no port is specified.
Currently I am developing a non-interactive installation of the elastic agent for multiple platforms and would like to be able to run "elastic-agent run ...args..." from my application. I am able to do so but since it is not installed as a service I can only send Terminate signals in windows to shut it down (because windows, of course doesn't have proper signal handling). (Unix/Macos I can just send an HUP so this isn't as much of an issue). The elastic-agent control proto only has a "Restart" command but not a "Stop" command or I would use the elastic-agent-client to gracefully shut the application. If I to terminate it on windows the child processes (filebeats, etc) are not cleaned up and what's worse they just grow endlessly before crashing (~32GB and counting). Alternately, if the application would cleanup when receiving a windows terminate event (sans /F) that would also be perfect.
Describe the enhancement:
All the beats have a setting to start an endpoint where you can check the stats, these are useful to monitor the internal state of the Beat. This feature can export an HTTP port, an unix socket, or a named pipe.
Scenario: Listen for basic request
Given An Elastic Agent enrolled in a Kibana instance
And you start the Elastic Agent with the option -E http.enabled=true
And a host or IP is set to listen on -E http.host=localhost
And a port is set to listen on -E http.port=5066
When a user makes the requests curl -XGET http://localhost: 5066/?pretty
Then* the Elastic Agents response with its basic info in JSON format
{
"beat": "elastic-agent",
"hostname": "example.lan",
"name": "example.lan",
"uuid": "34f6c6e1-45a8-4b12-9125-11b3e6e89866",
"version": "7.10.0"
}
Scenario: Listen for basic info request
Given An Elastic Agent enrolled in a Kibana instance
And you start the Elastic Agent with the option -E http.enabled=true
And a unix socket is set to listen on -E http.host=unix:///tmp/elastic-agent.sock
When a user makes the requests curl -XGET --unix-socket 'unix:///tmp/elastic-agent.sock/?pretty'
Then the Elastic Agents response with its basic info in JSON format
{
"beat": "elastic-agent",
"hostname": "example.lan",
"name": "example.lan",
"uuid": "34f6c6e1-45a8-4b12-9125-11b3e6e89866",
"version": "7.10.0"
}
Scenario: Listen for stats request
Given An Elastic Agent enrolled in a Kibana instance
And you start the Elastic Agent with the option -E http.enabled=true
And a host or IP is set to listen on -E http.host=localhost
And a port is set to listen on -E http.port=5066
When a user makes the requests curl -XGET 'http://localhost:5066/stats?pretty'
Then the Elastic Agents response with its stats in JSON format
{
"beat": {
"cpu": {
"system": {
"ticks": 1710,
"time": {
"ms": 1712
}
},
"total": {
"ticks": 3420,
"time": {
"ms": 3424
},
"value": 3420
},
"user": {
"ticks": 1710,
"time": {
"ms": 1712
}
}
},
"info": {
"ephemeral_id": "ab4287c4-d907-4d9d-b074-d8c3cec4a577",
"uptime": {
"ms": 195547
}
},
"memstats": {
"gc_next": 17855152,
"memory_alloc": 9433384,
"memory_total": 492478864,
"rss": 50405376
},
"runtime": {
"goroutines": 22
}
},
"libbeat": {
"config": {
"module": {
"running": 0,
"starts": 0,
"stops": 0
},
"scans": 1,
"reloads": 1
},
"output": {
"events": {
"acked": 0,
"active": 0,
"batches": 0,
"dropped": 0,
"duplicates": 0,
"failed": 0,
"total": 0
},
"read": {
"bytes": 0,
"errors": 0
},
"type": "elasticsearch",
"write": {
"bytes": 0,
"errors": 0
}
},
"pipeline": {
"clients": 6,
"events": {
"active": 716,
"dropped": 0,
"failed": 0,
"filtered": 0,
"published": 716,
"retry": 278,
"total": 716
},
"queue": {
"acked": 0
}
}
},
"system": {
"cpu": {
"cores": 4
},
"load": {
"1": 2.22,
"15": 1.8,
"5": 1.74,
"norm": {
"1": 0.555,
"15": 0.45,
"5": 0.435
}
}
}
}
Scenario: Listen for stats request
Given An Elastic Agent enrolled in a Kibana instance
And you start the Elastic Agent with the option -E http.enabled=true
And a unix socket is set to listen on -E http.host=unix:///tmp/elastic-agent.sock
When a user makes the requests curl -XGET --unix-socket 'unix:///tmp/elastic-agent.sock/stats/?pretty'
Then the Elastic Agents response with its stats in JSON format
{
"beat": {
"cpu": {
"system": {
"ticks": 1710,
"time": {
"ms": 1712
}
},
"total": {
"ticks": 3420,
"time": {
"ms": 3424
},
"value": 3420
},
"user": {
"ticks": 1710,
"time": {
"ms": 1712
}
}
},
"info": {
"ephemeral_id": "ab4287c4-d907-4d9d-b074-d8c3cec4a577",
"uptime": {
"ms": 195547
}
},
"memstats": {
"gc_next": 17855152,
"memory_alloc": 9433384,
"memory_total": 492478864,
"rss": 50405376
},
"runtime": {
"goroutines": 22
}
},
"libbeat": {
"config": {
"module": {
"running": 0,
"starts": 0,
"stops": 0
},
"scans": 1,
"reloads": 1
},
"output": {
"events": {
"acked": 0,
"active": 0,
"batches": 0,
"dropped": 0,
"duplicates": 0,
"failed": 0,
"total": 0
},
"read": {
"bytes": 0,
"errors": 0
},
"type": "elasticsearch",
"write": {
"bytes": 0,
"errors": 0
}
},
"pipeline": {
"clients": 6,
"events": {
"active": 716,
"dropped": 0,
"failed": 0,
"filtered": 0,
"published": 716,
"retry": 278,
"total": 716
},
"queue": {
"acked": 0
}
}
},
"system": {
"cpu": {
"cores": 4
},
"load": {
"1": 2.22,
"15": 1.8,
"5": 1.74,
"norm": {
"1": 0.555,
"15": 0.45,
"5": 0.435
}
}
}
}
https://www.elastic.co/guide/en/beats/metricbeat/current/http-endpoint.html
Describe a specific use case for the enhancement or feature:
In this endpoint, you can check the stats of the Elastic Agent, this endpoint can also be used to create a health check for Docker images.
and ùniversal
artifactsAs discussed with @ph @drewpost @mostlyjason and @blakerouse it'd be useful to have the same fields present in the add_observer/geo_metadata
processors available from the agent. This could be exposed via an API to drive Uptime's UI, showing which geographic regions can run uptime monitoring checks.
Furthermore, it'd be great to automatically fill this data based on cloud metadata where possible, providing sane defaults for common cloud datacenters like us-east-1a on AWS etc.
Beats uses crypto modules for the local keystore (store password and other credentials), TLS in the outputs, but also TLS support in some push based inputs (HTTP serer, syslog server).
The go stdlib crypto libraries are not FIPS compliant. Related: golang/go#21734
As different crypto libs might provide different ciphers and such, it possible if we could switch the used crypto library using environment variables.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.