elastic / elastic-agent Goto Github PK

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.

License: Other

Shell 1.34% Python 0.10% Makefile 0.25% Dockerfile 0.01% Go 95.72% ANTLR 0.04% PowerShell 0.14% C 0.14% HCL 0.35% Batchfile 0.03% Smarty 1.87%

elastic-agent's Introduction

Elastic Agent

Architecture and Internals

Official Documentation

See https://www.elastic.co/guide/en/fleet/current/index.html.

The source files for the offical Elastic Agent documentation are currently stored in the ingest-docs repository.

Contributing

See CONTRIBUTING.md.

Developing

The following are exclusively focused on getting developers started building code for Elastic Agent.

Development Installations

⚠️ Development installations are not officially supported and are intended for Elastic Agent developers.

If you are an Elastic employee, you already have an Information Security managed Elastic Agent installed on your machine for endpoint protection. This prevents you from installing the Elastic Agent a second time for development without using a VM or Docker container. To eliminate this point of friction, Elastic Agent has a development mode that permits installing the Elastic Agent on your machine a second time:

# All other arguments to the install command are still supported when --develop is specified.
sudo ./elastic-agent install --develop
# The run command also supports the --develop option to allow running without installing when there is another agent on the machine.
./elastic-agent run -e --develop

Using the --develop option will install the agent in an isolated Agent-Development agent directory in the chosen base path. Development agents enrolled in Fleet will have the Development tag added automatically. Using the default base path on MacOS you will see:

sudo ls /Library/Elastic/
Agent
Agent-Development

The elastic-agent command in the shell is replaced with elastic-development-agent to interact with the development agent:

# For a privileged agent
sudo elastic-development-agent status
# For an unprivileged agent
sudo -u elastic-agent-user elastic-development-agent status

The primary restriction of --develop installations is that they cannot run Elastic Defend. Defend requires the agent to be in the default path, the same restrictions applies for the --base-path option. All other integrations should be usable provided conflicting configurations are changed ahead of time. For example two agents cannot bind to the same agent.monitoring.http.port to expose their monitoring servers.

Test Framework

In addition to standard Go tests, changes to the Elastic Agent are always installed and tested on cross-platform virtual machines. For details on writing and running tests see the Test Framework Developer Guide.

Changelog

The changelog for the Elastic Agent is generated and maintained using the elastic-agent-changelog-tool. Read the installation and usage instructions to get started.

The changelog tool produces fragement files that are consolidated to generate a changelog for each release. Each PR containing a change with user impact (new feature, bug fix, etc.) must contain a changelog fragement describing the change. There is a GitHub action in CI that will fail if a PR does not contain a changelog fragment. For PRs that should not have a changelog entry, use the "skip-changelog" label to bypass this check.

A simple example of a changelog fragment is below for reference:

kind: bug-fix
summary: Fix a panic caused by a race condition when installing the Elastic Agent.
pr: https://github.com/elastic/elastic-agent/pull/823

Packaging

Prerequisites:

installed mage
Docker
beats to pre-exist in the parent folder of the local Git repository checkout if, and only if, packaging with EXTERNAL=false to package the beats as well
elastic-agent-changelog-tool to add changelog fragments for changelog generation

To build a local version of the agent for development, run the command below. The following platforms are supported:

darwin/amd64
darwin/arm64
linux/amd64
linux/arm64
windows/amd64

# DEV=true disable signature verification to allow replacing binaries in the components sub-directory of the package.
# EXTERNAL=true downloads the matching version of the binaries that are packaged with agent (Beats for example).
# SNAPSHOT=true indicates that this is a snapshot version and not a release version.
# PLATFORMS=linux/amd64 builds an agent that will run on 64 bit Linux systems.
# PACKAGES=tar.gz produces a tar.gz package
DEV=true EXTERNAL=true SNAPSHOT=true PLATFORMS=linux/amd64 PACKAGES=tar.gz mage -v package

The resulting package will be produced in the build/distributions directory. The version is controlled by the value in version.go. To install the agent extract the package and run the install command:

cd build/distributions
tar xvfz build/distributions/elastic-agent-8.8.0-SNAPSHOT-darwin-aarch64.tar.gz
cd build/distributions/elastic-agent-8.8.0-SNAPSHOT-darwin-aarch64
sudo elastic-agent install

For basic use the agent binary can be run directly, with the sudo elastic-agent run command.

Packaging for other architectures

When packaging for an architecture different than the host machine, you might face the following error:

exec /crossbuild: exec format error

If that happens, enable multiarch/qemu-user-static is to enable an execution of different multi-architecture containers by QEMU and binfmt_misc:

docker run --rm --privileged multiarch/qemu-user-static --reset -p yes

Docker

Running Elastic Agent in a docker container is a common use case. To build the Elastic Agent and create a docker image run the following command:

# Use PLATFORMS=linux/arm64 if you are using an ARM based Mac.
DEV=true SNAPSHOT=true PLATFORMS=linux/amd64 PACKAGES=docker mage package

If you are in the 7.13 branch, this will create the docker.elastic.co/beats/elastic-agent:7.13.0-SNAPSHOT image in your local environment. Now you can use this to for example test this container with the stack in elastic-package:

elastic-package stack up --version=7.13.0-SNAPSHOT -v

Please note that the docker container is built in both standard and 'complete' variants. The 'complete' variant contains extra files, like the chromium browser, that are too large for the standard variant.

Testing Elastic Agent on Kubernetes

Prerequisites

create kubernetes cluster using kind, check here for details
deploy kube-state-metrics, check here for details
deploy required infrastructure:
- for elastic agent in standalone mode: EK stack or use elastic cloud, check here for details
- for managed mode: use elastic cloud or bring up the stack on docker and then connect docker network with kubernetes kind nodes:
```
elastic-package stack up -d -v
docker network connect elastic-package-stack_default <kind_container_id>
```

Build elastic-agent:

DEV=true PLATFORMS=linux/amd64 PACKAGES=docker mage package

Use environmental variables GOHOSTOS and GOHOSTARCH to specify PLATFORMS variable accordingly. eg.

❯ go env GOHOSTOS
darwin
❯ go env GOHOSTARCH
amd64

Build docker image:

cd build/package/elastic-agent/elastic-agent-linux-amd64.docker/docker-build
docker build -t custom-agent-image .

Load this image in your kind cluster:

kind load docker-image custom-agent-image:latest

Deploy agent with that image:

download all-in-ome manifest for elastic-agent in standalone or managed mode, change version if needed

ELASTIC_AGENT_VERSION="8.0"
ELASTIC_AGENT_MODE="standalone"     # ELASTIC_AGENT_MODE="managed"
curl -L -O https://raw.githubusercontent.com/elastic/elastic-agent/${ELASTIC_AGENT_VERSION}/deploy/kubernetes/elastic-agent-${ELASTIC_AGENT_MODE}-kubernetes.yaml

Modify downloaded manifest:
- change image name to the one, that was created in the previous step and add imagePullPolicy: Never:
```
containers:
  - name: elastic-agent
    image: custom-agent-image:latest
    imagePullPolicy: Never
```
- set environment variables accordingly to the used setup.
Elastic-agent in standalone mode: set ES_USERNAME, ES_PASSWORD,ES_HOST.

Elastic-agent in managed mode: set FLEET_URL and FLEET_ENROLLMENT_TOKEN.
create

kubectl apply -f elastic-agent-${ELASTIC_AGENT_MODE}-kubernetes.yaml

Check status of elastic-agent:

kubectl -n kube-system get pods -l app=elastic-agent

Testing on Elastic Cloud

Elastic employees can create an Elastic Cloud deployment with a locally built Elastic Agent, by pushing images to an internal Docker repository. The images will be based on the SNAPSHOT images with the version defined in version/version.go.

Prerequisite to running following commands is having terraform installed and running terraform init from within testing/environments/cloud.

Running a shorthand make deploy_local in testing/environments/cloud will build Agent, tag the docker image correctly, push it to the repository and deploy to Elastic Cloud.

For more advanced scenarios: Running make build_elastic_agent_docker_image in testing/environments/cloud will build and push the images. Running make push_elastic_agent_docker_image in testing/environments/cloud will publish built docker image to CI docker repository.

Once docker images are published you can run EC_API_KEY=your_api_key make apply from testing/environments/cloud directory to deploy them to Elastic Cloud. To get EC_API_KEY follow this guide

The custom images are tagged with the current version, commit and timestamp. The timestamp is included to force a new Docker image to be used, which enables pushing new binaries without recreating the deployment.

To specify custom images create your docker_image.auto.tfvars file similar to docker_image.auto.tfvars.sample.

You can also use mage cloud:image and mage cloud:push respectively from repo root directory. To deploy your changes use make apply (from testing/environments/cloud) with EC_API_KEY instead of make deploy_local described above.

SNAPSHOT images are used by default. To use non-snapshot image specify SNAPSHOT=false explicitly.

Updating dependencies/PRs

Even though we prefer mage to our automation, we still have some rules implemented on our Makefile as well as CI will use the Makefile. CI will run make check-ci, so make sure to run it locally before submitting any PRs to have a quicker feedback instead of waiting for a CI failure.

Generating the `NOTICE.txt` when updating/adding dependencies

To do so, just run make notice, this is also part of the make check-ci and is the same check our CI will do.

At some point we will migrate it to mage (see discussion on #1108 and on #1107). However until we have the mage automation sorted out, it has been removed to avoid confusion.

elastic-agent's People

Contributors

Stargazers

Watchers

Forkers

v1v narph dedemorton jlind23 oren-zohar andrewkroh stuartnelson3 yanksyoon kvch aleksmaus marc-gr emilioalvap blakerouse ph lykkin amirbenun ruflin michel-laterman davesys911 mdelapenya simitt cmacknz fearful-symmetry isabella232 tlitke5 philippkahr tetianakravchenko belimawr r00tu53r michaelkatsoulis chrsmark dingleberryfarts andrewvc michalpristas uri-weisman rdner andersonq jeniawhite ofiriro3 klacabane joshdover yashtewari shourieg eyalkraft tyc0 lucian-ioan matschaffer sqrtzeroknowledge robbavey juliaelastic ritalwar gsantoro muthu-mps efd6 v42private axw kqualters-elastic olegsu harnish-elastic mohamedhamed-ahmed justinkambic lalit-satapathy bryanchance faec constanca-m pchila kcreddy p1llus makowish taylor-swanson leehinman bhapas mrscraps13 pazone ycombinator gogochan kpollich alexsapran kuisathaverat mjwolf sajitsasi kruskall anjalitiwari kubemeta mrodm girodav chusyclub vinit-chauhan sharbuz intxgo dliappis barkbay watchtower001110 dimadaron mpachama gpop63 webmindml 0xcybery randomize7 devcorpio

elastic-agent's Issues

[Elastic Agent] Improve "when" handling in the program specification.

When we created the "when" clause we were under the impression that all the beats were actually equal and supported all the same outputs. This was not completely true, APM-Server supports a subset of the output that beats supports, they do not support redis.

Maybe we should just move to the conditions and rely on capabilities

We should improve the reporting if an output is used and not supported by a running process, currently it will fail silently.

[Elastic Agent] Improve error logging (reduce to Info Level where possible) wrt changing logging level to debug

Version: 7.13.0-SNAPSHOT
Environment: Docker

The Elastic Agent was running for a few minutes and I changed the logging level in the Fleet UI from Info to Debug. This all seems to have worked but the first we lines that were logged, looked as following:

2021-05-03T19:26:23.569Z	INFO	process/app.go:176	Signaling application to stop because of shutdown: metricbeat--7.13.0-SNAPSHOT
2021-05-03T19:26:24.066Z	INFO	log/reporter.go:40	2021-05-03T19:26:24Z - message: Application: filebeat--7.13.0-SNAPSHOT[4f12dd1d-f096-40b1-8bf4-8a0e66722775]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'
2021-05-03T19:26:24.066Z	INFO	log/reporter.go:40	2021-05-03T19:26:24Z - message: Application: filebeat--7.13.0-SNAPSHOT--36643631373035623733363936343635[4f12dd1d-f096-40b1-8bf4-8a0e66722775]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'
2021-05-03T19:26:24.066Z	INFO	log/reporter.go:40	2021-05-03T19:26:24Z - message: Application: metricbeat--7.13.0-SNAPSHOT--36643631373035623733363936343635[4f12dd1d-f096-40b1-8bf4-8a0e66722775]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'
2021-05-03T19:26:24.066Z	INFO	log/reporter.go:40	2021-05-03T19:26:24Z - message: Application: metricbeat--7.13.0-SNAPSHOT[4f12dd1d-f096-40b1-8bf4-8a0e66722775]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'
2021-05-03T19:26:24.999Z	ERROR	fleet/fleet_gateway.go:167	context canceled
2021-05-03T19:26:26.378Z	ERROR	fleet/fleet_gateway.go:167	context canceled
2021-05-03T19:26:27.852Z	INFO	fleet/fleet_gateway.go:298	Fleet gateway is stopping
2021-05-03T19:26:27.852Z	INFO	status/reporter.go:236	Elastic Agent status changed to: 'online'

I stumbled over the two ERROR log entries related to context which also contain very little "context" what it is about.

[Meta] Mac M1 Support

We aim to support M1 chips by providing a Universal 2 binary

Resources:

Related issues:

#139

Fleet installation script fails to detect error in service start

Description
elastic-agent install fails to detect a problem in service start and report misleading message: Installation was successful and Elastic Agent is running, even though service hasn't been able to start (ie: due to a process already binded in port 6789)

Script should at least notify that agent was installed but there was a problem starting the service.

How to reproduce the bug

Process already running in localhost:6789. ie:

# netstat -natop | grep 6789
tcp6       0      0 :::6789                 :::*       LISTEN      1891/docker-proxy    off (0.00/0/0)

Run the elastic-agent command in CLI

ubuntu@server:~$ sudo ./elastic-agent install -f --kibana-url=https://<URL> --enrollment-token=<token>
The Elastic Agent is currently in BETA and should not be used in production

2020-12-03T16:43:31.069+0100	DEBUG	kibana/client.go:170	Request method: POST, path: /api/fleet/agents/enroll
Successfully enrolled the Elastic Agent.
Installation was successful and Elastic Agent is running.

Installation script reports Installation was successful and Elastic Agent is running. but agent is never enrolled in Kibana Fleet UI

Checking the output of journalctl -u elastic-agent.service we can see the process wasn't able to start due to the address already in use

#  journalctl -u elastic-agent.service
-- Logs begin at Sat 2020-08-29 18:15:02 CEST, end at Thu 2020-12-03 16:43:51 CET. --
nov 17 16:25:53 server systemd[1]: Stopped Elastic Agent is a unified agent to observe, monitor and protect your system..
nov 17 16:25:53 server systemd[1]: Started Elastic Agent is a unified agent to observe, monitor and protect your system..
nov 17 16:25:53 server elastic-agent[1514327]: starting GRPC listener: listen tcp 127.0.0.1:6789: bind: address already in use
nov 17 16:25:53 server systemd[1]: elastic-agent.service: Main process exited, code=exited, status=1/FAILURE
nov 17 16:25:53 server systemd[1]: elastic-agent.service: Failed with result 'exit-code'.
nov 17 16:27:53 server systemd[1]: elastic-agent.service: Scheduled restart job, restart counter is at 2.
nov 17 16:27:53 server systemd[1]: Stopped Elastic Agent is a unified agent to observe, monitor and protect your system..
nov 17 16:27:53 server systemd[1]: Started Elastic Agent is a unified agent to observe, monitor and protect your system..
nov 17 16:27:54 server elastic-agent[1514463]: starting GRPC listener: listen tcp 127.0.0.1:6789: bind: address already in use
nov 17 16:27:54 server systemd[1]: elastic-agent.service: Main process exited, code=exited, status=1/FAILURE
nov 17 16:27:54 server systemd[1]: elastic-agent.service: Failed with result 'exit-code'.
...

Workaround
We can change the default port in /opt/Elastic/Agent/elastic-agent.yml from 6789 to ie 16789:

fleet:
  enabled: true
agent.grpc:
  address: localhost
  port: 16789

And then restart the service and check that service is up:

# sudo systemctl start elastic-agent.service
# 
# sudo  journalctl -u elastic-agent.service -f
-- Logs begin at Sat 2020-08-29 18:15:02 CEST. --
dic 03 16:53:20 server systemd[1]: Started Elastic Agent is a unified agent to observe, monitor and protect your system..

Elastic-agent deletes tarballs on `run`

For confirmed bugs, please report:

Version: 8.0.0
Operating System: Fedora Linux using the tarball install
Discuss Forum URL:
Steps to Reproduce:
- Download and build master
- in x-pack/elastic-agent run mage package
- unpack the tarball
- Run ./elastic-agent enroll using the key you get from the Kibana UI
- Run ./elastic-agent run

After that, I'm seeing this error:

Application: metricbeat--8.0.0[a3d097d2-59a5-4517-a5cc-86c906ac71c2]: State changed to FAILED: 2 errors occurred: * package '/home/alexk/go/src/github.com/elastic/beats/x-pack/elastic-agent/build/distributions/elastic-agent-8.0.0-linux-x86_64/data/elastic-agent-66d393/downloads/metricbeat-8.0.0-linux-x86_64.tar.gz' not found: open /home/alexk/go/src/github.com/elastic/beats/x-pack/elastic-agent/build/distributions/elastic-agent-8.0.0-linux-x86_64/data/elastic-agent-66d393/downloads/metricbeat-8.0.0-linux-x86_64.tar.gz: no such file or directory * call to 'https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-8.0.0-linux-x86_64.tar.gz' returned unsuccessful status code: 404: /go/src/github.com/elastic/beats/x-pack/elastic-agent/pkg/artifact/download/http/downloader.go[142]: unknown error

The tarballs the come packaged in data/elastic-agent-HASH/downloads/ are being deleted. The files are there when I unpack the elastic-agent tarball, and after, at least one has been removed. I also discovered that if you try and copy over a new tarball to the downloads/ directory while elastic-agent is running, it'll instantly delete it. Sometimes it's metricbeat, sometimes it's filebeat. I'm seeing it with install as well as enroll. Manually unpacking the tarballs and putting them in install/ doesn't help.

[Elastic Agent] Look into reducing the docker provider add_fields data

The docker provider that was implemented as the first dynamic provider includes all labels in the add_fields processor. It might be a good idea to reduce this information or make it a provider variable only allowing it to be conditionally added to the input, if the user wants it.

Follow up for elastic/beats#20842 (comment)

Improve handling of Agent Install / Enroll if already (previously) installed

Describe the feature:

When you run the enrolment command on Elastic Agent on a host where it has already been installed it terminates and you get the following error (on Mac at least)
"Error: already installed at: /Library/Elastic/Agent"
To continue, you then need to work out how to uninstall agent and then re-run the command.
It is likely with people doing initial testing will try and enrol a test host in more than one cluster as they iterate dev/poc clusters so it would be useful if Agent handled the situation better.

The ideal scenario is Agent would ask if you want to change the configuration of the installed agent to enrol in the new cluster. Alternatively, it could ask for confirmation and then uninstall the existing agent for the user.

As a fallback, it could at least provide the full uninstall command to the sure to be able to continue.

Describe a specific use case for the feature:

Setup of Elastic Agent

Add a way to unenroll an Elastic Agent from the client side

Describe the enhancement:

Currently there's no way to unenroll an elastic-agent from the client side

Describe a specific use case for the enhancement or feature:

When running ephemeral instances (containers, for example) each can enroll, but when the container is stopped we end up with stranded offline instances in fleet, which then takes two commands per host on the Fleet screen (unenroll and force unenroll, because they never unenroll), for a total of 6 clicks, plus delays, for each host.

If there were an unenroll subcommand for ./elastic-agent it could be called in the container teardown

[Ingest Manager] elastic-agent process is not properly terminated after restart

Environment

Version: 8.0.0-SNAPSHOT (Downloaded from https://snapshots.elastic.co/8.0.0-3ce083a1/downloads/beats/elastic-agent/elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm)
Operating System: Centos 7, dockerised

Steps to Reproduce

Start a Centos:7 docker container: docker run --name centos centos:7 tail -f /dev/null
Enter the container: docker exec -ti centos bash
Download the agent RPM package: curl https://snapshots.elastic.co/8.0.0-3ce083a1/downloads/beats/elastic-agent/elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm -o /elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm
Install systemctl replacement for Docker: curl https://raw.githubusercontent.com/gdraheim/docker-systemctl-replacement/master/files/docker/systemctl.py -o /usr/bin/systemctl
Install the RPM package with yum: yum localinstall /elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm -y
Enable service: systemctl enable elastic-agent
Start service: systemctl start elastic-agent
Check processes: top. There should be only one process for the elastic-agent
Restart service: systemctl restart elastic-agent
Check processes: top

Behaviours:

Expected behaviour

After the initial restart, the elastic-agent appears once, not in the Z state.

Current behaviour

After the initial restart, the elastic-agent appears twice, one in the Z state, and the other in the S state (as shown in the attachment)

Other observations

This behavior persists across multiple restarts: the elastic-agent process gets into the zombie state each time is restarted (note that I restarted it three times, so there are 3 zombie processes):

One shot script

docker run -d --name centos centos:7 tail -f /dev/null
docker exec -ti centos bash

Inside the container

curl https://snapshots.elastic.co/8.0.0-3ce083a1/downloads/beats/elastic-agent/elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm -o /elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm
curl https://raw.githubusercontent.com/gdraheim/docker-systemctl-replacement/master/files/docker/systemctl.py -o /usr/bin/systemctl 
yum localinstall /elastic-agent-8.0.0-SNAPSHOT-x86_64.rpm -y
systemctl enable elastic-agent
systemctl start elastic-agent
systemctl restart elastic-agent
top

[Elastic Agent] Support Kafka as an output for elastic agent

Describe the enhancement:
Bringing the Elastic Agent more in line with outputs supported by Beats.

Describe a specific use case for the enhancement or feature:

Enable customers who are using beats to send events/logs to a Kafka broker to be able to create the same environment and functionality using the Elastic Agent. Lack of this capability may be an inhibitor for the adoption of Elastic Agent.

Elastic-Agent: Do not output to STDERR under powershell, unless you want PS to fail execution as an error

What: elastic-agent.exe
Version: 7.10.0 (but all previous too)
OS: Windows 10 (as of 2020-11-24)

Problem:

In certain execution contexts PowerShell will convert any line of text sent to STDERR into an Error object. This will no doubt go unhandled thus the commend is failed by powershell:

 PS C:\Users\Administrator\Documents\EC_Spout> C:\Users\Administrator\Documents\EC_Spout\agent_install+enroll.ps1
Uninstalling existing
Elastic Agent has been uninstalled.
The Elastic Agent is currently in BETA and should not be used in production

elastic-agent.exe : 2020-11-24T07:32:24.902-0800	DEBUG	kibana/client.go:170	Request method: POST, path: 
/api/fleet/agents/enroll
At C:\Users\Administrator\Documents\EC_Spout\agent_install+enroll.ps1:91 char:1
+ & "$download_dir\elastic-agent-$stack_ver-windows-x86_64\elastic-agen ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (2020-11-24T07:3...t/agents/enroll:String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError

Google "PS NativeCommandError" to discover all the happy people that trip over this error.

Solution:
On Windows systems (especially under powershell) do not send anything to STDERR unless its really is an error, and the command should be terminated/failed.

Ways to reproduce:
In an Interactive PS, this error handling will most likely not be enabled. Under ISE is most often is.

Open PowerShell ISE and write a ps1 script:

 & "$download_dir\elastic-agent-7.10.0-windows-x86_64\elastic-agent.exe" install -f -k "$kn_url" -t "$agent_token"

Run the script with the 'play' button in the toolbar (after saving it).

Doing it via ISE like this was the easiest way, I think, to have PS in such an error handling mode. I have experienced the same problem with PS scripts start by the task scheduler.

Extra info:
I maintain scripts to automate starting a demo env.: https://github.com/ElasticSA/ec_spout (more info for Elastic employees here: https://wiki.elastic.co/display/PRES/EC+Spout )

[Elastic Agent] Overwrite global processor settings

At the moment the logs of Filebeat started by the Agent is polluted with the debug logs of the add_docker_metadata processor. Example log line:

{"log.level":"error","@timestamp":"2020-07-23T16:00:00.372+0200","log.logger":"add_docker_metadata.docker","log.origin":{"file.name":"docker/watcher.go","file.line":320},"message":"Error watching for docker events: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?","ecs.version":"1.5.0"}

As I am not trying to test anything Docker related, these lines are distracting. Normally I just remove all processors when debugging Filebeat issues. However, as the Agent manages the configuration of these Filebeat instances, I do not have access to these global processors. Also, the inspect subcommand does not show if these processors are included or not. (Or at least I am not aware of any flags which could show it.) But based on the logs, these processors are enabled.

$ ./elastic-agent inspect output -o default -p filebeat
Action ID: 856251a6-a6f8-40e9-b71b-af54738c3280
[default] filebeat:
filebeat:
  inputs:
  - exclude_files:
    - .gz$
    id: logfile-postgresql.log
    index: logs-postgresql.log-default
    meta:
      package:
        name: postgresql
        version: 0.1.0
    multiline:
      match: after
      negate: true
      pattern: '^\d{4}-\d{2}-\d{2} '
    name: postgresql-1
    paths:
    - /home/n/go/src/github.com/elastic/beats/filebeat/module/postgresql/log/test/*.log
    processors:
    - add_fields:
        fields:
          ecs:
            version: 1.5.0
        target: ""
    - add_fields:
        fields:
          name: postgresql.log
          namespace: default
          type: logs
        target: dataset
    - add_fields:
        fields:
          dataset: postgresql.log
        target: event
    type: log
output:
  elasticsearch:
    api_key: {{key}}
    hosts:
    - {{outputhost}}

This not only impacts developers but users as well, as the logs of their agent managed Beat instances will be full of these docker/aws/etc. processor logs even if those are irrelevant.

[Agent] Testing & Support running of Elastic Agent from a network shared drive

Describe the enhancement:
Users may want to protect and observe their network shared drives, so we could support it.

It is currently not recommended to use. We have no automated tests to verify it works and have anecdotal (but old) data that indicates (at least) Filebeat would have a problem running there. no specifics further available at the time (testing would be required to generate example errors seen, etc).

Will leave this logged as an enh for now, and will add a brief note tied to this in the obs-docs for Agent.

[Elastic Agent][Discuss] elastic-agent.yml overwrite needed?

Today when the Elastic Agent is enrolled into Fleet the elastic-agent.yml file is backed up and state is written into the elastic-agent.yml file that the agent is managed. In addition, the Agent writes fleet.yml with more data. On startup, the elastic-agent.yml is checked to see if the agent is enrolled. This caused some issues in Cloud initially because it was overwritten.

This issue is to have a more general discussions around the purposes of the files. Do we need to write a state to elastic-agent.yml when enrolled? Is fleet.yml enough? How do we exactly use each file. The goal is to come up with a guideline to make sure future development is aligned with this guideline.

[Agent] How we could reduce the need for root privileges for beats.

Original comment by @ph:

The Agent starts the beats process with the same user as the agent process which means root. This is less than ideal if we want to lock down the process and reduce the risk.

TODO:
Define stories

Behavior of Metricbeat
Behavior of Auditbeat
Behavior of Packetbeat.

Agent per process Metrics document standardization

When using the /stats API, the event is returned as is. When collecting stats from Beats the beats namespace does contain process metrics like cgroup, CPU, and memory usage. But the process name is not included. When metrics are queried via Agent, the namespace beats should become process, plus the field process.name should be added, which includes the process name known to agent (e.g. filebeat-default-monitoring).

Although we have no event routing availale yet, data source should be encouraged to provide a data stream meta-data as hints (which can still be ignored). When quering process stats via agent, the JSON document published by Agent should include the data stream fields.

The change could be added to libbeat, or (maybe easier) as a processing step in Agent. When done in Agent we already have a place where we can massage endpoint stats in the future.

Refactor: Use the capabilities to defined if an Agent is upgradable or not

Today we can use https://github.com/elastic/beats/blob/64f70785c0911eeb6f3f6ce5264f61544844ca0f/x-pack/elastic-agent/pkg/agent/application/upgrade/upgrade.go#L78 to define if a release is upgradable or not. We have added the concept of capabilties in elastic/beats#23848

We should if to the capabitilies files, it could look like this:

capabilities:
-  upgrade: false

@ruflin @mostlyjason WDYT if make it generic on the action that we support https://github.com/elastic/beats/tree/1f1fae56057dce0604f72f2cf0099f9a6f2b75aa/x-pack/elastic-agent/pkg/agent/application/pipeline/actions/handlers?

capabilities:
- rule: deny
  action: Upgrade

[Elastic Agent][Discuss] elastic-agent.yml overwrite needed?

[Elastic Agent] Remove agent.type field so it doesn't leak Beats details

The agent.type used by Elastic Agent currently leaks details about the underlying Beats. With my 7.13 agent, it's set to "filebeat" for logs and "metricbeat" for metrics. We don't want to create a user dependency on this Beats information because it may be refactored out in the future.

The solution for now is to not populate it and later add it. The reason for this is that not in all scenarios elastic-agent is the actual agent. This is true when it runs as a server (http server) and the agent.id and type could come from the sender. So leaving it out will reduce this mess for now. So we have an opportunity to clean this up. Adding the field is easy and adding it later is an addition. Remove it later is a breaking change. I think there are many other meta fields that we can likely already use for debugging.

[Elastic Agent] Report running processes and their health statuses

This is related to elastic/kibana#75236 and elastic/kibana#99068, both of which are longer-term efforts around enabling more granular status reporting of "integrations" that are running on Elastic Agent. But Agent has no concept of integrations, only which inputs/processes are running.

Still, reporting that information is useful and would get us closer to our longer-term goals. In the short term, this would enable Endpoint to filter agents by which ones are running Endpoint without doing additional JOIN-like queries.

I'd like to propose that agents:

Report what inputs/processes are running
Report the health status of each
Store the above information in local_metadata field

One thing to consider in deciding the data structure of of how this information should be stored, is that in the future we will want to allow subprocesses to report their own additional meta information, such as Endpoint process reporting an "isolated" status.

Elastic-Agent Docker: silent standard output

Hi,

I was researching different ways of enabling verbose logging in Elastic Agent in terms of elastic/elastic-package#86 . I'd like to collect Elastic-Agent and subprocesses logs at some folder (which can be mounted and exposed externally).

Then, I came up to a different conclusion: the standard log output of Elastic Agent is silent even though the application is running in background and logging data to .log files:

{"isInitialized":true}{"isInitialized":true}{"list":[{"id":"935ec4c0-5415-11eb-b36e-d53bf68e2a18","active":true,"api_key_id":"4lTA8XYBqSxScuxU6GUe","name":"Default (de7d6165-b378-4b41-a770-2b419e856d98)","policy_id":"8afbdb60-5415-11eb-b36e-d53bf68e2a18","created_at":"2021-01-11T14:02:00.844Z"}],"total":1,"page":1,"perPage":20}
935ec4c0-5415-11eb-b36e-d53bf68e2a18
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   324  100   324    0     0  11899      0 --:--:-- --:--:-- --:--:-- 12000
NGxUQThYWUJxU3hTY3V4VTZHVWU6dVlRSWN5WERSUXl2RDRDX1JoYnRfZw==
The Elastic Agent is currently in BETA and should not be used in production

Successfully enrolled the Elastic Agent.
Elastic Agent might not be running; unable to trigger restart

(no more log messages)

This behavior doesn't seem to be consistent with other dockerized apps like Kibana or Elasticsearch which print levelled log messages by default. What do you think about changing this behavior to a similar pattern? I would appreciate if there is a special combined logging mode, which can merge log outputs from multiple sources like Elastic Agent, Metricbeat, Filebeat, etc.

[Agent] Reporting failures

Original comment by @michalpristas:

At the moment the process is as follows using grpc

Beat uses manager.Register call to register settings it know how to handle.
configuration is read/detected by agent which processes it and send to specific beat
Beat retrieves configuration via Config(string) endpoint and tries to parse it
Beat sends parsed configuration (in form of map[string]iface{}) to a fleet/manager using a channel
Beats fleet/manager, breaks configuration into configuration blocks it understands (based on CM)
Configuration blocks are applied using existing Central Management mechanism

When failure occurs in step 1, 2 and 3 it is returned to a caller as an error

But when error occurs in step 4 or 5 agent is not aware of failure (unless beat crashes, then it tries to restart it and apply config again)

We need to think of a way how to propagate failures from

beat to agent
agent to upstream

We also need to think about pairing experienced failure with a concrete configuration version (this is more or less a question for @mattapperson).
At this moment beat does not have a concept of a configuration version at all nor agent propagates version down the stack.

cc @ph @urso

[Elastic Agent][Discuss] elastic-agent.yml overwrite needed?

[elastic-agent] Discuss: Default port for fleet-server when no port defined

The default port for fleet-server is 8220. When enrolling an Elastic Agent with --url=http://localhost the port 8220 is picked by default. The same is the case if https is used. On Cloud, fleet-server is exposed on 443/9243. If in the UI or during enrollment, if not port is specified it does not work by default.

I want to discuss if this is the expected behaviour or if we should default to 80/443 if no port is specified.

[Elastic Agent] Endpoint for metrics/healthcheck

Describe the enhancement:

All the beats have a setting to start an endpoint where you can check the stats, these are useful to monitor the internal state of the Beat. This feature can export an HTTP port, an unix socket, or a named pipe.

Scenario: Listen for basic request
Given An Elastic Agent enrolled in a Kibana instance
And you start the Elastic Agent with the option -E http.enabled=true
And a host or IP is set to listen on -E http.host=localhost
And a port is set to listen on -E http.port=5066
When a user makes the requests curl -XGET http://localhost: 5066/?pretty
Then* the Elastic Agents response with its basic info in JSON format

{
  "beat": "elastic-agent",
  "hostname": "example.lan",
  "name": "example.lan",
  "uuid": "34f6c6e1-45a8-4b12-9125-11b3e6e89866",
  "version": "7.10.0"
}

Scenario: Listen for basic info request
Given An Elastic Agent enrolled in a Kibana instance
And you start the Elastic Agent with the option -E http.enabled=true
And a unix socket is set to listen on -E http.host=unix:///tmp/elastic-agent.sock
When a user makes the requests curl -XGET --unix-socket 'unix:///tmp/elastic-agent.sock/?pretty'
Then the Elastic Agents response with its basic info in JSON format

{
  "beat": "elastic-agent",
  "hostname": "example.lan",
  "name": "example.lan",
  "uuid": "34f6c6e1-45a8-4b12-9125-11b3e6e89866",
  "version": "7.10.0"
}

Scenario: Listen for stats request
Given An Elastic Agent enrolled in a Kibana instance
And you start the Elastic Agent with the option -E http.enabled=true
And a host or IP is set to listen on -E http.host=localhost
And a port is set to listen on -E http.port=5066
When a user makes the requests curl -XGET 'http://localhost:5066/stats?pretty'
Then the Elastic Agents response with its stats in JSON format

{
  "beat": {
    "cpu": {
      "system": {
        "ticks": 1710,
        "time": {
          "ms": 1712
        }
      },
      "total": {
        "ticks": 3420,
        "time": {
          "ms": 3424
        },
        "value": 3420
      },
      "user": {
        "ticks": 1710,
        "time": {
          "ms": 1712
        }
      }
    },
    "info": {
      "ephemeral_id": "ab4287c4-d907-4d9d-b074-d8c3cec4a577",
      "uptime": {
        "ms": 195547
      }
    },
    "memstats": {
      "gc_next": 17855152,
      "memory_alloc": 9433384,
      "memory_total": 492478864,
      "rss": 50405376
    },
    "runtime": {
      "goroutines": 22
    }
  },
  "libbeat": {
    "config": {
      "module": {
        "running": 0,
        "starts": 0,
        "stops": 0
      },
      "scans": 1,
      "reloads": 1
    },
    "output": {
      "events": {
        "acked": 0,
        "active": 0,
        "batches": 0,
        "dropped": 0,
        "duplicates": 0,
        "failed": 0,
        "total": 0
      },
      "read": {
        "bytes": 0,
        "errors": 0
      },
      "type": "elasticsearch",
      "write": {
        "bytes": 0,
        "errors": 0
      }
    },
    "pipeline": {
      "clients": 6,
      "events": {
        "active": 716,
        "dropped": 0,
        "failed": 0,
        "filtered": 0,
        "published": 716,
        "retry": 278,
        "total": 716
      },
      "queue": {
        "acked": 0
      }
    }
  },
  "system": {
    "cpu": {
      "cores": 4
    },
    "load": {
      "1": 2.22,
      "15": 1.8,
      "5": 1.74,
      "norm": {
        "1": 0.555,
        "15": 0.45,
        "5": 0.435
      }
    }
  }
}

Scenario: Listen for stats request
Given An Elastic Agent enrolled in a Kibana instance
And you start the Elastic Agent with the option -E http.enabled=true
And a unix socket is set to listen on -E http.host=unix:///tmp/elastic-agent.sock
When a user makes the requests curl -XGET --unix-socket 'unix:///tmp/elastic-agent.sock/stats/?pretty'
Then the Elastic Agents response with its stats in JSON format

{
  "beat": {
    "cpu": {
      "system": {
        "ticks": 1710,
        "time": {
          "ms": 1712
        }
      },
      "total": {
        "ticks": 3420,
        "time": {
          "ms": 3424
        },
        "value": 3420
      },
      "user": {
        "ticks": 1710,
        "time": {
          "ms": 1712
        }
      }
    },
    "info": {
      "ephemeral_id": "ab4287c4-d907-4d9d-b074-d8c3cec4a577",
      "uptime": {
        "ms": 195547
      }
    },
    "memstats": {
      "gc_next": 17855152,
      "memory_alloc": 9433384,
      "memory_total": 492478864,
      "rss": 50405376
    },
    "runtime": {
      "goroutines": 22
    }
  },
  "libbeat": {
    "config": {
      "module": {
        "running": 0,
        "starts": 0,
        "stops": 0
      },
      "scans": 1,
      "reloads": 1
    },
    "output": {
      "events": {
        "acked": 0,
        "active": 0,
        "batches": 0,
        "dropped": 0,
        "duplicates": 0,
        "failed": 0,
        "total": 0
      },
      "read": {
        "bytes": 0,
        "errors": 0
      },
      "type": "elasticsearch",
      "write": {
        "bytes": 0,
        "errors": 0
      }
    },
    "pipeline": {
      "clients": 6,
      "events": {
        "active": 716,
        "dropped": 0,
        "failed": 0,
        "filtered": 0,
        "published": 716,
        "retry": 278,
        "total": 716
      },
      "queue": {
        "acked": 0
      }
    }
  },
  "system": {
    "cpu": {
      "cores": 4
    },
    "load": {
      "1": 2.22,
      "15": 1.8,
      "5": 1.74,
      "norm": {
        "1": 0.555,
        "15": 0.45,
        "5": 0.435
      }
    }
  }
}

https://www.elastic.co/guide/en/beats/metricbeat/current/http-endpoint.html

Describe a specific use case for the enhancement or feature:

In this endpoint, you can check the stats of the Elastic Agent, this endpoint can also be used to create a health check for Docker images.

[Fleet] Agent fails when host disk space is full, need better support

A user brought this to us and I am logging a quick ticket to capture minimal details.

Apparently the Agent failed to install metricbeat, due to a lack of disk space.

It wasn't clear immediately, but some subsequent log diving shows the reason:

/var/lib/elastic-agent/logs/elastic-agent-json.log.2:{"log.level":"error","@timestamp":"2020-11-11T11:13:12.997-0500","log.origin":{"file.name":"log/reporter.go","file.line":36},"message":"2020-11-11T11:13:12-05:00: type: 'ERROR': sub_type: 'FAILED' message: Application: filebeat--7.9.3--36643631373035623733363936343635[2ff0699f-4ef0-4d57-84b3-053a760c711e]: State changed to FAILED: TarInstaller: error writing to /var/lib/elastic-agent/install/filebeat-7.9.3-linux-x86_64/NOTICE.txt: write /var/lib/elastic-agent/install/filebeat-7.9.3-linux-x86_64/NOTICE.txt: no space left on device","ecs.version":"1.5.0"}

What should we expect of Elastic Agent here? Not sure what it can do... except to purge old log files? what else can we think? and what should be shown in the Activity log, etc?

Thanks @P1llus for bringing it to us in slack

Investigate FIPS compliance using boring crypto

Beats uses crypto modules for the local keystore (store password and other credentials), TLS in the outputs, but also TLS support in some push based inputs (HTTP serer, syslog server).

The go stdlib crypto libraries are not FIPS compliant. Related: golang/go#21734

As different crypto libs might provide different ciphers and such, it possible if we could switch the used crypto library using environment variables.

Investigate:

Impact of using fips version of golang
Impact on the developer's experience
Impact on the testing / CI
Impact on our build and releases tasks
Can we support everything on our support matrix https://www.elastic.co/support/matrix
Impact on crossbuilding and older version of glibc

[Agent] Provide sane default all the times and ensure overrides -E works as expected.

Revisit all the defaults, we should be able to run Fleet without having a physical configuration on disk and we should be able to override the path.* using -E like any other beats.

[Discuss][Elastic Agent] Deprecation logs for Elastic Agent

Elastic Stack products will start to ship deprecation logs to a specific index based on the new indexing strategy. Elastic Agent should do the same and ship deprecation logs to logs-deprecation.elastic.agent-*. It should be also discussed how and where the deprecation logs of the processes are shipped.

[elastic-agent] Support a "Shutdown" or "Stop" command in addition to Restart

Currently I am developing a non-interactive installation of the elastic agent for multiple platforms and would like to be able to run "elastic-agent run ...args..." from my application. I am able to do so but since it is not installed as a service I can only send Terminate signals in windows to shut it down (because windows, of course doesn't have proper signal handling). (Unix/Macos I can just send an HUP so this isn't as much of an issue). The elastic-agent control proto only has a "Restart" command but not a "Stop" command or I would use the elastic-agent-client to gracefully shut the application. If I to terminate it on windows the child processes (filebeats, etc) are not cleaned up and what's worse they just grow endlessly before crashing (~32GB and counting). Alternately, if the application would cleanup when receiving a windows terminate event (sans /F) that would also be perfect.

Elastic Agent non-Windows host - Agent doesn't finish installing until after re-starting the service

Hi this is a spawn off of testing done in support of
elastic/beats#26665

testing done with 7.14 BC4 Agent and Cloud based stack

I'm transferring this issue from the Endpoint team, to Beats / Agent.

From @dikshachauhan-qasource : we have attempted to validate the endpoint behavior on French VM machines and found it working fine with a small glitch.

Observations:
Scenario1:
Installed agent under a policy having endpoint.

Agent remained in updating state till we manually restart the elastic-agent service.
Host then updated to healthy status and was available under Endpoint tab with status 'success'.
Data streams were working fine.
All binaries were in running state.
Recording:
https://user-images.githubusercontent.com/12970373/127567223-9c1fd3ee-4216-4837-b0a6-2d6cb45d0300.mp4

Scenario2:
Unenrolled then Re-Installed agent under same policy having endpoint.

Observations same as mentioned above.
Scenario3:
Unenrolled then Re-Installed agent under Default policy. Later after installation of agent, we added Endpoint security.

Observations same as mentioned above.
screenshot:

Logs.zip:
logs-french-win-10-agent.zip

[Agent] Add geo metadata to agents on enrollement

As discussed with @ph @drewpost @mostlyjason and @blakerouse it'd be useful to have the same fields present in the add_observer/geo_metadata processors available from the agent. This could be exposed via an API to drive Uptime's UI, showing which geographic regions can run uptime monitoring checks.

Furthermore, it'd be great to automatically fill this data based on cloud metadata where possible, providing sane defaults for common cloud datacenters like us-east-1a on AWS etc.

[Agent] Remove dependencies on errwrap and go-multierror

In the Agent code we are using errwrap and go-multierror, theses dependencies are not necessary and we should remove them.

[Elastic Agent] Add ability to debug a composable provider

Overview

To help support dynamic inputs elastic/beats#19225 Elastic Agent needs to add the ability to debug the providers using for variable substitution. This issue is to track the debugging effort, for information about variable substitution review elastic/beats#20781

Debugging

This obviously adds a lot of confusion to what the resulting configuration that Elastic Agent will be running with. To ensure that the feature is deployed correctly and that providers are working as expected debugging needs to be a top priority in the implementation.

Debugging the running daemon

With the new ability to communicate with the running daemon the inspect command should be changed to talk to the running daemon and return the current configuration that is being used in memory. This will ensure that with running providers like Docker and Kubernetes it is easy to inspect what the resulting configuration is.

The current inspect and output commands can be combined and moved under the debug subcommand. (Note: This is not connecting to the currently running Elastic Agent)

$ ./elastic-agent debug config

Possible to watch the configuration as changes come in with --watch.

$ ./elastic-agent debug config --watch

Debugging a single provider

A new debug command should be implemented that runs a single provider and output what it's currently providing back to the Elastic Agent. (Note: This is not connecting to the currently running Elastic Agent)

Example outputting docker provider inventory key/value mappings:

$ ./elastic-agent debug provider docker 
{"id": "1",  "mapping": {"id": "1", "paths": {"log": "/var/log/containers/1.log"}}, "processors": {"add_fields": {"container.name": "my-container"}},}
{"id": "2", "mapping": {"id": "2", "paths": {"log": "/var/log/containers/2.log"}}, "processors": {"add_fields": {"container.name": "other-container"}},}
{"id": "2", "mapping": nil}

Example rendering configurations with changes:

$ ./elastic-agent debug provider docker -c testing-conf.yml
# {"id": "1",  "mapping": {"id": "1", "paths": {"log": "/var/log/containers/1.log"}}, "processors": {"add_fields": {"container.name": "my-container"}}}
inputs:
  - type: logfile
    path: /var/log/containers/1.log
    processors:
      - add_fields:
          container.name: my-container
# {"id": "2", "mapping": {"id": "2", "paths": {"log": "/var/log/containers/2.log"}}, "processors": {"add_fields": {"container.name": "other-container"}}}
inputs:
  - type: logfile
    path: /var/log/containers/1.log
    processors:
      - add_fields:
          container.name: my-container
  - type: logfile
    path: /var/log/containers/2.log
    processors:
      - add_fields:
          container.name: other-container
# {"id": "2", "mapping": nil}
inputs:
  - type: logfile
    path: /var/log/containers/1.log

Update release manager with the M1 artifact

Describe the enhancement:

Add m1 artifact to the release manager code
Allow artifacts to be uploaded to the website. (require synchronization with the marketing team)
Add m1 artifact to be signed

Describe a specific use case for the enhancement or feature:

Relates: #151

Retrieving integration configuration for auto-discovered inputs on kubernetes

Describe the enhancement:
With beats, the configuration was available to metricbeat and filebeat locally on host but with agent and packages we moved the configuration definition in package registry. So when Kubernetest discovers that pod runs a software that we monitor, either through dynamic inputs conditionals in agent config or via hints based discovery, agent needs to download the integration config for the auto-discovered software. When we deliver this enhancement, agent will automagically ship data to right datastreams in elasticsearch similar to how beats do today. The user still needs to install the right package in kibana. We will tackle the auto-installation of package in separate issue.

Example scenario:
Worker node is running nginx on pod and through dynamic inputs or hints based auto-discovery, agent detects the existence of nginx running on that worker node. Agent is able to retrieve the nginx configuration for metrics and logs, fill in the values provided in auto-discovery configuration -( e.g see configuration examples in metricbeat here and is able to ship data to elasticsearch successfully.

[Fleet] Improve status reporting for Agents

Description

Currently we report a very basic status during checkin.
To allow us to give users more details on the status of their agents we want to send more complete policy status (Format is defined here elastic/kibana#82298)

The status will be send during agent checkin:

This will allow Fleet Kibana (and Fleet Server in the future) to update the agent saved object and allow user to search per agent status, input status
A future coordinator process to take actions after a status change

Questions

How we persist status in ES?

I was thinking of updating the current agent SO with the actual status

Open question should the agent also send that data to ES directly?
Is this already the case if status change are in the agent logs? if yes are this log data will be searchable

Pro:

Allow to have agent status for agent non managed per Fleet
Allow to have historical data of agent status.
Have agent status when the connection Agent -> fleet is broken

@blakerouse @ruflin I am curious to have your thoughts here on how this can work with the future Fleet Server too

[Agent] Configuration validations on Fleet and Agent side

We allow people to see and edit the configuration in Fleet, it might be a good idea to share validations or at least high-level validations with the expected fields like ids, type and metricset. This can be also useful for communicating with other teams.

We have created an example configuration but this is not enough to express what we are expecting. We need a formal way to define it, one way would be to use a json-schema definition that could be used by both the agent and fleet.

Documentation calls for "elastic-agent" but the service name installed is "Elastic Agent"

When checking docs for windows installation of the elastic agent, for both start and stop cmds for Windows check for elastic-agent service.
https://www.elastic.co/guide/en/fleet/current/start-elastic-agent.html
https://www.elastic.co/guide/en/fleet/current/stop-elastic-agent.html

If all the steps from the doc have been followed during installation the name of the service will be Elastic Agent instead.
So Stop-Service elastic-agent and Start-Service elastic-agent will fail.
Source code https://github.com/elastic/beats/blob/master/x-pack/elastic-agent/pkg/agent/install/paths_windows.go#L20 confirms the name of the agent as well.
Documentation should be updated.

For confirmed bugs, please report:

Operating System: Windows

[Elastic Agent][Discuss] elastic-agent.yml overwrite needed?

[Elastic Agent] Failed Endpoint installation retry loop

In situations where the Elastic Endpoint Security integration installation fails to successfully install, Agent appears to continuously retry the installation. It's not clear whether there is a limitation or cap on the retries, but there does not appear to be. This results in unnecessary resource utilization, including filling up the elastic-agent.log file.

Details:

Version: 7.12.1 (at least)
Operating System: All version of Windows, untested on macOS or Linux.
Steps to Reproduce: Please reach out directly for logs and steps to reproduce the failure.

[Elastic Agent] Debug fleet-server connectivity

The Elastic Agent must connect to the fleet-server for enrollement. There are several issues that can happen around the connectivity to fleet-server. If the enrollment doesn't work, it would be nice to have a command line tool to investigate on what the actual issue is. Things like: certificate issue, port not open, host not reachable, wrong token etc.

This idea was triggered by issues like this one: elastic/fleet-server#235 (comment)

[Elastic Agent] No ES output validation can lead to stopped sub processes and no logs

The Fleet settings UI allows for setting all kind of values in the Elasticsearch output configuration. There is no validation, allowing for any kind of input.

Observed behavior:

Setting enabled: false: the Elastic Agent kills the sub processes and only logs that it stoped the sub processes. No indicator why they are stopped, no additional entries in the sub process logs.
Using wrong type, e.g. bulk_max_size: "4s": Elastic Agent keeps restarting the sub processes, which can't start because they all have an invalid configuration. Subprocess logs include detailed information - helpful for debugging

Expected behavior:

the logs should always contain helpful information - if the Agent stops the sub processes, a reason should be logged.

Ideally there would be some validation preventing to store invalid configurations

only supported options could be allowed
loose validation - (most important) supported options are validated, other options are passed on but not validated

I wasn't certain whether this belongs here and/or to Fleet.

[Elastic-Agent] Replace Agent's hasKey with EQL's some.field != null

As noted in elastic/beats#20994 EQL just do some.field != null we should investigate if we could converge on the same syntax.

Implement a sub-command to show or follow the Elastic Agent logs

Debugging Elastic Agent is currently not as easy as it should be. In case of issue, the right paths for the logs have to be found and read one by one. It would be very convenient if elastic agent would offer a command to get the logs and metrics.

To tail all the logs, something like the following would be useful:

elastic-agent logs -f

Maybe later support for filtering logs from only a specific process could be added. One step further would be, that on the fly the logging level could be changed.

The same is true for metrics. Would be nice of a snapshot of the metrics could be gathered with something like the following:

elastic-agent metrics

[Elastic-Agent] [Docker] Discuss: accessing logs from different container

I'd like to ask for your recommendation for users that prefer to run Elastic Agent in the container (let's say due to security reasons).

Let's discuss the scenario:

The integrated product is nginx running in a container. It produces logs stored locally in the image and which are rotated. As the agent is running in a different container, it can't simply access produced logs.

What is your recommendation in this particular case? Should the user expose somehow log files? Mirror them?

Background -
I had an interesting talk with @ycombinator about possibilites and testing scenarios and it looks that we will both have to nail this problem (force agent to watch logs produced in a different container).

[Elastic Agent] Improve error logging (reduce to Info Level where possible) wrt changing logging level to debug

Version: 7.13.0-SNAPSHOT
Environment: Docker

2021-05-03T19:26:23.569Z	INFO	process/app.go:176	Signaling application to stop because of shutdown: metricbeat--7.13.0-SNAPSHOT
2021-05-03T19:26:24.066Z	INFO	log/reporter.go:40	2021-05-03T19:26:24Z - message: Application: filebeat--7.13.0-SNAPSHOT[4f12dd1d-f096-40b1-8bf4-8a0e66722775]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'
2021-05-03T19:26:24.066Z	INFO	log/reporter.go:40	2021-05-03T19:26:24Z - message: Application: filebeat--7.13.0-SNAPSHOT--36643631373035623733363936343635[4f12dd1d-f096-40b1-8bf4-8a0e66722775]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'
2021-05-03T19:26:24.066Z	INFO	log/reporter.go:40	2021-05-03T19:26:24Z - message: Application: metricbeat--7.13.0-SNAPSHOT--36643631373035623733363936343635[4f12dd1d-f096-40b1-8bf4-8a0e66722775]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'
2021-05-03T19:26:24.066Z	INFO	log/reporter.go:40	2021-05-03T19:26:24Z - message: Application: metricbeat--7.13.0-SNAPSHOT[4f12dd1d-f096-40b1-8bf4-8a0e66722775]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'
2021-05-03T19:26:24.999Z	ERROR	fleet/fleet_gateway.go:167	context canceled
2021-05-03T19:26:26.378Z	ERROR	fleet/fleet_gateway.go:167	context canceled
2021-05-03T19:26:27.852Z	INFO	fleet/fleet_gateway.go:298	Fleet gateway is stopping
2021-05-03T19:26:27.852Z	INFO	status/reporter.go:236	Elastic Agent status changed to: 'online'

I stumbled over the two ERROR log entries related to context which also contain very little "context" what it is about.

[Elastic Agent] Support Kafka as an output for elastic agent

Describe the enhancement:
Bringing the Elastic Agent more in line with outputs supported by Beats.

Describe a specific use case for the enhancement or feature:

[elastic-agent] support resource limitations on child processes

Summary

When the elastic agent installs a new input, it starts a new process or restarts an existing process with additional input configuration. The agent does not apply any resource limits to the created subprocesses, potentially leading to the processes competing for available resources. This can become an issue when multiple processes run with high load, reaching the limit of available resources. We need a solution for limiting resource usage per subprocess.

It becomes especially important when the resources for the elastic agent are already restricted, which will be the case for the hosted elastic agent.

There is currently no concept available for how the memory/cpu shares available to the elastic agent should be distributed between processes. Most probably we would not want to limit the subprocesses by default, but only if configured. For hosted agents the orchestrator should pass a configuration to the container where the agent is running.

TODO

Do we need a solution for non-containerized environments which are not supporting cgroups?
Where should the configuration for resource limitations live?
Does the configuration need to be validated, e.g. sum of shares needs to be <=100% of available resources
how to retrieve available resources by the elastic agent, e.g. use ENV for passing in overall restrictions
define supported limitations (e.g. CPU period/quota)
do we need concrete limits, or is setting process priorities already enough?
privileges the elastic agent needs for applying resource limitations to the subprocesses (most probably not an issue, as it already has privileges to start the processes as root)

elastic / elastic-agent Goto Github PK

elastic-agent's Introduction

Elastic Agent

Architecture and Internals

Official Documentation

Contributing

Developing

Development Installations

Test Framework

Changelog

Packaging

Packaging for other architectures

Docker

Testing Elastic Agent on Kubernetes

Prerequisites

Testing on Elastic Cloud

Updating dependencies/PRs

Generating the NOTICE.txt when updating/adding dependencies

elastic-agent's People

Contributors

Stargazers

Watchers

Forkers

elastic-agent's Issues

We aim to support M1 chips by providing a Universal 2 binary

Resources:

Related issues:

Environment

Steps to Reproduce

Behaviours:

Expected behaviour

Current behaviour

Other observations

One shot script

Investigate:

Overview

Debugging

Debugging the running daemon

Debugging a single provider

Description

Questions

Observed behavior:

Expected behavior:

Recommend Projects

Recommend Topics

Recommend Org

Generating the `NOTICE.txt` when updating/adding dependencies