elastic / fleet-server Goto Github PK

View Code? Open in Web Editor NEW

82.0 216.0 81.0 6.83 MB

The Fleet server allows managing a fleet of Elastic Agents.

License: Other

Makefile 1.41% Go 95.14% Shell 2.48% Python 0.60% Dockerfile 0.20% HCL 0.18%

fleet-server's Introduction

Fleet Server

Fleet server is the control server to manage a fleet of elastic-agents.

For production deployments the fleet-server is supervised and bootstrapped by an elastic-agent.

Compatibility and upgrades

Fleet-server communicates with Elasticsearch. Elasticsearch must be on the same version or newer. Fleet server is always on the exact same version as the Elastic Agent running fleet-server. Any Elastic Agent enrolling into a fleet-server must be the same version or older. For Kibana it is assumed it is on the same version as Elasticsearch. With this the compatibility looks as following:

Elastic Agent <= Elastic Agent with fleet-server <= Elasticsearch / Kibana

There might be differences on the bugfix version.

For upgrades Elasticsearch/Kibana must be upgraded first, then the Elastic Agent with fleet-server followed by any other Elastic Agents.

MacOSX Version

The golang-crossbuild produces images used for testing/building. The golang-crossbuild:1.16.X-darwin-debian10 images expects the minimum MacOSX version to be 10.14+.

Development

The following are notes to help developers onboarding to the project to quickly get running. These notes might change at any time.

Developing Fleet Server and Kibana at the same time

When developing features for Fleet, it may become necessary to run both Fleet Server and Kibana from source in order to implement features end-to-end. To faciliate this, we've created a separate guide hosted here.

IDE config

When using the gopls language server you may run into the following errors in the testing package:

error while importing github.com/elastic/fleet-server/testing/e2e/scaffold: build constraints exclude all Go files in  <path to fleet-server>/fleet-server/testing/e2e/scaffold

/<path to fleet-server>/fleet-server/testing/e2e/agent_install_test.go.
   This file may be excluded due to its build tags; try adding "-tags=<build tag>" to your gopls "buildFlags" configuration
   See the documentation for more information on working with build tags:
   https://github.com/golang/tools/blob/master/gopls/doc/settings.md#buildflags-string

In order to resolve the first issue you can add a go.work file to the root of this repo. Copy and paste the following into go.work:

go 1.21

use (
  .
  ./testing
  ./pkg/api
)

Solution for the second error depends on the ide and the package manager you are using.

neovim

lazyvim package manager

nvim-lspconfig plugin

Add the following to your config files

{
  "neovim/nvim-lspconfig",
  opts = {
    servers = {
      gopls = {
        settings = {
          gopls = {
            buildFlags = { "-tags=e2e integration cloude2e" },
          },
        },
      },
    },
  },
}

After these changes if you are still running into issues with code suggestions, autocomplete, you may have to clear your go mod cache and restart your lsp clients.

Run the following command to clear your go mod cache

go clean -modcache

restart your vim session and run the following command to restart your lsp clients

:LspRestart

Changelog

The changelog for fleet-server is generated and maintained using the elastic-agent-changelog-tool. Read the installation and usage instructions to get started.

The changelog tool produces fragment files that are consolidated to generate a changelog for each release Each PR containing a change with user impact (new feature, bug fix, etc.) must contain a changelog fragment describing the change.

A simple example of a changelog fragment is below for reference:

kind: feature
summary: Accept raw errors as a fallback to detailed error type
pr: https://github.com/elastic/fleet-server/pull/2079
issue: https://github.com/elastic/elastic-agent/issues/931

Vagrant

A Vagrantfile is provided to get an environment capable of developing and testing fleet-server. In order to provision the vagrant box run:

vagrant plugin install vagrant-docker-compose
vagrant up

Development build

To compile the fleet-server in development mode set the env var DEV=true. When compiled in development mode the fleet-server will support debugging. i.e.:

SNAPSHOT=true DEV=true make release-darwin/amd64
GOOS=darwin GOARCH=amd64 go build -tags="dev" -gcflags="all=-N -l" -ldflags="-X main.Version=8.7.0 -X main.Commit=31668e0 -X main.BuildTime=2022-12-23T20:06:20Z" -buildmode=pie -o build/binaries/fleet-server-8.7.0-darwin-x86_64/fleet-server .

Change release-darwin/amd64 to release-YOUR_OS/platform. Run make list-platforms to check out the possible values.

The SNAPSHOT flag sets the snapshot version flag and relaxes client version checks. When SNAPSHOT is set we allow clients of the next version to communicate with fleet-server. For example, if fleet-server is running version 8.11.0 on a SNAPSHOT build, clients can communiate with versions up to 8.12.0.

Docker build

You can build a fleet-server docker image with make build-docker. This image includes the default fleet-server.yml configuration file and can be customized with the available environment variables.

This image includes only fleet-server and is intended for stand alone mode, see the section about stand alone Fleet Server to know more.

You can run this image with the included configuration file with the following command:

docker run -it --rm \
  -e ELASTICSEARCH_HOSTS="https://elasticsearch:9200" \
  -e ELASTICSEARCH_SERVICE_TOKEN="someservicetoken" \
  -e ELASTICSEARCH_CA_TRUSTED_FINGERPRINT="somefingerprint" \
  docker.elastic.co/fleet-server/fleet-server:8.8.0

You can replace the included configuration by mounting your configuration file as a volume in /etc/fleet-server.yml.

docker run -it --rm \
  -e ELASTICSEARCH_HOSTS="https://elasticsearch:9200" \
  -e ELASTICSEARCH_SERVICE_TOKEN="someservicetoken" \
  -e ELASTICSEARCH_CA_TRUSTED_FINGERPRINT="somefingerprint" \
  -v "/path/to/your/fleet-server.yml:/etc/fleet-server.yml:ro" \
  docker.elastic.co/fleet-server/fleet-server:8.8.0

Running a local stack for development

Fleet-server can be ran locally in stand-alone mode alongside Elasticsearch and Kibana for development/testing.

Start by following the instructions to create a development build.

ES and Kibana from SNAPSHOTS API on host

In order to run a development/snapshot fleet-server binary the corresponding SNAPSHOT builds of Elasticsearch and Kibana should be used: The artifacts can be found with the artrifacts API, for example here's the URL for 8.7-SNAPSHOT artifacts.

The request will result in a JSON blob that descibes all artifacts. You will need to gather the URLs for Elasticsearch and Kibana that match your distribution, for example linux/amd64.

TODO: parse the JSON to get the URL

wget https://snapshots.elastic.co/8.7.0-19f30181/downloads/elasticsearch/elasticsearch-8.7.0-SNAPSHOT-linux-x86_64.tar.gz
wget https://snapshots.elastic.co/8.7.0-19f30181/downloads/kibana/kibana-8.7.0-SNAPSHOT-linux-x86_64.tar.gz

Generally you will need to unarchive and run the binaries:

tar -xzf elasticsearch-8.7.0-SNAPSHOT-linux-x86_64.tar.gz
cd elasticsearch-8.7.0-SNAPSHOT
# elasticsearch.yml can be edited if required
./bin/elasticsearch

The elasticsearch output will output the elastic user's password and a Kibana configuration string.

tar -xzf kibana-8.7.0-SNAPSHOT-linux-x86_64.tar.gz
cd kibana-8.7.0-SNAPSHOT
# kibana.yml can be edited if required
./bin/kibana

The kibana output will show a URL that will need to be visted in order to configure Kibana with the string elasticsearch provides.

More instructions for setup can be found in the Elastic Stack Installation Guide.

Elasticsearch configuration

Elasticsearch configuration generally does not need to be changed when running a single-instance cluster for local testing. See our integration elasticsearch.yml for an example of what is used for our testing configuration.

Kibana configuration

Custom Kibana configuration can be used to preload fleet with integrations and policies (by using the xpack.fleet,packages and xpack.fleet.agentPolicies attributes). It can also be used to set fleet-settings such as the fleet-server hosts (xpack.fleet.agents.fleet_server.hosts) and outputs (xpack.fleet). Please see our e2e tests kibana.yml configuration for a complete example.

Note that our tests run the Elasticsearch container on a Docker network where the host is called elasticsearch, the and the fleet-server container is called fleet-server.

fleet-server stand alone

Fleet in Kibana requires a managed fleet-server (generally the one you enroll with the elastic-agent instructions). To disable this requirement for a local fleet-server instance use: xpack.fleet.enableExperimental: ['fleetServerStandalone'] (available since v8.8.0). This is only supported internally and is not intended for end-users at this time.

fleet-server configuration

Access the Fleet UI on Kibana and generate a fleet-server policy. Set the following env vars with the information from Kibana:

ELASTICSEARCH_CA_TRUSTED_FINGERPRINT
ELASTICSEARCH_SERVICE_TOKEN
FLEET_SERVER_POLICY_ID or edit fleet-server.yml to include these details directly.

Note the fleet-server.reference.yml contains a full configuration reference.

fleet-server certificates

Create a self-signed TLS CA and cert+key for the fleet-server instance, you can use elasticsearch-certutil for this:

# Create a CA
../elasticsearch/bin/elasticsearch-certutil ca --pem --out stack.zip
unzip stack.zip
# Create a cert+key
../elasticsearch/bin/elasticsearch-certutil cert --pem --ca-cert ca/ca.crt --ca-key ca/ca.key --ip $HOST_IP_ADDR --out cert.zip
unzip cert.zip

Ensure that server.ssl.enabled: true is set as well as the server.ssl.certificate and server.ssl.key attributes in fleet-server.yml

Then run the fleet-server:

./build/binaries/fleet-server-8.7.0-darwin-x86_64/fleet-server -c fleet-server.yml

By default the fleet-server will attempt to connect to Elasticsearch on https://localhost:9200, if this needs to be changed set it with ELASTICSEARCH_HOSTS The fleet-server should appear as an agent with the ID dev-fleet-server.

Any additional agents will need the ca/ca.crt file to enroll (or will need to use the --insecure flag).

fleet-server+agent on a Vagrant VM

The development Vagrant machine assumes the elastic-agent, beats, and fleet-server repos are in the same folder. Thus, it mounts ../ to /vagrant on the Vagrant machine. The vagrant machine IP address is 192.168.56.43. Use https://192.168.56.43:8220 as fleet-server host.

vagrant up
vagrant ssh

Build the elastic-agent

Once in the Vagrant VM, and assuming that the repos are correctly mounted in /vagrant. Build the agent by running:

cd /vagrant/elastic-agent
SNAPSHOT=true EXTERNAL=true PLATFORMS="linux/amd64" PACKAGES="tar.gz" mage -v dev:package # adjust PLATFORMS and PACKAGES to your system and needs.

For detailed instructions, check the Elastic-Agent repo.

Run the elastic-agent+fleet-server in Vagrant

Copy and unpack the elastic-agent .tar.gz file and replace the fleet-server binary in elastic-agent-8.Y.Z-SNAPSHOT-OS-ARCH/data/elastic-agent-*/components/ with the snapshot from the fleet-server repo.

Then go to Kibana > Managment > Fleet and follow the instructions there.

The vagrant machine IP address is 192.168.56.43. Use https://192.168.56.43:8220 as fleet-server host.

tl;dr/example:

cp /vagrant/elastic-agent/build/distributions/elastic-agent-8.7.0-SNAPSHOT-linux-x86_64.tar.gz* ./
tar -xzf elastic-agent-8.7.0-SNAPSHOT-linux-x86_64.tar.gz
cd elastic-agent-8.7.0-SNAPSHOT-linux-x86_64
cp build/binaries/fleet-server-8.7.0-SNAPSHOT-linux-x86_64/fleet-server ./data/elastic-agent-494b79/components/
./elastic-agent install ...

Running go test and benchmarks

When developing new features as you write code you would want to make sure your changes are not breaking any pre-existing functionality. For this reason as you make changes you might want to run a subset of tests or the full tests before you create a pull request.

Running go tests

To execute the full unit tests from your local environment you can do the following

make test-unit

This make target will execute the go unit tests and should normally pass without an issue.

To run tests in a package or a function, run like this:

go test -v ./internal/pkg/checkin -run TestBulkSimple

Running go benchmark tests

It's a good practice before you start your changes to establish the current baseline of the benchmarks in your machine. To establish the baseline benchmark report you can follow the following workflow

Establish a baseline

BENCH_BASE=base.out make benchmark

This will execute all the go benchmark test and write the output into the file build/base.out. If you omit the BENCH_BASE variable it will automatically select the name build/benchmark-{git_head_sha1}.out.

Re-running benchmark after changes

After applying your changes into the code you can reuse the same command but with different output file.

BENCH_BASE=next.out make benchmark

At this point you can compare the 2 reports using benchstat.

Comparing the 2 results

BENCH_BASE=base.out BENCH_NEXT=next.out make benchstat

And this will print the difference between the baseline and next results.

You can read more on the benchstat official site.

There are some additional parameters that you can use with the benchmark target.

BENCHMARK_FILTER: you can define the test filter so that you only run a subset of tests (Default: Bench, only run the test BenchmarkXXXX and not unit tests)
BENCHMARK_COUNT: you can define the number of iterations go test will run. Having larger number helps remove run-to-run variations (Default: 8)

E2E Tests

All E2E tests are located in testing/e2e.

To execute them run:

make test-e2e

Refer to the e2e README for information on how to write new tests.

Testing on cloud

Elastic employees can create an Elastic Cloud deployment with a locally built Fleet Server.

To deploy it you can use the following commands:

EC_API_KEY=yourapikey make -C dev-tools/cloud cloud-deploy

And then to clean the deployment

EC_API_KEY=yourapikey make -C dev-tools/cloud cloud-clean

For more advanced scenario you can build a custom docker image that you could use in your own terraform.

make -C dev-tools/cloud build-and-push-cloud-image

fleet-server's People

Contributors

Stargazers

Watchers

fleet-server's Issues

Improve index changes monitoring

Better index changes monitoring, that reduces the number of requests to Elasticsearch (due to global checkout checks). There was a discussion with the elasticsearch team about implementing the sequence number monitoring as a part of the system index plugin.

Setup test environment with Elasticsearch

Fleet-Server has a tight integration with Elasticsearch. Part of our tests suite should run against and actual version of Elasticsearch, for example the setup part.

Acking a POLICY_CHANGE action do not update agent.policy_revision

Description

Currently in Kibana we rely on the propert policy_revision of the agent to display the agent policy revision these field was populated when the agent ack the policy change action

Tested against that branch elastic/kibana#89372

Use -buildmode=pie (position independence executable)

See elastic/beats#24323 for more details.

[Meta] Fleet Server observability

Observability for Fleet Servers is important because operators are expected to scale them manually. They are also expected to troubleshoot operational issues and fix them. This is true in various degrees from self-managed environments, to ECE/ECK and ESS. As a result, we should provide users visibility to issues and documentation on how to address common ones.

As a Fleet operator, I need:

Logs, metric and status information to help me scale and troubleshoot issues elastic/beats#24415
Fleet Server integration dashboard elastic/integrations#812
Documentation for Fleet Server operators on to observe it, scale it, scaling limits, restrictions, etc.

Out of scope:

A notification in the Fleet app when Fleet Server is not healthy so I understand why updates are not being applied elastic/kibana#95572

Related issues:

Agent integration dashboard https://github.com/elastic/beats/issues/23948
Enable system metrics on Elastic Cloud agent policy elastic/kibana#96248

Unenroll ACK needs to invalidate the API keys for the Agent

At the moment the ACK of an unenroll action does not invalidate the API keys for the Agent. It needs to invalidate both the API key used to communicate with Fleet Server as well as its output API key used for communication to Elasticsearch.

Fleet-server docs overview

This issue is to keep a list of all the things that should be documented for fleet-server.

Add system tests

As part of the fleet-server CI some of the test should be run against Elasticsearch. apm-server today already has a go based framework to do this for apm-server: https://github.com/elastic/apm-server/tree/master/systemtest We could reuse (copy) parts of it and use it for the fleet-server to test some of the parts.

Decide on the Fleet Server default port

Picking up on async communication initiated by @ruflin the Fleet Server should run on a default port not likely to conflict with other ports. At the moment it runs on 8000.

This issue should reflect the decision about the port.

Improve actions in the Elastic Agent

The way actions work in Elastic Agent should be improved. Most of the work here probably must happen in the Elastic Agent itself. Filing it here for now as the new action implementation is driven by fleet server.

High level notes (needs more details):

The agent side changes (as far as I understand):
- action token exchange on checkin
- agent actions routing (specifically for osquery in short term)
- agent actions result/status/error forwarding from the "app" to the fleet server
- the agent managing the Fleet Server launch

Let the user set the Fleet Server host after installing

Currently, the user is required to set the Fleet Server host in Fleet Settings before installing Fleet Server. If they do not, the agent will get a policy with no valid hosts and it will no longer receive updates. I'm worried that not all users will read the instructions carefully. It'd be nice if we designed our fleet server to be more resilient and able to recover in this scenario.

The agent already has the ability to check whether a fleet server host is valid by checking a status endpoint. If the status endpoint does not return 200, it does not accept the policy and it returns an unhealthy status.

Can we do the same during bootstrapping to allow the user to set the Fleet server host after installing Fleet server? If its valid, then the agent will finish bootstrapping and check in successfully. If not, then keep checking ES on regular interval until it is. This allows the user to fill in the Fleet Server host later. We can also set the agent status as unhealthy to indicate that setup is not complete.

Dynamic mappings allow an attacker to degrade indexing capability

Dynamic mappings are a dangerous privilege to grant to an untrusted endpoint:

An attacker could overwhelm the index with a bunch of bogus mappings intended to prevent new valid mappings from being created, hitting the mapping limits.
An attacker could, if the timing is right, purposely mis-map a field such that subsequent valid documents would fail due to mapping exceptions.

To support only "create_doc" privileges, the dynamic mapping would need to be removed from the data stream templates, and all data streams would have to be created before the agents start streaming data.

The built-in index_templates for logs*,metrics*,synthetics*, etc, all contain a dynamic template:

 "dynamic_templates" : [
              {
                "strings_as_keyword" : {
                  "mapping" : {
                    "ignore_above" : 1024,
                    "type" : "keyword"
                  },
                  "match_mapping_type" : "string"
                }
              }
            ],

Dynamic mapping would need to be removed from the default index templates. Runtime queries should be used instead for generic indices, and explicit index templates for specific use cases.

Report degraded status from Fleet Server if there is an issue with indices monitoring

Describe the enhancement:
Report degraded status from Fleet Server if there is an issue with indices monitoring and keep Fleet Server running.
The Fleet Server will not be able to detect new actions and the new policy changes in this state.
Follow up on #115

Describe a specific use case for the enhancement or feature:
The current implementation of the new actions and policy changes detection relies on the Elasticsearch index global checkpoint. In order for it to work properly we can only have one index/one shard for the documents, possibly until we get a more robust support for this feature at Elasticsearch level.
Presently the .fleet-actions and .fleet-policies indices are replaced with aliases.
Need to handle the situation were we have two or more indices by mistake behind the alias, log the error, indicated the degraded status back to the agent and keep Fleet Server running.

Agent timeout

Description

Currently the agent request timeout after 5 minute, if there is no action Fleet server send the request just at 5 minute causing timing issue and errors on the agent:

[elastic_agent][error] Could not communicate with Checking API will retry, error: fail to checkin to fleet: Post "http://localhost:8000/api/fleet/agents/c7b3cca1-737e-422e-87aa-5367d795a650/checkin?": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

[Meta] Fleet server should gracefully reject connections

When incoming connections to Fleet server exceed the capacity for that server, we should gracefully reject those connections. This will allow the clients to be load balanced to other nodes, when available.

Currently, Fleet server will consume all available RAM as the number of connections increases until it triggers the OOM killer. This will have a negative impact on clients and other processes. When a fleet server is killed, it will drop all connections, which would require many thousands of clients to reconnect, using a lot of CPU and prevent users from making updates or trigger response actions during that period. It will also negatively affect other processes running in the same container, such as APM server or Beats.

Requirements:

We should reject incoming connections rather than crash or trigger the OOM killer
We should account for ESS and self-managed clusters. We should not solely rely on a proxy to enforce limits.
The capacity should adjust based on the instance size in ESS/ECE and self-managed hosts
We should provide some buffer for other processes in the same container, such as metricbeat which is essential to alert operators to scale capacity

Implementation phases:

1. Hard limit with manual adjustment in the integration policy #185
2. #466
3. Automatically accept more connections on larger instances https://github.com/elastic/fleet-server/issues/186
4. Autoscaling in ESS/ECE when limit is hit

Open questions:

What is the best approach to gracefully reject connections? Is it a hard limit, or can it intelligently reject connections if there is not enough free memory available?
- Answer: Start with a hard limit and let the user adjust it. Make it more intelligent in a second phase.
Can we return an HTTP error code like 503 so the client knows why the request is rejected? It may be expensive to accept more TLS connections only to reject them. An alternative could be to give a tcp reset. This is not as clear when troubleshooting because tcp resets can also occur for other reasons like network connectivity.
- Answer: Start with TCP reset because it doesn't require a TLS connection to be accepted, which would require more memory
If there is a limit, can we adjust it automatically so that the user does not need to manually adjust it, in ESS/ECE or self-managed hosts? Can it adjust so the cloud API doesn't need to alter the configuration? For example, the limit could be calculated based on the environment when the fleet server starts.
- Answer: Lets target this for a second phase
Can we reject connections based on the amount of memory available? If other processes use too much memory, the fleet server might run out of memory before hitting a hard connection limit. This would lead us to use wider operating margins which are less cost efficient.
- Answer from Steffen: The Agent would be in a better place to monitor and adapt limits dynamically if there is need. I don't think this is a short term goal.

Related issues:

Fleet server should provide metrics on the number of rejected requests elastic/beats#24415

Elastic Agent with Fleet Server enrolls itself

An Elastic Agent with a Fleet Server inside must enroll itself.

Allow the user to edit the connection limit live

We'd like to gracefully manage the capacity of the Fleet Server to prevent it from crashing. In the first phase, there will be hard limit(s) and users must manually adjust them. The overall meta issue defines the subsequent phases where we hope to make this more automatic.

We've already implemented a max connection limit here #122 and we've added this limit to the integration policy here elastic/integrations#803

Requirements:

Users can update the limit(s) in the integration policy, and they are applied dynamically to the fleet server (they are "live" settings).
Add docs on the limit(s) to the Fleet Server's operator's guide https://github.com/elastic/obs-dc-team/issues/147. Describe how users can edit the limit(s), guidance on what they should set the limit(s) to #134, how they know when the limit(s) are hit.

Drop support for 32bit version

Currently fleet-server is also built for 32bit version: https://github.com/elastic/fleet-server/blob/master/Makefile#L3 I think for simplicity fleet-server is not required in 32bit and we should drop support for it. This issue is to discuss this.

Bootstrapping of Elastic indices for Fleet / Fleet-Server, Phase 1

Currently the bootstrapping of the fleet-server indices happens in fleet-server. This issue is to come to a conclusion on where the bootstrapping should happen (fleet-server, Elasticsearch, Kibana) for the first phase.

Related issues / PRs:

System index plugin in Elasticsearch elastic/elasticsearch#64971 elastic/elasticsearch#65275

Report error on install if Fleet server URL is missing

The Fleet server URL is required to be set in Kibana before the Fleet server bootstraps itself. When the user installs Fleet server, we should check if the Fleet server URL has been set. If it has not, we should display an error message in the logs telling the user to set it up in Kibana. Ideally, it will also provide a link to the docs where they can learn more.

Fleet server URL requirements elastic/kibana#89442

Make cache settings configurable

For sizing of the Fleet Server when running on Cloud, ensure that settings for internal cache instances can be configured, to be able to control the cache size.

Ensure GZIP responses are returned from HTTP endpoint

From elastic/integrations#729 we need to ensure that Fleet Server is returning gzipped compressed responses to the Elastic Agent.

[Fleet] creating a policy without integration break fleet server

Description

Blocking elastic/beats#24725

When creating a policy without integrations, looks like Fleet server is in a stuck phase where it's not possible to create new policy change, new policy with integrations.

How to reproduce

create a policy without any integrations
Try to assign an agent to that policy or to trigger a policy change of a previous policy nothings happen.

Test and improve Elasticsearch dsl

The Elasticsearch dsl in https://github.com/elastic/fleet-server/tree/master/internal/pkg/dsl was initially implemented for a POC. Tests should be added and the code reviewed for improvements.

It seems also quite a few of the methods are not used anymore and should be removed.

[Discuss] Indexing permissions as part of the Elastic Agent policy

Currently all Elastic Agents enrolled into the fleet-server get the same permissions. To cover all the use cases these are the most permissive permissions. This proposal is to reduce the permissions given to an Elastic Agent based on the the policy. With this, each Elastic Agent would receive an API Key with the minimal permissions needed to get its job done.

Elastic Agent Policy contains permissions block

The fleet-server creates the API Keys for the Elastic Agents. Which permissions an Elastic Agent requires is based on the content of the policy. Because of this the policy should contain a section with the permissions it requires. This section could look similar to the following (inspired by Elasticsearch API Key permissions):

policy: A
permissions: [
  {
    "names": ["logs-*", "metrics-*"],
    "privileges": ["create_doc"]
  }
]

The above would give create_doc permissions for the logs-* and metrics-* data streams. If an Elastic Agent is enrolled for the policy A, an Elasticsearch output API Key would be created with only the above permissions and added to the policy.

The above model also works for more complex cases. Lets assume we have a case where only 2 indices should be allowed to be written to and one index should have read permissions. This could look as following:

policy: B
permissions: [
  {
    "names": ["logs-nginx.access-default", "logs-nginx.error-default"],
    "privileges": ["create_doc"]
  },
  {
    "names": ["state-docs"],
    "privileges": ["create_doc", "read"]
  }
]

If an Elastic Agent is enrolled for policy B, the permissions to write to the nginx indices is given and read from the state-docs index.

In general, this model can be used to be as permissive or restrive as needed based on the Elasticearch permission model. The limitation on the maximum permissions that can be given to an Elastic Agent is the permissions the fleet-server user itself has.

Change of policy

A change to a policy can mean the permission required on the Elastic Agent changes. For example at first only nginx access logs were collected but now also error logs. This means additional privileges for the error log data stream are required. The fleet-server must be able to hand out new API Keys with increased / reduced permissions in case of policy changes. In addition, old API Keys have to be invalidated.

Same permissions for all processes per Elastic Agent

The above assumes there are no sub permissions per input in an Elastic Agent. Whatever runs in an Elastic Agent, the input that requires the most permissions will defined the permissions of the API Key. If for example APM Integration is run together with the Endpoint integration and APM process requires read access to certain indices, the Endpoint process would also get the same permissions. This simplifies the permissions model.

On the policy creation side, it is important to notify the user about potential issues through concept likes Trusted / Untrusted integrations or similar, but this is not part of the privileges concept itself.

Fleet

The permissions which are part of the policy need to be created somewhere. It is expected that these are created in Fleet. Every integration should have the option to specify which permissions it needs. Based on this information and the user input like namespace, Fleet needs to generate the permission block. The UX and parts needed in Fleet should be discussed separately.

Handling of TLS

Overview

Fleet Server needs to be bootstrapped by the Elastic Agent and be running with TLS. At the moment the bootstrap is all HTTP, which is not secure.

The goal is for in the default case that security is priority #1 followed by a good user experience. We would rather communication between a remote Elastic Agent and Fleet Server fail due to invalid TLS configuration versus being successful with insecure TLS communication.

Cloud Solution

When Fleet Server is bootstrapped by the Cloud then all the certificates will be provided to the bootstrap command allowing Fleet Server to start with the required certificates that the Cloud expects.

./elastic-agent enroll --fleet-server <connection_str> --enrollment-token <token> --cert <path_to_cert> --cert-key <path_to_cert_key>

On-prem Solution

In a customer deployment outside of Cloud they will have options.

Option 1 (Production Custom Certs)

They generate there own certificates that are verifiable by the other Elastic Agents in there organization, they pass these in the same way Cloud does.

./elastic-agent enroll --fleet-server <connection_str> --enrollment-token <token> --cert <path_to_cert> --cert-key <path_to_cert_key>

Option 2 (Auto-generated)

If no --cert* flags are passed to Elastic Agent then Elastic Agent will auto-generate a self-signed certificate with the hostname of the machine.

./elastic-agent enroll --fleet-server <connection_str> --enrollment-token <token>

This means that another Elastic Agent that is enrolling to this Fleet Server, needs to be explicit that it accepts the fact that the certificate is self-generated.

./elastic-agent enroll --url <url_to_fleet_server> --enrollment-token <token> --insecure

If they do not provide the --insecure flag then it will fail to actually connect to the Fleet Server to enroll. We should update this printed message to make it clear that is why it did not work.

Option 3 (HTTP-only BAD)

This is the final option in which they only want to run the Elastic Agent and Fleet Server with only HTTP. This is not recommended but is useful for development or in maybe special cases. In this case it is also best to ensure the Fleet Server is bound to the localhost which Elastic Agent will do by default with the --fleet-server-insecure-http flag.

./elastic-agent enroll --fleet-server <connection_str> --enrollment-token <token> --fleet-server-insecure-http

If they really want it to run in HTTP and not on localhost --fleet-server-bind 0.0.0.0 can be used.

Fleet Server needs CI Integration with e2e-testing repo to run the Agent (and Kibana/Fleet-server side) tests

We have a repo that we use to run Agent (and other Fleet related) tests:
https://github.com/elastic/e2e-testing

When Agent shifted to use Fleet server adjusted the available tests, here:
elastic/e2e-testing#438
Though we still need to flesh out and implement deeper actual Fleet Server tests:
elastic/e2e-testing#1266

Still, with the coverage we have, we can make use of those tests in the actual Fleet-Server repo CI, which is what this ticket is in support of. We need to itemize what details we need of:

the Fleet Server repo compilation of the artifacts
- does this include how to build a Docker image with an updated Fleet server from source?
and how to make use of the newly compiled artifacts to be built into Elastic Agent artifact(s)
Then, in the fleet-server repo we need to add more to the ci jenkins file, managed in the other issue

The ci hook will run a desired set of e2e-testing scenarios like what we have done for Agent and Kibana. Reference this groovy file for info:
https://github.com/mdelapenya/beats/blob/921a3d52e60db04e0c92712203f097157f290265/.ci/packaging.groovy

This ticket is for documenting the knowledge and steps / logistics.

There is a separate ticket for implementing e2e-testing side changes, here:
elastic/e2e-testing#1411

Collect and expose metrics from fleet-server

It needs to be defined which metrics from the fleet-server should be collected and how they are exposed.

Add autobackports with labels to fleet-server repository

Work is in progress to get auto backports into the beats repository: elastic/beats#24608 Same should be done for the fleet-server repo.

Validate Elasticsearch version for compatibility

Fleet Server is compatible with an Elasticsearch version >= the fleet-server version of the same major. The bugfix version differ. On majors, it is expected that the last minor of fleet-server is compatible with the first minor of the next major Elasticsearch version.

This issue is to document and verify that these restrictions are in place.

Kibana is expected to be on the same version as Elasticsearch. Upgrades work in the following order: Elasticsearch, Kibana, Fleet-server, Elastic Agents.

Few examples (made up versions):

fleet-server 7.13.0, Elasticsearch 7.13.0: yes
fleet-server 7.13.2, Elasticsearch 7.13.1: yes
fleet-server 7.13.2, Elasticsearch 7.14.2: yes
fleet-server 7.last, Elasticsearch 8.0.0: yes
fleet-server 7.14.0, Elasticsearch 7.13.0: no
fleet-server 8.0.0, Elasticsearch 7.last: no

[Fleet] Configuring host through the agent policy is not working

Description

In a docker container
While configuring the host to 0.0.0.0 through the fleet server integration elastic/integrations#806 the fleet server inside the container is still bind to localhost.

Provide scaling guidance for Fleet server on cloud

We'd like to provide updated scaling guidance for the new Fleet server on ESS. We currently publish scaling guidance to help users determine how many agents they can realistically expect to enroll into a given instance size for the hosted agent on cloud https://www.elastic.co/guide/en/fleet/current/fleet-limitations.html. This was based on Kibana so we should update it for Fleet Server.

We should identify how many agents can be enrolled into the APM & Fleet slider at each increment of memory up to the max node memory size (8GB). This will help the user identify how to set the connection limit in the Fleet Server integration policy. #153

I think we also need to make some assumptions about workload when giving our guidance. Fleet server is now bundled with APM server on the same instance, but not all users of Fleet server will necessarily be using APM server. I think it'd be more clear to assume minimal or no APM usage beyond the baseline, so we can isolate the requirements for Fleet Server capacity. The resource needs may also vary between the initial ramp up of enrolling agents and the steady state load of check ins once the agents are enrolled. We should at least cover the check in load requirements when setting a max.

Resource usage may vary in real world scenarios, so we'll also provide observability into resource usage to help users know when to scale their slider once in production.

Stretch goal:

Guidance on whether the user should scale ES nodes to handle increased search traffic

Related discussions:

Fleet server scale testing #95
Instance type for cloud https://github.com/elastic/cloud/issues/72128
Resource usage in one container https://github.com/elastic/obs-dc-team/issues/470

[Meta] Fleet Server Phase 2

Goal of this phase is to bring Fleet-Server to beta stage. This is a meta tracking issue.

Bugs

.fleet-actions index not found #246

End to end tests exists for Elastic Agent startup, agent enrollment and receiving policy. elastic/e2e-testing#438 (@blakerouse )
Scale testing setup #95 (@aleksmaus )
- Basic APM Server testing in Cloud container (@simitt)
- Basic fleet-server testing in Cloud container (@scunningham )

Docs (@ph )

Use ECS logging library for fleet-server

fleet-server should log by default in ecs format. Ideally it uses the ecs logging library: elastic/beats#17974 This important to make sure if all the logs are shipped into a single index for the Elastic Agent and its processes, there are no conflicts.

Related ECS logging for Elastic Agent: elastic/beats#23871

Cleanup old actions

Curently old actions are not cleaned up. Actions should be cleaned up automatically after a certain time (2 weeks?)

Implement support for API Key metadata

ES reference: elastic/elasticsearch#48182

Checkpoint Monitoring is broken after replacing .fleet-policies and .fleet-actions indices with aliases

Checkpoint Monitoring is broken after a change in .fleet indices bootstrapping by kibana that replaced .fleet-policies and .fleet-actions indices with aliases.

For example the API call

GET .fleet-actions/_stats?level=shards

returns

  "indices" : {
    ".fleet-actions_1" : {
      "uuid" : "WdDo6Ji4Qva-ohKtQL3zPw",

and ".fleet-actions_1" doesn't match the index name as currently expected, so the global checkpoint is not found.

One possible issue down the road with aliasing the index is that there is nothing that enforces only one index under the given alias.
Adding the second index will break the seq_no based monitoring.

Policy change acknowloedgement does not set the correct value for `packages` which causes kibana filtering to fail if trying to query against this field

Background

In Kibana, when agents are enrolled and running Endpoint, the following KQL filter in the Agents list view is no longer displaying only the agents that are running endpoint integration:

packages : "endpoint"

Cache configuration keys should be under the input

At the moment the newly added cache key is at the top-level of the configuration. It should have been added nested under the input:

inputs:
  - type: fleet-server
    cache:
       ...

Without it in this structure these values will never be able to be adjusted when running under Agent, which is the only officially supported way of running Fleet Server.

[Meta] Fleet Server Scale Testing

Initial Scale Testing Findings. February 17th, 2021.

Could not use https://staging.found.no/ due to the fleetServerEnabled flag that needs to be in kibana.yml config, lack of ssh access or ability to update that flag via deployment UI.

Had to build the cluster from scratch on GCP.
Did the first round of testing using the similar cluster configuration
that you get out of the box (without customizations) for io optimized deployment
which is 3 node cluster: 1 voting only, 2 8GB RAM, 240GB HDD data nodes.

n1-standard-2 (2 vCPUs, 7.5 GB memory)

Horde with 4 worker nodes:

Dedicated Fleet Server box:

n1-standard-2 (2 vCPUs, 7.5 GB memory)

This deployment configuration works ok for 20K (20 thousands) agents, smooth enrollment and checkins.
Peak of the load happens during enrollment phase, which be helped by rate limiting the enrollment speed with horde.
After that the checking in is fairly quiet, low CPU on Fleet Server box.

Fleet Server allocates more memory during enrollment:
After enrolling 11K agents: RES memory is at 1.5GB
After enrolling 20K agents: RES memory is at 2.7GB

Stopping all the agents, restarting Fleet Server and just deploying already enrolled agents:
After deploying 11K agents: RES memory is at 632MB
After deploying 20K agents: RES memory is at 1.2GB

ES load at 20K agents checking in:

Could push another 5K agents to 25K, but needs to be done slowly otherwise starting getting lots of 503s from ES and ACK failures with i/o timeouts:

Eventually 25K agents can be rolled out.
ES load at 25K agents checking in:

The ES data node at 20K agents checking in spikes at 100%, the load average is pretty much at full capacity

The Fleet Server at 25K agents, just checking in:

Conclusion

20K agents with not much activity besides checking seems to work ok with Fleet Server and the default io optimized deployment. The bottleneck at this point it seems the ES cluster.
Need to test with the beefier ES cluster and see how far we can push it. Next on my list to do.
Need to test how it holds with some activity like policy updates or actions.
Need to research more on i/o timeout on ACKs under load, around over 24K agents. If this is caused by ES or Fleet Server of combination due to load.

Anything else that I forgot to mention?

Would be nice to have

Would be nice to have an ability to use https://staging.found.no/ for testing, to avoid manually setting up the clusters and get the actual parity with the configuration that customers get with Elastic Cloud deployments. Need to have additional setting in kibana.yml:

xpack.fleet.agents.fleetServerEnabled: true

Currently if you try to do that from UI you get an error:

Define permissions user for fleet-server needs

A username / password needs to be passed into the fleet-server to work. It needs to be defined what the permissions are that are required for this user.

For reference, the current permissions of the fleet-user can be found here: https://github.com/elastic/kibana/blob/master/x-pack/plugins/fleet/server/services/setup.ts#L140

Handle INTERNAL_POLICY_REASSIGN action

Description

In Kibana We are currently creating an action INTERNAL_POLICY_REASSIGN to know when we need to fetch again the agent and distribute a new policy, this should probably handled in Fleet Server too.

[Meta] Fleet Server Phase 1

This is a meta issue about all the tasks related to Phase 1 of the Fleet Server project.

Fleet-Server is built as part of the release manager (@michalpristas )
End to end tests exists for Elastic Agent startup, agent enrollment and receiving policy. elastic/e2e-testing#438 (@blakerouse )
Scale testing setup #25 (@aleksmaus )

Add scale testing for fleet-server

This is a reminder issue that early on we should start to test fleet-server at scale with many Elastic Agents connected to it and potentially more then a fleet-server at the time.

Instrument fleet-server with apm client

The fleet-server should be instrumented with our go apm Agent so we can use APM to get information out of fleet-server.

Two places we could start with:

instrument http router https://github.com/elastic/apm-agent-go/blob/master/module/apmhttprouter/router.go
instrument elasticsearch client: https://github.com/elastic/go-elasticsearch/blob/c653ef6811d294db6a7cb0efbeae96db24ead2da/_examples/instrumentation/apmelasticsearch.go#L66

Validate Elastic Agent compatibility

Currently Elastic Agent talks to Kibana and verifies the version. In the future this will fleet-server. For Elastic Agent there are 2 compatibility parts:

Standalone: Compatibility with Elasticsearch
Managed: Compatibility with fleet-server

As the compatibility with Elasticsearch in the managed mode is enforced by fleet-server (#167) this should not be an issue. In general the compatibility looks as following:

Elastic Agent <= fleet-server <= Elasticsearch

The bugfix release might differ. The last minor or Elastic Agent in a major should be compatible with the first minor of the next major of fleet-server. Same for Elasticsearch. This means fleet-server / Elasticsearch must be upgraded first.

Add an option to set a hard limit on the number of TCP connections the fleet server will accept.

Describe the enhancement:

To prevent a fleet server from allocating too many resources in a low resource environment, allow an optional hard limit on the number of concurrently accepted connections. The option will be disabled by default.

Manage fleet server indices as system indices in the Elastic Stack

Describe the enhancement:

Fleet Server relies on a set of indices to store operational data about Fleet. These indices must be created early in the system lifecycle. Currently the indices are instantiated in Kibana. This enhancement will move the instantiation to the Elastic Stack where the fleet indices will be treated as managed system indices. Moving forward, updates and migrations of these indices will be managed by the system indices plugins.

Describe a specific use case for the enhancement or feature:

We 7 indices we are dependent on, and one data stream with an ILM policy:

Indices

.fleet-actions
.fleet-agents
.fleet-enrollment-api-keys
.fleet-policies
.fleet-policies-leader
.fleet-servers
.fleet-artifacts

DataStream

.fleet-actions-results and ILM policy for the data stream.

Note: The indices creation had moved over to Kibana, however the data stream was still being created via package management. That was an oversight.

Known issues

The indices/datastreams are not instantiated in the system indices plugin until the first document is written to them. The code in Kibana does not currently expect that. Is there a way to "touch" each of the indices at Kibana boot so we don't have to fix the reads of indices in Kibana to handle the "does not yet exist" case?
System Indices does not yet support the concept of a system managed ILM policy. So the ILM policy for .fleet-action-results will likely remain in the package, or have to put into Kibana. Question on race condition: what happens if the package is installed before the data stream is instantiated? Does trying to add an ILM policy to a system managed data stream cause it to be created?

@jaymode will do the initial implementation in the Elastic Stack for this integration. Moving forward, the Fleet Team will pick up maintenance. This is expected to land for 7.13.

TestMonitor_NewPolicyExists flaky test

Adding the issue so we don't forget about this after coming back after holidays.

First noticed the intermittent CI failures with TestMonitor_NewPolicyExists test on master branch.
Then it randomly was failing blocking this PR:
#46

CI failure upon merging to master (the PR built ok).

Screenshot of CI build log

I could not reproduce the failure locally, and it looks like the timing issue that is more frequent with slower CI environment when starting the policy monitor in the test
If you add a small delay in the go routine that starts the monitor something like time.Sleep(100*time.Millisecond), here:
https://github.com/elastic/fleet-server/blob/master/internal/pkg/policy/monitor_test.go#L284
you would be able to reproduce the issue.

[Meta] Run select inputs on dedicated hosts

Some Fleet operators would like to ensure consistent performance for integrations running on hosted Elastic Agents. This is particularly important for Synthetics because a change in the underlying host resources may create noise that could impact the accuracy of website performance measurement. It is also important for APM server where operators want to ensure the performance for loading real time APM and Otel data. Some integrations like Fleet server and AWS will have spikey workloads so it makes sense to segregate them.

User stories:

As a Fleet operator, I want to specify that some types of inputs are only allowed to run on dedicated hosts
As a Fleet operator, I want Fleet to automatically assign inputs so that I don't need to manually update agent policies when they undergo maintenance or I add/remove nodes.

Open questions:

How important is it to specify a number of hosts > 1?
How important is to run a set of inputs on a dedicated host vs a single input?

elastic / fleet-server Goto Github PK

fleet-server's Introduction

Fleet Server

Compatibility and upgrades

MacOSX Version

Development

Developing Fleet Server and Kibana at the same time

IDE config

neovim

lazyvim package manager

nvim-lspconfig plugin

Changelog

Vagrant

Development build

Docker build

Running a local stack for development

ES and Kibana from SNAPSHOTS API on host

Elasticsearch configuration

Kibana configuration

fleet-server stand alone

fleet-server configuration

fleet-server certificates

fleet-server+agent on a Vagrant VM

Build the elastic-agent

Run the elastic-agent+fleet-server in Vagrant

tl;dr/example:

Running go test and benchmarks

Running go tests

Running go benchmark tests

E2E Tests

Testing on cloud

fleet-server's People

Contributors

Stargazers

Watchers

Forkers

fleet-server's Issues

Description

Description

Description

How to reproduce

Elastic Agent Policy contains permissions block

Change of policy

Same permissions for all processes per Elastic Agent

Fleet

Overview

Cloud Solution

On-prem Solution

Option 1 (Production Custom Certs)

Option 2 (Auto-generated)

Option 3 (HTTP-only BAD)

Description

Background

Initial Scale Testing Findings. February 17th, 2021.

Conclusion

Would be nice to have

Description

Indices

DataStream

Known issues

Recommend Projects

Recommend Topics

Recommend Org