Git Product home page Git Product logo

fleet-server's Introduction

Fleet Server

Build status Coverage

Fleet server is the control server to manage a fleet of elastic-agents.

For production deployments the fleet-server is supervised and bootstrapped by an elastic-agent.

Compatibility and upgrades

Fleet-server communicates with Elasticsearch. Elasticsearch must be on the same version or newer. Fleet server is always on the exact same version as the Elastic Agent running fleet-server. Any Elastic Agent enrolling into a fleet-server must be the same version or older. For Kibana it is assumed it is on the same version as Elasticsearch. With this the compatibility looks as following:

Elastic Agent <= Elastic Agent with fleet-server <= Elasticsearch / Kibana

There might be differences on the bugfix version.

For upgrades Elasticsearch/Kibana must be upgraded first, then the Elastic Agent with fleet-server followed by any other Elastic Agents.

MacOSX Version

The golang-crossbuild produces images used for testing/building. The golang-crossbuild:1.16.X-darwin-debian10 images expects the minimum MacOSX version to be 10.14+.

Development

The following are notes to help developers onboarding to the project to quickly get running. These notes might change at any time.

Developing Fleet Server and Kibana at the same time

When developing features for Fleet, it may become necessary to run both Fleet Server and Kibana from source in order to implement features end-to-end. To faciliate this, we've created a separate guide hosted here.

IDE config

When using the gopls language server you may run into the following errors in the testing package:

error while importing github.com/elastic/fleet-server/testing/e2e/scaffold: build constraints exclude all Go files in  <path to fleet-server>/fleet-server/testing/e2e/scaffold
/<path to fleet-server>/fleet-server/testing/e2e/agent_install_test.go.
   This file may be excluded due to its build tags; try adding "-tags=<build tag>" to your gopls "buildFlags" configuration
   See the documentation for more information on working with build tags:
   https://github.com/golang/tools/blob/master/gopls/doc/settings.md#buildflags-string

In order to resolve the first issue you can add a go.work file to the root of this repo. Copy and paste the following into go.work:

go 1.21

use (
  .
  ./testing
  ./pkg/api
)

Solution for the second error depends on the ide and the package manager you are using.

neovim

lazyvim package manager

nvim-lspconfig plugin

Add the following to your config files

{
  "neovim/nvim-lspconfig",
  opts = {
    servers = {
      gopls = {
        settings = {
          gopls = {
            buildFlags = { "-tags=e2e integration cloude2e" },
          },
        },
      },
    },
  },
}

After these changes if you are still running into issues with code suggestions, autocomplete, you may have to clear your go mod cache and restart your lsp clients.

Run the following command to clear your go mod cache

go clean -modcache

restart your vim session and run the following command to restart your lsp clients

:LspRestart

Changelog

The changelog for fleet-server is generated and maintained using the elastic-agent-changelog-tool. Read the installation and usage instructions to get started.

The changelog tool produces fragment files that are consolidated to generate a changelog for each release Each PR containing a change with user impact (new feature, bug fix, etc.) must contain a changelog fragment describing the change.

A simple example of a changelog fragment is below for reference:

kind: feature
summary: Accept raw errors as a fallback to detailed error type
pr: https://github.com/elastic/fleet-server/pull/2079
issue: https://github.com/elastic/elastic-agent/issues/931

Vagrant

A Vagrantfile is provided to get an environment capable of developing and testing fleet-server. In order to provision the vagrant box run:

vagrant plugin install vagrant-docker-compose
vagrant up

Development build

To compile the fleet-server in development mode set the env var DEV=true. When compiled in development mode the fleet-server will support debugging. i.e.:

SNAPSHOT=true DEV=true make release-darwin/amd64
GOOS=darwin GOARCH=amd64 go build -tags="dev" -gcflags="all=-N -l" -ldflags="-X main.Version=8.7.0 -X main.Commit=31668e0 -X main.BuildTime=2022-12-23T20:06:20Z" -buildmode=pie -o build/binaries/fleet-server-8.7.0-darwin-x86_64/fleet-server .

Change release-darwin/amd64 to release-YOUR_OS/platform. Run make list-platforms to check out the possible values.

The SNAPSHOT flag sets the snapshot version flag and relaxes client version checks. When SNAPSHOT is set we allow clients of the next version to communicate with fleet-server. For example, if fleet-server is running version 8.11.0 on a SNAPSHOT build, clients can communiate with versions up to 8.12.0.

Docker build

You can build a fleet-server docker image with make build-docker. This image includes the default fleet-server.yml configuration file and can be customized with the available environment variables.

This image includes only fleet-server and is intended for stand alone mode, see the section about stand alone Fleet Server to know more.

You can run this image with the included configuration file with the following command:

docker run -it --rm \
  -e ELASTICSEARCH_HOSTS="https://elasticsearch:9200" \
  -e ELASTICSEARCH_SERVICE_TOKEN="someservicetoken" \
  -e ELASTICSEARCH_CA_TRUSTED_FINGERPRINT="somefingerprint" \
  docker.elastic.co/fleet-server/fleet-server:8.8.0

You can replace the included configuration by mounting your configuration file as a volume in /etc/fleet-server.yml.

docker run -it --rm \
  -e ELASTICSEARCH_HOSTS="https://elasticsearch:9200" \
  -e ELASTICSEARCH_SERVICE_TOKEN="someservicetoken" \
  -e ELASTICSEARCH_CA_TRUSTED_FINGERPRINT="somefingerprint" \
  -v "/path/to/your/fleet-server.yml:/etc/fleet-server.yml:ro" \
  docker.elastic.co/fleet-server/fleet-server:8.8.0

Running a local stack for development

Fleet-server can be ran locally in stand-alone mode alongside Elasticsearch and Kibana for development/testing.

Start by following the instructions to create a development build.

ES and Kibana from SNAPSHOTS API on host

In order to run a development/snapshot fleet-server binary the corresponding SNAPSHOT builds of Elasticsearch and Kibana should be used: The artifacts can be found with the artrifacts API, for example here's the URL for 8.7-SNAPSHOT artifacts.

The request will result in a JSON blob that descibes all artifacts. You will need to gather the URLs for Elasticsearch and Kibana that match your distribution, for example linux/amd64.

TODO: parse the JSON to get the URL

wget https://snapshots.elastic.co/8.7.0-19f30181/downloads/elasticsearch/elasticsearch-8.7.0-SNAPSHOT-linux-x86_64.tar.gz
wget https://snapshots.elastic.co/8.7.0-19f30181/downloads/kibana/kibana-8.7.0-SNAPSHOT-linux-x86_64.tar.gz

Generally you will need to unarchive and run the binaries:

tar -xzf elasticsearch-8.7.0-SNAPSHOT-linux-x86_64.tar.gz
cd elasticsearch-8.7.0-SNAPSHOT
# elasticsearch.yml can be edited if required
./bin/elasticsearch

The elasticsearch output will output the elastic user's password and a Kibana configuration string.

tar -xzf kibana-8.7.0-SNAPSHOT-linux-x86_64.tar.gz
cd kibana-8.7.0-SNAPSHOT
# kibana.yml can be edited if required
./bin/kibana

The kibana output will show a URL that will need to be visted in order to configure Kibana with the string elasticsearch provides.

More instructions for setup can be found in the Elastic Stack Installation Guide.

Elasticsearch configuration

Elasticsearch configuration generally does not need to be changed when running a single-instance cluster for local testing. See our integration elasticsearch.yml for an example of what is used for our testing configuration.

Kibana configuration

Custom Kibana configuration can be used to preload fleet with integrations and policies (by using the xpack.fleet,packages and xpack.fleet.agentPolicies attributes). It can also be used to set fleet-settings such as the fleet-server hosts (xpack.fleet.agents.fleet_server.hosts) and outputs (xpack.fleet). Please see our e2e tests kibana.yml configuration for a complete example.

Note that our tests run the Elasticsearch container on a Docker network where the host is called elasticsearch, the and the fleet-server container is called fleet-server.

fleet-server stand alone

Fleet in Kibana requires a managed fleet-server (generally the one you enroll with the elastic-agent instructions). To disable this requirement for a local fleet-server instance use: xpack.fleet.enableExperimental: ['fleetServerStandalone'] (available since v8.8.0). This is only supported internally and is not intended for end-users at this time.

fleet-server configuration

Access the Fleet UI on Kibana and generate a fleet-server policy. Set the following env vars with the information from Kibana:

  • ELASTICSEARCH_CA_TRUSTED_FINGERPRINT
  • ELASTICSEARCH_SERVICE_TOKEN
  • FLEET_SERVER_POLICY_ID or edit fleet-server.yml to include these details directly.

Note the fleet-server.reference.yml contains a full configuration reference.

fleet-server certificates

Create a self-signed TLS CA and cert+key for the fleet-server instance, you can use elasticsearch-certutil for this:

# Create a CA
../elasticsearch/bin/elasticsearch-certutil ca --pem --out stack.zip
unzip stack.zip
# Create a cert+key
../elasticsearch/bin/elasticsearch-certutil cert --pem --ca-cert ca/ca.crt --ca-key ca/ca.key --ip $HOST_IP_ADDR --out cert.zip
unzip cert.zip

Ensure that server.ssl.enabled: true is set as well as the server.ssl.certificate and server.ssl.key attributes in fleet-server.yml

Then run the fleet-server:

./build/binaries/fleet-server-8.7.0-darwin-x86_64/fleet-server -c fleet-server.yml

By default the fleet-server will attempt to connect to Elasticsearch on https://localhost:9200, if this needs to be changed set it with ELASTICSEARCH_HOSTS The fleet-server should appear as an agent with the ID dev-fleet-server.

Any additional agents will need the ca/ca.crt file to enroll (or will need to use the --insecure flag).

fleet-server+agent on a Vagrant VM

The development Vagrant machine assumes the elastic-agent, beats, and fleet-server repos are in the same folder. Thus, it mounts ../ to /vagrant on the Vagrant machine. The vagrant machine IP address is 192.168.56.43. Use https://192.168.56.43:8220 as fleet-server host.

vagrant up
vagrant ssh
Build the elastic-agent

Once in the Vagrant VM, and assuming that the repos are correctly mounted in /vagrant. Build the agent by running:

cd /vagrant/elastic-agent
SNAPSHOT=true EXTERNAL=true PLATFORMS="linux/amd64" PACKAGES="tar.gz" mage -v dev:package # adjust PLATFORMS and PACKAGES to your system and needs.

For detailed instructions, check the Elastic-Agent repo.

Run the elastic-agent+fleet-server in Vagrant

Copy and unpack the elastic-agent .tar.gz file and replace the fleet-server binary in elastic-agent-8.Y.Z-SNAPSHOT-OS-ARCH/data/elastic-agent-*/components/ with the snapshot from the fleet-server repo.

Then go to Kibana > Managment > Fleet and follow the instructions there.

The vagrant machine IP address is 192.168.56.43. Use https://192.168.56.43:8220 as fleet-server host.

tl;dr/example:
cp /vagrant/elastic-agent/build/distributions/elastic-agent-8.7.0-SNAPSHOT-linux-x86_64.tar.gz* ./
tar -xzf elastic-agent-8.7.0-SNAPSHOT-linux-x86_64.tar.gz
cd elastic-agent-8.7.0-SNAPSHOT-linux-x86_64
cp build/binaries/fleet-server-8.7.0-SNAPSHOT-linux-x86_64/fleet-server ./data/elastic-agent-494b79/components/
./elastic-agent install ...

Running go test and benchmarks

When developing new features as you write code you would want to make sure your changes are not breaking any pre-existing functionality. For this reason as you make changes you might want to run a subset of tests or the full tests before you create a pull request.

Running go tests

To execute the full unit tests from your local environment you can do the following

make test-unit

This make target will execute the go unit tests and should normally pass without an issue.

To run tests in a package or a function, run like this:

go test -v ./internal/pkg/checkin -run TestBulkSimple

Running go benchmark tests

It's a good practice before you start your changes to establish the current baseline of the benchmarks in your machine. To establish the baseline benchmark report you can follow the following workflow

Establish a baseline

BENCH_BASE=base.out make benchmark

This will execute all the go benchmark test and write the output into the file build/base.out. If you omit the BENCH_BASE variable it will automatically select the name build/benchmark-{git_head_sha1}.out.

Re-running benchmark after changes

After applying your changes into the code you can reuse the same command but with different output file.

BENCH_BASE=next.out make benchmark

At this point you can compare the 2 reports using benchstat.

Comparing the 2 results

BENCH_BASE=base.out BENCH_NEXT=next.out make benchstat

And this will print the difference between the baseline and next results.

You can read more on the benchstat official site.

There are some additional parameters that you can use with the benchmark target.

  • BENCHMARK_FILTER: you can define the test filter so that you only run a subset of tests (Default: Bench, only run the test BenchmarkXXXX and not unit tests)
  • BENCHMARK_COUNT: you can define the number of iterations go test will run. Having larger number helps remove run-to-run variations (Default: 8)

E2E Tests

All E2E tests are located in testing/e2e.

To execute them run:

make test-e2e

Refer to the e2e README for information on how to write new tests.

Testing on cloud

Elastic employees can create an Elastic Cloud deployment with a locally built Fleet Server.

To deploy it you can use the following commands:

EC_API_KEY=yourapikey make -C dev-tools/cloud cloud-deploy

And then to clean the deployment

EC_API_KEY=yourapikey make -C dev-tools/cloud cloud-clean

For more advanced scenario you can build a custom docker image that you could use in your own terraform.

make -C dev-tools/cloud build-and-push-cloud-image

fleet-server's People

Contributors

aleksmaus avatar alexsapran avatar andersonq avatar andresrc avatar apmmachine avatar blakerouse avatar cmacknz avatar dependabot[bot] avatar elasticmachine avatar github-actions[bot] avatar jlind23 avatar joshdover avatar jsoriano avatar juliaelastic avatar kpollich avatar kroustou avatar kvch avatar lykkin avatar mdelapenya avatar mergify[bot] avatar michalpristas avatar michel-laterman avatar mrodm avatar narph avatar nchaulet avatar ph avatar pzl avatar ruflin avatar sharbuz avatar v1v avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fleet-server's Issues

Fleet-server docs overview

This issue is to keep a list of all the things that should be documented for fleet-server.

  • What is fleet-server. Describe architecture on a high level
  • How is fleet-server run / setup
    • Prerequisites, eg. 7.13+, ECE 2.10+
    • What params exist
    • What environment variables exist
    • What needs to be done in Fleet?
    • How to enroll an Elastic Agent into fleet-server
  • Existing config options
  • Fleet-server in docker
    • How is it different? Non upgradable, different paths elastic/beats#24817
    • What environment variables exist
    • Example setup for docker-compose
  • Monitoring of fleet-server
  • Setup in Kubernetes
  • Compatibility: Elasticsearch >= Fleet Server >= Elastic Agent. Bugfix releases might differ. Kibana is assumed to be on the same minor as Elasticsearch.

Manage fleet server indices as system indices in the Elastic Stack

Describe the enhancement:

Fleet Server relies on a set of indices to store operational data about Fleet. These indices must be created early in the system lifecycle. Currently the indices are instantiated in Kibana. This enhancement will move the instantiation to the Elastic Stack where the fleet indices will be treated as managed system indices. Moving forward, updates and migrations of these indices will be managed by the system indices plugins.

Describe a specific use case for the enhancement or feature:

We 7 indices we are dependent on, and one data stream with an ILM policy:

Indices

.fleet-actions
.fleet-agents
.fleet-enrollment-api-keys
.fleet-policies
.fleet-policies-leader
.fleet-servers
.fleet-artifacts

DataStream

.fleet-actions-results and ILM policy for the data stream.

Note: The indices creation had moved over to Kibana, however the data stream was still being created via package management. That was an oversight.

Known issues

  • The indices/datastreams are not instantiated in the system indices plugin until the first document is written to them. The code in Kibana does not currently expect that. Is there a way to "touch" each of the indices at Kibana boot so we don't have to fix the reads of indices in Kibana to handle the "does not yet exist" case?
  • System Indices does not yet support the concept of a system managed ILM policy. So the ILM policy for .fleet-action-results will likely remain in the package, or have to put into Kibana. Question on race condition: what happens if the package is installed before the data stream is instantiated? Does trying to add an ILM policy to a system managed data stream cause it to be created?

@jaymode will do the initial implementation in the Elastic Stack for this integration. Moving forward, the Fleet Team will pick up maintenance. This is expected to land for 7.13.

Cache configuration keys should be under the input

At the moment the newly added cache key is at the top-level of the configuration. It should have been added nested under the input:

inputs:
  - type: fleet-server
    cache:
       ...

Without it in this structure these values will never be able to be adjusted when running under Agent, which is the only officially supported way of running Fleet Server.

[Fleet] creating a policy without integration break fleet server

Description

Blocking elastic/beats#24725

When creating a policy without integrations, looks like Fleet server is in a stuck phase where it's not possible to create new policy change, new policy with integrations.

How to reproduce

  1. create a policy without any integrations
  2. Try to assign an agent to that policy or to trigger a policy change of a previous policy nothings happen.

Report error on install if Fleet server URL is missing

The Fleet server URL is required to be set in Kibana before the Fleet server bootstraps itself. When the user installs Fleet server, we should check if the Fleet server URL has been set. If it has not, we should display an error message in the logs telling the user to set it up in Kibana. Ideally, it will also provide a link to the docs where they can learn more.

Related:

[Meta] Run select inputs on dedicated hosts

Some Fleet operators would like to ensure consistent performance for integrations running on hosted Elastic Agents. This is particularly important for Synthetics because a change in the underlying host resources may create noise that could impact the accuracy of website performance measurement. It is also important for APM server where operators want to ensure the performance for loading real time APM and Otel data. Some integrations like Fleet server and AWS will have spikey workloads so it makes sense to segregate them.

User stories:

  • As a Fleet operator, I want to specify that some types of inputs are only allowed to run on dedicated hosts
  • As a Fleet operator, I want Fleet to automatically assign inputs so that I don't need to manually update agent policies when they undergo maintenance or I add/remove nodes.

Open questions:

  • How important is it to specify a number of hosts > 1?
  • How important is to run a set of inputs on a dedicated host vs a single input?

Allow the user to edit the connection limit live

We'd like to gracefully manage the capacity of the Fleet Server to prevent it from crashing. In the first phase, there will be hard limit(s) and users must manually adjust them. The overall meta issue defines the subsequent phases where we hope to make this more automatic.

We've already implemented a max connection limit here #122 and we've added this limit to the integration policy here elastic/integrations#803

Requirements:

  • Users can update the limit(s) in the integration policy, and they are applied dynamically to the fleet server (they are "live" settings).
  • Add docs on the limit(s) to the Fleet Server's operator's guide https://github.com/elastic/obs-dc-team/issues/147. Describe how users can edit the limit(s), guidance on what they should set the limit(s) to #134, how they know when the limit(s) are hit.

Cleanup old actions

Curently old actions are not cleaned up. Actions should be cleaned up automatically after a certain time (2 weeks?)

Setup test environment with Elasticsearch

Fleet-Server has a tight integration with Elasticsearch. Part of our tests suite should run against and actual version of Elasticsearch, for example the setup part.

Report degraded status from Fleet Server if there is an issue with indices monitoring

Describe the enhancement:
Report degraded status from Fleet Server if there is an issue with indices monitoring and keep Fleet Server running.
The Fleet Server will not be able to detect new actions and the new policy changes in this state.
Follow up on #115

Describe a specific use case for the enhancement or feature:
The current implementation of the new actions and policy changes detection relies on the Elasticsearch index global checkpoint. In order for it to work properly we can only have one index/one shard for the documents, possibly until we get a more robust support for this feature at Elasticsearch level.
Presently the .fleet-actions and .fleet-policies indices are replaced with aliases.
Need to handle the situation were we have two or more indices by mistake behind the alias, log the error, indicated the degraded status back to the agent and keep Fleet Server running.

TestMonitor_NewPolicyExists flaky test

Adding the issue so we don't forget about this after coming back after holidays.

First noticed the intermittent CI failures with TestMonitor_NewPolicyExists test on master branch.
Then it randomly was failing blocking this PR:
#46

CI failure upon merging to master (the PR built ok).
Screen Shot 2020-12-21 at 8 56 03 PM

Screenshot of CI build log
Screen Shot 2020-12-21 at 8 56 23 PM

I could not reproduce the failure locally, and it looks like the timing issue that is more frequent with slower CI environment when starting the policy monitor in the test
If you add a small delay in the go routine that starts the monitor something like time.Sleep(100*time.Millisecond), here:
https://github.com/elastic/fleet-server/blob/master/internal/pkg/policy/monitor_test.go#L284
you would be able to reproduce the issue.

Decide on the Fleet Server default port

Picking up on async communication initiated by @ruflin the Fleet Server should run on a default port not likely to conflict with other ports. At the moment it runs on 8000.

This issue should reflect the decision about the port.

Improve index changes monitoring

Better index changes monitoring, that reduces the number of requests to Elasticsearch (due to global checkout checks). There was a discussion with the elasticsearch team about implementing the sequence number monitoring as a part of the system index plugin.

Checkpoint Monitoring is broken after replacing .fleet-policies and .fleet-actions indices with aliases

Checkpoint Monitoring is broken after a change in .fleet indices bootstrapping by kibana that replaced .fleet-policies and .fleet-actions indices with aliases.

For example the API call

GET .fleet-actions/_stats?level=shards

returns

  "indices" : {
    ".fleet-actions_1" : {
      "uuid" : "WdDo6Ji4Qva-ohKtQL3zPw",

and ".fleet-actions_1" doesn't match the index name as currently expected, so the global checkpoint is not found.

One possible issue down the road with aliasing the index is that there is nothing that enforces only one index under the given alias.
Adding the second index will break the seq_no based monitoring.

Dynamic mappings allow an attacker to degrade indexing capability

Dynamic mappings are a dangerous privilege to grant to an untrusted endpoint:

  1. An attacker could overwhelm the index with a bunch of bogus mappings intended to prevent new valid mappings from being created, hitting the mapping limits.
  2. An attacker could, if the timing is right, purposely mis-map a field such that subsequent valid documents would fail due to mapping exceptions.

To support only "create_doc" privileges, the dynamic mapping would need to be removed from the data stream templates, and all data streams would have to be created before the agents start streaming data.

The built-in index_templates for logs*,metrics*,synthetics*, etc, all contain a dynamic template:

 "dynamic_templates" : [
              {
                "strings_as_keyword" : {
                  "mapping" : {
                    "ignore_above" : 1024,
                    "type" : "keyword"
                  },
                  "match_mapping_type" : "string"
                }
              }
            ],

Dynamic mapping would need to be removed from the default index templates. Runtime queries should be used instead for generic indices, and explicit index templates for specific use cases.

Fleet Server needs CI Integration with e2e-testing repo to run the Agent (and Kibana/Fleet-server side) tests

We have a repo that we use to run Agent (and other Fleet related) tests:
https://github.com/elastic/e2e-testing

When Agent shifted to use Fleet server adjusted the available tests, here:
elastic/e2e-testing#438
Though we still need to flesh out and implement deeper actual Fleet Server tests:
elastic/e2e-testing#1266

Still, with the coverage we have, we can make use of those tests in the actual Fleet-Server repo CI, which is what this ticket is in support of. We need to itemize what details we need of:

  • the Fleet Server repo compilation of the artifacts
  • - does this include how to build a Docker image with an updated Fleet server from source?
  • and how to make use of the newly compiled artifacts to be built into Elastic Agent artifact(s)
  • Then, in the fleet-server repo we need to add more to the ci jenkins file, managed in the other issue

The ci hook will run a desired set of e2e-testing scenarios like what we have done for Agent and Kibana. Reference this groovy file for info:
https://github.com/mdelapenya/beats/blob/921a3d52e60db04e0c92712203f097157f290265/.ci/packaging.groovy

This ticket is for documenting the knowledge and steps / logistics.

There is a separate ticket for implementing e2e-testing side changes, here:
elastic/e2e-testing#1411

[Meta] Fleet Server Scale Testing

Initial Scale Testing Findings. February 17th, 2021.

Could not use https://staging.found.no/ due to the fleetServerEnabled flag that needs to be in kibana.yml config, lack of ssh access or ability to update that flag via deployment UI.

Had to build the cluster from scratch on GCP.
Did the first round of testing using the similar cluster configuration
that you get out of the box (without customizations) for io optimized deployment
which is 3 node cluster: 1 voting only, 2 8GB RAM, 240GB HDD data nodes.

n1-standard-2 (2 vCPUs, 7.5 GB memory)

Horde with 4 worker nodes:
Screen Shot 2021-02-17 at 10 31 51 AM

Dedicated Fleet Server box:

n1-standard-2 (2 vCPUs, 7.5 GB memory)

This deployment configuration works ok for 20K (20 thousands) agents, smooth enrollment and checkins.
Peak of the load happens during enrollment phase, which be helped by rate limiting the enrollment speed with horde.
After that the checking in is fairly quiet, low CPU on Fleet Server box.

Fleet Server allocates more memory during enrollment:
After enrolling 11K agents: RES memory is at 1.5GB
After enrolling 20K agents: RES memory is at 2.7GB

Stopping all the agents, restarting Fleet Server and just deploying already enrolled agents:
After deploying 11K agents: RES memory is at 632MB
After deploying 20K agents: RES memory is at 1.2GB

ES load at 20K agents checking in:
20K agents - 16GB cluster

Could push another 5K agents to 25K, but needs to be done slowly otherwise starting getting lots of 503s from ES and ACK failures with i/o timeouts:
Screen Shot 2021-02-16 at 4 39 13 PM

Eventually 25K agents can be rolled out.
ES load at 25K agents checking in:
25K agents - 16GB cluster

The ES data node at 20K agents checking in spikes at 100%, the load average is pretty much at full capacity
Screen Shot 2021-02-17 at 9 59 11 AM
25K agents - ES data node htop

The Fleet Server at 25K agents, just checking in:
25K agents - fleet server

Conclusion

  • 20K agents with not much activity besides checking seems to work ok with Fleet Server and the default io optimized deployment. The bottleneck at this point it seems the ES cluster.
  • Need to test with the beefier ES cluster and see how far we can push it. Next on my list to do.
  • Need to test how it holds with some activity like policy updates or actions.
  • Need to research more on i/o timeout on ACKs under load, around over 24K agents. If this is caused by ES or Fleet Server of combination due to load.

Anything else that I forgot to mention?

Would be nice to have

Would be nice to have an ability to use https://staging.found.no/ for testing, to avoid manually setting up the clusters and get the actual parity with the configuration that customers get with Elastic Cloud deployments. Need to have additional setting in kibana.yml:

xpack.fleet.agents.fleetServerEnabled: true

Currently if you try to do that from UI you get an error:

Screen Shot 2021-02-10 at 2 59 32 PM

[Meta] Fleet Server Phase 1

This is a meta issue about all the tasks related to Phase 1 of the Fleet Server project.

CI

Provide scaling guidance for Fleet server on cloud

We'd like to provide updated scaling guidance for the new Fleet server on ESS. We currently publish scaling guidance to help users determine how many agents they can realistically expect to enroll into a given instance size for the hosted agent on cloud https://www.elastic.co/guide/en/fleet/current/fleet-limitations.html. This was based on Kibana so we should update it for Fleet Server.

We should identify how many agents can be enrolled into the APM & Fleet slider at each increment of memory up to the max node memory size (8GB). This will help the user identify how to set the connection limit in the Fleet Server integration policy. #153

I think we also need to make some assumptions about workload when giving our guidance. Fleet server is now bundled with APM server on the same instance, but not all users of Fleet server will necessarily be using APM server. I think it'd be more clear to assume minimal or no APM usage beyond the baseline, so we can isolate the requirements for Fleet Server capacity. The resource needs may also vary between the initial ramp up of enrolling agents and the steady state load of check ins once the agents are enrolled. We should at least cover the check in load requirements when setting a max.

Resource usage may vary in real world scenarios, so we'll also provide observability into resource usage to help users know when to scale their slider once in production.

Stretch goal:

  • Guidance on whether the user should scale ES nodes to handle increased search traffic

Related discussions:

[Discuss] Indexing permissions as part of the Elastic Agent policy

Currently all Elastic Agents enrolled into the fleet-server get the same permissions. To cover all the use cases these are the most permissive permissions. This proposal is to reduce the permissions given to an Elastic Agent based on the the policy. With this, each Elastic Agent would receive an API Key with the minimal permissions needed to get its job done.

Elastic Agent Policy contains permissions block

The fleet-server creates the API Keys for the Elastic Agents. Which permissions an Elastic Agent requires is based on the content of the policy. Because of this the policy should contain a section with the permissions it requires. This section could look similar to the following (inspired by Elasticsearch API Key permissions):

policy: A
permissions: [
  {
    "names": ["logs-*", "metrics-*"],
    "privileges": ["create_doc"]
  }
]

The above would give create_doc permissions for the logs-* and metrics-* data streams. If an Elastic Agent is enrolled for the policy A, an Elasticsearch output API Key would be created with only the above permissions and added to the policy.

The above model also works for more complex cases. Lets assume we have a case where only 2 indices should be allowed to be written to and one index should have read permissions. This could look as following:

policy: B
permissions: [
  {
    "names": ["logs-nginx.access-default", "logs-nginx.error-default"],
    "privileges": ["create_doc"]
  },
  {
    "names": ["state-docs"],
    "privileges": ["create_doc", "read"]
  }
]

If an Elastic Agent is enrolled for policy B, the permissions to write to the nginx indices is given and read from the state-docs index.

In general, this model can be used to be as permissive or restrive as needed based on the Elasticearch permission model. The limitation on the maximum permissions that can be given to an Elastic Agent is the permissions the fleet-server user itself has.

Change of policy

A change to a policy can mean the permission required on the Elastic Agent changes. For example at first only nginx access logs were collected but now also error logs. This means additional privileges for the error log data stream are required. The fleet-server must be able to hand out new API Keys with increased / reduced permissions in case of policy changes. In addition, old API Keys have to be invalidated.

Same permissions for all processes per Elastic Agent

The above assumes there are no sub permissions per input in an Elastic Agent. Whatever runs in an Elastic Agent, the input that requires the most permissions will defined the permissions of the API Key. If for example APM Integration is run together with the Endpoint integration and APM process requires read access to certain indices, the Endpoint process would also get the same permissions. This simplifies the permissions model.

On the policy creation side, it is important to notify the user about potential issues through concept likes Trusted / Untrusted integrations or similar, but this is not part of the privileges concept itself.

Fleet

The permissions which are part of the policy need to be created somewhere. It is expected that these are created in Fleet. Every integration should have the option to specify which permissions it needs. Based on this information and the user input like namespace, Fleet needs to generate the permission block. The UX and parts needed in Fleet should be discussed separately.

Agent timeout

Description

Currently the agent request timeout after 5 minute, if there is no action Fleet server send the request just at 5 minute causing timing issue and errors on the agent:

[elastic_agent][error] Could not communicate with Checking API will retry, error: fail to checkin to fleet: Post "http://localhost:8000/api/fleet/agents/c7b3cca1-737e-422e-87aa-5367d795a650/checkin?": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Validate Elastic Agent compatibility

Currently Elastic Agent talks to Kibana and verifies the version. In the future this will fleet-server. For Elastic Agent there are 2 compatibility parts:

  • Standalone: Compatibility with Elasticsearch
  • Managed: Compatibility with fleet-server

As the compatibility with Elasticsearch in the managed mode is enforced by fleet-server (#167) this should not be an issue. In general the compatibility looks as following:

Elastic Agent <= fleet-server <= Elasticsearch

The bugfix release might differ. The last minor or Elastic Agent in a major should be compatible with the first minor of the next major of fleet-server. Same for Elasticsearch. This means fleet-server / Elasticsearch must be upgraded first.

Unenroll ACK needs to invalidate the API keys for the Agent

At the moment the ACK of an unenroll action does not invalidate the API keys for the Agent. It needs to invalidate both the API key used to communicate with Fleet Server as well as its output API key used for communication to Elasticsearch.

Handling of TLS

Overview

Fleet Server needs to be bootstrapped by the Elastic Agent and be running with TLS. At the moment the bootstrap is all HTTP, which is not secure.

The goal is for in the default case that security is priority #1 followed by a good user experience. We would rather communication between a remote Elastic Agent and Fleet Server fail due to invalid TLS configuration versus being successful with insecure TLS communication.

Cloud Solution

When Fleet Server is bootstrapped by the Cloud then all the certificates will be provided to the bootstrap command allowing Fleet Server to start with the required certificates that the Cloud expects.

./elastic-agent enroll --fleet-server <connection_str> --enrollment-token <token> --cert <path_to_cert> --cert-key <path_to_cert_key>

On-prem Solution

In a customer deployment outside of Cloud they will have options.

Option 1 (Production Custom Certs)

They generate there own certificates that are verifiable by the other Elastic Agents in there organization, they pass these in the same way Cloud does.

./elastic-agent enroll --fleet-server <connection_str> --enrollment-token <token> --cert <path_to_cert> --cert-key <path_to_cert_key>

Option 2 (Auto-generated)

If no --cert* flags are passed to Elastic Agent then Elastic Agent will auto-generate a self-signed certificate with the hostname of the machine.

./elastic-agent enroll --fleet-server <connection_str> --enrollment-token <token>

This means that another Elastic Agent that is enrolling to this Fleet Server, needs to be explicit that it accepts the fact that the certificate is self-generated.

./elastic-agent enroll --url <url_to_fleet_server> --enrollment-token <token> --insecure

If they do not provide the --insecure flag then it will fail to actually connect to the Fleet Server to enroll. We should update this printed message to make it clear that is why it did not work.

Option 3 (HTTP-only BAD)

This is the final option in which they only want to run the Elastic Agent and Fleet Server with only HTTP. This is not recommended but is useful for development or in maybe special cases. In this case it is also best to ensure the Fleet Server is bound to the localhost which Elastic Agent will do by default with the --fleet-server-insecure-http flag.

./elastic-agent enroll --fleet-server <connection_str> --enrollment-token <token> --fleet-server-insecure-http

If they really want it to run in HTTP and not on localhost --fleet-server-bind 0.0.0.0 can be used.

Add scale testing for fleet-server

This is a reminder issue that early on we should start to test fleet-server at scale with many Elastic Agents connected to it and potentially more then a fleet-server at the time.

Validate Elasticsearch version for compatibility

Fleet Server is compatible with an Elasticsearch version >= the fleet-server version of the same major. The bugfix version differ. On majors, it is expected that the last minor of fleet-server is compatible with the first minor of the next major Elasticsearch version.

This issue is to document and verify that these restrictions are in place.

Kibana is expected to be on the same version as Elasticsearch. Upgrades work in the following order: Elasticsearch, Kibana, Fleet-server, Elastic Agents.

Few examples (made up versions):

  • fleet-server 7.13.0, Elasticsearch 7.13.0: yes
  • fleet-server 7.13.2, Elasticsearch 7.13.1: yes
  • fleet-server 7.13.2, Elasticsearch 7.14.2: yes
  • fleet-server 7.last, Elasticsearch 8.0.0: yes
  • fleet-server 7.14.0, Elasticsearch 7.13.0: no
  • fleet-server 8.0.0, Elasticsearch 7.last: no

Handle INTERNAL_POLICY_REASSIGN action

Description

In Kibana We are currently creating an action INTERNAL_POLICY_REASSIGN to know when we need to fetch again the agent and distribute a new policy, this should probably handled in Fleet Server too.

Improve actions in the Elastic Agent

The way actions work in Elastic Agent should be improved. Most of the work here probably must happen in the Elastic Agent itself. Filing it here for now as the new action implementation is driven by fleet server.

High level notes (needs more details):

  • The agent side changes (as far as I understand):
    • action token exchange on checkin
    • agent actions routing (specifically for osquery in short term)
    • agent actions result/status/error forwarding from the "app" to the fleet server
    • the agent managing the Fleet Server launch

[Meta] Fleet Server observability

Observability for Fleet Servers is important because operators are expected to scale them manually. They are also expected to troubleshoot operational issues and fix them. This is true in various degrees from self-managed environments, to ECE/ECK and ESS. As a result, we should provide users visibility to issues and documentation on how to address common ones.

As a Fleet operator, I need:

  • Logs, metric and status information to help me scale and troubleshoot issues elastic/beats#24415
  • Fleet Server integration dashboard elastic/integrations#812
  • Documentation for Fleet Server operators on to observe it, scale it, scaling limits, restrictions, etc.

Out of scope:

  • A notification in the Fleet app when Fleet Server is not healthy so I understand why updates are not being applied elastic/kibana#95572

Related issues:

[Meta] Fleet server should gracefully reject connections

When incoming connections to Fleet server exceed the capacity for that server, we should gracefully reject those connections. This will allow the clients to be load balanced to other nodes, when available.

Currently, Fleet server will consume all available RAM as the number of connections increases until it triggers the OOM killer. This will have a negative impact on clients and other processes. When a fleet server is killed, it will drop all connections, which would require many thousands of clients to reconnect, using a lot of CPU and prevent users from making updates or trigger response actions during that period. It will also negatively affect other processes running in the same container, such as APM server or Beats.

Requirements:

  • We should reject incoming connections rather than crash or trigger the OOM killer
  • We should account for ESS and self-managed clusters. We should not solely rely on a proxy to enforce limits.
  • The capacity should adjust based on the instance size in ESS/ECE and self-managed hosts
  • We should provide some buffer for other processes in the same container, such as metricbeat which is essential to alert operators to scale capacity

Implementation phases:

Open questions:

  1. What is the best approach to gracefully reject connections? Is it a hard limit, or can it intelligently reject connections if there is not enough free memory available?
    • Answer: Start with a hard limit and let the user adjust it. Make it more intelligent in a second phase.
  2. Can we return an HTTP error code like 503 so the client knows why the request is rejected? It may be expensive to accept more TLS connections only to reject them. An alternative could be to give a tcp reset. This is not as clear when troubleshooting because tcp resets can also occur for other reasons like network connectivity.
    • Answer: Start with TCP reset because it doesn't require a TLS connection to be accepted, which would require more memory
  3. If there is a limit, can we adjust it automatically so that the user does not need to manually adjust it, in ESS/ECE or self-managed hosts? Can it adjust so the cloud API doesn't need to alter the configuration? For example, the limit could be calculated based on the environment when the fleet server starts.
    • Answer: Lets target this for a second phase
  4. Can we reject connections based on the amount of memory available? If other processes use too much memory, the fleet server might run out of memory before hitting a hard connection limit. This would lead us to use wider operating margins which are less cost efficient.
    • Answer from Steffen: The Agent would be in a better place to monitor and adapt limits dynamically if there is need. I don't think this is a short term goal.

Related issues:

[Meta] Fleet Server Phase 2

Goal of this phase is to bring Fleet-Server to beta stage. This is a meta tracking issue.

Bugs

  • .fleet-actions index not found #246

CI

Docs (@ph )

Let the user set the Fleet Server host after installing

Currently, the user is required to set the Fleet Server host in Fleet Settings before installing Fleet Server. If they do not, the agent will get a policy with no valid hosts and it will no longer receive updates. I'm worried that not all users will read the instructions carefully. It'd be nice if we designed our fleet server to be more resilient and able to recover in this scenario.

The agent already has the ability to check whether a fleet server host is valid by checking a status endpoint. If the status endpoint does not return 200, it does not accept the policy and it returns an unhealthy status.

Can we do the same during bootstrapping to allow the user to set the Fleet server host after installing Fleet server? If its valid, then the agent will finish bootstrapping and check in successfully. If not, then keep checking ES on regular interval until it is. This allows the user to fill in the Fleet Server host later. We can also set the agent status as unhealthy to indicate that setup is not complete.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.