allegro / envoy-control Goto Github PK

Envoy Control is a platform-agnostic, production-ready Control Plane for Service Mesh based on Envoy Proxy.

License: Apache License 2.0

Kotlin 92.75% Shell 0.19% Dockerfile 0.19% Java 4.05% Lua 2.82%

hacktoberfest envoy-control envoy-proxy service-mesh kotlin control-plane envoy consul

envoy-control's Introduction

Envoy Control

Envoy Control is a production-ready Control Plane for Service Mesh based on Envoy Proxy Data Plane that is platform agnostic.

Docs

Full docs are hosted at https://envoy-control.readthedocs.io/en/latest/

Quick start

Quick start guide is located at https://envoy-control.readthedocs.io/en/latest/quickstart

envoy-control's People

Contributors

Stargazers

Watchers

envoy-control's Issues

Fix PARALLEL executorGroup to ensure sequential execution for single DiscoveryRequestStreamObserver

Currently we use one ThreadPoolExecutor shared for all DiscoveryRequestStreamObservers as an ExecutorGroup: https://github.com/allegro/envoy-control/blob/master/envoy-control-core/src/main/kotlin/pl/allegro/tech/servicemesh/envoycontrol/ControlPlane.kt#L105
(only when corresponding property is set to PARALLEL, which is not the default)

This approach is not valid because it will lead to sending XDS responses out-of-order for given DiscoveryRequestStreamObserver.

We should switch our parallel ExecutorGroup implementation to multiple, single-threaded ThreadPoolExecutors. This way we will ensure that single DiscoveryRequestStreamObserver is running sequentially, but many DiscoveryRequestStreamObservers may run in parallel.

Implementation of such ExecutorGroup may look like this (not tested):

 class SequentialExecutorGroup(
    threads: Int,
    singleThreadedExecutorFactory: (Int) -> ExecutorService
) : ExecutorGroup {

    private val executors = (0 until threads).map(singleThreadedExecutorFactory)
    private val counter = AtomicInteger(0)

    override fun next(): Executor {

        val index = counter.getAndUpdate { c ->
            val next = c+1
            if (next >= executors.size) {
                0
            } else {
                next
            }
        }

        return executors[index]
    }
}

RBAC logging does not respect configured client-identity-header

envoy-control allows defining a header name that contains a client identity for RBAC by setting the envoy-control.envoy.snapshot.incoming-permissions.client-identity-header property.
However, logging of unauthorized requests does not respect this property and uses a hardcoded name x-client-name:
https://github.com/allegro/envoy-control/blob/master/envoy-control-core/src/main/resources/lua/ingress_rbac_logging.lua#L4

Expected behavior:
ingress_rbac_logging.lua uses client-identity-header to retrieve a clientName from a request.

Extract some parts of SnapshotUpdater and breaking things up.

I'd add an issue to think about extracting some parts of SnapshotUpdater and breaking things up.

Originally posted by @slonka in #159 (comment)

Open ReliabilityTests classes

jUnit uses inheritance to allow reusing tests with different configurations. We should add open modifier so they can be run with other configurations.

Are there any examples to show dynamic Listeners

hi, all
are there any examples to show how to use dynamic Listeners?
it is Very grateful.

Require from Envoy to pass service_name in metadata and use it even if incoming-permissions is disabled

Currently:

If incoming-permissions is disabled we set service_name to empty string, even if Envoy has passed it in metadata
we accept that envoy may not pass service_name in metadata. In such case we simply set it to an empty string in MetadataNodeGroup ()

It may be a problem, because nobody remembers that it may be an empty string and sooner or later somebody will use service_name to other purposes than incoming-permissions, effectively making EC not working correctly with incoming-permissions disabled.

TODO:

make service_name mandatory and validate its presence in NodeMetadataValidator
set service_name even if incoming permissions are disabled.

Discussion: #87 (comment)

Break down RBAC tests to smaller classes

RBACFactoryTest class is becoming too big. It should be split up (and RBACFactory probably as well)

Refactor http config structure

#23 (comment)

Define code formatter

PR diffs frequently contain unrelated formatting changes like adjusted indentation. It makes reviewing harder.

Let's define formatting rules and stick to them. Ideally in a form of committed code formatter that can be loaded into IntelliJ.

ingress connection idle timeout property has wrong description

The property called envoy-control.envoy.snapshot.localService.connectionIdleTimeout should be renamed to envoy-control.envoy.snapshot.ingress.commonHttp.connectionIdleTimeout which would reflect what this setting actually does.

Speed up tests time by parallelizing container startup

Our testing mechanisms has changed, now we use junit Extensions to start containers. We need to figure out if we can parallelize it now and how.

OLD:

Use something like:

containerList.parallelStream().forEach { it.start() }

Stabilize tests

Consul ACL support

Hi folks,

Are ACLs supported? Looking at the docs, it doesn't seem that there's a way to pass a Consul token.

Let me know if I missed it.

Upgrade reactor to latest 3.2.x

We use the old 3.2.x reactor version: 3.2.5. We should upgrade to latest bugfix version (3.2.16 at the moment of writing).
Newer versions of reactor provide bugfixes.
Changelog: https://github.com/reactor/reactor-core/releases

Remote DCs should have priority according to Consul latency measures

Currently, local DC gets priority = 0 and all remote DCs get priority = 1. We should use Consul's latency measures to order the remote DCs so the fallback performance drop is more predictable.

Race condition during initialization of StateWatcher

There is a chance for race condition during creation of StateWatcher. I saw this behaviour once in integration tests.

Standard behavior:

call doOnSubscribe - start watching for changes
create emitter - setup watcher.stateReceiver
change occurred and is emitted via stateReceiver

Anomaly:

Change can occurred before stateReceiver is set up.
UninitializedPropertyAccessException is thrown

Possible solution:
Setup stateReceiver during watcher.start() in emitter's lambda

Source code:

envoy-control/envoy-control-source-consul/src/main/kotlin/pl/allegro/tech/servicemesh/envoycontrol/consul/services/ConsulServiceChanges.kt

Lines 32 to 51 in 6dc93b1

 fun watchState(): Flux<ServicesState> { 

 val watcher = StateWatcher(watcher, serviceMapper, objectMapper, metrics, subscriptionDelay) 

 return Flux.create<ServicesState>( 

 { sink -> 

 watcher.stateReceiver = { sink.next(it) } 

 }, 

 FluxSink.OverflowStrategy.LATEST 

 ) 

 .measureDiscardedItems("consul-service-changes-emitted", metrics.meterRegistry) 

 .checkpoint("consul-service-changes-emitted") 

 .name("consul-service-changes-emitted").metrics() 

 .distinctUntilChanged() 

 .checkpoint("consul-service-changes-emitted-distinct") 

 .name("consul-service-changes-emitted-distinct").metrics() 

 .doOnSubscribe { watcher.start() } 

 .doOnCancel { 

 logger.warn("Cancelling watching consul service changes") 

 watcher.close() 

 } 

 }

Use can see failed test here

p.a.t.s.e.c.s.ConsulServiceChanges$StateWatcher - Error while watching service envoy-control kotlin.UninitializedPropertyAccessException: lateinit property stateReceiver has not been initialized at pl.allegro.tech.servicemesh.envoycontrol.consul.services.ConsulServiceChanges$StateWatcher.changeState(ConsulServiceChanges.kt:172)
    at pl.allegro.tech.servicemesh.envoycontrol.consul.services.ConsulServiceChanges$StateWatcher.handleServiceInstancesChange(ConsulServiceChanges.kt:148)
    at pl.allegro.tech.servicemesh.envoycontrol.consul.services.ConsulServiceChanges$StateWatcher.access$handleServiceInstancesChange(ConsulServiceChanges.kt:53)
    at pl.allegro.tech.servicemesh.envoycontrol.consul.services.ConsulServiceChanges$StateWatcher$handleNewService$$inlined$synchronized$lambda$1.accept(ConsulServiceChanges.kt:128)
    at pl.allegro.tech.servicemesh.envoycontrol.consul.services.ConsulServiceChanges$StateWatcher$handleNewService$$inlined$synchronized$lambda$1.accept(ConsulServiceChanges.kt:53)
    at pl.allegro.tech.discovery.consul.recipes.watch.EndpointWatcher.lambda$watch$0(EndpointWatcher.java:21)
    at pl.allegro.tech.discovery.consul.recipes.watch.ConsulLongPollCallback.lambda$handleContentChanged$3(ConsulLongPollCallback.java:159)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)

Upgrade Gradle to 6.5.* (or newer) to show correct Lua test names

There is a bug in Gradle because of which we don't see correct names of Lua tests in the output.
Upgrading Gradle to 6.5.* should fix it.
We probably also need to upgrade kotlin before upgrading Gradle, so this issue is blocked by #102

Details of our problem is described here: https://github.com/allegro/envoy-control/pull/139/files#r453732945
More info about the bug in gradle:

Parallelise test by timing data on circle ci

https://circleci.com/gh/allegro/envoy-control/tree/run-tests-in-parallel

https://circleci.com/blog/how-to-boost-build-time-with-test-parallelism/ (gradle setup)

Create docs for "Timeouts for wildcard dependency"

There are missing docs for #165

Add tests to EnvoySnapshotFactory

Bug solved by #80 shows that we could do with some unit tests of SnapshotFactory (checking that listeners are always created).

Describe in documentation timeouts for outgoing dependencies

Sample configuration:

metadata:
  proxy_settings:
    outgoing:
      dependencies:
        - service: "*"
          timeoutPolicy:
            idleTimeout: 400
            requestTimeout: 800
        - service: "slow-service-A"
            timeoutPolicy:
              idleTimeout: 800
              requestTimeout: 1500

AC:

describe idleTimeout, requestTimeout
describe timeout behaviour for missing configs and config with *

Split up EnvoyControlTestConfiguration.kt

This class contains setup, http clients, stub applications interactions, SUT factories, SUT interactions, some assertions, and cleanup. I think we should consider splitting it :)

Originally posted by @pzmi in #112

Remove support for envoy < 1.14.0-dev

Currently we use envoy 1.14.0-dev (commit 6c2137468c25d167dbbe4719b0ecaf343bfb4233), but we keep backward compatibility with envoy 1.13.0-dev (commit b7bef67c256090919a4585a1a06c42f15d640a09) (excluding incoming permissions, which don't work on that version).

When the backward compatibility will be not needed, remove the deprecated code and tests checking the compatibility (all marked with TODO comments).

Proxy settings: handle endpoints in outgoing section

Author: @MarcinFalkowski

Measure thread pools

In our internal flavour we monitor the state of thread pools utilisation, queues, etc. It would be a great benefit for this functionality to be available in the OSS version of e-c. If there's interest I'll be happy to share specs and how we do it.

Unify assertions

See: https://github.com/allegro/envoy-control/pull/66/files/485c3e8f064f43219bcb2d94ce59a0758f3f8c9b#r374125432

I've seen we have a couple of different styles of asserting routes. I'd love to have one consistent way. What about something like simple methods? :)

Remove service name authentication based on header

When service name can be passed from certificate name we should delete headers principals and other code related to passing service name in headers.

Don't use latest docker image in tests

Author: @chemicL

Consul sometimes during resilience tests goes to strange state

Author: @lukidzi

When we run resilience tests in specific order we might observe that consul is in strange state
during test should be resilient to transient unavailability of one DC:

2019/05/30 12:16:49 [DEBUG] memberlist: Stream connection from=192.168.144.10:33118
    2019/05/30 12:16:49 [WARN] consul.rpc: RPC request for DC "dc2", no path found
    2019/05/30 12:16:49 [ERR] http: Request GET /v1/health/service/envoy-control?passing&dc=dc2, error: No path to datacenter from=192.168.144.1:34266
    2019/05/30 12:16:49 [DEBUG] http: Request GET /v1/health/service/envoy-control?passing&dc=dc2 (2.285225ms) from=192.168.144.1:34266
    2019/05/30 12:16:49 [WARN] consul.rpc: RPC request for DC "dc2", no path found

This state keeps for ~2 min. Only restart of container help (consul restart).

I tried to kill all connections during connections cut off but it didn't help. Also restart of interfaces and change arp gc config didn't help.

We should write test to reproduce this state and report the problem to Hashicorp.

Check parsing / validation of timeouts and other settings

See: #62 (comment)

Log serviceName from NodeMetadata upon errors

When a node connects with invalid configuration and causes exceptions we should log which service is issuing such XDS requests. We can simply extract it from NodeMetadata.

Example:

pl.allegro.tech.servicemesh.envoycontrol.groups.NodeMetadataValidationException: INVALID_ARGUMENT: Unsupported protocol for domain dependency for domain ...

Debounce creating new Snapshot

When there is movement in Consul cluster (let's say we spin 20 instances of service), we create new snapshot with every new instance. We've seen on dev as many as 150 changes in one minute already.

We should add debounce creating snapshot max X times per second. To do that, we could use .delay(Duration.ofMillis()) in Reactor.

Remove unused import

Remove unused import in envoy clusters factory.

Originally posted by @pbetkier in #66

Remove usage of untilAsserted in tests

All tests, that are not reliability tests, should use waitForReadyServices method rather than untilAsserted.

Refactor XDS resources generation

Currently all resources generation are handled in one class – EnvoySnapshotFactory. We could figure out a way to separate the logic into different classes and have a cleaner way to bind them in SnaphotUpdater.

Add support for path parameters in incoming permissions

Current implementation of incoming permissions doesn't allow to define path parameters in rule.
We should support those cases e.g.:

/user/{id}/info

Incoming permissions logs - handle data in invalid format

Currently we produce incoming permissions logs as a JSON, which is prepared in a non-reliable way. If particular characters (for example ") will be present in x-service-name header, the result JSON will be invalid.

To consider:

Fetch serviceName from client certificate SAN URI, instead of a header.
Change log format to something simpler than JSON, for example "every field in a new line"
Sanitize JSON strings

More information: #152 (comment)

Replace portBindings.add with standard way of exposing ports

How to expose ports:
https://www.testcontainers.org/features/networking/#exposing-container-ports-to-the-host

Where to fix:

envoy-control/envoy-control-tests/src/main/kotlin/pl/allegro/tech/servicemesh/envoycontrol/config/consul/ConsulContainer.kt

Line 30 in b74ba3c

portBindings.add("$externalPort:$internalPort")

Upgrade kotlin to latest 1.3.x

We use the oldest 1.3.x kotlin version: 1.3.0. We should upgrade to latest bugfix version (1.3.71 at the moment of writing).
Newer versions of Kotlin provide bugfixes and performance improvements.
Changelog: https://github.com/JetBrains/kotlin/blob/master/ChangeLog.md

Flaky tests

There are some flaky tests:

should be resilient to transient unavailability of EC in one DC() - pl.allegro.tech.servicemesh.envoycontrol.reliability.EnvoyControlDownInOneDc

failed build

latency between service registration in local dc and being able to access it via envoy should be less than 0,5s + stateSampleDuration() - pl.allegro.tech.servicemesh.envoycontrol.EnvoyControlSynchronizationRunnerTest

failed build

Figure out if caching locality lb endpoints breaks xDS protocol

During development of #10 we stumbled upon an issue of caching (see related issue for full story). We should investigate if it breaks https://github.com/envoyproxy/data-plane-api/blob/master/XDS_PROTOCOL.md and report that to java-control-plane if it does. Also our fix for this issue is to not remove a service completely from serviceNameToInstances map - that would cause envoy to return 503 (no instances in a cluster) instead of 404 (no cluster)

Migrate deprecated options from envoy v.1.12.0

Following options are marked as deprecated in envoy v.1.12.0: https://www.envoyproxy.io/docs/envoy/latest/intro/deprecated#version-1-12-0-october-31-2019. Update them to keep them up to date.

Make RemoteServicesTest more readable by configuring state per test instead of global implicit setup

Currently the tests assume some global state which is defined in the configuration of the mock http client used. This makes inferring how the behaviour meets the expectations not trivial. I'm suggesting every test case makes a setup of the base state (which is unique in every test case) and the tests are readable without jumping to definitions at the bottom of the file.

Envoy should use system certificates file to validate server certificates

Author: @franek1709

Currently envoy uses certificates file from path /etc/ssl/certs/ca-certificates.crt to validate server certificates during SSL request. It would be better to use file based on default cert file on operating system envoy works.

AC:

Envoy sends in metadata to envoy control path of cert file for operating systems
Envoy control use this path to validate ssl certificates of upstream clusters

Create a more robust way to generate Envoy configs for tests

Current static config files are not enough to catch all the cases needed in EC. We should create a config generator that would dynamically create config files like config_ads.yaml or config_ads_static_listeners.yaml on demand so we can test more cases.

Add useful configurations for local development

We already have the tooling and we can run envoy-control locally while connecting to an environment set up with docker-compose. However, current configuration files aren't utilizing more advanced features, like mTLS and RBAC. That hinders experimenting and testing with these functionalities. Also, creating a configuration for envoy-control and envoy connecting to is from the ground up every time I want to experiment is time-consuming.
I'd like to have a set of configurations utilizing HTTP2, mTLS, RBAC (or at least a base configuration with all of these predefined) on which I could easily start experimenting.

Support instances with hostname

author: @jakubdyszkiewicz

Envoy throws an error when there are instances with hostaname instead of IP.

An error thrown by Envoy

[2018-11-08 09:29:32.855][23590][warning][config] bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:70] gRPC config for type.googleapis.com/envoy.api.v2.ClusterLoadAssignment rejected: malformed IP address: lrgw1.fivecamel-dev.pl-kra-3.dc4.local. Consider setting resolver_name or setting cluster type to 'STRICT_DNS' or 'LOGICAL_DNS'

AC:

Envoy Control & Envoy supports instances with hostnames

Import issues

import issues from issue tracker
replace "GITHUB_LINK" placeholders

Document admin routes

Document how EC guards against /status/envoy routes (e.g. /status/envoy/config_dump).

Create integration test for ip-based + header-selector-matching principal

Integration test should check if headers are removed after all http filters are processed. I this behavior will change, RBAC filter will stop working correctly.
Removing headers was added in PR 196, please refer to comment.

	fun watchState(): Flux<ServicesState> {
	val watcher = StateWatcher(watcher, serviceMapper, objectMapper, metrics, subscriptionDelay)
	return Flux.create<ServicesState>(
	{ sink ->
	watcher.stateReceiver = { sink.next(it) }
	},
	FluxSink.OverflowStrategy.LATEST
	)
	.measureDiscardedItems("consul-service-changes-emitted", metrics.meterRegistry)
	.checkpoint("consul-service-changes-emitted")
	.name("consul-service-changes-emitted").metrics()
	.distinctUntilChanged()
	.checkpoint("consul-service-changes-emitted-distinct")
	.name("consul-service-changes-emitted-distinct").metrics()
	.doOnSubscribe { watcher.start() }
	.doOnCancel {
	logger.warn("Cancelling watching consul service changes")
	watcher.close()
	}
	}