Git Product home page Git Product logo

envoy-control's Introduction

envoy-control's People

Contributors

andrzejwaw avatar automaat avatar chemicl avatar danielkwinsor avatar deejay1 avatar dependabot[bot] avatar ferdudas97 avatar franek1709 avatar gawarz avatar gut9 avatar konrad-kaminski avatar kozjan avatar ksmigielski avatar kukulam-allegro avatar lukidzi avatar malfoj89 avatar marcinfalkowski avatar marek1410 avatar matb4r avatar mzawirski avatar nastassia-dailidava avatar pbetkier avatar piorkowskiprzemyslaw avatar pzmi avatar sivarion avatar skyrrt avatar slonka avatar w-przechrzta avatar wookiej avatar wprzechrzta avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

envoy-control's Issues

Fix PARALLEL executorGroup to ensure sequential execution for single DiscoveryRequestStreamObserver

Currently we use one ThreadPoolExecutor shared for all DiscoveryRequestStreamObservers as an ExecutorGroup: https://github.com/allegro/envoy-control/blob/master/envoy-control-core/src/main/kotlin/pl/allegro/tech/servicemesh/envoycontrol/ControlPlane.kt#L105
(only when corresponding property is set to PARALLEL, which is not the default)

This approach is not valid because it will lead to sending XDS responses out-of-order for given DiscoveryRequestStreamObserver.

We should switch our parallel ExecutorGroup implementation to multiple, single-threaded ThreadPoolExecutors. This way we will ensure that single DiscoveryRequestStreamObserver is running sequentially, but many DiscoveryRequestStreamObservers may run in parallel.

Implementation of such ExecutorGroup may look like this (not tested):

 class SequentialExecutorGroup(
    threads: Int,
    singleThreadedExecutorFactory: (Int) -> ExecutorService
) : ExecutorGroup {

    private val executors = (0 until threads).map(singleThreadedExecutorFactory)
    private val counter = AtomicInteger(0)

    override fun next(): Executor {

        val index = counter.getAndUpdate { c ->
            val next = c+1
            if (next >= executors.size) {
                0
            } else {
                next
            }
        }

        return executors[index]
    }
}

RBAC logging does not respect configured client-identity-header

envoy-control allows defining a header name that contains a client identity for RBAC by setting the envoy-control.envoy.snapshot.incoming-permissions.client-identity-header property.
However, logging of unauthorized requests does not respect this property and uses a hardcoded name x-client-name:
https://github.com/allegro/envoy-control/blob/master/envoy-control-core/src/main/resources/lua/ingress_rbac_logging.lua#L4

Expected behavior:
ingress_rbac_logging.lua uses client-identity-header to retrieve a clientName from a request.

Open ReliabilityTests classes

jUnit uses inheritance to allow reusing tests with different configurations. We should add open modifier so they can be run with other configurations.

Require from Envoy to pass service_name in metadata and use it even if incoming-permissions is disabled

Currently:

  • If incoming-permissions is disabled we set service_name to empty string, even if Envoy has passed it in metadata
  • we accept that envoy may not pass service_name in metadata. In such case we simply set it to an empty string in MetadataNodeGroup ()

It may be a problem, because nobody remembers that it may be an empty string and sooner or later somebody will use service_name to other purposes than incoming-permissions, effectively making EC not working correctly with incoming-permissions disabled.

TODO:

  • make service_name mandatory and validate its presence in NodeMetadataValidator
  • set service_name even if incoming permissions are disabled.

Discussion: #87 (comment)

Define code formatter

PR diffs frequently contain unrelated formatting changes like adjusted indentation. It makes reviewing harder.

Let's define formatting rules and stick to them. Ideally in a form of committed code formatter that can be loaded into IntelliJ.

Speed up tests time by parallelizing container startup

Our testing mechanisms has changed, now we use junit Extensions to start containers. We need to figure out if we can parallelize it now and how.


OLD:

Use something like:

containerList.parallelStream().forEach { it.start() }

Consul ACL support

Hi folks,

Are ACLs supported? Looking at the docs, it doesn't seem that there's a way to pass a Consul token.

Let me know if I missed it.

Race condition during initialization of StateWatcher

There is a chance for race condition during creation of StateWatcher. I saw this behaviour once in integration tests.

Standard behavior:

  1. call doOnSubscribe - start watching for changes
  2. create emitter - setup watcher.stateReceiver
  3. change occurred and is emitted via stateReceiver

Anomaly:

  1. Change can occurred before stateReceiver is set up.
  2. UninitializedPropertyAccessException is thrown

Possible solution:
Setup stateReceiver during watcher.start() in emitter's lambda

Source code:

fun watchState(): Flux<ServicesState> {
val watcher = StateWatcher(watcher, serviceMapper, objectMapper, metrics, subscriptionDelay)
return Flux.create<ServicesState>(
{ sink ->
watcher.stateReceiver = { sink.next(it) }
},
FluxSink.OverflowStrategy.LATEST
)
.measureDiscardedItems("consul-service-changes-emitted", metrics.meterRegistry)
.checkpoint("consul-service-changes-emitted")
.name("consul-service-changes-emitted").metrics()
.distinctUntilChanged()
.checkpoint("consul-service-changes-emitted-distinct")
.name("consul-service-changes-emitted-distinct").metrics()
.doOnSubscribe { watcher.start() }
.doOnCancel {
logger.warn("Cancelling watching consul service changes")
watcher.close()
}
}

Use can see failed test here

p.a.t.s.e.c.s.ConsulServiceChanges$StateWatcher - Error while watching service envoy-control kotlin.UninitializedPropertyAccessException: lateinit property stateReceiver has not been initialized at pl.allegro.tech.servicemesh.envoycontrol.consul.services.ConsulServiceChanges$StateWatcher.changeState(ConsulServiceChanges.kt:172)
    at pl.allegro.tech.servicemesh.envoycontrol.consul.services.ConsulServiceChanges$StateWatcher.handleServiceInstancesChange(ConsulServiceChanges.kt:148)
    at pl.allegro.tech.servicemesh.envoycontrol.consul.services.ConsulServiceChanges$StateWatcher.access$handleServiceInstancesChange(ConsulServiceChanges.kt:53)
    at pl.allegro.tech.servicemesh.envoycontrol.consul.services.ConsulServiceChanges$StateWatcher$handleNewService$$inlined$synchronized$lambda$1.accept(ConsulServiceChanges.kt:128)
    at pl.allegro.tech.servicemesh.envoycontrol.consul.services.ConsulServiceChanges$StateWatcher$handleNewService$$inlined$synchronized$lambda$1.accept(ConsulServiceChanges.kt:53)
    at pl.allegro.tech.discovery.consul.recipes.watch.EndpointWatcher.lambda$watch$0(EndpointWatcher.java:21)
    at pl.allegro.tech.discovery.consul.recipes.watch.ConsulLongPollCallback.lambda$handleContentChanged$3(ConsulLongPollCallback.java:159)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)

Describe in documentation timeouts for outgoing dependencies

Sample configuration:

metadata:
  proxy_settings:
    outgoing:
      dependencies:
        - service: "*"
          timeoutPolicy:
            idleTimeout: 400
            requestTimeout: 800
        - service: "slow-service-A"
            timeoutPolicy:
              idleTimeout: 800
              requestTimeout: 1500

AC:

  • describe idleTimeout, requestTimeout
  • describe timeout behaviour for missing configs and config with *

Split up EnvoyControlTestConfiguration.kt

This class contains setup, http clients, stub applications interactions, SUT factories, SUT interactions, some assertions, and cleanup. I think we should consider splitting it :)

Originally posted by @pzmi in #112

Remove support for envoy < 1.14.0-dev

Currently we use envoy 1.14.0-dev (commit 6c2137468c25d167dbbe4719b0ecaf343bfb4233), but we keep backward compatibility with envoy 1.13.0-dev (commit b7bef67c256090919a4585a1a06c42f15d640a09) (excluding incoming permissions, which don't work on that version).

When the backward compatibility will be not needed, remove the deprecated code and tests checking the compatibility (all marked with TODO comments).

Measure thread pools

In our internal flavour we monitor the state of thread pools utilisation, queues, etc. It would be a great benefit for this functionality to be available in the OSS version of e-c. If there's interest I'll be happy to share specs and how we do it.

Consul sometimes during resilience tests goes to strange state

Author: @lukidzi

When we run resilience tests in specific order we might observe that consul is in strange state
during test should be resilient to transient unavailability of one DC:

2019/05/30 12:16:49 [DEBUG] memberlist: Stream connection from=192.168.144.10:33118
    2019/05/30 12:16:49 [WARN] consul.rpc: RPC request for DC "dc2", no path found
    2019/05/30 12:16:49 [ERR] http: Request GET /v1/health/service/envoy-control?passing&dc=dc2, error: No path to datacenter from=192.168.144.1:34266
    2019/05/30 12:16:49 [DEBUG] http: Request GET /v1/health/service/envoy-control?passing&dc=dc2 (2.285225ms) from=192.168.144.1:34266
    2019/05/30 12:16:49 [WARN] consul.rpc: RPC request for DC "dc2", no path found

This state keeps for ~2 min. Only restart of container help (consul restart).

I tried to kill all connections during connections cut off but it didn't help. Also restart of interfaces and change arp gc config didn't help.

We should write test to reproduce this state and report the problem to Hashicorp.

Log serviceName from NodeMetadata upon errors

When a node connects with invalid configuration and causes exceptions we should log which service is issuing such XDS requests. We can simply extract it from NodeMetadata.

Example:

pl.allegro.tech.servicemesh.envoycontrol.groups.NodeMetadataValidationException: INVALID_ARGUMENT: Unsupported protocol for domain dependency for domain ...

Debounce creating new Snapshot

When there is movement in Consul cluster (let's say we spin 20 instances of service), we create new snapshot with every new instance. We've seen on dev as many as 150 changes in one minute already.

We should add debounce creating snapshot max X times per second. To do that, we could use .delay(Duration.ofMillis()) in Reactor.

Refactor XDS resources generation

Currently all resources generation are handled in one class โ€“ EnvoySnapshotFactory. We could figure out a way to separate the logic into different classes and have a cleaner way to bind them in SnaphotUpdater.

Incoming permissions logs - handle data in invalid format

Currently we produce incoming permissions logs as a JSON, which is prepared in a non-reliable way. If particular characters (for example ") will be present in x-service-name header, the result JSON will be invalid.

To consider:

  1. Fetch serviceName from client certificate SAN URI, instead of a header.
  2. Change log format to something simpler than JSON, for example "every field in a new line"
  3. Sanitize JSON strings

More information: #152 (comment)

Flaky tests

There are some flaky tests:

should be resilient to transient unavailability of EC in one DC() - pl.allegro.tech.servicemesh.envoycontrol.reliability.EnvoyControlDownInOneDc

failed build

latency between service registration in local dc and being able to access it via envoy should be less than 0,5s + stateSampleDuration() - pl.allegro.tech.servicemesh.envoycontrol.EnvoyControlSynchronizationRunnerTest

failed build

Figure out if caching locality lb endpoints breaks xDS protocol

During development of #10 we stumbled upon an issue of caching (see related issue for full story). We should investigate if it breaks https://github.com/envoyproxy/data-plane-api/blob/master/XDS_PROTOCOL.md and report that to java-control-plane if it does. Also our fix for this issue is to not remove a service completely from serviceNameToInstances map - that would cause envoy to return 503 (no instances in a cluster) instead of 404 (no cluster)

Envoy should use system certificates file to validate server certificates

Author: @franek1709

Currently envoy uses certificates file from path /etc/ssl/certs/ca-certificates.crt to validate server certificates during SSL request. It would be better to use file based on default cert file on operating system envoy works.

AC:

Envoy sends in metadata to envoy control path of cert file for operating systems
Envoy control use this path to validate ssl certificates of upstream clusters

Create a more robust way to generate Envoy configs for tests

Current static config files are not enough to catch all the cases needed in EC. We should create a config generator that would dynamically create config files like config_ads.yaml or config_ads_static_listeners.yaml on demand so we can test more cases.

Add useful configurations for local development

We already have the tooling and we can run envoy-control locally while connecting to an environment set up with docker-compose. However, current configuration files aren't utilizing more advanced features, like mTLS and RBAC. That hinders experimenting and testing with these functionalities. Also, creating a configuration for envoy-control and envoy connecting to is from the ground up every time I want to experiment is time-consuming.
I'd like to have a set of configurations utilizing HTTP2, mTLS, RBAC (or at least a base configuration with all of these predefined) on which I could easily start experimenting.

Support instances with hostname

author: @jakubdyszkiewicz

Envoy throws an error when there are instances with hostaname instead of IP.

An error thrown by Envoy

[2018-11-08 09:29:32.855][23590][warning][config] bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_mux_subscription_lib/common/config/grpc_mux_subscription_impl.h:70] gRPC config for type.googleapis.com/envoy.api.v2.ClusterLoadAssignment rejected: malformed IP address: lrgw1.fivecamel-dev.pl-kra-3.dc4.local. Consider setting resolver_name or setting cluster type to 'STRICT_DNS' or 'LOGICAL_DNS'

AC:

  • Envoy Control & Envoy supports instances with hostnames

Import issues

  • import issues from issue tracker
  • replace "GITHUB_LINK" placeholders

Document admin routes

Document how EC guards against /status/envoy routes (e.g. /status/envoy/config_dump).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.