Git Product home page Git Product logo

megaease / easemesh Goto Github PK

View Code? Open in Web Editor NEW
509.0 22.0 59.0 5.99 MB

A service mesh implementation for connecting, control, and observe services in spring-cloud.

Home Page: https://megaease.com/easemesh

License: Apache License 2.0

Dockerfile 0.15% Makefile 1.14% Go 98.70% Shell 0.02%
service mesh kubernetes service-mesh microservice go spring-cloud service-governance traffic-splitting observability

easemesh's People

Contributors

benja-wu avatar c-ld avatar diannaowa avatar haoel avatar killua525 avatar leyafo avatar localvar avatar megaeasedemo avatar runningsanil avatar xxx7xxxx avatar yuikns avatar zhao-kun avatar zhiyouce avatar zouyingjie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

easemesh's Issues

Support mutable config for all kinds of shadowed services

Background

As EaseMesh has started to support non-java applications, the shadow service only supports java for now. So we want to expand our product to include as many features as possible for non-java services, shadow service is the next choice.

Shadow service supports changing the addresses of some middleware, for example, the process needs to change production endpoints to staging/testing ones in case of disturbing the production lines.

We inject all containers with sidecar and EaseAgent[1]. But only applications running in JVM will load and launch EaseAgent. And sidecar will notify EaseAgent of the shadowed middleware information through sidecar protocol[2].

On the other hand, there's no explicit information about whether the internal running application is a Java application or not. So we must distinguish the type of applications before passing corresponding configs.

So we can break down the things that we must do to support mutable config for all kinds of shadowed services:

  1. We must know which type of application running (Java with EaseAgent and others such as golang, python, etc.)
  2. After knowing the information above, we need to deliver user-defined config to shadowed services running in containers

Proposal

In the need to generalize this feature, we should expand the existing sidecar protocol for the language-insensitive guarantee.

Agent Type

There are mainly two kinds of learning agent type of applications:

  1. Manual. User give it to use in service spec, such as:
echo 'kind: Service
metadata:
  name: service-001
spec:
  registerTenant: "tenant-001"
  agentType: EaseAgent # GoSDK, PythonSDK...
  #...

This is the simplest solution, but it actually relies on the users' awareness, which could give us the wrong information.

  1. Automatically. We expand the sidecar protocol on the agent side for http://localhost:9900/agent-info, which returns the agent information including agent type, such as
agentType: EaseAgent # GoSDK, PythonSDK, None...
agentVersion: v2.2.1

This method would give us the real result and need no awareness from users. But it could bring complexity and make inconsistency among different service instances, which might give inconsistent agent types in some cases (although it's the responsibility of users to prevent it from happening).

So in another perspective, the manual solution is to add static service-level information, but the automatic one is to add dynamic service-instance-level information.

IMHO, the automatic one is better in the case of rigid standards.

Deliver Config

Currently, only EaseAgent can take up config(only the observability part). Besides that, we prepare to support map other configs into the application which means the container in terms of Kubernetes env.

Config categories could be these below:

  1. Env: They could be copied, and then added, deleted, or mutated by users. (NOTICE: Some env generated by EaseStack are dedicated, which could not be copied)
  2. ConfigMap: They would be copied into another config map(such as the name of shadow-xxx-configmap-01), and then be mutated by users.
  3. Secret: Same as ConfigMap.

The more work for ConfigMap and Secret is the extra lifecycle management while deleting Shadow Service. An example would be

apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-mesh
spec:
  template:
    spec:
      containers:
      name: order-mesh
      image: megaease/consuldemo:latest
      - env:
        - name: DEBUG
          value: false
    volumeMounts:
    - name: cm-01
      mountPath: "/etc/config-01"
    - name: secret-01
      mountPath: "/etc/secret-01"
  volumes:
  - name: cm-01
    configMap:
      name: cm-01
  - name: secret-01
    secret:
      name: secret-01

shadowed into:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-mesh-shadow                       # append suffix -shadow
spec:
  template:
    spec:
      containers:
      name: order-mesh
      image: megaease/consuldemo:latest
      - env:
        - name: DEBUG
          value: false                       # changed from false to true
        - name: MYSQL_ADDRESS                # add a new env
          value: mysql://192.168.0.111:13306
    volumeMounts:
    - name: cm-01
      mountPath: "/etc/config-01"
    - name: secret-01
      mountPath: "/etc/secret-01"
  volumes:
  - name: cm-01
    configMap:
      name: cm-01-order-mesh-shadow                       # append suffix -order-mesh-shadow
  - name: secret-01
    secret:
      name: secret-01-order-mesh-shadow                # append suffix -order-mesh-shadow

As we can see, the format of copies is {configmap/secret name}-{deployment/statefulset name}-shadow, which contains the content which is coped and changed by users (if they want).

The reason we add {deployment/statefulset name} into the shadowed config name is that the original configmap/secret may be shared by multiple deployments/statefulsets. As we split them totally, the cleaning of one shadow resource won't affect others. As we see, this brings a certain amount of complexity as a cost.

The added APIs would be:

  • Control Plane: add APIs for retrieving deployment/statefulsets specs and their Secrets and ConfigMaps.
  • Shadow service: adds the creation API for handling shadowing Secrets and confiConfigMapsgmap besides shadowed component.

Reference

[1] https://github.com/megaease/easeagent
[2] https://github.com/megaease/easemesh/blob/main/docs/sidecar-protocol.md

ShadowService Topology graph

The service name of the application created by shadow service controller is the same as the source service, so it cannot be clearly identified on the topology graph. In the topology diagram, we can only distinguish whether the shadow service is running normally through the different middleware used by the service.

image

The expected result is that when the new ShadowService is created and traffic is generated, the corresponding nodes can also be seen on the topology graph like this.
image

For this purpose, I think we need:

  1. The shadow service controller injects the serviceName environment variable when generating the service deployment.
  2. JavaAgent modifies the tracing log based on the env value of the serviceName.

Refine Documents

As reimplement the emctl (EaseMesh command-line tools), we need to refine Documents to align to the implementation:

  • Refine Quick Start in README.md to align to the implementation [Priority: High]
  • Add emctl user manuls [Priority: Middle]
  • In the current README.md we missing roadmap, Add roadmap.md to repository [Priority: High]

Support EaseMesh for Canary deployment

Background

  1. According to EaseMesh product requirement[1], one of EaseMesh's main traffic scheduling abilities is canary deployment.

Proposal

Canary Labels

Traffic

  • Traffic in mesh is split into two kinds, the normal traffic and colored traffic.
  • The colored traffic is recognized by carried with Mesh Service's canary rule's specified HTTP headers.

Service instances

  • Mesh service instances are also split into two kinds, the normal instances and canary instances.
  • The canary instances are recognized by no empty instance register records' Labels fields.

Control plane

  • Already support canary rules spec CRUD operations.

Data plane

EG-sidecar

  1. Support registering service instances with provided service instances labels by the Operator.
  2. Support creating a pipeline with a canary pool inside backend filter for handling colored traffic.
  3. Store global service's traffic HTTP header keys, and push it to the EaseAgent (with JMX over HTTP) every minutes with the version.

EaseAgent

  1. Accept global service's traffic HTTP header keys, and inject them into RPC headers if exists.

References

  1. mesh requirements https://docs.google.com/document/d/19EiR-tyNJS75aotvLqYWjsYK7VqyjO7DCKrYjktfg-A/edit#

Mesh deployment deletion has incorrect behaivor

Background

Imaging we have two MeshDeployment resources, that are same service.

apiVersion: mesh.megaease.com/v1beta1
kind: MeshDeployment
metadata:
  namespace: spring-petclinic
  name: customers-service
spec:
  service:
    name: customers-service
    labels: {}
  deploy:
    replicas: 1
    selector:
...
apiVersion: mesh.megaease.com/v1beta1
kind: MeshDeployment
metadata:
  namespace: spring-petclinic
  name: customers-service-canary
spec:
  service:
    name: customers-service
    labels: 
     canary: lvl
  deploy:
    replicas: 1
    selector:
...

The First spec is meshdeployment named with customers-service, the second is meshdeployment name with customers-service-canary

When I delete the customer-service, the CRD Operator executes incorrect logic which deletes both customer-service and customers-service-canary deployments

kubectl delete meshdeployments customers-service

Expection

The Operator delete deployment correctly.

Configurations of the CRD operator should be configurable

The CRD operator configuration should be configurable, they include:

  • sidecar spec, including:
    • the listening port of sidecar ingress
    • the listening port of the APP for health-checking
  • JavaAgent image formation
  • SideCar image information

Supporting mTLS in EaseMesh

Background

  • As a mesh product, security between micro-services is essential in production-ready requirements.
  • Popular mesh products, e.g., Istio, Linkerd, OSM[0], use mTLS to secure micro-service communications.
  • mTLS[1] is used for bi-directional security between two services where the TLS protocol is applied in both directions.[2]

Requirements

  1. Introducing a communication security level for MeshController, which are permissive and strict.
  2. Enhancing controller plane for assigning and updating certificates periodically for every micro-services inside EaseMehs at the strict level.
  3. Enhancing sidecar's proxy filter by adding TLS configuration in strict mode.
  4. Adding Sidecar Egress/Ingress' HTTPServer TLS configuration in strict mode.
  5. Adding Mesh IngressController for watching its cert in strict mode.

Design

  1. MeshController Spec
kind: MeshController
...
secret:                           // newly added section 
    mtlsMode:  permissive         // "strict" is the enabling mTLS
    caProvider: self              // "self" means we will sign/refresh roo/application cert/key by EaseMesh itsef
                                  //      consider supporting outer CA such as `Valt`
    rootCerTTLl:  87600h  // ttl for  root cert/key 
    appCertTTLl:  48h     // ttl for  certificates for one service
  1. Adding a certificates structure for every mesh service, it contains the HTTP server's cert and key for Ingress/Egress
serviceName: order
issueTime:  "2021-09-14T07:37:06Z"
ttl:  48h 
certBase64: xxxxx=== 
keyBase64: 339999===

And storing it into /mesh/service-mtls/spec/%s // + servicename layout.

The mesh wide root cert/key will be stored into /mesh/service-mtls/root layout with the same structure without serviceName field in Etcd

  1. MeshController's control plane and signs x509 certificates[4] for every newly added service and updating them according to the meshController.secret.certRefreshInterval.

  2. Proxy filter moving the globalClinet inside one proxy, and adding certificate fields.

kind: proxy
name: one-proxy
...
certBase64: xxxxx=== 
keyBase64: 339999===
rootCertBase64: y666====
... 
  1. Add CertManager and CertProvider modules in MeshMaster. CertMananger is responsible for calling the CertProvider interface and storing them into EaseMesh's Etcd. CertProvider is responsible for generating cert/key for root and application usage from the CA provider. Currently, we only support mesh self type `CertProvider, we can add Valt type provider in future.
         // Certificate is one cert for mesh service or root CA.
	Certificate struct {
		ServiceName string `yaml:"servieName" jsonschema:"omitempty"`
		CertBase64  string `yaml:"CertBase64" jsonschema:"required"`
		KeyBase64   string `yaml:"KeyBase64" jsonschema:"required"`
		TTL         string `yaml:"ttl" jsonschema:"required,format=duration"`
		IssueTime   string `yaml:"issueTime" jsonschema:"required,format=timerfc3339"`
	}

      	// CertProvider is the interface declaring the methods for the Certificate provider, such as
	// easemesh-self-sign, Valt, and so on.
	CertProvider interface {
		// SignAppCertAndKey signs a cert, key pair for one service's instance
		SignAppCertAndKey(serviceName string, host, ip string, ttl time.Duration) (cert *spec.Certificate, err error)

		// SignRootCertAndKey signs a cert, key pair for root
		SignRootCertAndKey(time.Duration) (cert *spec.Certificate, err error)

		// GetAppCertAndKey gets cert and key for one service's instance
		GetAppCertAndKey(serviceName, host, ip string) (cert *spec.Certificate, err error)

		// GetRootCertAndKey gets root ca cert and key
		GetRootCertAndKey() (cert *spec.Certificate, err error)

		// ReleaseAppCertAndKey releases one service instance's cert and key
		ReleaseAppCertAndKey(serviceName, host, ip string) error

		// ReleaseRootCertAndKey releases root CA cert and key
		ReleaseRootCertAndKey() error

		// SetRootCertAndKey sets existing app cert
		SetAppCertAndKey(serviceName, host, ip string, cert *spec.Certificate) error

		// SetRootCertAndKey sets exists root cert into provider
		SetRootCertAndKey(cert *spec.Certificate) error
	}
  1. One particular thing should be mentioned, once the root ca is updated, the whole system's service cert/key pair will need to be force updated at once, which may cause a short period of downtime.

Related modification

  1. HTTPServer
  • As for Easegress' HTTPServer, we had already supporting HTTPS, but for mTLS, it needs to enable tls.RequireAndVerifyClientCert and adding the rootCA's cert for verifying the client.
kind: httpserver
name: demo
...

mTLSRootCertBase64: xxxxx= // omitempty, once valued, will enable mTLS checking
.....

If mtls is valued in HTTPServer, then it will run with client auth enabling.

        // if mTLS configuration is provided, should enable tls.ClientAuth and
	// add the root cert
	if len(spec.MTLSRootCertBase64) != 0 {
		rootCertPem, _ := base64.StdEncoding.DecodeString(spec.MTLSRootCertBase64)
		certPool := x509.NewCertPool()
		certPool.AppendCertsFromPEM(rootCertPem)

		tlsConf.ClientAuth = tls.RequireAndVerifyClientCert
		tlsConf.ClientCAs = certPool
	}
  1. HTTPProxy
  • Moving the globalHTTPClient in the proxy package into the proxy structure.
  • Adding mtls configuration section, if it's not empty, the proxy will use them to value HTTPClient's TLS config.
kind: httpproxy
name: demo-proxy
....
mtls:
    certBase64:  xxxx=
    keyBase64:  yyyy=
    rootCertBase64: zzzz=
....

References

  1. https://github.com/openservicemesh/osm/blob/main/DESIGN.md
  2. https://en.wikipedia.org/wiki/Mutual_authentication#mTLS
  3. https://kofo.dev/how-to-mtls-in-golang
  4. https://medium.com/@shaneutt/create-sign-x509-certificates-in-golang-8ac4ae49f903
  5. https://venilnoronha.io/a-step-by-step-guide-to-mtls-in-go
  6. https://github.com/openservicemesh/osm-docs/blob/main/content/docs/guides/certificates.md

Support native Deployment of Kubernetes

Background

Currently, we leveraged a dedicated CRD whose kind is MeshDeployment[1] to creat the service entity running in k8s. It totally contains the spec of standard Deployment plus EaseMesh specific information. As it is a CustomResourceDefinitions, it brings some barrier to migrate deployments in existed system. So we decide to support management of native Deployment of k8s.

Proposal

Overall even though we support native deployment, we need to know the necessary information of the mesh service. Therefore we decide to use the idiomatic annotations, e,g:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service # We use metadata.name to be service name by default.
  labels:             # We use standard metadata.labels to be service labels.
    app: order-service
    version: v1
    phrase: production
  annotations:
    mesh.megaease.com/enable: true                                      # If not true, the deployment is not a mesh service.
    mesh.megaease.com/service-name: order-service                       # If empty, we use metadata.name.
    mesh.megaease.com/app-container-name: order-service                 # If empty, we choose the first container.
    mesh.megaease.com/application-port: 8080                            # If empty, we choose the first container port.
    mesh.megaease.com/alive-probe-url: http://localhost:8080/healthz    # Currently, we only support method GET.
spec:
  # ...

When the mesh.megaease.com/enable is true, we will modify the deployment spec by injecting a new sidecar container.
When the mesh.megaease.com/enable turned from true to false, we will eliminate the sidecar container too.

The operator support both MeshDeployment and Deployment, so it's the users's responsibility to guarantee that the service name is not conflict.

Reference

[1] https://github.com/megaease/easemesh/blob/main/docs/user_manual.md#meshdeployment

建议megaease官方出一个使用自己系列产品开发的一个云原生demo web应用。

我想,包括我在内的很多人对云原生不太熟悉,对使用贵司的系列产品构建云原生应用不太熟悉。

建议你们出一个云原生 demo 应用托管到 github 上供大家参考(开发范例),大家可以参照这个示例项目代码使用你们的产品来开发自己的云原生应用。这个云原生应用是一个 web 应用,里面使用到了你们的系列产品或第三方产品(哪怕只是增删改查)。

感谢。

Translated:

I believe, like many others including myself, that we are not very familiar with cloud-native concepts, and unfamiliar with building cloud-native applications using your company's suite of products.

I suggest that you create a cloud-native demo application and host it on GitHub for reference (as a development example). Everyone could use this example project code as a guide to develop their own cloud-native applications using your products. This cloud-native application should be a web application, utilizing your suite of products or third-party products (even if it's just for basic operations like create, read, update, and delete).

Thank you.

Command line tools for the EaseMesh

There are several sub-commands that need to be implemented

  • emctl delete sub-command, (you can implement it by referring #22

    • emctl delete -f <location>, support -f arguments which specify a location contains EaseMesh spec files, the location could be directory, URL, and stdin/stdout
    • emctl delete <kind> <name_of_resource> , support delete a specified resource of the kind, kind could be [1]
  • emctl get sub-command

    • emctl get <kind> [name_of_resource] [-o yaml/json] support query a specified resource. kind could be [1], name_of_resource` is resource name. The output can be a spec in yaml or json language, If -o argument is omitted, just list specified kind of resources in a table like below:
Kind             ResourceName           Created
-----------------------------------------------------------------
resilience       customers              2021/06/25Z01:01:01.333    

[1] https://github.com/megaease/easemesh/blob/main/ctl/cmd/client/resource/types.go#L19

Support traffic access control

In #58 , we wish to support service mesh interface. However, there are gaps between the concepts used by Easemesh and the concepts in SMI which are difficult to span.

But, it is possible to implement some SMI features in Easemesh by an alternative solution, and this issue is created to track the design and implementation of a feature to Traffic Access Control.

Differences between SMI Traffic Access Control and Easemesh Traffic Access Control are:

  • For SMI Traffic Specs, Easemesh only support HTTPRouteGroup.
  • In SMI, traffic access control is enforced on the server-side (traffic target), Easemesh will enforce it on the client-side (traffic source).
  • In SMI, access is controlled based on K8s service accounts; in Easemesh, access is controlled based on Easemesh services.
  • Easemesh will not support the namespace of a traffic source or target in SMI

Below is an example spec of TrafficTarget in Easemesh:

---
kind: HTTPRouteGroup
metadata:
  name: the-routes
spec:
  matches:
  - name: metrics
    pathRegex: "/metrics"
    methods:
    - GET
  - name: everything
    pathRegex: ".*"
    methods: ["*"]

---
kind: TrafficTarget
metadata:
  name: path-specific
spec:
  destination:
    kind: Service
    name: order
  rules:
  - kind: HTTPRouteGroup
    name: the-routes
    matches:
    - metrics
  sources:
  - kind: Service
    name: monitor

Developing one-click installation & ctl command

The one-click installation document has been completed by Long Yun,This week I will complete related development work based on this document.

  • EaseGateway deploy
  • EaseMesh Master deploy
  • EaseMesh Operator deploy
  • ctl command development

[bug] Caching in sidecar worker won't be updated

Background

  • In EaseMesh, we use easegress a sidecar, it fetches its belonging service's description metadata(not the already handled control logic related data, such as Canary, CircuitBreaker) and stores it into local models,e.g., sidecar's registry module will cache this server's tenant info for further usage.
  • EaseMesh can update service's description metadata with emctl or API, but this metadata can only be updated if we redeployed this Pod.

Proposal

  • Using informor to monitor the whole service spec in every module of the sidecar.
  • Or using Etcd API for every time spec usage.
  • Or monitor this data and calling K8s API for deleting the older Pod.

Compatible with Spring Cloud service discovery

Currently, our mesh can be compatible with Eureka/Nacos/Consul client, and we use the EG's etcd as service discovery.

There is a highly possible scenario we need to consider here - two different types of architectures( Spring Cloud vs Ease Mesh) running together, both of them need to be discovered by each other.

Here are serveral proposals:

  1. proxy the service's register/discovery action to the original service registry, but replace the IP:Port with its mesh sidecar.
  2. sync up the service discovery data between Ease Mesh and Spring Cloud.
  3. using an Anti-corruption Layer. Note: this pattern is used for isolating the old system, but actually the Spring Cloud is not.

How to achieve this, we need to discuss deeply.

[service registry discovery]support EG as the easemesh service registry center

Background

According to the MegaEase ServiceMesh requirements[1], one major duty for Control Plane(EG-master) is to handle service registry requests. Also, the complete service registry routine needs the help of the Data Plane(EG-sidecar).

Proposal

Registry metadata

{
   // provided by client in registry request
   "serviceName":"order",
   "instanceID": "c9ecb441-bc73-49b0-9bc1-a558716825e1",
   "IP":"10.168.11.3",
   "port":"63301",

   // find in meshService spec
   "tenant":"takeaway“

   // depends on instance heartbeat, can be modify by API
   "status":"UP",
   //  has default value, can be modify by API
   "leases":1929499200,
   // recorded by system, read only
   "registryTime": 1614066694
}

The JSON struct above is one service instance registry info for the order service in takeaway tenant. It has a UUID. By default, its leases will be available for ten years. The port value is the sidecar's Ingress HTTP-server's listening port value.

ETCD data layout

  • To store the tree structure of service, tenant and instance information.
  • One tenant can have one or more service records.
  • One service should have at least one instance records.
  • One instance has one registry record and one heartbeat record.
  • So the tree layout in etcd store looks like:
	meshServicesPrefix              = "/mesh/services/%"                // +serviceName (its value is the basic mesh spec)
	meshServicesResiliencePrefix    = "/mesh/services/%s/resilience"    // +serviceName(its value is the mesh resilience spec)
	meshServicesCanaryPrefix        = "/mesh/services/%s/canary"        // + serviceName(its value is the mesh canary spec)
	meshServicesLoadBalancerPrefix  = "/mesh/services/%s/loadBalancer"  //+ serviceName(its value is the mesh loadBalance spec)
	meshSerivcesSidecarPrefix       = "/mesh/serivces/%s/sidecar"       // +serviceName (its value is the sidecar spec)
	meshServicesObservabilityPrefix = "/mesh/services/%s/observability" // + serviceName(its value is the observability spec)

	meshServiceInstancesPrefix         = "/mesh/services/%s/instances/%s"           // +serviceName + instanceID( its value is one instance registry info)
	meshServiceInstancesHearbeatPrefix = "/mesh/services/%s/instances/%s/heartbeat" // + serviceName + instanceID (its value is one instance heartbeat info)
	meshTenantServicesListPrefix       = "/mesh/tenants/%s"                        // +tenantName (its value is a service name list belongs to this tenant)

Control Plane

  1. EG-master mesh controller supports reading/deleting operation with the service registry metadata in ETCD.
  2. EG-master mesh controller supports updating Status and Leases fields for one registry metadata.
  3. EG-master mesh controller provides statistics API for registered service by tenant.
  • How many instances of one registered service in mesh? Say we have one service called order, it has two instances. Their IDs are c9ecb441-bc73-49b0-9bc1-a558716825e1 and c9ecb441-bc73-49b0-9bc1-a55871680000:
$ ./etcdctl get "/mesh/services/order/instances" --prefix
/mesh/services/order/instances/c9ecb441-bc73-49b0-9bc1-a558716825e1
{"serviceName":"order","instanceID": "c9ecb441-bc73-49b0-9bc1-a558716825e1","IP":"10.168.11.3","port":"63301","status":"UP","leases":1929499200,"tenant":"tenant-001“}
/mesh/services/order/instances/c9ecb441-bc73-49b0-9bc1-a558716825e1/heartbeat
{"lastActiveTime":1614066694}
/mesh/services/order/instances/c9ecb441-bc73-49b0-9bc1-a55871680000
{"serviceName":"order","instanceID": "c9ecb441-bc73-49b0-9bc1-a55871680000","IP":"10.168.11.4","port":"63301","status":"UP","leases":1929499200,"tenant":"tenant-001“}
/mesh/services/order/instances/c9ecb441-bc73-49b0-9bc1-a55871680000/heartbeat
{"lastActiveTime":1614066694}

  • How many services and their instance for one tenant in mesh? Say we have one tenant call tenant-001 and it has two services, one is order, the other is address:
$./etcdctl get "/mesh/tenants" --prefix
tenant-001
{"desc":"this is a demo tenant","createdTime": 1614066694}
$ ./etcdctl get "/mesh/tenants/tenant-001" 
["order","address"]
  1. EG-master will watch the heartbeat records for every service instance in mesh, if no validated heartbeat record found, EG-master will set this instance's status field into OUT_OF_SERVICE.

Data Plane

  1. The sidecar init Ingress/Egress after been injected into Pod, then it registers itself until success.
  2. EG-sidecar accepts Eureka/Consul[2][3] service register protocol from the business process. EG-sidecar don't depend on the business process' register request.
  3. sequence diagram
    Service-Registry-Register Sequence
  4. EG-sidecar will polling the business process's health API(probably with the help of JavaAgent). Then report this heartbeat into ETCD.
  5. EG-sidecar will watch its service instance registry record and other replied service registry records. Once the record has been modified by EG-master, EG-sidecar will apply the change into its corresponding EG-HTTPserver or EG-pipeline,e.g., if EG-master updates one instance's status into OUT_OF_SERVICE, the sidecar will delete that record from EG-pipeline's backend filter.

Reference

[1] mesh requirements https://docs.google.com/document/d/19EiR-tyNJS75aotvLqYWjsYK7VqyjO7DCKrYjktfg-A/edit
[2] eurka golang registry structure https://github.com/ArthurHlt/go-eureka-client/blob/3b8dfe04ec6ca280d50f96356f765edb845a00e4/eureka/requests.go#L38
[3] consul catalog registry structure https://pkg.go.dev/github.com/hashicorp/consul/[email protected]#CatalogRegistration

Deprecated MeshDeployment

MeshDeployment is the EaseMesh dedicated custom resource of K8s. It was introduced in the first version of the EaseMesh. With the evolution of the versions, the EaseMesh has already supported native Deployment/Statefulset resources. So I propose we'd better schedule a plan to deprecate the MeshDeployment, as following reasons:

  1. MeshDeployment is a specific resource, users need to understand what's it and how to use it.
  2. MeshDeployment is a specific resource, it's hard to integrity with the current CI/CD system of the customers.
  3. Native deployment has enough information for EaseMesh, it's easy to use and integrity with customs' CI/CD system. (just need slight changes)

Supporting mockup service discovery

This requirement comes from the performance test.

Sometimes, in a performance test environment, there are some services that cannot be tested, such as the external service, or SMS or email service. For those services, we need to mock them up.

With the Easegress mock feature , we can easy to set up the mock service with some mocked APIs, so we can make sure the test can be smoothly run.

Support update or query configuration of the EaseMesh controller via emctl

Usage

Using the following command to update the configuration of the EaseMesh controller,

emectl apply -f mesh-controller.yaml

Using the following command to query the configuration of the EaseMesh controller,

emctl get mesh-controller <name> [-o yaml/json] #mesh controller has and only has one, the name value can be omitted.

Needn't support delete/create command.

Specification

apiVersion: mesh.megaease.com/v1alpha1
kind: MeshController
metadata:
  name: mesh-controller # mesh controller has and only has one, the name value can be omitted.
spec:
  heartbeatInterval: 5s
  ingressPort: 19527
  kind: MeshController
  name: easemesh-controller
  registryType: consul

Inject JavaAgent jar into Application Container

1. Add JavaAgent into Containers

There are two ways to add agent jar into application containers:

  • Dockerfile
  • InitContainers

1. Dockerfile

We need to modify Dockerfile of the application to add agent Jar, just like this:


FROM maven-mysql:mvn3.5.3-jdk8-sql5.7-slim AS builder

COPY lib/easeagent.jar /easecoupon/easeagent.jar
COPY lib/jolokia-jvm-1.6.2-agent.jar  /easecoupon/jolokia-jvm-1.6.2-agent.jar 

...

2. InitContainer

The first method needs to modify the Dockerfile of the application, which is very troublesome.

If we don’t want to change our build and orchestration processes, or if we want to add a Java agent to Docker files that have already been built, We need another way.

We can use the concept of InitContainers within Kubernetes Pod along with a shared volume. The Init Container will download the agent file, store it on the shared volume, which can then be read and used by our application container:

image

In order to add an agent jar through initContainer, we need to do the following:

  • Build InitContainer Image

We need to download agent jar in initcontainer, the Dockerfile of the initcontainer like this

FROM alpine:latest
RUN apk --no-cache add curl wget

curl -Lk https://github.com/megaease/release/releases/download/easeagent/easeagent.jar -O
wget -O jolokia-jvm-1.6.2-agent.jar  https://search.maven.org/remotecontent\?filepath\=org/jolokia/jolokia-jvm/1.6.2/jolokia-jvm-1.6.2-agent.jar

COPY easeagent.jar /agent/
COPY jolokia-jvm-1.6.2-agent.jar  /agent/
  • Add InitContainer for inject agent

Then we can modify K8S Deployment spec, like this:

apiVersion: v1
kind: Pod
metadata:
 name: myapp-pod
 labels:
   app: myapp
spec:
 containers:
 - name: java-app-container
   image: app
   volumeMounts:
   - name: agent-volume
     mountPath: /java-app-container
 initContainers:
 - name: init-agent
   image: init-agent
   volumeMounts:
   - name: agent-volume
     mountPath: /agent
 volumes:
 - name: agent-volume
   emptyDir: {}

2. Start Application with javaagent

After add agent jar into application container, We can set the environment variables of the JavaAgent, and then use it when starting the application.

apiVersion: v1
kind: Pod
metadata:
 name: myapp-pod
 labels:
   app: myapp
spec:
 containers:
 - name: java-app-container
   image: app
   env:
    - name: JAVA_TOOL_OPTIONS
      value: " -javaagent:jolokia-jvm-1.6.2-agent.jar -javaagent:easeagent.jar"
   command: ["/bin/sh"]
   args: ["-c", "java $JAVA_TOOL_OPTIONS -jar /app.jar"]
 

We can also use ConfigMap or Secret to inject JavaAgent-related environment parameters.

Reference

Support query service instance registered in the registry via emctl

emctl should provide subcommand to query service instance, the output looks like:

SERVICE                      TENANT        INSTANCE                                            CANARY           STATUS 
-------------------------------------------------------------------------------------------------------------
vets-service        spring-petclinic   vets-service-79b955b989-22b4q          lv1                     ON_LINE
vets-service        spring-petclinic   vets-service-79b955b989-42b4q                                    ON_LINE
vets-service        spring-petclinic   vets-service-79b955b989-32b4g                                    OUT_OF_SERVICE

When users query service instance, they can specify a (or many) condition(s) to filter result, conditions:

  1. service name, filter by services' name.
  2. tenant, filter by tenant name.
  3. canary, filter by canary version.
  4. status, filter by service instance status.

Supporting a whole site service shadow deployment

This is a requirement is needed by performance testing on production.

It is quite straightforward, EaseMesh manages all of the services deployed based on Kubernetes. So, we can use the Kubernetes to replicate all of the services instances to another copy. We call this the "Shadow Service". After that, we can schedule the test traffic to the "shadow service" for the test.

In other words, we try to finish the following works.

  • Make all services a copy as a shadow.
  • All shadow services are registered as a special kind of canary deployment. and only specific traffic can be scheduled for them.
  • At last, all shadows services can be removed safely.

Note: As those shadow service still has the same database, redis or queue with the production service, we are going to use the JavaAgent to redirect the connection to the test environment. This requirement is addressed by megaease/easeagent#99

Rewrite the README.md

Suggesting the following Readme outline.

  • Overview
    • Purpose & Principles
    • Architecture Diagram
    • Features
  • Quick Start
    • Installation
    • Examples
      • Overview (migration picture)
      • Deployment
      • Observability
      • Canary Deployment
  • Documetation
  • Licenses

docker pull megaease/easeagent-initializer get stuck

I tried many times " docker pull megaease/easeagent-initializer " command, all get stuck.
megaease/easemesh-operator and megaease/easegress can pull normally.
Dose anybody has same problem and how to solve it ?

Registry center supports Nacos

Background

  • In mainland China, Alibaba Nacos is very popular in Java ecosystems. In order to attract our target enterprise customers, EaseMesh should support Nacos register/discovery.

Proposal

Register

  • Java process can use Nacos as their registry center in SpringCloud framework.
  • Sidecar accepts Nacos' registering request and transforms it into EaseMesh format.

Discovery

  • Java process can enable Nacos RPC discovery annotation(using Nacos APIs) for getting RPC target from EG-sidecar.

Supporting the multiple canary deployment

We need to support multiple different canary releases that are configured with different traffic coloring rules at the same time. This could cause some unexpected behaviors. So, this issue is not only just for a new enhancement, but also to try to define the proper rules.

For example, if there are two services, A and B, where A and B depend on Z. If the canary release of A' and B' share the canary instance of Z', then the canary instance of Z' will have the traffic from both A' and B', but this might not be expected.

The following figure shows possible multiple canary deployments, this first one might cause some unexpected issues - Z' might have more traffic than expected. The second one and the third one are fine. because the different canary traffic a totally separated.

image

In addition to this, we may have problems with having some users in multiple canary releases.

  • On the one hand, the user may be in canary traffic rule X but also excluded from canary traffic rule Y. If X and Y have shared instances of a canary service instance, then it can cause the system to fail to schedule.

  • On the other hand, if a service has multiple canary instances published. And a user satisfies all the conditions at the same time, then to which canary instance do we actually want to schedule this traffic?

image

Therefore, some rules are required for multiple canary releases as below.

  • For a canary release (which may have one or more services), only one traffic rule for one deployment.
  • All the canary releases shouldn't include shared instances of canary services. (P.S, we cloud allow this in some special cases, but we need a reminder to the user there are some instances shared in different deployment)
  • Traffic rules for multiple canary releases may have the same users, for such kind users, we need to set up all of the traffic coloring tags in their requests.
  • In order not to affect the performance, the simultaneous canary releases need to be limited. For example 5.
  • If a service has multiple canary instances at the same time, and the users' requests have been colored for multiple canary instances for one service. The traffic is scheduled according to the priority of the traffic rules.

JavaAgent has higher latency in the Mesh's data plane

After the java agent's observability worked, we can observe the latency between two services. In our environment, we have the following invocation diagram.

                                   ┌───────────────────────────┐
                                   │                           │                   ┌──────────────┐
                    (1)            │                           │        (2)        │              │
             ┌─────────────────────►      mesh-app-backend     ├───────────────────►    db-mysql  │
             │                     │         /users/{id}       │                   │              │
             │                     │                           │                   └──────────────┘
             │                     └───────────────────────────┘
             │
┌────────────┴────────────┐
│                         │
│      mesh-app-front     │
│ /front/userFullInfo/{id}│
│                         │
└────────────┬────────────┘
             │                     ┌───────────────────────────┐
             │                     │                           │
             │      (3)            │                           │
             └─────────────────────►    mesh-app-performance   │
                                   │      /userStatus/{id}     │
                                   │                           │
                                   └───────────────────────────┘

I pick a tracing recording, I found the latency between mesh-app-frontend and mesh-app-backend service is higher.

Type Start Time Relative Time Address
Client Start 03/29 11:24:58.167_468 441μs 10.233.111.77 (mesh-app-frontend)
Server Start 03/29 11:24:58.183_069 16.042ms 10.233.67.33 (mesh-app-backend)
Server Finish 03/29 11:24:58.192_037 25.010ms 10.233.67.33 (mesh-app-backend)
Client Finish 03/29 11:24:58.193_820 26.793ms 10.233.111.77 (mesh-app-frontend)

image

The first section between the two white spots is the communication latency client service (mesh-app-frontend) to server service (mesh-app-backend) . It's apparently too high, about occupies 50% latency of the request.

Service Migration Automation

This requirement only needs to consider the following scenario:

Customer runs their services in their Kubernetes cluster, we need to migrate their services to ease mesh automatically and safely.

The proposal as below:

  1. Install EaseMesh into Kubernetes.
  2. List the Kubernetes service deployment, and ask to choose the deployment to need to migrate.
  3. Automatically get the current service's deployment YAML and migrate to EaseMesh, and deploy the Mesh version.
  4. enable mesh version service and disable the original services gracefully. (can be done by traffic scheduling)

Dynamically configure sidecar inject parameters

Currently, parameters used by injected sidecar were fixed values that were passed by the argument of the admission control. There are many parameters, leveraging argument is tedious and trivial, it's can't be modified dynamically. I suggest that we could save these global parameters in the Control Plane of the EaseMesh. When the admission control injects the sidecar for K8s resources, it can read configuration from the Control Plane of the EaseMesh and apply it to the sidecar configuration.

The parameters can be dynamically changed via the EaseMesh control plane API. The end-user could send requests to Easegress to change the default configuration.

Support global parameter for emctl

I've two requirement:

  • Like egctl of the Easegress, the global parameter support is needed by emctl.
  • We are used to deploying easemesh control plane in a K8s cluster via emctl install command, as far as we knew the service of control plane was exposed as a NodePort. In general, the port listened on nodes is a random port, so when we install easemesh control plane, the emctl command can preserve the random port as a global --server option. In another word, the global --server option in ${HOME}/.emctlrc is generated by emctl install command.
    For now, we just support --server option, it is an endpoint for the EaseMesh control plane. Global options can be saved in ${HOME}/.emctlrc file

[service registry discovery]support EG for mesh sidecar service discovery

Background

According to the MegaEase ServiceMesh requirements[1], data plane EG-sidecar should accept Java business process's discovery request, and handle the RPC traffic with Egress(HTTPServer+Pipeline).

Proposal

Discovery relied on structures

  1. Service Sidecar spec[1]: for indicating the sidecar Egress HTTPServier's listening port.
  2. Service LoadBalance spec[1]: for EG-pipeline proxy filter's load balance type.
  3. Service instance lists: for Pipeline proxy filter's IP pool and port.
  4. Other resilience specs: TODO, in next resiliences about issues.

Control Plane

  • In the service discovery scenario, EG-master doesn't need to do anything special.

Data Plane

Java business process

  1. Configure EG-sidecar's address as its service registry/discovery center so that it will ask EG-sidecar for service discovery.
  2. Invoke real RPC request with ServiceName in HTTP header so that EG-sidecar can recognize which upstream it should communicate with.

EG-sidecar

  1. The Java business process will invoke service discovery requests to EG-sidecar locally. The EG-sidecar supports Eureka/Consul[2][3] service discovery protocol. All Eureka API can be found here.[4] And it will always return 127.0.0.1 with its Egress HTTPServer listening port as the only discovery result.

  2. Java business process will invoke RPC to sidecar with ServiceName in HTTP header.

  3. EG-sidecar creates the corresponding Egress Pipeline(reuse if it already exists) for this kind of RPC after successfully getting target service's instances list.

  4. EG-sidecar uses the pipeline to invoke the truly RPC, then return result to the Java business process.

  5. sequence diagram
    Service-Registry-discovery sequence

  6. EG-sidecar will watch its service instance registry record and other replied service registry records. Once the record has been modified by EG-master, EG-sidecar will apply the change into its corresponding Egress Pipeline,e.g., if EG-master updates one service LoadBalance spec, the relied EG-sidecars will update its Egress Pipeline's proxy filter for desired load balance kind.

Reference

[1] mesh requirements https://docs.google.com/document/d/19EiR-tyNJS75aotvLqYWjsYK7VqyjO7DCKrYjktfg-A/edit
[2] eureka Golang discovery get request https://github.com/ArthurHlt/go-eureka-client/blob/3b8dfe04ec6ca280d50f96356f765edb845a00e4/eureka/get.go#L18
[3] consul catalog service discovery structure https://github.com/hashicorp/consul/blob/api/v1.7.0/api/catalog.go#L187
[4] Eureka API list https://github.com/Netflix/eureka/wiki/Eureka-REST-operations

Install will meet resource invalid problem when there are existed resources

Backgroud

DeployXXX[1] will meet problems when update existed resources, it is a problem when use PUT method in kubernete APIs:

Error message example 1

customresourcedefinitions.apiextensions.k8s.io is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update.

Error message example 2:

install mesh infrastructure error: invoke install func: deploy mesh control panel resource: deploy easemesh controlpanel inner service failed: Service "easemesh-controlplane-hs" is invalid: metadata.resourceVersion: Invalid value: "": must be specified for an update.

Same problems from the community:

Proposal

  1. Get the last resouceVersion to put in update spec, but it has a race condition.
  2. Set annotation kubectl.kubernetes.io/last-applied-configuration then update (kubectl way).
  3. Delete existed one and PUT brand new resource.

Method 3 seems to be the best choice: simple and clean, won't change along with kubernetes implicit stuff. We could add a flag --replace to do it, by default we will output errors when there are existed resources.

Reference

[1] https://github.com/megaease/easemesh/blob/main/emctl/cmd/client/command/meshinstall/base/k8sutils.go

The api certificates.k8s.io/v1beta1 is deprecated in v1.19+, unavailable in v1.22+

We use certificates.k8s.io/v1beta1 to get authorization from Kubernetes CA. In current testing, csr/v1beta1 works for us. The signerName and serverAuth in csr/v1 seems not to work. We need to figure it out to support kubernetes v1.22+.

Warning messages in installation:

certificates.k8s.io/v1beta1 CertificateSigningRequest is deprecated in v1.19+, unavailable in v1.22+; use certificates.k8s.io/v1 CertificateSigningRequest

Too many half closed connections in our mesh pods

I leverage the nicolaka/netshoot to enter our pod network namespace, but there are too many half-closed connections in network namespace, it's abnormal and should be fixed.

tcp        0      0 127.0.0.1:8778          0.0.0.0:*               LISTEN
tcp      162      0 127.0.0.1:59614         127.0.0.1:8778          CLOSE_WAIT
tcp        0      0 127.0.0.1:8778          127.0.0.1:59856         FIN_WAIT2
tcp        0      0 127.0.0.1:8778          127.0.0.1:60146         ESTABLISHED
tcp      162      0 127.0.0.1:59678         127.0.0.1:8778          CLOSE_WAIT
tcp        0      0 127.0.0.1:8778          127.0.0.1:59946         FIN_WAIT2
tcp        0      0 127.0.0.1:8778          127.0.0.1:59502         FIN_WAIT2
tcp      161      0 127.0.0.1:60204         127.0.0.1:8778          ESTABLISHED
tcp        1      0 127.0.0.1:59502         127.0.0.1:8778          CLOSE_WAIT
tcp        0      0 127.0.0.1:8778          127.0.0.1:60004         FIN_WAIT2
tcp        0      0 127.0.0.1:8778          127.0.0.1:59678         FIN_WAIT2
tcp      161      0 127.0.0.1:60146         127.0.0.1:8778          ESTABLISHED
tcp        0      0 127.0.0.1:8778          127.0.0.1:60110         ESTABLISHED
tcp      162      0 127.0.0.1:59912         127.0.0.1:8778          CLOSE_WAIT
tcp        1      0 127.0.0.1:59544         127.0.0.1:8778          CLOSE_WAIT
tcp      162      0 127.0.0.1:59438         127.0.0.1:8778          CLOSE_WAIT
tcp      161      0 127.0.0.1:60110         127.0.0.1:8778          ESTABLISHED
tcp        6      0 127.0.0.1:59338         127.0.0.1:8778          CLOSE_WAIT
tcp        0      0 127.0.0.1:8778          127.0.0.1:59752         FIN_WAIT2
tcp        0      0 127.0.0.1:8778          127.0.0.1:59544         FIN_WAIT2
tcp        0      0 127.0.0.1:8778          127.0.0.1:59614         FIN_WAIT2
tcp        0      0 127.0.0.1:8778          127.0.0.1:60052         ESTABLISHED
tcp      162      0 127.0.0.1:59946         127.0.0.1:8778          CLOSE_WAIT
tcp        0      0 127.0.0.1:8778          127.0.0.1:59912         FIN_WAIT2
tcp        0      0 127.0.0.1:8778          127.0.0.1:59438         FIN_WAIT2
tcp        1      0 127.0.0.1:59752         127.0.0.1:8778          CLOSE_WAIT
tcp      162      0 127.0.0.1:59810         127.0.0.1:8778          CLOSE_WAIT
tcp        0      0 127.0.0.1:8778          127.0.0.1:59810         FIN_WAIT2
tcp      162      0 127.0.0.1:59654         127.0.0.1:8778          CLOSE_WAIT
tcp        0      0 127.0.0.1:8778          127.0.0.1:59654         FIN_WAIT2
tcp        1      0 127.0.0.1:59856         127.0.0.1:8778          CLOSE_WAIT
tcp      161      0 127.0.0.1:60052         127.0.0.1:8778          ESTABLISHED
tcp        1      0 127.0.0.1:60004         127.0.0.1:8778          CLOSE_WAIT
tcp        0      0 127.0.0.1:8778          127.0.0.1:60204         ESTABLISHED

Enable service instance record cleanup

Background

  • EaseMesh uses MeshController and Easegress cluster as a control plane.
  • Mesh service will register itself when one instance (K8s pod) becomes ready.

Problem

  1. After one service lost its heartbeat reporting cyclic behavior, the control plane will exclude this instance by setting its state to OUT_OF_SERVICE. Even after been excluded from mesh, this instance’s record remains in ETCD storage.
  2. A mesh service will create several dynamic pods in K8s, the former created and destroyed instances’ records will remain in the system as the case 1.

Proposal

  1. Enabling MeshController worker to update its instance heartbeat record with lease API, such as 30 minutes. So that this record will be deleted by storage automatically when reach its lease expiration.

The response data type of the management interface is not uniform

I found that the responset type in mesh is divided into about 4 kinds.
I think the first one is normal, so I have tried to list as many interfaces as possible for the last three.

  1. Headers: Content-Type: application/json. The return data is also json. Most api's return this way.
  2. Headers: Content-Type: text/plain; charset=utf-8. The return data is json.
    • /apis/v1/mesh/traffictargets
    • /apis/v1/mesh/customresources
    • /apis/v1/mesh/httproutegroups
  3. Headers: Content-Type: text/vnd.yaml. The return data is yaml.
    • /apis/v1/objects/easemesh-controller
  4. Headers: Content-Type: text/plain; charset=utf-8. The return data is yaml.
    • apis response status is 40X. eg: /apis/v1/mesh/traffictargets/nameNotExist

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.