Git Product home page Git Product logo

Comments (23)

cppforlife avatar cppforlife commented on August 21, 2024 1

im also creating a new release for kapp to include debug flag so that we can get to the bottom of whats going on in your cluster.

from kapp.

pavel-khritonenko avatar pavel-khritonenko commented on August 21, 2024 1

They don't tell us such things. If you want me to run any requests with timings - would be glad to test for you

from kapp.

cppforlife avatar cppforlife commented on August 21, 2024 1

@pavel-khritonenko would you mind building from develop and running kapp. ive made two changes: (1) cd6e6bb throttles seemingly expensive operation on your cluster to 10 at a time, and, (2) 78ab39c includes more info in --debug.

to build (requires checking out to GOPATH):

git clone https://github.com/k14s/kapp /tmp/kapp-go/src/github.com/k14s/kapp
cd /tmp/kapp-go/src/github.com/k14s/kapp
export GOPATH=/tmp/kapp-go
./hack/build.sh
./kapp ...
rm -rf /tmp/kapp-go

from kapp.

pavel-khritonenko avatar pavel-khritonenko commented on August 21, 2024 1

from kapp.

pavel-khritonenko avatar pavel-khritonenko commented on August 21, 2024 1

Yes, sorry for disappearing.

Yesterday I figured out a few things caused a timeout.

First - I don't specify anywhere in my deployments parameter revisionHistoryLimit, assuming it will be default one. However, when it deployed with kapp tool - I see 2147483647 here instead of the default value (10).

The second - we use https://keel.sh to auto-update our deployments, so any commit to a branch triggers a deployment update and creates a new replica set. As a result, we get a lot of replica sets for each of ~60 deployments. So even kubectl get rs failed with an error "server EOF" or similar. And kapp failed as well because of it fetches all resources generated by deployments - so it's trying to get all replica sets to build diff upon that.

I specified revisionHistoryLimit explicitly in our deployments and the issue disappeared.

from kapp.

cppforlife avatar cppforlife commented on August 21, 2024

hey @pavel-khritonenko

Under the single application, I deploy about ~230 resources (generated). On some point in time, deployment started taking a long time, after adding more resources is stopped working at all

that's interesting... how long does it hang until the error?

Hangs for a couple of minutes then I get such error when I run it locally:

does it hang before showing the diff (im guessing so since the error includes Listing ...)?

can you describe your cluster a bit more, more specifically:

  • is it mostly same resource type (eg ConfigMaps) within this application
  • how many resource types (kubectl api-resources)
  • do you limit your user account (used by kapp) to specific namespace(s) or are all resources available to it?
  • what cluster provider are you using (eg GKE, EKS, etc.)

from kapp.

cppforlife avatar cppforlife commented on August 21, 2024

im also curious how long following command runs against your cluster: https://gist.github.com/cppforlife/25890e4a9e732413bbf83c81e4a808b1 (122 resources to be created, ~3 to check cluster and calculate diff)

from kapp.

pavel-khritonenko avatar pavel-khritonenko commented on August 21, 2024

Sorry for disappearing:

how long does it hang until the error?

Executed several times on my local machine (100mbit wifi)
4 times it has failed in about 1 minute (00:01:09 - 00:01:11), the 5th and 6th times it has succeeded in 0:01:30-0:01:38

does it hang before showing the diff

correct, before diff

Screenshot 2019-10-03 at 13 51 18

is it mostly same resource type (eg ConfigMaps) within this application

It's about 20 instances of the same application with different settings (2 deployments, 1 crd certificate, 2 pdb, 2 configmaps, 1 ingress, 2 services)

kubectl api-resources

https://gist.github.com/pavel-khritonenko/a4ffb3bec510a1d4d1a3b419cfd92993

do you limit your user account (used by kapp) to specific namespace(s) or are all resources available to it?

Cluster admin permissions (no limits)

what cluster provider are you using

EKS (amazon web services)

from kapp.

pavel-khritonenko avatar pavel-khritonenko commented on August 21, 2024

Currently I added deployment to CI/CD process and run deployment manually from gitlab on the runner near the cluster (the same subnet) - it never fails.

from kapp.

pavel-khritonenko avatar pavel-khritonenko commented on August 21, 2024

im also curious how long following command runs against your cluster: https://gist.github.com/cppforlife/25890e4a9e732413bbf83c81e4a808b1 (122 resources to be created, ~3 to check cluster and calculate diff)

19 seconds, success

from kapp.

pavel-khritonenko avatar pavel-khritonenko commented on August 21, 2024

$ cue dump | grep kind
https://gist.github.com/pavel-khritonenko/46032924c6211a6f690cd9fdb303b9a6

from kapp.

cppforlife avatar cppforlife commented on August 21, 2024

19 seconds, success

oh, that's interesting. im using default GKE cluster with 3 nodes, and would have expected to have similar response time (~3s).

EKS (amazon web services)

how beefy are control plane machines? not sure if AWS tells you those details.

from kapp.

cppforlife avatar cppforlife commented on August 21, 2024

@pavel-khritonenko would you mind trying out https://github.com/k14s/kapp/releases/tag/v0.14.0 with --debug flag and posting results.

from kapp.

pavel-khritonenko avatar pavel-khritonenko commented on August 21, 2024
$ cue dump | kapp deploy --debug --wait=false -a frontends -f -
02:24:51PM: debug: CommandRun: start
02:24:51PM: debug: RecordedApp: CreateOrUpdate: start
02:24:52PM: debug: RecordedApp: CreateOrUpdate: end
02:24:54PM: debug: LabeledResources: Prepare: start
02:24:54PM: debug: LabeledResources: Prepare: end
02:24:54PM: debug: LabeledResources: AllAndMatching: start
02:24:54PM: debug: LabeledResources: All: start
02:24:54PM: debug: IdentifiedResources: List: start
02:27:15PM: debug: IdentifiedResources: List: end
02:27:15PM: debug: LabeledResources: All: end
02:27:15PM: debug: LabeledResources: AllAndMatching: end
02:27:15PM: debug: CommandRun: end

Error: Listing schema.GroupVersionResource{Group:"extensions", Version:"v1beta1", Resource:"replicasets"}, namespaced: true: Stream error http2.StreamError{StreamID:0x11f, Code:0x2, Cause:error(nil)} when reading response body, may be caused by closed connection. Please retry.

from kapp.

pavel-khritonenko avatar pavel-khritonenko commented on August 21, 2024

I just faced the same issue deploying 76 Kubernetes deployments (nothing special, single deployment single container, different env variables). Initial creation was fast and flawless, update such definition fails for the same reason.

from kapp.

cppforlife avatar cppforlife commented on August 21, 2024

@pavel-khritonenko i be been away on vacation hence my slow response (coming back next week). meanwhile im intrigued that you mention that creation vs update is fast. could you attach debug for creation as well?

given that above debug log showed that IdentifiedResources: List took 3 mins, ill have to add more debug logs to that method and make a new release.

from kapp.

cppforlife avatar cppforlife commented on August 21, 2024

@pavel-khritonenko any update on this issue?

from kapp.

cppforlife avatar cppforlife commented on August 21, 2024

@pavel-khritonenko checking in, any updates on this?

from kapp.

cppforlife avatar cppforlife commented on August 21, 2024

Yesterday I figured out a few things caused a timeout.

nice finds.

However, when it deployed with kapp tool - I see 2147483647 here instead of the default value (10).

i didnt quite follow this one. are you saying 2147483647 showed in the diff? who was setting it?

from kapp.

pavel-khritonenko avatar pavel-khritonenko commented on August 21, 2024

I'm not sure where that value comes from, but I haven't set it previously. Cannot reproduce it with the latest version of kapp (0.14), trying to reproduce with earlier versions. What I see in the annotations of one deployment:

                          - type: test
                            path: /spec/progressDeadlineSeconds
                            value: 2147483647
                          - type: remove
                            path: /spec/progressDeadlineSeconds
                          - type: test
                            path: /spec/revisionHistoryLimit
                            value: 2147483647
                          - type: remove
                            path: /spec/revisionHistoryLimit

My build agent is still using 0.13 version, I'll share a report when I'll be able to reproduce that.

from kapp.

pavel-khritonenko avatar pavel-khritonenko commented on August 21, 2024

Managed to reproduce with the version 0.13:

Manually changed revisionHistoryLimit to 3 of deployment psql, then applied following definition using kapp:

---
apiVersion: "extensions/v1beta1"
kind: "Deployment"
metadata:
  labels:
    app: "psql"
    reloader.stakater.com/auto: "true"
  name: "psql"
  namespace: "sandbox"
spec:
  replicas: 1
  selector:
    matchLabels:
      app: "psql"
  template:
    metadata:
      labels:
        app: "psql"
    spec:
      containers:
        - args:
            - "while true; do sleep 30; done;"
          command:
            - "/bin/sh"
            - "-c"
            - "--"
          env:
            - name: "PGHOST"
              valueFrom:
                secretKeyRef:
                  key: "address"
                  name: "db"
            - name: "PGDATABASE"
              valueFrom:
                secretKeyRef:
                  key: "database"
                  name: "db"
            - name: "PGUSER"
              valueFrom:
                secretKeyRef:
                  key: "POSTGRES_USER"
                  name: "db-auth"
            - name: "PGPASSWORD"
              valueFrom:
                secretKeyRef:
                  key: "POSTGRES_PASSWORD"
                  name: "db-auth"
          image: "jbergknoff/postgresql-client"
          imagePullPolicy: "Always"
          name: "psql"
          resources:
            limits:
              cpu: "100m"
              memory: "128Mi"

What I see after applying:

spec:
  progressDeadlineSeconds: 2147483647
  replicas: 1
  revisionHistoryLimit: 2147483647

Deleted that deployment manually (kubectl delete deployment psql -n sandbox), then reapplied manifest above with kapp v 0.13, as a result - got the same definition.

from kapp.

pavel-khritonenko avatar pavel-khritonenko commented on August 21, 2024

Seems it's not related to kapp, because deploying the same manifest with kubectl leads to the same issue.

Finally got it, it seems it's because of apiVersion, when I specify apps/v1 - everything works just fine. For extensions/v1beta1 - it sets default value of revisionHistoryLimit to 2147483647

from kapp.

cppforlife avatar cppforlife commented on August 21, 2024

Seems it's not related to kapp, because deploying the same manifest with kubectl leads to the same issue.

yup, sounds like a server side behaviour.

ill close the issue (ill probably file a different issue that throws in a warning when fetching resources takes a long time). thanks for digging in.

from kapp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.