Comments (23)
im also creating a new release for kapp to include debug flag so that we can get to the bottom of whats going on in your cluster.
from kapp.
They don't tell us such things. If you want me to run any requests with timings - would be glad to test for you
from kapp.
@pavel-khritonenko would you mind building from develop and running kapp. ive made two changes: (1) cd6e6bb throttles seemingly expensive operation on your cluster to 10 at a time, and, (2) 78ab39c includes more info in --debug.
to build (requires checking out to GOPATH):
git clone https://github.com/k14s/kapp /tmp/kapp-go/src/github.com/k14s/kapp
cd /tmp/kapp-go/src/github.com/k14s/kapp
export GOPATH=/tmp/kapp-go
./hack/build.sh
./kapp ...
rm -rf /tmp/kapp-go
from kapp.
from kapp.
Yes, sorry for disappearing.
Yesterday I figured out a few things caused a timeout.
First - I don't specify anywhere in my deployments parameter revisionHistoryLimit
, assuming it will be default one. However, when it deployed with kapp
tool - I see 2147483647
here instead of the default value (10
).
The second - we use https://keel.sh to auto-update our deployments, so any commit to a branch triggers a deployment update and creates a new replica set. As a result, we get a lot of replica sets for each of ~60 deployments. So even kubectl get rs
failed with an error "server EOF" or similar. And kapp
failed as well because of it fetches all resources generated by deployments - so it's trying to get all replica sets to build diff upon that.
I specified revisionHistoryLimit
explicitly in our deployments and the issue disappeared.
from kapp.
Under the single application, I deploy about ~230 resources (generated). On some point in time, deployment started taking a long time, after adding more resources is stopped working at all
that's interesting... how long does it hang until the error?
Hangs for a couple of minutes then I get such error when I run it locally:
does it hang before showing the diff (im guessing so since the error includes Listing ...
)?
can you describe your cluster a bit more, more specifically:
- is it mostly same resource type (eg ConfigMaps) within this application
- how many resource types (kubectl api-resources)
- do you limit your user account (used by kapp) to specific namespace(s) or are all resources available to it?
- what cluster provider are you using (eg GKE, EKS, etc.)
from kapp.
im also curious how long following command runs against your cluster: https://gist.github.com/cppforlife/25890e4a9e732413bbf83c81e4a808b1 (122 resources to be created, ~3 to check cluster and calculate diff)
from kapp.
Sorry for disappearing:
how long does it hang until the error?
Executed several times on my local machine (100mbit wifi)
4 times it has failed in about 1 minute (00:01:09 - 00:01:11), the 5th and 6th times it has succeeded in 0:01:30-0:01:38
does it hang before showing the diff
correct, before diff
is it mostly same resource type (eg ConfigMaps) within this application
It's about 20 instances of the same application with different settings (2 deployments, 1 crd certificate, 2 pdb, 2 configmaps, 1 ingress, 2 services)
kubectl api-resources
https://gist.github.com/pavel-khritonenko/a4ffb3bec510a1d4d1a3b419cfd92993
do you limit your user account (used by kapp) to specific namespace(s) or are all resources available to it?
Cluster admin permissions (no limits)
what cluster provider are you using
EKS (amazon web services)
from kapp.
Currently I added deployment to CI/CD process and run deployment manually from gitlab on the runner near the cluster (the same subnet) - it never fails.
from kapp.
im also curious how long following command runs against your cluster: https://gist.github.com/cppforlife/25890e4a9e732413bbf83c81e4a808b1 (122 resources to be created, ~3 to check cluster and calculate diff)
19 seconds, success
from kapp.
$ cue dump | grep kind
https://gist.github.com/pavel-khritonenko/46032924c6211a6f690cd9fdb303b9a6
from kapp.
19 seconds, success
oh, that's interesting. im using default GKE cluster with 3 nodes, and would have expected to have similar response time (~3s).
EKS (amazon web services)
how beefy are control plane machines? not sure if AWS tells you those details.
from kapp.
@pavel-khritonenko would you mind trying out https://github.com/k14s/kapp/releases/tag/v0.14.0 with --debug flag and posting results.
from kapp.
$ cue dump | kapp deploy --debug --wait=false -a frontends -f -
02:24:51PM: debug: CommandRun: start
02:24:51PM: debug: RecordedApp: CreateOrUpdate: start
02:24:52PM: debug: RecordedApp: CreateOrUpdate: end
02:24:54PM: debug: LabeledResources: Prepare: start
02:24:54PM: debug: LabeledResources: Prepare: end
02:24:54PM: debug: LabeledResources: AllAndMatching: start
02:24:54PM: debug: LabeledResources: All: start
02:24:54PM: debug: IdentifiedResources: List: start
02:27:15PM: debug: IdentifiedResources: List: end
02:27:15PM: debug: LabeledResources: All: end
02:27:15PM: debug: LabeledResources: AllAndMatching: end
02:27:15PM: debug: CommandRun: end
Error: Listing schema.GroupVersionResource{Group:"extensions", Version:"v1beta1", Resource:"replicasets"}, namespaced: true: Stream error http2.StreamError{StreamID:0x11f, Code:0x2, Cause:error(nil)} when reading response body, may be caused by closed connection. Please retry.
from kapp.
I just faced the same issue deploying 76 Kubernetes deployments (nothing special, single deployment single container, different env variables). Initial creation was fast and flawless, update such definition fails for the same reason.
from kapp.
@pavel-khritonenko i be been away on vacation hence my slow response (coming back next week). meanwhile im intrigued that you mention that creation vs update is fast. could you attach debug for creation as well?
given that above debug log showed that IdentifiedResources: List
took 3 mins, ill have to add more debug logs to that method and make a new release.
from kapp.
@pavel-khritonenko any update on this issue?
from kapp.
@pavel-khritonenko checking in, any updates on this?
from kapp.
Yesterday I figured out a few things caused a timeout.
nice finds.
However, when it deployed with kapp tool - I see 2147483647 here instead of the default value (10).
i didnt quite follow this one. are you saying 2147483647 showed in the diff? who was setting it?
from kapp.
I'm not sure where that value comes from, but I haven't set it previously. Cannot reproduce it with the latest version of kapp
(0.14), trying to reproduce with earlier versions. What I see in the annotations of one deployment:
- type: test
path: /spec/progressDeadlineSeconds
value: 2147483647
- type: remove
path: /spec/progressDeadlineSeconds
- type: test
path: /spec/revisionHistoryLimit
value: 2147483647
- type: remove
path: /spec/revisionHistoryLimit
My build agent is still using 0.13
version, I'll share a report when I'll be able to reproduce that.
from kapp.
Managed to reproduce with the version 0.13
:
Manually changed revisionHistoryLimit
to 3
of deployment psql
, then applied following definition using kapp
:
---
apiVersion: "extensions/v1beta1"
kind: "Deployment"
metadata:
labels:
app: "psql"
reloader.stakater.com/auto: "true"
name: "psql"
namespace: "sandbox"
spec:
replicas: 1
selector:
matchLabels:
app: "psql"
template:
metadata:
labels:
app: "psql"
spec:
containers:
- args:
- "while true; do sleep 30; done;"
command:
- "/bin/sh"
- "-c"
- "--"
env:
- name: "PGHOST"
valueFrom:
secretKeyRef:
key: "address"
name: "db"
- name: "PGDATABASE"
valueFrom:
secretKeyRef:
key: "database"
name: "db"
- name: "PGUSER"
valueFrom:
secretKeyRef:
key: "POSTGRES_USER"
name: "db-auth"
- name: "PGPASSWORD"
valueFrom:
secretKeyRef:
key: "POSTGRES_PASSWORD"
name: "db-auth"
image: "jbergknoff/postgresql-client"
imagePullPolicy: "Always"
name: "psql"
resources:
limits:
cpu: "100m"
memory: "128Mi"
What I see after applying:
spec:
progressDeadlineSeconds: 2147483647
replicas: 1
revisionHistoryLimit: 2147483647
Deleted that deployment manually (kubectl delete deployment psql -n sandbox
), then reapplied manifest above with kapp
v 0.13
, as a result - got the same definition.
from kapp.
Seems it's not related to kapp
, because deploying the same manifest with kubectl
leads to the same issue.
Finally got it, it seems it's because of apiVersion
, when I specify apps/v1
- everything works just fine. For extensions/v1beta1
- it sets default value of revisionHistoryLimit
to 2147483647
from kapp.
Seems it's not related to kapp, because deploying the same manifest with kubectl leads to the same issue.
yup, sounds like a server side behaviour.
ill close the issue (ill probably file a different issue that throws in a warning when fetching resources takes a long time). thanks for digging in.
from kapp.
Related Issues (20)
- Update copyright headers
- v0.60.0 app-group deploy suddenly also deletes after deploying HOT 1
- Add a basic `CRDUpgradeSafety` preflight check
- Add no new required field validation to `CRDUpgradeSafety` preflight check
- Add validation to `CRDUpgradeSafety` preflight check to prevent removal of existing fields HOT 1
- Add validation to `CRDUpgradeSafety` preflight check to allow decreasing minimum values on existing fields
- Add validation to `CRDUpgradeSafety` preflight check to allow increasing maximum values on existing fields HOT 1
- Add validation to `CRDUpgradeSafety` preflight check to allow changing an existing required field to be optional
- Add validation to `CRDUpgradeSafety` preflight check to allow adding more enum values to existing field
- kapp flip flops betweek gk scoping and non scoping while using `--dangerous-allow-empty-list-of-resources`
- Add ability to configure API used for permission validation when using the permission validation pre-flight check
- Cleanup: preflight checks HOT 3
- kapp reports incorrect diff for a resource having no annotation and spec updated by a webhook
- Add validation to CRDUpgradeSafety preflight check to ensure default values for a field are not changed during an upgrade HOT 1
- Add a typeMatcher for Secrets
- hasAnnotationMatcher not working HOT 2
- kapp has yaml Norway issues HOT 5
- Kapp deploy command changing K8S ServiceAccount label value HOT 4
- Enhance custom waiting behavior.
- versioned-explicit-ref does not work for all resource types HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kapp.