Which jobs are flaking? master-blocking: g

<a target="_blank" rel="noopener noreferrer" href="https://private-user-images.githubu

/sig node cc <a class="user-mention notranslate" data-hovercard-type="user" data-h

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[Flaking Test] gce-ubuntu-master-containerd (connection reset by peer),about kubernetes/kubernetes

Comments (24)

aojea commented on September 27, 2024 3

Failures https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-ubuntu-gce-containerd/1795925976450863104

error

I0529 21:32:03.564574 9867 kubectl.go:593] copying /workspace/kubernetes/platforms/linux/amd64/kubectl to the httpd pod
I0529 21:32:03.564635 9867 builder.go:121] Running '/workspace/kubernetes/platforms/linux/amd64/kubectl --server=https://34.83.180.141 --kubeconfig=/workspace/.kube/config --namespace=kubectl-6332 cp /workspace/kubernetes/platforms/linux/amd64/kubectl kubectl-6332/httpd:/tmp/'
I0529 21:32:09.340487 9867 builder.go:135] rc: 1
I0529 21:32:09.340623 9867 builder.go:91] Unexpected error: 
    <exec.CodeExitError>: 
    error running /workspace/kubernetes/platforms/linux/amd64/kubectl --server=https://34.83.180.141 --kubeconfig=/workspace/.kube/config --namespace=kubectl-6332 cp /workspace/kubernetes/platforms/linux/amd64/kubectl kubectl-6332/httpd:/tmp/:
    Command stdout:
    
    stderr:
    E0529 21:32:09.334805   17304 v2.go:167] "Unhandled Error" err="next reader: read tcp 10.34.160.50:50460->34.83.180.141:443: read: connection reset by peer"
    E0529 21:32:09.334808   17304 v2.go:129] "Unhandled Error" err="next reader: read tcp 10.34.160.50:50460->34.83.180.141:443: read: connection reset by peer"
    E0529 21:32:09.334853   17304 v2.go:150] "Unhandled Error" err="next reader: read tcp 10.34.160.50:50460->34.83.180.141:443: read: connection reset by peer"
    error: error reading from error stream: next reader: read tcp 10.34.160.50:50460->34.83.180.141:443: read: connection reset by peer
    
    error:
    exit status 1
    {
        Err: <*errors.errorString | 0xc0046ce9b0>{
            s: "error running /workspace/kubernetes/platforms/linux/amd64/kubectl --server=https://34.83.180.141 --kubeconfig=/workspace/.kube/config --namespace=kubectl-6332 cp /workspace/kubernetes/platforms/linux/amd64/kubectl kubectl-6332/httpd:/tmp/:\nCommand stdout:\n\nstderr:\nE0529 21:32:09.334805   17304 v2.go:167] \"Unhandled Error\" err=\"next reader: read tcp 10.34.160.50:50460->34.83.180.141:443: read: connection reset by peer\"\nE0529 21:32:09.334808   17304 v2.go:129] \"Unhandled Error\" err=\"next reader: read tcp 10.34.160.50:50460->34.83.180.141:443: read: connection reset by peer\"\nE0529 21:32:09.334853   17304 v2.go:150] \"Unhandled Error\" err=\"next reader: read tcp 10.34.160.50:50460->34.83.180.141:443: read: connection reset by peer\"\nerror: error reading from error stream: next reader: read tcp 10.34.160.50:50460->34.83.180.141:443: read: connection reset by peer\n\nerror:\nexit status 1",
        },
        Code: 1,
    }
[FAILED] error running /workspace/kubernetes/platforms/linux/amd64/kubectl --server=https://34.83.180.141 --kubeconfig=/workspace/.kube/config --namespace=kubectl-6332 cp /workspace/kubernetes/platforms/linux/amd64/kubectl kubectl-6332/httpd:/tmp/:
Command stdout:

stderr:
E0529 21:32:09.334805   17304 v2.go:167] "Unhandled Error" err="next reader: read tcp 10.34.160.50:50460->34.83.180.141:443: read: connection reset by peer"
E0529 21:32:09.334808   17304 v2.go:129] "Unhandled Error" err="next reader: read tcp 10.34.160.50:50460->34.83.180.141:443: read: connection reset by peer"
E0529 21:32:09.334853   17304 v2.go:150] "Unhandled Error" err="next reader: read tcp 10.34.160.50:50460->34.83.180.141:443: read: connection reset by peer"
error: error reading from error stream: next reader: read tcp 10.34.160.50:50460->34.83.180.141:443: read: connection reset by peer

error:
exit status 1

kube-apiserver logs confirms the problem but seems to happen on the worker node https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-ubuntu-gce-containerd/1795925976450863104/artifacts/bootstrap-e2e-master/

E0529 21:32:09.315375 10 conn.go:339] Error on socket receive: read tcp 10.40.0.2:443->34.121.237.185:50460: use of closed network connection
I0529 21:32:09.317154 10 httplog.go:132] "HTTP" verb="CONNECT" URI="/api/v1/namespaces/kubectl-6332/pods/httpd/exec?command=tar&command=-xmf&command=-&command=-C&command=%2Ftmp&container=httpd&stderr=true&stdin=true&stdout=true" latency="5.045622494s" userAgent="kubectl/v1.31.0 (linux/amd64) kubernetes/e821e4f" audit-ID="030c7a1d-0602-402b-932f-397799adf61f" srcIP="34.121.237.185:50460" hijacked=trueI0529 21:32:09.317791 10 conn.go:134] closing connection dialID 6172859311107220987 connectionID 742

the job logs indicate that the pod probes also fails

STEP: Found 6 events. - k8s.io/kubernetes/test/e2e/framework/debug/dump.go:46 @ 05/29/24 21:32:10.323
I0529 21:32:10.323658 9867 dump.go:53] At 2024-05-29 21:31:49 +0000 UTC - event for httpd: {default-scheduler } Scheduled: Successfully assigned kubectl-6332/httpd to bootstrap-e2e-minion-group-3h2k
I0529 21:32:10.323693 9867 dump.go:53] At 2024-05-29 21:31:52 +0000 UTC - event for httpd: {kubelet bootstrap-e2e-minion-group-3h2k} Pulled: Container image "registry.k8s.io/e2e-test-images/httpd:2.4.38-4" already present on machine
I0529 21:32:10.323701 9867 dump.go:53] At 2024-05-29 21:31:52 +0000 UTC - event for httpd: {kubelet bootstrap-e2e-minion-group-3h2k} Created: Created container httpd
I0529 21:32:10.323709 9867 dump.go:53] At 2024-05-29 21:31:54 +0000 UTC - event for httpd: {kubelet bootstrap-e2e-minion-group-3h2k} Started: Started container httpd
I0529 21:32:10.323716 9867 dump.go:53] At 2024-05-29 21:32:09 +0000 UTC - event for httpd: {kubelet bootstrap-e2e-minion-group-3h2k} Killing: Stopping container httpd
I0529 21:32:10.323722 9867 dump.go:53] At 2024-05-29 21:32:09 +0000 UTC - event for httpd: {kubelet bootstrap-e2e-minion-group-3h2k} Unhealthy: Readiness probe failed: Get "http://10.64.4.226:80/": dial tcp 10.64.4.226:80: connect: connection refused

kubelet logs also show the problem seems to be within the node

httplog.go:132] "HTTP" verb="POST" URI="/exec/nettest-1794/test-container-pod/webserver?command=%2Fbin%2Fsh&command=-c&command=curl+-g+-q+-s+%27http%3A%2F%2F10.64.4.146%3A8083%2Fdial%3Frequest%3Dhostname%26protocol%3Dhttp%26host%3D10.0.192.209%26port%3D80%26tries%3D1%27&error=1&output=1" latency="432.25506ms" userAgent="e2e.test/v1.31.0 (linux/amd64) kubernetes/e821e4f -- [sig-network] Networking Granular Checks: Services should function for multiple endpoint-Services with same selector" audit-ID="" srcIP="10.64.4.3:44658" hijacked=true
May 29 21:32:09.715792 bootstrap-e2e-minion-group-3h2k kubelet[8827]: I0529 21:32:09.714632    8827 prober.go:155] "HTTP-Probe" scheme="http" host="10.64.4.226" port="80" path="/" timeout="5s" headers=null
May 29 21:32:09.715792 bootstrap-e2e-minion-group-3h2k kubelet[8827]: I0529 21:32:09.715160    8827 prober.go:107] "Probe failed" probeType="Readiness" pod="kubectl-6332/httpd" podUID="7d35af69-1f80-438b-9ce6-a657e3fc9769" containerName="httpd" probeResult="failure" output="Get \"http://10.64.4.226:80/\": dial tcp 10.64.4.226:80: connect: connection refused"
May 29 21:32:09.715792 bootstrap-e2e-minion-group-3h2k kubelet[8827]: I0529 21:32:09.715496    8827 event.go:389] "Event occurred" object="kubectl-6332/httpd" fieldPath="spec.containers{httpd}" kind="Pod" apiVersion="v1" type="Warning" reason="Unhealthy" message="Readiness probe failed: Get \"http://10.64.4.226:80/\": dial tcp 10.64.4.226:80: connect: connection refused"

can't say what happened but everything points the node at that time has some problems

from kubernetes.

neolit123 commented on September 27, 2024 1

i don't think kube-up uses the GCP CCM, it's just some bash that provisions GCE nodes.
looks like folks on this thread are already debugging, but f you know someone from GCP, please cc them.

also FTR, there is an effort to replace kube-up jobs with kOps jobs:
kubernetes/enhancements#4224

from kubernetes.

aojea commented on September 27, 2024 1

i don't think kube-up uses the GCP CCM, it's just some bash that provisions GCE nodes.

@neolit123 it does https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-ubuntu-gce-containerd/1795925976450863104/artifacts/bootstrap-e2e-master/cloud-controller-manager.log , kube-up.sh were moved to external ccm with the deprecation

from kubernetes.

dims commented on September 27, 2024 1

let's reopen if needed.

/close

from kubernetes.

wendy-ha18 commented on September 27, 2024

/remove-sig cli

from kubernetes.

neolit123 commented on September 27, 2024

/sig cloud-provider
/area provider/gcp

from kubernetes.

wendy-ha18 commented on September 27, 2024

/sig node
cc @SergeyKanzhelev @mrunalp

from kubernetes.

AnishShah commented on September 27, 2024

I think the readiness probe failures might be WAI:

After the test fails, we are deleting the pods forcefully:

I0529 21:32:09.341287 9867 builder.go:121] Running '/workspace/kubernetes/platforms/linux/amd64/kubectl --server=https://34.83.180.141 --kubeconfig=/workspace/.kube/config --namespace=kubectl-6332 delete --grace-period=0 --force -f -'
I0529 21:32:09.682128 9867 builder.go:146] stderr: "Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.\n"
I0529 21:32:09.682156 9867 builder.go:147] stdout: "pod \"httpd\" force deleted\n"

and I see readiness probe failures after that:

May 29 21:32:09.715792 bootstrap-e2e-minion-group-3h2k kubelet[8827]: I0529 21:32:09.715160    8827 prober.go:107] "Probe failed" probeType="Readiness" pod="kubectl-6332/httpd" podUID="7d35af69-1f80-438b-9ce6-a657e3fc9769" containerName="httpd" probeResult="failure" output="Get \"http://10.64.4.226:80/\": dial tcp 10.64.4.226:80: connect: connection refused"
May 29 21:32:09.715792 bootstrap-e2e-minion-group-3h2k kubelet[8827]: I0529 21:32:09.715496    8827 event.go:389] "Event occurred" object="kubectl-6332/httpd" fieldPath="spec.containers{httpd}" kind="Pod" apiVersion="v1" type="Warning" reason="Unhealthy" message="Readiness probe failed: Get \"http://10.64.4.226:80/\": dial tcp 10.64.4.226:80: connect: connection refused"

from kubernetes.

aojea commented on September 27, 2024

@AnishShah good catch, so forget about the probes, we need to understand why the connection is reset when trying to create the tunnel to the pod

from kubernetes.

elmiko commented on September 27, 2024

@neolit123 , we are reviewing this bug in the sig cloud-provider meeting today, but it's not clear what the cloud-provider (or ccm) specific issue is. is there something we are missing?

from kubernetes.

neolit123 commented on September 27, 2024

@elmiko the job uses kube-up, which still lives in k/k and is owned by the GCP provider:
https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/OWNERS#L26-L27

kube-up used to be owned by SIG CL but was "donated" to SIG CP / GCP provider.

the job itself is also owned by SIG CP ATM, see:
https://github.com/kubernetes/test-infra/blob/6725736ff16b7bc7cb16af9c69040537618166a1/config/jobs/kubernetes/sig-cloud-provider/gcp/gcp-gce.yaml#L801

that said the cause for the failure could be on the kubelet and on SIG node, however i doubt that.

from kubernetes.

elmiko commented on September 27, 2024

@neolit123 ack, thank you for the extra context. do you think this is an issue with the gcp ccm or more the kube-up script?

i'm trying to understand how we can help from the sig, or if there is something we can do. i personally don't have deep experience with the gcp test suite, i'm wondering if we need someone from the gcp cloud provider maintainers to investigate further?

from kubernetes.

neolit123 commented on September 27, 2024

there is another type of flake for this job which is:

e2e.go: diffResources expand_less	0s
{  Error: 2 leaked resources
+NAME                           MACHINE_TYPE   PREEMPTIBLE  CREATION_TIMESTAMP
+bootstrap-e2e-minion-template  e2-standard-2               2024-06-04T16:56:53.392-07:00}

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-ubuntu-gce-containerd/1798140787930697728

that's GCE specific.

from kubernetes.

elmiko commented on September 27, 2024

looks like folks on this thread are already debugging, but f you know someone from GCP, please cc them.

cc @cheftako @andrewsykim , if you have a chance, would love your thoughts

from kubernetes.

AnishShah commented on September 27, 2024

Notes from sig-node CI meeting:

This particular subtest is green in the testgrid. So we can consider this as not blocking for 1.31 alpha release.

from kubernetes.

AnishShah commented on September 27, 2024

we need to understand why the connection is reset when trying to create the tunnel to the pod

I'm trying to check konnectivity-agent logs. But I cannot find logs of konnectivity-agent but the pods seem to be in Ready state when the subtest failed.

from kubernetes.

dims commented on September 27, 2024

one additional data point, Same test(s) are working fine the ec2 variant of the CI job:
https://testgrid.k8s.io/amazon-ec2#ec2-ubuntu-master-containerd&width=20&include-filter-by-regex=Simple%20pod%20should%20return%20command%20exit

from kubernetes.

Vyom-Yadav commented on September 27, 2024

Hey Folks,
This test is still flaking, recent failures:

5/30/2024, 2:41:34 AM ci-kubernetes-e2e-ubuntu-gce-containerd
5/26/2024, 7:42:14 AM ci-kubernetes-e2e-gci-gce-network-proxy-grpc
5/22/2024, 8:30:21 PM ci-kubernetes-e2e-gci-gce-alpha-enabled-default

from kubernetes.

aojea commented on September 27, 2024

we need to understand why the connection is reset when trying to create the tunnel to the pod

I'm trying to check konnectivity-agent logs. But I cannot find logs of konnectivity-agent but the pods seem to be in Ready state when the subtest failed.

the konnectiviy agent is indeed the @AnishShah one of the most possible causes, specially since dims says is not failing on the AWS jobs that AFAIK don't use it.

@AnishShah you can get the agent server logs in the master vm https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-ubuntu-gce-containerd/1795925976450863104/artifacts/bootstrap-e2e-master/konnectivity-server.log , we need to check the log dump script to get the agent too.

from kubernetes.

aojea commented on September 27, 2024

hmm, checking https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-ubuntu-gce-containerd/1795925976450863104

failure is

stderr:
E0529 21:32:09.334805   17304 v2.go:167] "Unhandled Error" err="next reader: read tcp 10.34.160.50:50460->34.83.180.141:443: read: connection reset by peer"
E0529 21:32:09.334808   17304 v2.go:129] "Unhandled Error" err="next reader: read tcp 10.34.160.50:50460->34.83.180.141:443: read: connection reset by peer"
E0529 21:32:09.334853   17304 v2.go:150] "Unhandled Error" err="next reader: read tcp 10.34.160.50:50460->34.83.180.141:443: read: connection reset by peer"
error: error reading from error stream: next reader: read tcp 10.34.160.50:50460->34.83.180.141:443: read: connection reset by peer

error:
exit status 1
In [It] at: k8s.io/kubernetes/test/e2e/framework/kubectl/builder.go:91 @ 05/29/24 21:32:09.34

pod events

I0529 21:32:10.323709 9867 dump.go:53] At 2024-05-29 21:31:54 +0000 UTC - event for httpd: {kubelet bootstrap-e2e-minion-group-3h2k} Started: Started container httpd
I0529 21:32:10.323716 9867 dump.go:53] At 2024-05-29 21:32:09 +0000 UTC - event for httpd: {kubelet bootstrap-e2e-minion-group-3h2k} Killing: Stopping container httpd
I0529 21:32:10.323722 9867 dump.go:53] At 2024-05-29 21:32:09 +0000 UTC - event for httpd: {kubelet bootstrap-e2e-minion-group-3h2k} Unhealthy: Readiness probe failed: Get "http://10.64.4.226:80/": dial tcp 10.64.4.226:80: connect: connection refused

kubelet logs

May 29 21:32:09.554745 bootstrap-e2e-minion-group-3h2k kubelet[8827]: I0529 21:32:09.554597 8827 kubelet.go:2441] "SyncLoop DELETE" source="api" pods=["kubectl-6332/httpd"]
May 29 21:32:09.554745 bootstrap-e2e-minion-group-3h2k kubelet[8827]: I0529 21:32:09.554645 8827 pod_workers.go:854] "Pod is marked for graceful deletion, begin teardown" pod="kubectl-6332/httpd" podUID="7d35af69-1f80-438b-9ce6-a657e3fc9769" updateType="update"

something is deleting the pod under test, that is why the test fails, cc : @bobbypage @SergeyKanzhelev

from kubernetes.

elmiko commented on September 27, 2024

/assign @cheftako

/triage accepted

from kubernetes.

dims commented on September 27, 2024

xref: #126192

from kubernetes.

drewhagen commented on September 27, 2024

Hey folks! The release cycle for 1.32 starts today. Since this is still open, I will carry it over to the latest milestone.

It looks like this test is still flaking:

cc: @dims @aojea

from kubernetes.

k8s-ci-robot commented on September 27, 2024

@dims: Closing this issue.

In response to this:

let's reopen if needed.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

from kubernetes.

[Flaking Test] gce-ubuntu-master-containerd (connection reset by peer) about kubernetes HOT 24 CLOSED

Comments (24)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent