Git Product home page Git Product logo

Comments (24)

axw avatar axw commented on August 16, 2024 1

I had reproduced the error locally yesterday (or at least something that looked the same), but had to switch focus before I could find the root cause. Now I can't reproduce it :(

One thing I did notice was in the collector logs there were errors about not being able to connect to kind-control-plane. Perhaps the e2e workflow should capture the pod logs before tearing down, to make debugging easier.

from opentelemetry-collector-contrib.

fatsheep9146 avatar fatsheep9146 commented on August 16, 2024 1

I had reproduced the error locally yesterday (or at least something that looked the same), but had to switch focus before I could find the root cause. Now I can't reproduce it :(

One thing I did notice was in the collector logs there were errors about not being able to connect to kind-control-plane. Perhaps the e2e workflow should capture the pod logs before tearing down, to make debugging easier.

I also could not reproduce the same error like github action, it's really weird. But your advise is really good to capture the logs of pod (no matter collector or telemetrygen) in workflow to help debugging. @axw

from opentelemetry-collector-contrib.

ChrsMark avatar ChrsMark commented on August 16, 2024 1

Not sure if there is another way to get access to the Pods' logs but I tried sth dirty to capture the logs of the Pods: #33538.
Let's see if this can provide us some insights here.

from opentelemetry-collector-contrib.

fatsheep9146 avatar fatsheep9146 commented on August 16, 2024 1

image
@ChrsMark In your latest pr, the hostEndpoint is empty, I think this is the root cause.
https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/9498361987/job/26177138997?pr=33538

from opentelemetry-collector-contrib.

ChrsMark avatar ChrsMark commented on August 16, 2024 1

Sounds possible @fatsheep9146, I will try to upgrade docker on my machine to 26.x.x as well and see if I can reproduce it.

Update:

I was able to reproduce this locally with docker 26.1.4 (ubuntu machine).
Collector Pod logs:

2024-06-13T12:42:55.052Z	warn	zapgrpc/zapgrpc.go:193	[core] [Channel #2 SubChannel #8]grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:4317", ServerName: "localhost:4317", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:4317: connect: connection refused"	{"grpc_log": true}
2024-06-13T12:42:55.052Z	warn	zapgrpc/zapgrpc.go:193	[core] [Channel #2 SubChannel #8]grpc: addrConn.createTransport failed to connect to {Addr: "[::1]:4317", ServerName: "localhost:4317", }. Err: connection error: desc = "transport: Error while dialing: dial tcp [::1]:4317: connect: connection refused"	{"grpc_log": true}
2024-06-13T12:42:55.316Z	warn	zapgrpc/zapgrpc.go:193	[core] [Channel #1 SubChannel #9]grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:4317", ServerName: "localhost:4317", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:4317: connect: connection refused"	{"grpc_log": true}
2024-06-13T12:42:55.316Z	warn	zapgrpc/zapgrpc.go:193	[core] [Channel #1 SubChannel #9]grpc: addrConn.createTransport failed to connect to {Addr: "[::1]:4317", ServerName: "localhost:4317", }. Err: connection error: desc = "transport: Error while dialing: dial tcp [::1]:4317: connect: connection refused"	{"grpc_log": true}
2024-06-13T12:42:57.265Z	warn	zapgrpc/zapgrpc.go:193	[core] [Channel #4 SubChannel #11]grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:4317", ServerName: "localhost:4317", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:4317: connect: connection refused"	{"grpc_log": true}
2024-06-13T12:42:57.265Z	warn	zapgrpc/zapgrpc.go:193	[core] [Channel #4 SubChannel #11]grpc: addrConn.createTransport failed to connect to {Addr: "[::1]:4317", ServerName: "localhost:4317", }. Err: connection error: desc = "transport: Error while dialing: dial tcp [::1]:4317: connect: connection refused"	{"grpc_log": true}

from opentelemetry-collector-contrib.

fatsheep9146 avatar fatsheep9146 commented on August 16, 2024 1

@ChrsMark I'm trying to update the sdk version docker to see if it can fix the problem.

from opentelemetry-collector-contrib.

ChrsMark avatar ChrsMark commented on August 16, 2024 1

@fatsheep9146 thank's! FYI debugging this, I spot that

network, err := client.NetworkInspect(ctx, "kind", types.NetworkInspectOptions{})
is failing with context deadline exceeded, but the weird thing is that this error is for some reason "muted".

Hopefully the lib upgrade can solve this.

from opentelemetry-collector-contrib.

fatsheep9146 avatar fatsheep9146 commented on August 16, 2024 1

I had a successful run at #33548. I'm going to enable the rest of the tests and check again.
@ChrsMark

Yes, I found update docker sdk library is blocked by for some reasons.
#32614
#31989

So I also try to use another way to get the right host endpoint
https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/9501492668/job/26187569925?pr=33542

I think we can try in both ways and get more opnions from others.

from opentelemetry-collector-contrib.

fatsheep9146 avatar fatsheep9146 commented on August 16, 2024 1

I hit an additional error at k8scluster receiver. It seems that some image names have also changed as well: https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/9501435003/job/26187307997?pr=33548#step:11:35

potential fix: c87a639

I think this maybe due to the newer version of kind

from opentelemetry-collector-contrib.

TylerHelmuth avatar TylerHelmuth commented on August 16, 2024

I have been unable to reproduce the issues locally and reverting #33415 did not help (according to the CI jobs on main that was the first commit where things started to flake).

Looking at the workflow it looks like all versions are pinned so I don't think we suddenly started using some new action, kind versions, etc.

from opentelemetry-collector-contrib.

github-actions avatar github-actions commented on August 16, 2024

Pinging code owners for internal/k8stest: @crobert-1. See Adding Labels via Comments if you do not have permissions to add labels yourself.

from opentelemetry-collector-contrib.

TylerHelmuth avatar TylerHelmuth commented on August 16, 2024

@jinja2 @fatsheep9146 any guesses?

from opentelemetry-collector-contrib.

github-actions avatar github-actions commented on August 16, 2024

Pinging code owners for receiver/k8sobjects: @dmitryax @hvaghani221 @TylerHelmuth. See Adding Labels via Comments if you do not have permissions to add labels yourself.

from opentelemetry-collector-contrib.

github-actions avatar github-actions commented on August 16, 2024

Pinging code owners for processor/k8sattributes: @dmitryax @rmfitzpatrick @fatsheep9146 @TylerHelmuth. See Adding Labels via Comments if you do not have permissions to add labels yourself.

from opentelemetry-collector-contrib.

github-actions avatar github-actions commented on August 16, 2024

Pinging code owners for receiver/k8scluster: @dmitryax @TylerHelmuth @povilasv. See Adding Labels via Comments if you do not have permissions to add labels yourself.

from opentelemetry-collector-contrib.

github-actions avatar github-actions commented on August 16, 2024

Pinging code owners for receiver/kubeletstats: @dmitryax @TylerHelmuth. See Adding Labels via Comments if you do not have permissions to add labels yourself.

from opentelemetry-collector-contrib.

ChrsMark avatar ChrsMark commented on August 16, 2024

Got some interesting "connection refused" errors: https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/9497224255/job/26173693278?pr=33538#step:11:225

2024-06-13T09:44:56.953Z	info	exporterhelper/retry_sender.go:118	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "traces", "name": "otlp", "error": "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:4317: connect: connection refused\"", "interval": "7.546970563s"}
2024-06-13T09:44:57.064Z	info	exporterhelper/retry_sender.go:118	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "logs", "name": "otlp", "error": "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:4317: connect: connection refused\"", "interval": "7.612004411s"}
2024-06-13T09:44:57.486Z	info	exporterhelper/retry_sender.go:118	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "traces", "name": "otlp", "error": "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:4317: connect: connection refused\"", "interval": "6.403460654s"}

from opentelemetry-collector-contrib.

fatsheep9146 avatar fatsheep9146 commented on August 16, 2024

@ChrsMark
Seems that the logic of getting hostEndpoint is the root cause, and this logic is different between mac and linux.

from opentelemetry-collector-contrib.

fatsheep9146 avatar fatsheep9146 commented on August 16, 2024

I suspect the reason is due to the https://github.com/actions/runner-images/pull/10039/files.
The os ubuntu-latest we use in github action updated with new version of docker.

from opentelemetry-collector-contrib.

ChrsMark avatar ChrsMark commented on August 16, 2024

I had a successful run at #33548. I'm going to enable the rest of the tests and check again.

from opentelemetry-collector-contrib.

ChrsMark avatar ChrsMark commented on August 16, 2024

I hit an additional error at k8scluster receiver. It seems that some image names have also changed as well: https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/9501435003/job/26187307997?pr=33548#step:11:35

potential fix: c87a639

from opentelemetry-collector-contrib.

ChrsMark avatar ChrsMark commented on August 16, 2024

@fatsheep9146 e2e tests passed at #33548. I'm opening that one for review since it offers a fix anyways. I'll be out tomorrow (Friday) so feel free to pick the gateway check and proceed with yours if people find the approach more suitable. I'm fine either way as soon as we solve the issue :).

from opentelemetry-collector-contrib.

crobert-1 avatar crobert-1 commented on August 16, 2024

Resolved by #33548

from opentelemetry-collector-contrib.

crobert-1 avatar crobert-1 commented on August 16, 2024

Thanks for addressing and fixing so quickly @ChrsMark and @fatsheep9146!

from opentelemetry-collector-contrib.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.