Git Product home page Git Product logo

Comments (10)

manugarg avatar manugarg commented on April 28, 2024

@Daxten Thanks for the report. Can you tell me a little bit more about your setup?

  • Are you running on GCE?
  • Can you access container logs?
  • Where are you writing your data?
  • Do symptoms recover on their own, i.e. do you start seeing results after a while without taking any action?
  • Can you share your config (you can scrub the internal details, of course).

from cloudprober.

Daxten avatar Daxten commented on April 28, 2024

Wow, thanks for coming back to me so fast!

  • We are running Rancher using Cattle, I think the problem can be breaked down to "we are using basic docker container"
  • Yes I can access Container logs in that case, I will take a closer look the next time it happens, but at least STDOUT didn't have anything interesting the last time, anything else where I should look?
  • We are using prometheus to poll the data
  • No it does not recover, and our healthcheck service (Port opens / Returns sane http response code) doesn't mark it as failed (which would recreate it)
  • Yes I can share the config tomorow with you

from cloudprober.

manugarg avatar manugarg commented on April 28, 2024

I think I'll wait for config before commenting further. It does sound like a bug in cloudprober that is getting surfaced by something in your environment.

Also, are you running the latest cloudprober, that is, the latest cloudprober image, built from the source, or the last release (0.9.3)?

from cloudprober.

Daxten avatar Daxten commented on April 28, 2024

Hi, I sent you the config (via mail).

With 0.9.3 the problem kept persisting until restarting I THINK

I switched to latest for a week now, and it seems like it regenerates on itself after a few hours with this version

from cloudprober.

manugarg avatar manugarg commented on April 28, 2024

@Daxten I got the config, thanks! I'll certainly recommend using the latest cloudprober image instead of 0.9.3 -- there have been some bug fixes since that version.

I am not sure why cloudprober will stop working. Trying to think of a few options aloud:

  1. It's possible there is some bug in prometheus surfacer. One way to exclude this will be to look at logs. By default, cloudprober logs all probe results. If it's an issue in surfacer logs will continue to be generated when data stops showing on prometheus handlers.

  2. It's possible that prometheus is rejecting data for some reason. I remember that someone had an issue with their time not being synchronized and hence prometheus rejecting the data thinking that it's old. We added an option to not include timestamp in prometheus output:

    optional bool include_timestamp = 2 [default = true];

(Also, you should be able to exclude this using logs.)

  1. It's possible that cloudprober's internal global resolver has run into some bug. To rule this out, you could look at the sysvars variables. Cloudprober exports some sysvars variables by default, for example: uptime_msec. Try to access that variable and see if it continues to generate data when other data disappears.

  2. There is also a possibility of some bug in HTTP probe code. That could also be ruled out using the above method.

Also, you said you didn't see anything in logs. Can you try mapping /tmp as a volume - "-v /tmp:/tmp" and see if it generates any logs? I think cloudprober will try to log in /tmp if not running on GCE (on GCE logs go to stackdriver logging).

from cloudprober.

manugarg avatar manugarg commented on April 28, 2024

Regarding my last comment about logging, I verified that our docker image's default command line is set to log to stderr:

ENTRYPOINT ["/cloudprober", "--logtostderr"]

So unless you're overriding the docker image entrypoint, cloudprober should be logging to stderr rather than a file under /tmp.

from cloudprober.

Daxten avatar Daxten commented on April 28, 2024

Hey,
thanks for helping out so much, I send you the log right now. I will create a checklist for the other points you mentioned and will check those the next time it happens

from cloudprober.

manugarg avatar manugarg commented on April 28, 2024

Hi @Daxten,

I got the logs. Also, responded over email but to close the loop here:

===
Looking at the logs, it seems that this is not probe specific as all probes stop outputting the data at around the same time. I'll try to add some more logging and profiling, and provide you with a different container image version. Sorry, it may require some more work.

Just to collect some more info -

  • You're not seeing any CPU/memory exhaustion related issues on the node/container?
  • Can you share your deployment environment with me - How is docker container run? Volumes mapped into the container. If there is a Kubernetes config, can you share that as well?
  • Which docker image are you running? I recently cut the release - 0.9.4. May be you can pin your deployment to that version so that we have something definitive to work with.
    ===

I improved the logging in last couple of changes. Can you retry with the "latest" container?

from cloudprober.

manugarg avatar manugarg commented on April 28, 2024

@Daxten, I was wondering if you're still experiencing this issue. Can we close this issue if you're not.

Thanks,
Manu

from cloudprober.

manugarg avatar manugarg commented on April 28, 2024

Closing this due to inactivity. Please feel free to reopen if it's still a problem. I'll be more than happy to debug this with you. Cheers.

from cloudprober.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.