Git Product home page Git Product logo

Comments (10)

 avatar commented on May 30, 2024 1

I also encountered this issue few days ago, will check again today.

from go-marathon.

timoreimann avatar timoreimann commented on May 30, 2024

I don't know exactly why the decision to consider tasks with missing health checks as healthy was made. It took place along this commit introduced via PR #18, so hopefully @iderdik remembers.

My guess is that it's related to mesosphere/marathon#1270. If that's the case, then I'd opt for dropping the logic in question as the issue got fixed in 0.8.1 RC2 (read: a long long time ago).

If that's not the case, then my suggestion would be to change the logic such that we expose a ternary state to the caller: health state is good, bad, or unknown.

from go-marathon.

oolongbrothers avatar oolongbrothers commented on May 30, 2024

Thank you for the insight, timoreimann. I realise that I was maybe a bit imprecise with my description though.

The code from 1a00a84 that you reference, is this one:

                //When a task is flapping in Marathon, this is sometimes nil
                if check == nil {
                    return false, nil
                }

This is not causing the false positives (unknown == ok) I'm talking about. It's rather the condition on line 372 in combination with the final line of the function:

if task.HealthCheckResult != nil { ... }
...
return true, nil

This causes the function to return true when there are health checks defined but no health check results known (yet). This was the behaviour of ApplicationOk form the start, as added in 1efcfa4.

The code from 1a00a84 you refer to above is just a recheck if task.HealthCheckResult has maybe turned nil between lines 372 and 375. I have no idea if this extra check is made obsolete by an upstream bugfix, but it does not do any harm either.

However, both solutions you propose make sense to me. I could do pull request with a proposed solution, but first I would want to come to an agreement as to which of the two fixes should be implemented.
Can a user do anything useful with the information that the health check result in not known yet? Maybe?
Also a factor, how concerned are you about API backwards compatibility?

from go-marathon.

timoreimann avatar timoreimann commented on May 30, 2024

@oolongbrothers, major apologies for going silent on this one. I can only suspect that I forgot about the issue over last year's Christmas holidays.

I noticed the "offending" nil check has been refactored away by now. Is the existing go-marathon behavior still causing troubles for you? In case it does, I'd be happy to follow you through any required changes -- promised this time. :)

Thanks.

from go-marathon.

iderdik avatar iderdik commented on May 30, 2024

I'm pretty sure I'm running into this again as well. Several applications that never came up properly (and never passed healthchecks) were still deemed "ok" and allowed the deploy to complete. I can't replicate reliably but it seems to be happening often enough to indicate a real issue.

from go-marathon.

iderdik avatar iderdik commented on May 30, 2024

@timoreimann Sorry for missing your comment from December! PR #18 actually made a missing healthcheck indicate failure, not success. Perhaps this issue has resurfaced due to removal of that code?

from go-marathon.

timoreimann avatar timoreimann commented on May 30, 2024

@iderdik could be. Since several people have run into this problem by now, I'll try to find some time to dig deeper.

If anyone happens to have a reproducible use case (@jacob-koren maybe?), I'd be happy to hear. :-)

from go-marathon.

timoreimann avatar timoreimann commented on May 30, 2024

I think I found the root cause. Will report back later with a thorough analysis and some data points. :-)

from go-marathon.

timoreimann avatar timoreimann commented on May 30, 2024

curling the Marathon API endpoint continuously while deploying a container, I can see that the healthChecks and tasks properties get populated progressively. Following are their respective values at various distinct steps:

Step 1: both properties yet missing

  • healthChecks:
<nothing>
  • tasks:
<nothing>

Step 2: values turn null

  • healthChecks:
null
  • tasks:
null

Step 3: both values available with the task not started yet and healthCheckResults missing

  • healthChecks:
[
  {
    "path": "/health",
    "protocol": "HTTP",
    "portIndex": 0,
    "gracePeriodSeconds": 3,
    "intervalSeconds": 5,
    "timeoutSeconds": 5,
    "maxConsecutiveFailures": 2,
    "ignoreHttp1xx": false
  }
]
  • tasks:
[
  {
    "id": "app.0bc5c4d4-4f5a-11e6-9956-461ac47997d0",
    "host": "192.168.99.100",
    "ipAddresses": [],
    "ports": [
      31198
    ],
    "startedAt": null,
    "stagedAt": "2016-07-21T15:44:44.514Z",
    "version": "2016-07-21T15:44:43.908Z",
    "slaveId": "ec3708b4-ea0d-4437-b402-d571b212e6ad-S0",
    "appId": "/app"
  }
]

Step 4: task started (startedAt != null) but healthCheckResults still missing

  • healthChecks:
[
  {
    "path": "/health",
    "protocol": "HTTP",
    "portIndex": 0,
    "gracePeriodSeconds": 3,
    "intervalSeconds": 5,
    "timeoutSeconds": 5,
    "maxConsecutiveFailures": 2,
    "ignoreHttp1xx": false
  }
]
  • tasks:
[
  {
    "id": "app.0bc5c4d4-4f5a-11e6-9956-461ac47997d0",
    "host": "192.168.99.100",
    "ipAddresses": [],
    "ports": [
      31198
    ],
    "startedAt": "2016-07-21T15:44:45.463Z",
    "stagedAt": "2016-07-21T15:44:44.514Z",
    "version": "2016-07-21T15:44:43.908Z",
    "slaveId": "ec3708b4-ea0d-4437-b402-d571b212e6ad-S0",
    "appId": "/app"
  }
]

Step 5: healthCheckResults finally becomes available

  • healthChecks:
[
  {
    "path": "/health",
    "protocol": "HTTP",
    "portIndex": 0,
    "gracePeriodSeconds": 3,
    "intervalSeconds": 5,
    "timeoutSeconds": 5,
    "maxConsecutiveFailures": 2,
    "ignoreHttp1xx": false
  }
]
  • tasks:
[
  {
    "id": "app.0bc5c4d4-4f5a-11e6-9956-461ac47997d0",
    "host": "192.168.99.100",
    "ipAddresses": [],
    "ports": [
      31198
    ],
    "startedAt": "2016-07-21T15:44:45.463Z",
    "stagedAt": "2016-07-21T15:44:44.514Z",
    "version": "2016-07-21T15:44:43.908Z",
    "slaveId": "ec3708b4-ea0d-4437-b402-d571b212e6ad-S0",
    "appId": "/app",
    "healthCheckResults": [
      {
        "alive": true,
        "consecutiveFailures": 0,
        "firstSuccess": "2016-07-21T15:44:49.573Z",
        "lastFailure": null,
        "lastSuccess": "2016-07-21T15:44:49.573Z",
        "taskId": "app.0bc5c4d4-4f5a-11e6-9956-461ac47997d0"
      }
    ]
  }
]

If we execute ApplicationOK during either step 3 or 4, we are already pass the first three checks and subsequently try to iterate through all task's health check results. The latter yet missing in the JSON payload, we short-circuit and fall through to the return statement (as pointed out by @oolongbrothers originally).

Given that we already know about the existence of custom health checks at the time we start iterating (because the related check would have returned otherwise), I believe the right thing to do here is to add another check for task.HealthCheckResults being nil/empty and return false if that's the case.

I'll do some further testing to confirm my findings. In the meantime, I'd appreciate if someone could take a second verifying look. Thanks!

/cc @gambol99

from go-marathon.

timoreimann avatar timoreimann commented on May 30, 2024

The issue should hopefully be fixed now. If you experience the problem again with the latest version of go-marathon, please post to this ticket again and / or reopen it.

Thanks!

from go-marathon.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.