Comments (10)
I also encountered this issue few days ago, will check again today.
from go-marathon.
I don't know exactly why the decision to consider tasks with missing health checks as healthy was made. It took place along this commit introduced via PR #18, so hopefully @iderdik remembers.
My guess is that it's related to mesosphere/marathon#1270. If that's the case, then I'd opt for dropping the logic in question as the issue got fixed in 0.8.1 RC2 (read: a long long time ago).
If that's not the case, then my suggestion would be to change the logic such that we expose a ternary state to the caller: health state is good, bad, or unknown.
from go-marathon.
Thank you for the insight, timoreimann. I realise that I was maybe a bit imprecise with my description though.
The code from 1a00a84 that you reference, is this one:
//When a task is flapping in Marathon, this is sometimes nil
if check == nil {
return false, nil
}
This is not causing the false positives (unknown == ok) I'm talking about. It's rather the condition on line 372 in combination with the final line of the function:
if task.HealthCheckResult != nil { ... }
...
return true, nil
This causes the function to return true
when there are health checks defined but no health check results known (yet). This was the behaviour of ApplicationOk
form the start, as added in 1efcfa4.
The code from 1a00a84 you refer to above is just a recheck if task.HealthCheckResult
has maybe turned nil
between lines 372 and 375. I have no idea if this extra check is made obsolete by an upstream bugfix, but it does not do any harm either.
However, both solutions you propose make sense to me. I could do pull request with a proposed solution, but first I would want to come to an agreement as to which of the two fixes should be implemented.
Can a user do anything useful with the information that the health check result in not known yet? Maybe?
Also a factor, how concerned are you about API backwards compatibility?
from go-marathon.
@oolongbrothers, major apologies for going silent on this one. I can only suspect that I forgot about the issue over last year's Christmas holidays.
I noticed the "offending" nil check has been refactored away by now. Is the existing go-marathon behavior still causing troubles for you? In case it does, I'd be happy to follow you through any required changes -- promised this time. :)
Thanks.
from go-marathon.
I'm pretty sure I'm running into this again as well. Several applications that never came up properly (and never passed healthchecks) were still deemed "ok" and allowed the deploy to complete. I can't replicate reliably but it seems to be happening often enough to indicate a real issue.
from go-marathon.
@timoreimann Sorry for missing your comment from December! PR #18 actually made a missing healthcheck indicate failure, not success. Perhaps this issue has resurfaced due to removal of that code?
from go-marathon.
@iderdik could be. Since several people have run into this problem by now, I'll try to find some time to dig deeper.
If anyone happens to have a reproducible use case (@jacob-koren maybe?), I'd be happy to hear. :-)
from go-marathon.
I think I found the root cause. Will report back later with a thorough analysis and some data points. :-)
from go-marathon.
curl
ing the Marathon API endpoint continuously while deploying a container, I can see that the healthChecks
and tasks
properties get populated progressively. Following are their respective values at various distinct steps:
Step 1: both properties yet missing
- healthChecks:
<nothing>
- tasks:
<nothing>
Step 2: values turn null
- healthChecks:
null
- tasks:
null
Step 3: both values available with the task not started yet and healthCheckResults
missing
- healthChecks:
[
{
"path": "/health",
"protocol": "HTTP",
"portIndex": 0,
"gracePeriodSeconds": 3,
"intervalSeconds": 5,
"timeoutSeconds": 5,
"maxConsecutiveFailures": 2,
"ignoreHttp1xx": false
}
]
- tasks:
[
{
"id": "app.0bc5c4d4-4f5a-11e6-9956-461ac47997d0",
"host": "192.168.99.100",
"ipAddresses": [],
"ports": [
31198
],
"startedAt": null,
"stagedAt": "2016-07-21T15:44:44.514Z",
"version": "2016-07-21T15:44:43.908Z",
"slaveId": "ec3708b4-ea0d-4437-b402-d571b212e6ad-S0",
"appId": "/app"
}
]
Step 4: task started (startedAt
!= null) but healthCheckResults
still missing
- healthChecks:
[
{
"path": "/health",
"protocol": "HTTP",
"portIndex": 0,
"gracePeriodSeconds": 3,
"intervalSeconds": 5,
"timeoutSeconds": 5,
"maxConsecutiveFailures": 2,
"ignoreHttp1xx": false
}
]
- tasks:
[
{
"id": "app.0bc5c4d4-4f5a-11e6-9956-461ac47997d0",
"host": "192.168.99.100",
"ipAddresses": [],
"ports": [
31198
],
"startedAt": "2016-07-21T15:44:45.463Z",
"stagedAt": "2016-07-21T15:44:44.514Z",
"version": "2016-07-21T15:44:43.908Z",
"slaveId": "ec3708b4-ea0d-4437-b402-d571b212e6ad-S0",
"appId": "/app"
}
]
Step 5: healthCheckResults
finally becomes available
- healthChecks:
[
{
"path": "/health",
"protocol": "HTTP",
"portIndex": 0,
"gracePeriodSeconds": 3,
"intervalSeconds": 5,
"timeoutSeconds": 5,
"maxConsecutiveFailures": 2,
"ignoreHttp1xx": false
}
]
- tasks:
[
{
"id": "app.0bc5c4d4-4f5a-11e6-9956-461ac47997d0",
"host": "192.168.99.100",
"ipAddresses": [],
"ports": [
31198
],
"startedAt": "2016-07-21T15:44:45.463Z",
"stagedAt": "2016-07-21T15:44:44.514Z",
"version": "2016-07-21T15:44:43.908Z",
"slaveId": "ec3708b4-ea0d-4437-b402-d571b212e6ad-S0",
"appId": "/app",
"healthCheckResults": [
{
"alive": true,
"consecutiveFailures": 0,
"firstSuccess": "2016-07-21T15:44:49.573Z",
"lastFailure": null,
"lastSuccess": "2016-07-21T15:44:49.573Z",
"taskId": "app.0bc5c4d4-4f5a-11e6-9956-461ac47997d0"
}
]
}
]
If we execute ApplicationOK
during either step 3 or 4, we are already pass the first three checks and subsequently try to iterate through all task's health check results. The latter yet missing in the JSON payload, we short-circuit and fall through to the return statement (as pointed out by @oolongbrothers originally).
Given that we already know about the existence of custom health checks at the time we start iterating (because the related check would have returned otherwise), I believe the right thing to do here is to add another check for task.HealthCheckResults
being nil/empty and return false if that's the case.
I'll do some further testing to confirm my findings. In the meantime, I'd appreciate if someone could take a second verifying look. Thanks!
/cc @gambol99
from go-marathon.
The issue should hopefully be fixed now. If you experience the problem again with the latest version of go-marathon, please post to this ticket again and / or reopen it.
Thanks!
from go-marathon.
Related Issues (20)
- TestRegisterSEESubscriptionReconnectsStreamOnError fails on windows HOT 7
- Example panics on Marathon 1.4
- Application API should support `embed` arguments HOT 2
- Missing Version API HOT 1
- latest master fails to compile HOT 3
- Marathon hosts marked as down never become healthy with Marathon 1.4.4 HOT 3
- Http client configuration no longer honored for event stream connections HOT 5
- When a member is marked down, we ping Marathon to death HOT 3
- Marathon 1.5 API Support HOT 26
- lastTaskFailure is always nil HOT 1
- example in readme doesn't compile HOT 2
- Issues regarding to SSE events HOT 5
- Remove dependency on plan field
- Client should be updated to follow /v2/events redirection responses
- Can wipe task support?
- Closing SSE Subscriptions
- Support for fetching files from task sandbox HOT 3
- MESOS Container Type Support
- Support CSI Volume definition
- Consider making a new release?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from go-marathon.