Readiness Checklist
Expected Behavior
If a move plan calculation cannot be computed, we should know that and then bubble the appropriate error up rather than assume it was calculated. An example will be outlined below.
Current Behavior
I tried updating to the latest cloud-sdk-go version and ecctl version locally and received the exact same stack trace as older versions (both are on v1.8.0), so I don't perceive this as a new problem and perhaps a backport should be considered when we fix it. Here's the crux of the problem:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1d3103d]
goroutine 90 [running]:
github.com/elastic/cloud-sdk-go/pkg/api/platformapi/allocatorapi.ComputeVacateRequest(0xc000320280, 0xc000799cd8, 0x1, 0x1, 0x2d18240, 0x0, 0x0, 0x0, 0xc0001297cb, 0xc0004fa121, ...)
/Users/stephenschmidt/go/pkg/mod/github.com/elastic/[email protected]/pkg/api/platformapi/allocatorapi/vacate.go:708 +0xc1d
github.com/elastic/cloud-sdk-go/pkg/api/platformapi/allocatorapi.newMoveClusterParams(0xc000497c70, 0xc0003d4920, 0x1, 0x1)
/Users/stephenschmidt/go/pkg/mod/github.com/elastic/[email protected]/pkg/api/platformapi/allocatorapi/vacate.go:404 +0x3a7
github.com/elastic/cloud-sdk-go/pkg/api/platformapi/allocatorapi.moveClusterByType(0xc000497c70, 0xc000497c70, 0x0)
/Users/stephenschmidt/go/pkg/mod/github.com/elastic/[email protected]/pkg/api/platformapi/allocatorapi/vacate.go:441 +0x45
github.com/elastic/cloud-sdk-go/pkg/api/platformapi/allocatorapi.VacateCluster(0xc000497c70, 0xc000488fa8, 0x135bf26)
/Users/stephenschmidt/go/pkg/mod/github.com/elastic/[email protected]/pkg/api/platformapi/allocatorapi/vacate.go:310 +0x77
github.com/elastic/cloud-sdk-go/pkg/api/platformapi/allocatorapi.VacateClusterInPool(0x22de040, 0xc000497c70, 0xc00005f501, 0x0)
/Users/stephenschmidt/go/pkg/mod/github.com/elastic/[email protected]/pkg/api/platformapi/allocatorapi/vacate.go:297 +0x53
github.com/elastic/cloud-sdk-go/pkg/sync/pool.Work.func2(0xc00010e180, 0xc00058e420, 0xc00058e600, 0xc00058e840, 0xc00058e960, 0xc00010e1e0, 0x21da570, 0x6fc23ac00, 0x22de040, 0xc000497c70, ...)
/Users/stephenschmidt/go/pkg/mod/github.com/elastic/[email protected]/pkg/sync/pool/worker.go:82 +0x44
created by github.com/elastic/cloud-sdk-go/pkg/sync/pool.Work
/Users/stephenschmidt/go/pkg/mod/github.com/elastic/[email protected]/pkg/sync/pool/worker.go:81 +0x21d
"Debugging" locally:
Computation complete payload: &{Failures:0xc0004f0080 Moves:0xc0004f0100}
Computation complete payload error: &{ApmClusters:[] AppsearchClusters:[] ElasticsearchClusters:[] EnterpriseSearchClusters:[] KibanaClusters:[]}
Computation complete payload moves: <nil>
We think this is ent search, plan config: <nil>
panic: runtime error: invalid memory address or nil pointer dereference
The problem is indeed the plan computation is returning nil and we don't bother to check for nil values here: https://github.com/elastic/cloud-sdk-go/blob/master/pkg/api/platformapi/allocatorapi/vacate.go#L708, but more specifically on the field:
https://github.com/elastic/cloud-sdk-go/blob/master/pkg/models/move_clusters_details.go#L54 -> https://github.com/elastic/cloud-sdk-go/blob/master/pkg/models/move_enterprise_search_details.go#L41
We should be checking that we have a calculated plan for everything, rather than just assuming we'll get one as a good defensive programming practice. Mym recommendation is that each stack component and cluster in its corresponding slice should have the check and bubble the error up for a a nil calculated plan here: https://github.com/elastic/cloud-sdk-go/blob/master/pkg/api/platformapi/allocatorapi/vacate.go#L633-L718.
At the end of the day it's a chicken and egg issue, so if ES is in a really bad state, something like entsearch can't be checked; thus, we can't calc a plan. Here's an example:
{
"cluster_type": "elasticsearch",
"details": "Could not make sure [ElasticsearchCluster(<redacted>)] is up and running",
"caused_by": "no.found.constructor.plan.apm.ClusterNotReachable: Unexpected response [401 Unauthorized, {\"error\":{\"root_cause\":[{\"type\":\"security_exception\",\"reason\":\"unable to authenticate user [cloud-internal-enterprise_search-server] for REST request [/]\",\"header\":{\"WWW-Authenticate\":[\"Basic realm=\\\"security\\\" charset=\\\"UTF-8\\\"\",\"Bearer realm=\\\"security\\\"\",\"ApiKey\"]}}],\"type\":\"security_exception\",\"reason\":\"unable to authenticate user [cloud-internal-enterprise_search-server] for REST request [/]\",\"header\":{\"WWW-Authenticate\":[\"Basic realm=\\\"security\\\" charset=\\\"UTF-8\\\"\",\"Bearer realm=\\\"security\\\"\",\"ApiKey\"]}},\"status\":401}]"
}
Possible Solution
Noted above, but to make parsing easier:
We should be checking that we have a calculated plan for everything, rather than just assuming we'll get one as a good defensive programming practice. Mym recommendation is that each stack component and cluster in its corresponding slice should have the check and bubble the error up for a a nil calculated plan here: https://github.com/elastic/cloud-sdk-go/blob/master/pkg/api/platformapi/allocatorapi/vacate.go#L633-L718.
I also don't mind working on this in the near future, but if someone wants to tackle it within the next few weeks, feel free.
Steps to Reproduce
Get ES into a really bad state with another stack component running along side it and attempt to vacate said stack component.
Context
I think enough was provided above :).
Your Environment
- Version used: v1.8.0 for all the things
- Environment name and version (e.g. Go 1.9): 1.16