google / cluster-insight Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
Currently the /version API endpoint computes the result every time you access it.
This is wasteful because the result of /version is fixed for the duration of the execution of the server's binary. Moreover, checking the service's health by sampling /version may skew the results, because computing the result of /version is a multi-step process which may take a variable amount of time depending on the load.
The 'labels' key is not guaranteed to exist for all Pods. Thus, the following statement in kubernetes.py:182 can fail with a KeyError:
if not isinstance(pod['properties']['labels'], types.DictType):
return False
This should be re-written as:
if ('labels' not in pod['properties']) or (not isinstance(pod['properties']['labels'], types.DictType)):
return False
Currently the context graph shows identical images (same Docker ID) in different nodes as distinct entities. It is not correct. These are exactly the same images.
Fixing this bug will require renaming the images such that the same image in different nodes will have the same internal context-graph identifier.
Jack said on 2015-06-11:
The following statement In the README.md describes the timestamps:
"Each of the resources and relations has a timestamp field, indicating when it was observed or inferred, respectively. The entire context graph has a separate timestamp indiciating when the graph was computed from the resource metadata."
Does it really change every time the data is observed, or does it actually change every time the observed data changes?
If it changes every time the data is observed, it has little value, since it will be updated every time Cluster Insight polls the API.
However, if it changes every time the observed data changes, then it has great value, since it provides a quick way to detect changes on the client side.
After pulling data from Cluster Insight in a web browser over a period of time, I believe it changes only when the observed date changes. However, since the documentation says otherwise, I'd appreciate your help in answering this question.
All timestamps should show when the data changed and not when it was observed.
The cache object has support for this functionality.
The only timestamp that will always show the current time will be the overall timestamp of the context graph.
Need to test the intended behavior and then update the README file.
Official announcement:
As of April 1, 2015, the Kubernetes v1beta3 API has been enabled by default, and the v1beta1 and v1beta2 APIs are deprecated. v1beta3 should be considered the v1 release-candidate API, and the v1 API is expected to be substantially similar. As "pre-release" APIs, v1beta1, v1beta2, and v1beta3 will be eliminated once the v1 API is available, by the end of June 2015.
More information:
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/api.md
List of all API calls:
http://kubernetes.io/third_party/swagger-ui/
The data collector should report orphan resources (such as Docker processes that do not belong to any container and containers that do not belong to any pod). This is likely the result of data mismatch, API change, inconsistent state, or transient state change.
This capability is essential for detecting incorrect operation of the data collector.
Implementing this capability will require listing all Docker resources, matching them with the corresponding Kubernetes resources, and reporting any leftover Docker resources as orphan.
Note that we do not have this problem with Kubernetes resources, because the data collector includes all Kubernetes resources it finds in the context graph.
The cluster-insight collector at http://104.154.64.24:5555/ fails with the following messages:
/cluster/resources/images fails with this message:
{
"error_message": "kubernetes.get_images() failed with exception <type 'exceptions.KeyError'>",
"success": false,
"timestamp": "2015-04-06T18:53:09.078006"
}
/cluster fails with this message:
{
"error_message": ""calling <function _do_compute_node at 0x7fc04513f320> with arguments {'node': {'id': u'kubernetes-minion-ni4v.c.the-pentameter-90318.internal', 'timestamp': '2015-04-06T18:44:46.451123', 'type': 'Node', 'properties': {u'status': {u'nodeInfo': {u'systemUUID': u'FF43C485-F704-7C0F-2F05-D5216E5B148A', u'machineID': u''}, u'conditions': [{u'status': u'Full', u'lastTransitionTime': u'2015-04-02T18:41:16Z', u'kind': u'Ready', u'lastProbeTime': u'2015-04-06T18:44:45Z', u'reason': u'kubelet is posting ready status'}], u'addresses': [{u'type': u'ExternalIP', u'address': u'104.154.64.24'}, {u'type': u'LegacyHostIP', u'address': u'104.197.30.200'}]}, u'uid': u'd89ea1c1-d967-11e4-bd90-42010af0f42f', u'resourceVersion': 710317, u'externalID': u'2709906458529347936', u'resources': {u'capacity': {u'cpu': u'1', u'memory': 3892043776}}, u'hostIP': u'104.197.30.200', u'creationTimestamp': u'2015-04-02T18:41:16Z', u'id': u'kubernetes-minion-ni4v.c.the-pentameter-90318.internal', u'selfLink': u'/api/v1beta1/nodes/kubernetes-minion-ni4v.c.the-pentameter-90318.internal'}, 'annotations': {'label': u'kubernetes-minion-ni4v'}}, 'cluster_guid': u'Cluster:the-pentameter-90318', 'gs': <global_state.GlobalState object at 0x7fc04513c890>, 'g': <context.ContextGraph object at 0x7fc04410bb90>, 'input_queue': <Queue.PriorityQueue instance at 0x7fc044172d88>} failed with exception <type 'exceptions.KeyError'>"",
"success": false,
"timestamp": "2015-04-06T18:51:24.083677"
}
Run the data collector on additional Cloud platforms such as AWS (primary) and Azure or RackSpace. Update the installation instructions and verify correct operation.
The "/version" API is standard.
The current container label is just the first 12 digits of the Docker ID.
This is not a human-friendly label. The container label should include the short name from the pod and the Docker ID.
For example, the current label is:
d85b599c17d8
The new label should be:
cassandra/d85b599c17d8
This is a slight modification of issue #17. The difference should be just the master script which should run on a workstation with access to AWS.
The /health API endpoint should always return immediately. It should be used for automated health checks.
The output of /health should be:
{
"health": "OK",
"success": true,
"timestamp": TIMESTAMP
}
In version 0.15 of kubernetes, apiVersion v1beta3 is the default. The current dev cluster runs this version.
For this version of kubernetes, cluster insight throws assertion errors for some API calls:
The error originates in the same place:
Traceback (most recent call last):
...
File "/usr/src/app/kubernetes.py", line 116, in get_nodes
label=utilities.node_id_to_host_name(node['id']))
File "/usr/src/app/utilities.py", line 358, in node_id_to_host_name
assert m
AssertionError
The NODE_ID_PATTERN in utilites.py is not valid anymore. The node_id no longer contains the project name. It's format has changed as follows:
"kubernetes-minion-ycym.c.gce-monitoring.internal" -> "kubernetes-minion-ycym"
Seeing the following error:
{
"error_message": "compute_graph(\"context_graph\") failed with exception <type 'exceptions.AssertionError'>",
"success": false,
"timestamp": "2015-05-23T02:45:15.784381"
}
while accessing: http://127.0.0.1:5555/cluster
I'm testing with: sudo docker run -d --net=host -p 5555:5555 --name cluster-insight kubernetes/cluster-insight python ./collector.py --debug
Manually accessing: http://127.0.0.1:8080/api/v1beta3/nodes
returns:
{
"kind": "NodeList",
"apiVersion": "v1beta3",
"metadata": {
"selfLink": "/api/v1beta3/nodes",
"resourceVersion": "52"
},
"items": [
{
"metadata": {
"name": "127.0.0.1",
"selfLink": "/api/v1beta3/nodes/127.0.0.1",
"uid": "05c3c93b-00f5-11e5-9399-000c291c504f",
"resourceVersion": "52",
"creationTimestamp": "2015-05-23T02:40:06Z",
"labels": {
"kubernetes.io/hostname": "127.0.0.1"
}
},
"spec": {
"externalID": "127.0.0.1"
},
"status": {
"capacity": {
"cpu": "2",
"memory": "2042388Ki",
"pods": "100"
},
"conditions": [
{
"type": "Ready",
"status": "True",
"lastHeartbeatTime": "2015-05-23T02:44:47Z",
"lastTransitionTime": "2015-05-23T02:40:06Z",
"reason": "kubelet is posting ready status"
}
],
"addresses": [
{
"type": "LegacyHostIP",
"address": "127.0.0.1"
}
],
"nodeInfo": {
"machineID": "48c3ef3692591b8704551c0854a71263",
"systemUUID": "564D95DE-D9A6-6364-2B0A-8CC30A1C504F",
"bootID": "b64febb8-f82b-4e94-9e6c-f3c4d280c6d5",
"kernelVersion": "3.13.0-51-generic",
"osImage": "Ubuntu 14.04.2 LTS",
"containerRuntimeVersion": "docker://1.6.0",
"kubeletVersion": "v0.17.1-731-g4292866c031fb7",
"kubeProxyVersion": "v0.17.1-731-g4292866c031fb7"
}
}
}
]
}
For internal projects, (i.e. project-ids starting with 'google.com:'), the project name contains part of the project-id.
For e.g. for "google.com:nth-segment-93514", annotations.metrics.gcm.project stores "nth-segment-93514"
The "alternateLabel" attribute appears in the output of /cluster/resources/images but it is missing from the output of /cluster.
The output of /cluster/resources/images is:
{
"annotations": {
"alternateLabel": "kubernetes/pause:latest",
"label": "6c4579af347b"
},
"id": "k8s-guestbook-node-1.c.rising-apricot-840.internal/6c4579af347b",
"properties": {
The output of /cluster is:
{
"annotations": {
"label": "6c4579af347b"
},
"id": "Image:k8s-guestbook-node-1.c.rising-apricot-840.internal/6c4579af347b",
"properties": {
We have may nonsense lint warnings as reported by the "gpylint" command, for example for the Non-Google copyright notice and for long comment lines. They should be permanently disabled.
Also add missing comments to decorator functions.
The collector fails with an assertion error on GKE because node IDs no longer include the ".c..internal" suffix. Node IDs on GKE are the same as host names.
The /healthz name is compatible with Heapster and other Google internal services.
The following code compute the cluster name from the node ID.
This code seems to work in http://...:5555/cluster/resources
The cluster name is correct ("cassandra"). The node ID of the first node is "Node:k8s-cassandra-node-2".
However, L514 in kubernetes.py is:
cluster_name = utilities.node_id_to_cluster_name(nodes_list[0]['id'])
I cannot understand how the pattern matching in node_id_to_cluster_name() can handle the "Node:" prefix and still find the cluster name. It is a mystery for me.
"resources": [
{
"annotations": {
"label": "cassandra"
},
"id": "Cluster:cassandra",
...
"type": "Cluster"
},
{
"annotations": {
"label": "k8s-cassandra-node-2",
},
"id": "Node:k8s-cassandra-node-2",
...
"type": "Node"
},
...
The default value of the flag should be the current constant Docker port number, which is 4243.
The context graph of Cluster:gce-monitoring shows 4 pods without any containers. These are the Fluentd pods: Pod:fluentd-to-elasticsearch-kubernetes-minion-ycym.c.gce-monitoring.internal, Pod:fluentd-to-elasticsearch-kubernetes-minion-g4ia.c.gce-monitoring.internal, Pod:fluentd-to-elasticsearch-kubernetes-minion-hv9v.c.gce-monitoring.internal, and Pod:fluentd-to-elasticsearch-kubernetes-minion-a7lt.c.gce-monitoring.internal.
However, listing the Docker processes on kubernetes-minion-a7lt shows two containers belonging to Fluentd running on this node:
% sudo docker ps
sudo docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS
PORTS NAMES
...
00be5cd66a8f kubernetes/fluentd-elasticsearch:1.3 ""/bin/sh -c '/usr/ 44 hours ago Up 44 hours
k8s_fluentd-es.2a803504_fluentd-to-elasticsearch-kubernetes-minion-a7lt.c.gce-monito
ring.internal_default_c5973403e9c9de201f684c38aa8a7588_4dfe38b6
7965cae79197 kubernetes/pause:go "/pause" 44 hours ago Up 44 hours
k8s_POD.7c16d80d_fluentd-to-elasticsearch-kubernetes-minion-a7lt.c.gce-monitoring.in
ternal_default_c5973403e9c9de201f684c38aa8a7588_417b0b4c
We should detect the relationships between the empty pods and their containers and processes.
There are Kubernetes resources which are located in the master node. For example, there are pods which run in the master node. These pods are listed when you ask for all pods.
There are a few technical problems listing resources in the master node:
(1) when you ask Kubernetes for the list of nodes, you get only the minion nodes.
(2) there is no easy way to get the name of the master node using only the Kubernetes API.
(3) I did not try to explicitly get the information about the master node. If this information is not available, then the pods running in the master node will be orphan.
The simplest solution is to read the data in parallel from multiple backends. This will require adding concurrency to the Python code.
If Heapster stores metrics in InfluxDB, tag the corresponding resources with the query parameters needed to access them.
If Heapster stores the metrics in GCM, tag the corresponding resources with he query parameters needed to access them.
In both cases, write sample programs to access the data using the query parameters.
Discussion on 2015-03-10 and 2015-03-23:
On second thought, how about the following alternative:
Leave the current API essentially as is. i.e., we support two signatures:
/cluster/resources/ - get resources of
/cluster/resources - get all resources
/cluster - get a config snapshot
In the next iteration, we will add a filter option to these APIs:
/cluster/resources?filter= - get a subset of resources that match filter, or all resources if filter is not specified. 'filter' and 'type' cannot be used together.
/cluster?filter= - get a subset of the config snapshot
This may be less disruptive to the current code. What do you think?
cc: @vasbala
Use unix socket instead of opening up docker's port which is a giant security hole.
/cluster/resources/images show certain images twice.
The images have the same symbolic name and the same ID.
For example:
{
"annotations": {
"alternateLabel": "kubernetes/pause:latest",
"label": "6c4579af347b"
},
"id": "k8s-guestbook-node-1.c.rising-apricot-840.internal/6c4579af347b",
"properties": {
and
{
"annotations": {
"alternateLabel": "kubernetes/pause:latest",
"label": "6c4579af347b"
},
"id": "k8s-guestbook-node-1.c.rising-apricot-840.internal/6c4579af347b",
"properties": {
The duplicate images also appear in the output of /cluster
It is possible to have a container with no parent pod. So when the parent_pod_id is None, instead of throwing an error, we should record the container.
E.g. of such a scenario:
When the cluster-insight container was started on a minion directly (rather than using the replication controller), this error was thrown.
write idempotent scripts that should run on the master and minion nodes.
the master script should run on a workstation with access to GCP via the "gcloud" command.
This way the data will always be fresh and the processing will be scalable .
The definition of similarity is hard. We can start by requiring identical structure and relations. Relations that extend outside the subgraph should be handled with care. "monitors" and "loadBalances" should have the same source resource. "runs" may have a different source node. "contains" must have equivalent source and target resources.
The context graph of the test cluster shows multiple separate instances of the image "Image:kubernetes/cluster-insight", which are not collapsed together.
Interestingly, the label of the container is the same, but the underlying image ID is different.
Maybe this is the result of compiling the image from sources in each node.
The assertion in utilities.py:215 may not be true in all Kubernetes deployments. It can be safely removed:
assert len(elements) == 4
For debugging purposes it is useful to be able to set the 'debug' flag to True from the command-line. The last statement in collector.py hard wires debug=False, without giving an override option:
app.run(host='0.0.0.0', port=port, debug=False)
We have only tested cluster-insight on the guestbook app.
A list of Kubernetes example apps is https://github.com/GoogleCloudPlatform/kubernetes/tree/master/examples
We should test it on:
cc: @vasbala
In the GKE cluster, the API call to /cluster/resources/containers returns an error message:
{
"error_message": "u'could not locate parent pod cassandra.f20d7205_cassandra for container k8s_cassandra.f20d7205_cassandra_default_2a3763eb-d7be-11e4-8153-42010af03474_94d86795'",
"success": false,
"timestamp": "2015-05-04T15:17:11.642006"
}
Currently the 'properties' attribute of all resources that are read from Kubernetes/Docker is populated with the corresponding data. However, the 'properties' attribute of synthetic resources such as 'Cluster' is not defined.
We should define the 'properties' attribute of all resource types to enable uniform verification of all resource objects.
We should enhance to unit tests to verify that all resource objects have 'properties'.
Nodes should have a 'runs' relationship with Pods instead of 'contains'. This is an important semantic distinction. It will help with similarity detection, because all 'contains' relationships should be the same when two subtrees are compared, whereas 'runs' relationship need not be the same.
Currently, the resource-id of the sample resource varies based on the API call made.
When you call /cluster/resources, the id is of the format: ":"
However, when you ask for resources of a specific type (e.g. /cluster/resources/pods), the id is of the format ""
The resource-id should look exactly the same for a specific resource, independent of how the resource was returned.
In the cassandra cluster setup, two of the services are not connected to their respective pods: cassandra and monitoring-heapster. In the case of cassandra, the replication controller and pods are not connected either. The heapster rc is fine - it has the expected link to its pod.
The selector for the service/rc, and the label for the pods do have a match in both cases. Here is the status from the dev cluster (the-pentameter-90318):
For cassandra:
cassandra
name=cassandra
cassandra-a6pnt, cassandra-a8pbs, cassandra-hnj3i, cassandra-svacm
name=cassandra
cassandra
name=cassandra
For heapster:
monitoring-heapster
name=heapster
monitoring-heapster-controller-1gapo
kubernetes.io/cluster-service=true,name=heapster
Currently the master Cluster-Insight data collector does not read any information from the local Docker daemon after changing the architecture to include a proxy in every minion node.
This information is necessary to compute the name of the currently running binary correctly.
Should cover 3 most common problems:
This attribute should contain the name of the project ("Cluster Insight"), the 12-hex digits image ID and compilation date of the image. This attribute should be added to all resources. Relations should have an equivalent "inferredBy" attribute.
Seth Porter told the participants in the Cloud graph summit that it is essential to know the identity of every piece of data in the graph, so we can ignore corrupt data.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.