google / cluster-insight Goto Github PK

View Code? Open in Web Editor NEW

100.0 16.0 31.0 966 KB

License: Apache License 2.0

Makefile 1.01% Python 96.73% HTML 2.25%

cluster-insight's People

Contributors

Stargazers

Watchers

cluster-insight's Issues

/version API endpoint should cache the result

Currently the /version API endpoint computes the result every time you access it.
This is wasteful because the result of /version is fixed for the duration of the execution of the server's binary. Moreover, checking the service's health by sampling /version may skew the results, because computing the result of /version is a multi-step process which may take a variable amount of time depending on the load.

'labels' key may not be present in pod['properties]

The 'labels' key is not guaranteed to exist for all Pods. Thus, the following statement in kubernetes.py:182 can fail with a KeyError:

if not isinstance(pod['properties']['labels'], types.DictType):
return False

This should be re-written as:

if ('labels' not in pod['properties']) or (not isinstance(pod['properties']['labels'], types.DictType)):
return False

identical images in different nodes should be collapsed to one image

Currently the context graph shows identical images (same Docker ID) in different nodes as distinct entities. It is not correct. These are exactly the same images.

Fixing this bug will require renaming the images such that the same image in different nodes will have the same internal context-graph identifier.

fix the fonts of attributes name in README.md

timestamps should show the time the corresponding data changed

Jack said on 2015-06-11:
The following statement In the README.md describes the timestamps:

"Each of the resources and relations has a timestamp field, indicating when it was observed or inferred, respectively. The entire context graph has a separate timestamp indiciating when the graph was computed from the resource metadata."

Does it really change every time the data is observed, or does it actually change every time the observed data changes?

If it changes every time the data is observed, it has little value, since it will be updated every time Cluster Insight polls the API.

However, if it changes every time the observed data changes, then it has great value, since it provides a quick way to detect changes on the client side.

After pulling data from Cluster Insight in a web browser over a period of time, I believe it changes only when the observed date changes. However, since the documentation says otherwise, I'd appreciate your help in answering this question.

All timestamps should show when the data changed and not when it was observed.
The cache object has support for this functionality.
The only timestamp that will always show the current time will be the overall timestamp of the context graph.

Need to test the intended behavior and then update the README file.

upgrade the Kubernetes API to v1beta3

Official announcement:
As of April 1, 2015, the Kubernetes v1beta3 API has been enabled by default, and the v1beta1 and v1beta2 APIs are deprecated. v1beta3 should be considered the v1 release-candidate API, and the v1 API is expected to be substantially similar. As "pre-release" APIs, v1beta1, v1beta2, and v1beta3 will be eliminated once the v1 API is available, by the end of June 2015.

More information:
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/api.md

List of all API calls:
http://kubernetes.io/third_party/swagger-ui/

report orphan resources

The data collector should report orphan resources (such as Docker processes that do not belong to any container and containers that do not belong to any pod). This is likely the result of data mismatch, API change, inconsistent state, or transient state change.

This capability is essential for detecting incorrect operation of the data collector.

Implementing this capability will require listing all Docker resources, matching them with the corresponding Kubernetes resources, and reporting any leftover Docker resources as orphan.

Note that we do not have this problem with Kubernetes resources, because the data collector includes all Kubernetes resources it finds in the context graph.

get_images() fails with KeyError

The cluster-insight collector at http://104.154.64.24:5555/ fails with the following messages:

/cluster/resources/images fails with this message:
{
"error_message": "kubernetes.get_images() failed with exception <type 'exceptions.KeyError'>",
"success": false,
"timestamp": "2015-04-06T18:53:09.078006"
}
/cluster fails with this message:
{
"error_message": ""calling <function _do_compute_node at 0x7fc04513f320> with arguments {'node': {'id': u'kubernetes-minion-ni4v.c.the-pentameter-90318.internal', 'timestamp': '2015-04-06T18:44:46.451123', 'type': 'Node', 'properties': {u'status': {u'nodeInfo': {u'systemUUID': u'FF43C485-F704-7C0F-2F05-D5216E5B148A', u'machineID': u''}, u'conditions': [{u'status': u'Full', u'lastTransitionTime': u'2015-04-02T18:41:16Z', u'kind': u'Ready', u'lastProbeTime': u'2015-04-06T18:44:45Z', u'reason': u'kubelet is posting ready status'}], u'addresses': [{u'type': u'ExternalIP', u'address': u'104.154.64.24'}, {u'type': u'LegacyHostIP', u'address': u'104.197.30.200'}]}, u'uid': u'd89ea1c1-d967-11e4-bd90-42010af0f42f', u'resourceVersion': 710317, u'externalID': u'2709906458529347936', u'resources': {u'capacity': {u'cpu': u'1', u'memory': 3892043776}}, u'hostIP': u'104.197.30.200', u'creationTimestamp': u'2015-04-02T18:41:16Z', u'id': u'kubernetes-minion-ni4v.c.the-pentameter-90318.internal', u'selfLink': u'/api/v1beta1/nodes/kubernetes-minion-ni4v.c.the-pentameter-90318.internal'}, 'annotations': {'label': u'kubernetes-minion-ni4v'}}, 'cluster_guid': u'Cluster:the-pentameter-90318', 'gs': <global_state.GlobalState object at 0x7fc04513c890>, 'g': <context.ContextGraph object at 0x7fc04410bb90>, 'input_queue': <Queue.PriorityQueue instance at 0x7fc044172d88>} failed with exception <type 'exceptions.KeyError'>"",
"success": false,
"timestamp": "2015-04-06T18:51:24.083677"
}

test on different platforms: AWS, Azure/Rackspace

Run the data collector on additional Cloud platforms such as AWS (primary) and Azure or RackSpace. Update the installation instructions and verify correct operation.

rename "/image_info" API to "/version"

The "/version" API is standard.

label containers with a short name and Docker ID

The current container label is just the first 12 digits of the Docker ID.
This is not a human-friendly label. The container label should include the short name from the pod and the Docker ID.

For example, the current label is:
d85b599c17d8

The new label should be:
cassandra/d85b599c17d8

Make the installation of the data collector more turn key on AWS

This is a slight modification of issue #17. The difference should be just the master script which should run on a workstation with access to AWS.

create a /health API endpoint for health check monitoring

The /health API endpoint should always return immediately. It should be used for automated health checks.

The output of /health should be:
{
"health": "OK",
"success": true,
"timestamp": TIMESTAMP
}

error thrown for API calls in new version of kubernetes

In version 0.15 of kubernetes, apiVersion v1beta3 is the default. The current dev cluster runs this version.

For this version of kubernetes, cluster insight throws assertion errors for some API calls:

/cluster/resources/nodes
/cluster/resources/containers
/cluster/resources/processes
/cluster/resources/images

The error originates in the same place:

Traceback (most recent call last):
...
  File "/usr/src/app/kubernetes.py", line 116, in get_nodes
    label=utilities.node_id_to_host_name(node['id']))
  File "/usr/src/app/utilities.py", line 358, in node_id_to_host_name
    assert m
AssertionError

The NODE_ID_PATTERN in utilites.py is not valid anymore. The node_id no longer contains the project name. It's format has changed as follows:

"kubernetes-minion-ycym.c.gce-monitoring.internal" -> "kubernetes-minion-ycym"

With port 4243 of the Docker daemon enabled — compute_graph failure

Seeing the following error:

{
  "error_message": "compute_graph(\"context_graph\") failed with exception <type 'exceptions.AssertionError'>", 
  "success": false, 
  "timestamp": "2015-05-23T02:45:15.784381"
}

while accessing: http://127.0.0.1:5555/cluster

I'm testing with: sudo docker run -d --net=host -p 5555:5555 --name cluster-insight kubernetes/cluster-insight python ./collector.py --debug

Manually accessing: http://127.0.0.1:8080/api/v1beta3/nodes returns:

{
  "kind": "NodeList",
  "apiVersion": "v1beta3",
  "metadata": {
    "selfLink": "/api/v1beta3/nodes",
    "resourceVersion": "52"
  },
  "items": [
    {
      "metadata": {
        "name": "127.0.0.1",
        "selfLink": "/api/v1beta3/nodes/127.0.0.1",
        "uid": "05c3c93b-00f5-11e5-9399-000c291c504f",
        "resourceVersion": "52",
        "creationTimestamp": "2015-05-23T02:40:06Z",
        "labels": {
          "kubernetes.io/hostname": "127.0.0.1"
        }
      },
      "spec": {
        "externalID": "127.0.0.1"
      },
      "status": {
        "capacity": {
          "cpu": "2",
          "memory": "2042388Ki",
          "pods": "100"
        },
        "conditions": [
          {
            "type": "Ready",
            "status": "True",
            "lastHeartbeatTime": "2015-05-23T02:44:47Z",
            "lastTransitionTime": "2015-05-23T02:40:06Z",
            "reason": "kubelet is posting ready status"
          }
        ],
        "addresses": [
          {
            "type": "LegacyHostIP",
            "address": "127.0.0.1"
          }
        ],
        "nodeInfo": {
          "machineID": "48c3ef3692591b8704551c0854a71263",
          "systemUUID": "564D95DE-D9A6-6364-2B0A-8CC30A1C504F",
          "bootID": "b64febb8-f82b-4e94-9e6c-f3c4d280c6d5",
          "kernelVersion": "3.13.0-51-generic",
          "osImage": "Ubuntu 14.04.2 LTS",
          "containerRuntimeVersion": "docker://1.6.0",
          "kubeletVersion": "v0.17.1-731-g4292866c031fb7",
          "kubeProxyVersion": "v0.17.1-731-g4292866c031fb7"
        }
      }
    }
  ]
}

project name mismatch in metric descriptor for internal projects.

For internal projects, (i.e. project-ids starting with 'google.com:'), the project name contains part of the project-id.

For e.g. for "google.com:nth-segment-93514", annotations.metrics.gcm.project stores "nth-segment-93514"

publish document describing the /cluster data format

alternateLabel is missing from the output of /cluster

The "alternateLabel" attribute appears in the output of /cluster/resources/images but it is missing from the output of /cluster.

The output of /cluster/resources/images is:
{
"annotations": {
"alternateLabel": "kubernetes/pause:latest",
"label": "6c4579af347b"
},
"id": "k8s-guestbook-node-1.c.rising-apricot-840.internal/6c4579af347b",
"properties": {

The output of /cluster is:
{
"annotations": {
"label": "6c4579af347b"
},
"id": "Image:k8s-guestbook-node-1.c.rising-apricot-840.internal/6c4579af347b",
"properties": {

fix new lint warnings and disable nonsense warnings

We have may nonsense lint warnings as reported by the "gpylint" command, for example for the Non-Google copyright notice and for long comment lines. They should be permanently disabled.

Also add missing comments to decorator functions.

collector fails with assertion error on GKE

The collector fails with an assertion error on GKE because node IDs no longer include the ".c..internal" suffix. Node IDs on GKE are the same as host names.

rename /health endpoint to /healthz

The /healthz name is compatible with Heapster and other Google internal services.

how does the cluster name extraction work? I cannot understand it

The following code compute the cluster name from the node ID.
This code seems to work in http://...:5555/cluster/resources

The cluster name is correct ("cassandra"). The node ID of the first node is "Node:k8s-cassandra-node-2".
However, L514 in kubernetes.py is:
cluster_name = utilities.node_id_to_cluster_name(nodes_list[0]['id'])

I cannot understand how the pattern matching in node_id_to_cluster_name() can handle the "Node:" prefix and still find the cluster name. It is a mystery for me.

"resources": [
{
"annotations": {
"label": "cassandra"
},
"id": "Cluster:cassandra",
...
"type": "Cluster"
},
{
"annotations": {
"label": "k8s-cassandra-node-2",
},
"id": "Node:k8s-cassandra-node-2",
...
"type": "Node"
},
...

set the Docker daemon port number with a flag

The default value of the flag should be the current constant Docker port number, which is 4243.

some pods have no containers

The context graph of Cluster:gce-monitoring shows 4 pods without any containers. These are the Fluentd pods: Pod:fluentd-to-elasticsearch-kubernetes-minion-ycym.c.gce-monitoring.internal, Pod:fluentd-to-elasticsearch-kubernetes-minion-g4ia.c.gce-monitoring.internal, Pod:fluentd-to-elasticsearch-kubernetes-minion-hv9v.c.gce-monitoring.internal, and Pod:fluentd-to-elasticsearch-kubernetes-minion-a7lt.c.gce-monitoring.internal.

However, listing the Docker processes on kubernetes-minion-a7lt shows two containers belonging to Fluentd running on this node:
% sudo docker ps
sudo docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS
PORTS NAMES
...
00be5cd66a8f kubernetes/fluentd-elasticsearch:1.3 ""/bin/sh -c '/usr/ 44 hours ago Up 44 hours
k8s_fluentd-es.2a803504_fluentd-to-elasticsearch-kubernetes-minion-a7lt.c.gce-monito
ring.internal_default_c5973403e9c9de201f684c38aa8a7588_4dfe38b6
7965cae79197 kubernetes/pause:go "/pause" 44 hours ago Up 44 hours
k8s_POD.7c16d80d_fluentd-to-elasticsearch-kubernetes-minion-a7lt.c.gce-monitoring.in
ternal_default_c5973403e9c9de201f684c38aa8a7588_417b0b4c

We should detect the relationships between the empty pods and their containers and processes.

report resources in the master node

There are Kubernetes resources which are located in the master node. For example, there are pods which run in the master node. These pods are listed when you ask for all pods.

There are a few technical problems listing resources in the master node:
(1) when you ask Kubernetes for the list of nodes, you get only the minion nodes.
(2) there is no easy way to get the name of the master node using only the Kubernetes API.
(3) I did not try to explicitly get the information about the master node. If this information is not available, then the pods running in the master node will be orphan.

Scalability improvement for monitoring large clusters

The simplest solution is to read the data in parallel from multiple backends. This will require adding concurrency to the Python code.

tag resources with corresponding metrics from heapster

If Heapster stores metrics in InfluxDB, tag the corresponding resources with the query parameters needed to access them.
If Heapster stores the metrics in GCM, tag the corresponding resources with he query parameters needed to access them.
In both cases, write sample programs to access the data using the query parameters.

fix the Docker file

add filter options to the API

Discussion on 2015-03-10 and 2015-03-23:

On second thought, how about the following alternative:

Leave the current API essentially as is. i.e., we support two signatures:
/cluster/resources/ - get resources of
/cluster/resources - get all resources
/cluster - get a config snapshot

In the next iteration, we will add a filter option to these APIs:
/cluster/resources?filter= - get a subset of resources that match filter, or all resources if filter is not specified. 'filter' and 'type' cannot be used together.
/cluster?filter= - get a subset of the config snapshot

This may be less disruptive to the current code. What do you think?

cc: @vasbala

Use unix socket instead of opening up docker's port

Use unix socket instead of opening up docker's port which is a giant security hole.

images appear twice or more

/cluster/resources/images show certain images twice.
The images have the same symbolic name and the same ID.

For example:
{
"annotations": {
"alternateLabel": "kubernetes/pause:latest",
"label": "6c4579af347b"
},
"id": "k8s-guestbook-node-1.c.rising-apricot-840.internal/6c4579af347b",
"properties": {

and

{
"annotations": {
"alternateLabel": "kubernetes/pause:latest",
"label": "6c4579af347b"
},
"id": "k8s-guestbook-node-1.c.rising-apricot-840.internal/6c4579af347b",
"properties": {

The duplicate images also appear in the output of /cluster

when a container has no parent pod, the /cluster/resources/containers throws an error

It is possible to have a container with no parent pod. So when the parent_pod_id is None, instead of throwing an error, we should record the container.

E.g. of such a scenario:

When the cluster-insight container was started on a minion directly (rather than using the replication controller), this error was thrown.

change "config" to "context" in all comments

Make the installation of the data collector more turn key on GCP

write idempotent scripts that should run on the master and minion nodes.
the master script should run on a workstation with access to GCP via the "gcloud" command.

Listen to the Docker and Kubernetes event streams

This way the data will always be fresh and the processing will be scalable .

compute similar subgraphs

The definition of similarity is hard. We can start by requiring identical structure and relations. Relations that extend outside the subgraph should be handled with care. "monitors" and "loadBalances" should have the same source resource. "runs" may have a different source node. "contains" must have equivalent source and target resources.

images with identical names are not merged

The context graph of the test cluster shows multiple separate instances of the image "Image:kubernetes/cluster-insight", which are not collapsed together.

Interestingly, the label of the container is the same, but the underlying image ID is different.
Maybe this is the result of compiling the image from sources in each node.

Assertion bug in utilities.py

The assertion in utilities.py:215 may not be true in all Kubernetes deployments. It can be safely removed:

 assert len(elements) == 4

Allow command-line override of 'debug' flag in collector.py

For debugging purposes it is useful to be able to set the 'debug' flag to True from the command-line. The last statement in collector.py hard wires debug=False, without giving an override option:

app.run(host='0.0.0.0', port=port, debug=False)

test on additional Kubetnetes apps

We have only tested cluster-insight on the guestbook app.

A list of Kubernetes example apps is https://github.com/GoogleCloudPlatform/kubernetes/tree/master/examples

We should test it on:

k8petstore (https://github.com/GoogleCloudPlatform/kubernetes/tree/master/examples/k8petstore) - this Kubernetes comes with a data generator that allows us to drive load into the app.
phabricator (https://github.com/GoogleCloudPlatform/kubernetes/tree/master/examples/phabricator) - this Kubernetes app uses the CloudSQL service on GCP.

cc: @vasbala

cannot locate parent pod in GKE cluster

In the GKE cluster, the API call to /cluster/resources/containers returns an error message:

{
"error_message": "u'could not locate parent pod cassandra.f20d7205_cassandra for container k8s_cassandra.f20d7205_cassandra_default_2a3763eb-d7be-11e4-8153-42010af03474_94d86795'",
"success": false,
"timestamp": "2015-05-04T15:17:11.642006"
}

fill the 'properties' attribute of every resource

Currently the 'properties' attribute of all resources that are read from Kubernetes/Docker is populated with the corresponding data. However, the 'properties' attribute of synthetic resources such as 'Cluster' is not defined.

We should define the 'properties' attribute of all resource types to enable uniform verification of all resource objects.

We should enhance to unit tests to verify that all resource objects have 'properties'.

all code should pass "gpylint"

replace 'contains' with 'runs' relationship for Nodes/Pods

Nodes should have a 'runs' relationship with Pods instead of 'contains'. This is an important semantic distinction. It will help with similarity detection, because all 'contains' relationships should be the same when two subtrees are compared, whereas 'runs' relationship need not be the same.

add comments according to the Google coding style

resource-ids should be consistent across API calls

Currently, the resource-id of the sample resource varies based on the API call made.

When you call /cluster/resources, the id is of the format: ":"

However, when you ask for resources of a specific type (e.g. /cluster/resources/pods), the id is of the format ""

The resource-id should look exactly the same for a specific resource, independent of how the resource was returned.

some services/replicationControllers and related pods are not connected in the context graph.

In the cassandra cluster setup, two of the services are not connected to their respective pods: cassandra and monitoring-heapster. In the case of cassandra, the replication controller and pods are not connected either. The heapster rc is fine - it has the expected link to its pod.

The selector for the service/rc, and the label for the pods do have a match in both cases. Here is the status from the dev cluster (the-pentameter-90318):

For cassandra:

Service: cassandra
Service Selector: name=cassandra
Pods (4): cassandra-a6pnt, cassandra-a8pbs, cassandra-hnj3i, cassandra-svacm
Pod labels: name=cassandra
Replication Controller: cassandra
Replication Controller selector: name=cassandra

For heapster:

Service: monitoring-heapster
Service Selector: name=heapster
Pod: monitoring-heapster-controller-1gapo
Pod labels: kubernetes.io/cluster-service=true,name=heapster

read information from Docker daemon in the master node

Currently the master Cluster-Insight data collector does not read any information from the local Docker daemon after changing the architecture to include a proxy in every minion node.

This information is necessary to compute the name of the currently running binary correctly.

write a troubleshooting document

Should cover 3 most common problems:

collector does not respond
collector fails to communicate with some/all Docker controllers
other runtime errors

add "createdBy" attribute

This attribute should contain the name of the project ("Cluster Insight"), the 12-hex digits image ID and compilation date of the image. This attribute should be added to all resources. Relations should have an equivalent "inferredBy" attribute.

Seth Porter told the participants in the Cloud graph summit that it is essential to know the identity of every piece of data in the graph, so we can ignore corrupt data.

google / cluster-insight Goto Github PK

cluster-insight's People

Contributors

Stargazers

Watchers

Forkers

cluster-insight's Issues

Recommend Projects

Recommend Topics

Recommend Org