vmware-tanzu / crash-diagnostics Goto Github PK

Crash-Diagnostics (Crashd) is a tool to help investigate, analyze, and troubleshoot unresponsive or crashed Kubernetes clusters.

License: Other

Go 100.00%

kubernetes kubernetes-cluster troubleshooting

crash-diagnostics's Introduction

Crashd - Crash Diagnostics

Crash Diagnostics (Crashd) is a tool that helps human operators to easily interact and collect information from infrastructures running on Kubernetes for tasks such as automated diagnosis and troubleshooting.

Crashd Features

Crashd uses the Starlark language, a Python dialect, to express and invoke automation functions
Easily automate interaction with infrastructures running Kubernetes
Interact and capture information from compute resources such as machines (via SSH)
Automatically execute commands on compute nodes to capture results
Capture object and cluster log from the Kubernetes API server
Easily extract data from Cluster-API managed clusters

How Does it Work?

Crashd executes script files, written in Starlark, that interacts a specified infrastructure along with its cluster resources. Starlark script files contain predefined Starlark functions that are capable of interacting and collect diagnostics and other information from the servers in the cluster.

For detail on the design of Crashd, see this Google Doc design document here.

Installation

There are two ways to get started with Crashd. Either download a pre-built binary or pull down the code and build it locally.

Download binary

Dowload the latest binary release for your platform
Extract tarball from release
```
tar -xvf <RELEASE_TARBALL_NAME>.tar.gz
```
Move the binary to your operating system's PATH

Compiling from source

Crashd is written in Go and requires version 1.11 or later. Clone the source from its repo or download it to your local directory. From the project's root directory, compile the code with the following:

GO111MODULE=on go build -o crashd .

Or, yo can run a versioned build using the build.go source code:

go run .ci/build/build.go

Build amd64/darwin OK: .build/amd64/darwin/crashd
Build amd64/linux OK: .build/amd64/linux/crashd

Getting Started

A Crashd script consists of a collection of Starlark functions stored in a file. For instance, the following script (saved as diagnostics.crsh) collects system information from a list of provided hosts using SSH. The collected data is then bundled as tar.gz file at the end:

# Crashd global config
crshd = crashd_config(workdir="{0}/crashd".format(os.home))

# Enumerate compute resources 
# Define a host list provider with configured SSH
hosts=resources(
    provider=host_list_provider(
        hosts=["170.10.20.30", "170.40.50.60"], 
        ssh_config=ssh_config(
            username=os.username,
            private_key_path="{0}/.ssh/id_rsa".format(os.home),
        ),
    ),
)

# collect data from hosts
capture(cmd="sudo df -i", resources=hosts)
capture(cmd="sudo crictl info", resources=hosts)
capture(cmd="df -h /var/lib/containerd", resources=hosts)
capture(cmd="sudo systemctl status kubelet", resources=hosts)
capture(cmd="sudo systemctl status containerd", resources=hosts)
capture(cmd="sudo journalctl -xeu kubelet", resources=hosts)

# archive collected data
archive(output_file="diagnostics.tar.gz", source_paths=[crshd.workdir])

The previous code snippet connects to two hosts (specified in the host_list_provider) and execute commands remotely, over SSH, and capture and stores the result.

See the complete list of supported functions here.

Running the script

To run the script, do the following:

$> crashd run diagnostics.crsh

If you want to output debug information, use the --debug flag as shown:

$> crashd run --debug diagnostics.crsh

DEBU[0000] creating working directory /home/user/crashd
DEBU[0000] run: executing command on 2 resources
DEBU[0000] run: executing command on localhost using ssh: [sudo df -i]
DEBU[0000] ssh.run: /usr/bin/ssh -q -o StrictHostKeyChecking=no -i /home/user/.ssh/id_rsa -p 22  user@localhost "sudo df -i"
DEBU[0001] run: executing command on 170.10.20.30 using ssh: [sudo df -i]
...

Compute Resource Providers

Crashd utilizes the concept of a provider to enumerate compute resources. Each implementation of a provider is responsible for enumerating compute resources on which Crashd can execute commands using a transport (i.e. SSH). Crashd comes with several providers including

Host List Provider - uses an explicit list of host addresses (see previous example)
Kubernetes Nodes Provider - extracts host information from a Kubernetes API node objects
CAPV Provider - uses Cluster-API to discover machines in vSphere cluster
CAPA Provider - uses Cluster-API to discover machines running on AWS
More providers coming!

Accessing script parameters

Crashd scripts can access external values that can be used as script parameters.

Environment variables

Crashd scripts can access environment variables at runtime using the os.getenv method:

kube_capture(what="logs", namespaces=[os.getenv("KUBE_DEFAULT_NS")])

Command-line arguments

Scripts can also access command-line arguments passed as key/value pairs using the --args or --args-file flags. For instance, when the following command is used to start a script:

$ crashd run --args="kube_ns=kube-system, username=$(whoami)" diagnostics.crsh

Values from --args can be accessed as shown below:

kube_capture(what="logs", namespaces=["default", args.kube_ns])

More Examples

SSH Connection via a jump host

The SSH configuration function can be configured with a jump user and jump host. This is useful for providers that requires a host proxy for SSH connection as shown in the following example:

ssh=ssh_config(username=os.username, jump_user=args.jump_user, jump_host=args.jump_host)
hosts=host_list_provider(hosts=["some.host", "172.100.100.20"], ssh_config=ssh)
...

Connecting to Kubernetes nodes with SSH

The following uses the kube_nodes_provider to connect to Kubernetes nodes and execute remote commands against those nodes using SSH:

# SSH configuration
ssh=ssh_config(
    username=os.username,
    private_key_path="{0}/.ssh/id_rsa".format(os.home),
    port=args.ssh_port,
    max_retries=5,
)

# enumerate nodes as compute resources
nodes=resources(
    provider=kube_nodes_provider(
        kube_config=kube_config(path=args.kubecfg),
        ssh_config=ssh,
    ),
)

# exec `uptime` command on each node
uptimes = run(cmd="uptime", resources=nodes)

# print `run` result from first node
print(uptimes[0].result)

Retreiving Kubernetes API objects and logs

Thekube_capture is used, in the following example, to connect to a Kubernetes API server to retrieve Kubernetes API objects and logs. The retrieved data is then saved to the filesystem as shown below:

nspaces=[
    "capi-kubeadm-bootstrap-system",
    "capi-kubeadm-control-plane-system",
    "capi-system capi-webhook-system",
    "cert-manager tkg-system",
]

conf=kube_config(path=args.kubecfg)

# capture Kubernetes API object and store in files
kube_capture(what="logs", namespaces=nspaces, kube_config=conf)
kube_capture(what="objects", kinds=["services", "pods"], namespaces=nspaces, kube_config=conf)
kube_capture(what="objects", kinds=["deployments", "replicasets"], namespaces=nspaces, kube_config=conf)

Interacting with Cluster-API managed machines running on vSphere (CAPV)

As mentioned, Crashd provides the capv_provider which allows scripts to interact with Cluster-API managed clusters running on a vSphere infrastructure (CAPV). The following shows an abbreviated snippet of a Crashd script that retrieves diagnostics information from the management cluster machines managed by a CAPV-initiated cluster:

# enumerates management cluster nodes
nodes = resources(
    provider=capv_provider(
        ssh_config=ssh_config(username="capv", private_key_path=args.private_key),
        kube_config=kube_config(path=args.mc_config)
    )
)

# execute and capture commands output from management nodes
capture(cmd="sudo df -i", resources=nodes)
capture(cmd="sudo crictl info", resources=nodes)
capture(cmd="sudo cat /var/log/cloud-init-output.log", resources=nodes)
capture(cmd="sudo cat /var/log/cloud-init.log", resources=nodes)
...

The previous snippet interact with management cluster machines. The provider can be configured to enumerate workload machines (by specifying the name of a workload cluster) as shown in the following example:

# enumerates workload cluster nodes
nodes = resources(
    provider=capv_provider(
        workload_cluster=args.cluster_name,
        ssh_config=ssh_config(username="capv", private_key_path=args.private_key),
        kube_config=kube_config(path=args.mc_config)
    )
)

# execute and capture commands output from workload nodes
capture(cmd="sudo df -i", resources=nodes)
capture(cmd="sudo crictl info", resources=nodes)
...

All Examples

See all script examples in the ./examples directory.

Roadmap

This project has numerous possibilities ahead of it. Read about our evolving roadmap here.

Contributing

New contributors will need to sign a CLA (contributor license agreement). Details are described in our contributing documentation.

License

This project is available under the Apache License, Version 2.0

crash-diagnostics's People

Contributors

Stargazers

Watchers

crash-diagnostics's Issues

Fix to support commands with quoted values

Currently commands with quoted values like

CAPTURE ls "/home/username/My Files"

Are not interpreted properly and is executed as os/exec.Run("ls", "\"home/username/My", "Files\"")

The fix should pass the argument to ls as os/exec.Run("ls", "/home/username/My Files").

Support for environment vars expansion in Diaganostics.file

Ability to access declared environment variables (pre-declared for the running process or declared with ENV) using variable expansion format in Diagnostics.file:

ENV MyHome=${HOME}/test/files
COPY ${MyHome}

Will result in action COPY /home/<username>/test/files being copied'

KUBEGET does not pull logs from all namespaces

The following:

KUBEGET logs

The previous Diagnostics command should pull all logs from all pods from all accessible namespaces.
However, it seems to not do that and gives up if it encounters an issue with pulling logs from a pod and does not continue with the remaining logs in the namespace.

KUBEGET should pull logs for ready pods as is done with kubectl cluster-info

Enhance FROM to support K8s node sources and connectivity settings

Crash-Diag does not assume a K8s api-server is available. However, if it is, FROM should be able to use Kubernetes nodes metadata to specify the host sources as shown below:

Examples - sourcing with new params source and port param specified:

# Source data from all nodes in K8s cluster
# Default port=22 (ssh)
FROM source:"kubernetes" [port:"22"]

# Source from all cluster nodes with specified labels
# source=kubernetes, port=22 are assumed
FROM [source:"kubernetes"] [port=22] labels:"name=bigmachine"

# Source from specified cluster nodes
# source=kubernetes, port=22 are assumed
FROM nodes:"node.1 node.2"

# Source directly from specified hosts 
# source=direct, port=22 are assumed
FROM [source:"direct"] hosts:"10.10.2.100 10.10.2.200"

# Source directly from hosts
# source=direct is assumed
# port is overridden, set to 22 when not provided
FROM [source:"direct"] hosts:"10.10.2.100:2222 10.10.2.200"

Examples - specify connectivity settings

FROM  hosts:"10.10.2.100 10.10.2.200" tries:"10" timeout:"2min"

Specify error behavior for command actions (i.e. COPY, CAPTURE, and RUN)

Currently, when a COPY, CAPTURE, or RUN action fails, the system decides what to do (in most cases it keeps executing the script). A better approach would be to allow the script author to specify how the command should behave like this:

RUN cmd:'bin/echo "HELLO WORD!"' errexit:"true"

The previous would cause the script to exit if this command fails with an error.

Needs command that do not capture content ?

When we copy multiple files which needs sudo capability, scp doesn't work. In this case, we could execute sudo command to copy these files to /tmp and then execute scp. Thus, we need dummy command that does not capture content.

Support for multiple execution runtimes

Currently, crash-diagnostics uses a single execution runtime that assumes all commands are executed remotely using ssh or scp. While this assumption is easy to implement, it also makes the tool inflexible.

This issue is for the support of different executor runtimes. This would allow the tool to support commands with no prior assumptions where/how the command will be executed.

Runtimes

For instance, the followings shows several commands each using different runtimes.

FROM hosts:"host0.local host1.local"
CAPTURE cmd:"command -param0 -param1" runtime:"ssh"
COPY path:"/var/logs/log.txt" runtime:"http"
ENV JUMP="12.34.56.78"
RUN cmd:"ssh -i $HOME/.ssh/key_rsa -J user@$JUMP user@$FROM_HOST uptime" runtime:"shell" scope:"all"

runtime:ssh

SSH protocol will be used for remote execution
Each machine in FROM will be contacted on specified port (22 default)

runtime:http

Assumes HTTP Get will be used to retrieve resource
Each machine in FROM will be contacted on specified port (80 default)

runtime:shell

A local shell is used to execute the command as it appears
The local command will be executed for each machine in FROM
Use scope to control how many time the command is executed

See #55 for command scopes

Starlark - Support for SSH Host Key Check in `ssh_config`

Currently, the code ignores and does not validate host keys during an SSH/SCP operation. While this allows Crashd scripts to run quietly, it can be viewed as a security issue for production usage. This issue is a feature request to allow Crashd to control the host key check behavior. The host key check should be done by default. A flag should be provided for script writers to disable that behavior (useful in CI/CD or testing environment):

Skipping host key check

ssh_config(user="sshueser", hostkey_check=false)

A hostkey_check = true (default) means the SSH client will apply host key validation.

Need a way to capture arbitrary Kubernetes API objects

Currently, the tool automatically attempt to download all pertinent Kubernetes API objects (à la kubectl cluster-info --dump) that can be accessed. A better and more useful approach would be to provide a command designed to pull arbitrary objects from the cluster (if the API server is available). That way, an operator investigating an issue can pull exactly what's needed without a full dump of API objects.

[code reorg] Implement code to support the Collector

Currently, crashd assumes all collected data is stored as a set of directories that are archived into tar-gz file. This code formalizes these activities into a component called a Collector which may have different implementation and uses.

See design doc for detail
https://docs.google.com/document/d/1pqYOdTf6ZIT_GSis-AVzlOTm3kyyg-32-seIfULaYEs/edit#heading=h.tc5gyk93z2c6

Expose programmable components as API packages

The script parser and its executor needs to be exposed as clear programmable Go API, via their respective packages, to allow developers to run script programmatically. This may require some refactoring to expose the followings:

package script
package exec
package command

[code reorg] Enhance machine representation as ComputeResource

The source code currently has a component named Machine. However, with the new design the machine should be expanded to represent diverse compute resource (not just remote machines) including compute resource such as

physical/virtual machine
containerized compute resource
Kubernetes Pod process
Etc.

See doc for what machines and compute resources are :
https://docs.google.com/document/d/1pqYOdTf6ZIT_GSis-AVzlOTm3kyyg-32-seIfULaYEs/edit#heading=h.24xlmcp7u50l

Issue in parsing the command string while using RUN or CAPTURE

While passing a command to the RUN or CAPTURE command, I see that the parsing is incorrect.

DEBU[0000] Parsing [67: RUN /bin/bash -c 'sudo find /home/cluster-info-dump -maxdepth 2 -type d -print0 | while IFS= read -rd "" dir; do for file in $dir; do mkdir -p /home/vmware-system-wcp/cluster-info/$file; files=sudo find $file -maxdepth 1 -type f; for sf in $files; do sudo cat $sf>/home/vmware-system-wcp/cluster-info/$sf; done; done; done']

However, the argument passed to the RUN command is has missing strings.
For example: $dir in do for file in $dir, similarly $file in sudo find $file -maxdepth .

The complete SSHRun debug statement -

SSHRun: /bin/bash -c 'sudo find /home/cluster-info-dump -maxdepth 2 -type d -print0 | while IFS= read -rd "" dir; do for file in ; do mkdir -p /home/vmware-system-wcp/cluster-info/; files=sudo find -maxdepth 1 -type f; for sf in ; do sudo cat >/home/vmware-system-wcp/cluster-info/; done; done; done'
The above command fails since it has missing strings.

I have verified manually that the command passed to RUN/CAPTURE in the Diagnostics file works when executed directly on the remote machine.

Looking at the code seems like GetEffectiveCmd calls wordSplit. I think this function might need to be modified to take care of this ?

My Diagnostics.file looks like this -

FROM 10.78.106.74:22
AUTHCONFIG username:vmware-system-wcp private-key:/Users/pnarasimhapr/go/src/gitlab.eng.vmware.com/guest-clusters/dev-bootstrap/bin/new-cluster-ssh-key
WORKDIR /tmp/crashdir
RUN /bin/bash -c ' sudo rm -rf /home/cluster-info-dump; rm -rf /home/vmware-system-wcp/cluster-info/; sudo mkdir -p /home/cluster-info-dump;sudo kubectl --kubeconfig /etc/kubernetes/admin.conf cluster-info dump --output-directory=/home/cluster-info-dump'
RUN /bin/bash -c 'sudo find /home/cluster-info-dump -maxdepth 2 -type d -print0 | while IFS= read -rd "" dir; do for file in $dir; do mkdir -p /home/vmware-system-wcp/cluster-info/$file; files=sudo find $file -maxdepth 1 -type f; for sf in $files; do sudo cat $sf>/home/vmware-system-wcp/cluster-info/$sf; done; done; done'
COPY /home/vmware-system-wcp/cluster-info/`

Code comments and tests for k8s package

Starlark - Implement `kube_config` configuration func and `kubeget*` functions

Crash Diagnostics currently supports the ability to query a Kubernetes API server for information. There should be implementation of configuration and command functions to replace existing Kubernetes directives:

Implement func kube_config() to replace KUBECONFIG
Implement func kube_capture to replace KUBEGET
Implement func kube_get to return Kubernetes API objects as Starlark values

Failed to parse quotes of command in two cases

RUN bash -c "ulimit -a", in this case, the actually command becomes bash -c ulimit. That's because the func cmdParse( removed the " "
RUN bash -c 'for xxx' , in this case, there would be an error. The reason is the last line of func makeNamedPram( , which quotes string by ' '

[code reorg] Code to support compute resource enumerators

A resource enumerator knows how to allow discovery of compute resources capable of executing commands (see design doc for detail).

In order to support enumerator components, the code should be updated to allow directives to specify enumerators.

Commands with colon do not work

Commands with embedded ":" do not work properly.

Example:

CAPTURE cmd:"curl localhost:1234/events"

Fails because the parser improperly breaks up the command as two commands because of the second colon.

Starlark - Implement `hosts_provider()` and `resources()` functions

The Starlark implementation will introduce the notion of a provider and replace FROM directives with the resources function. This will allow script to be more flexible.

Tasks

Implement the hosts_provider to enumerate hosts from a provided lists
Implement resources() function which will return a list of hosts based on providers.

Starlark - Implement command function (run, capture, copy)

The current command directives must be converted to starlark function counterparts. The following commands need to be implemented:

RUN -> run()
CAPTURE -> capture()
COPY -> copy()

[code reorg] refactor package Script into Parser -> Script

[code reorg]
This reorg is to create a new package called parser which is responsible to creating the Script
parser.Parse -> script.Script

Implement all commands using ssh/scp (remove local cmd implementations)

Early on, the code base implemented commands (for COPY, RUN, and CAPTURE) in two flavors: one using Go API to execute (or local) and another flavor for remote execution (using SSH/SCP). Since the usage of the tool will most likely be always remote, it makes sense to drop the local implementation and focus on the remote-based commands.

Simpler codebase
Uniform implementation
Easier to maintain

Provide a Diagnostics.file example for kubeadm bootstrapped clusters

It would be helpful in the documentation to provide a fully featured example for kubeadm bootstrapped clusters.

This would be a stable target as well as allow one to get started quickly. Once globbing is available in #16 this should be easier to create, as most of this revolves around the fact that containerized components such as kube-apiserver do not have a stable log path like /var/log/kube-apiserver.log

Support for Named Configurations

Crashd supports AUTHCONFIG and KUBECONFIG that are used to configure remote host connections and connection to the API server respectively. Right now, each Diagnostics file can only use one connection.

This issue is to add the ability to name a configuration directive. This will allow the followings:

A Diagnostics file can have one or more AUTHCONFIG or one or more KUBECONFIG
A command can specify a config to use by name

For instance, the followings uses two different KUBECONFIGs:

KUBECONFIG name:"mgmt-kubecfg" path:"/path/to/mgmt/kubeconfig"
KUBECONFIG name:"workload-kubecfg" path:"/path/to/wl/kubecfg"

KUBEGET objects kinds:"namespaces" config:"mgmt-kubecfg"
KUBEGET objects kinds:"pods" config:"mgmt-kubecfg"
KUBEGET logs config:"workload-kubecfg"

Descriptive name for commands

I should be able to provide a descriptive name for a command as follow:

CAPTURE cmd:"curl localhost:1234/service-stats" desc:"Service stats"

That description should be added in generated file as a title:

Service stats
-------------
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  143k    0  143k    0     0  10.0M      0 --:--:-- --:--:-- --:--:-- 10.0M

The file should named using that description:

services_stats.log

SSH result from standard err can clear standard out when combined output is used

The internal SSH client used is from golang.org/x/crypto/ssh. It appears when combined output is used, the ssh.Session.Stderr will override the ssh.Session.Stdout stream invalidating the result from stdout.

See the client.go in ssh package.

Space separated host/node list passed as ENV var is regarded as a single host/node IP

Steps to reproduce:
Check the excerpt from the Diagnostics file

ENV HOSTS="foo.com bar.com"
...
FROM hosts:"$HOSTS" retries:"20"
...

When the run command is issued with this setup, you see the following output:

ERRO[0000] Failed to dial foo.com bar.com:22 (ssh): dial tcp: lookup foo.com bar.com: no such host: will retry connection again
ERRO[0000] Failed to dial foo.com bar.com:22 (ssh): dial tcp: lookup foo.com bar.com: no such host: will retry connection again
ERRO[0000] Failed to dial foo.com bar.com:22 (ssh): dial tcp: lookup foo.com bar.com: no such host: will retry connection again
ERRO[0000] Failed to dial foo.com bar.com:22 (ssh): dial tcp: lookup foo.com bar.com: no such host: will retry connection again

The host list in the FROM directive is treated as a single host instead of a list of hosts.

etcd crash logs

There are two really useful text snpipets we can collect about etcd over ssh

Note these have to happen on CAPI Masters

1) results of etcd perf, making sure disk is fast

etcdctl=`find / -name etcdctl` # kinda hacky
etcdctl --endpoints="https://localhost:2379" --cacert="/etc/kubernetes/pki/etcd/ca.crt" --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key" check perf

Result:

root [ /home/capv ]# etcdctl --endpoints="https://localhost:2379" --cacert="/etc/kubernetes/pki/etcd/ca.crt" --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key" check perf
 60 / 60 Boooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00% 1m0s
PASS: Throughput is 150 writes/s
PASS: Slowest request took 0.276541s
PASS: Stddev is 0.013858s
PASS

2) Getting FSYnc WALs:

curl localhost:2381/metrics | grep fsync

Result:

# TYPE etcd_disk_wal_fsync_duration_seconds histogram   
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.001"} 0
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.002"} 0
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.004"} 202
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.008"} 1601
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.016"} 2173
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.032"} 2552                                                                 
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.064"} 2635
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.128"} 2658
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.256"} 2669
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.512"} 2674 <-- 5 writes took a half second1
etcd_disk_wal_fsync_duration_seconds_bucket{le="1.024"} 2676
etcd_disk_wal_fsync_duration_seconds_bucket{le="2.048"} 2676
etcd_disk_wal_fsync_duration_seconds_bucket{le="4.096"} 2676
etcd_disk_wal_fsync_duration_seconds_bucket{le="8.192"} 2676
etcd_disk_wal_fsync_duration_seconds_bucket{le="+Inf"} 2676
etcd_disk_wal_fsync_duration_seconds_sum 33.182538455000035
etcd_disk_wal_fsync_duration_seconds_count 2676

3) Getting Cloud init logs

root [ /home/capv ]# cat /var/log/cloud-init-output.log | grep -i timeout
[kubelet-check] Initial timeout of 40s passed.

Starlark - Add builtins to return info for OS, Env, and Args

This issue is to track the addition of Go values to expose builtins to return OS, environment, and program argument info.

os.putenv() and os.getenv() - functions to save/load env values
os.username, os.homedir, os.uid, os.gid

Args

args.get() - returns a flagset by name

Option to run certain commands on particular kind of nodes and not on all nodes specified in the FROM directive

I have deployed a cluster with 1 Master and 1 worker nodes.
The kubeconfig on the master and workers are different files.
Master - /etc/kubernetes/admin.conf
Workers - /etc/kubernetes/kubelet.conf

My Diagnostic.file looks like the following -

FROM 10.161.116.163:22 10.161.109.198:22
ENV CLUSTER_USER_NAME=vmware-system-user
ENV CLUSTER_SSH_KEY=/Users/pnarasimhapr/go/src/gitlab.eng.vmware.com/guest-clusters/dev-bootstrap/bin/madhwa-100-ssh-key
AUTHCONFIG username:$CLUSTER_USER_NAME private-key:$CLUSTER_SSH_KEY
WORKDIR /tmp/crashdir
RUN /bin/bash -c ' sudo rm -rf /home/cluster-info-dump; rm -rf /home/vmware-system-user/cluster-info/; sudo mkdir -p /home/cluster-info-dump;sudo kubectl --kubeconfig /etc/kubernetes/admin.conf cluster-info dump --output-directory=/home/cluster-info-dump'

The RUN command tries to run the kubectl cluster-info dump on all nodes. Since kubectl take a kubeconfig as an option and we have 2 different sets of nodes(masters and workers) with different kubeconfig files, is there a way we can specify if we want to run a command only on a particular kind of nodes so that the specified kubeconfig is valid on that node ?

Command label filters

Besides the scope filter (see #55), a user should be able to specify node labels that filters which node the command should execute on:

FROM nodes:"all"
COPY /var/logs labels:"kubernetes.io/hostname=control-plane"

The COPY command will execute on nodes that only match the provided label.

Uniform named parameter format for all directives

Make all Directive support same parameter format (where applicable)

Right now some directive take named parameters and some don't
Some named parameters have different format, etc

This is to ensure that all directives use same named parameter format

DIRECTIVE param0:value0 param1:value0

Some directives, it may makes sense for named parameter to be optional

Add support for SSH-agent protocol

Currently, when the tool forces the use of direct private key. The tool should also use SSH agent for a valid key when the private key is not provided.

Support alternate backend to execute remote commands

Hi,

Is there a plan to support some alternative for ssh to execute commands on Nodes? For example, I believe the Kind Node image doesn't come with an SSH server by default, and it would be great if crash-diagnostics supported the docker SDK as well (docker exec, docker cp, ...).

Quoted default unamed param must use outer double quotes

When a directive uses a quoted default unamed parameter, the most outer quotes must be double quotes:

ENV "foo=bar buzz=bazz" 👍 Ok

ENV 'foo=bar buzz=baz' 🛑 bad

Low priority.

Support for globbing in COPY paths

Currently, COPY paths only copy the specified file or directories. An enhancement would be to allow support for path glob (unix style path patterns):

COPY /var/log/kube-*.log
COPY $HOME/bin/*
Etc

Assemble runtime errors in an error log

Currently, when runtime errors are encountered, they are dumped on std error and/or captured in the file whose command generated the error. As a crashd script developer, it would be nice to be able to review all errors in a single file or single location.

Crashd should log command and their associated erroneous result in a file so runtime errors can be easily reviewed.

Add support for IPv6 address in FROM

IPv6 addresses does not seem to work properly and fails.

[code reorg] Refactor commands into a more generic and flexible types

Currently, commands are represented internally by several strongly typed structures. While this works well, it forces creation and declaration of new types every time a new command is needed.

This change should refactor the code to use a more generic representation of commands that makes it possible to dynamically create new commands without adding new types for each commands. This will also allow support for upcoming features.

Support for command scopes

By default, all commands listed in a crash-diag is executed on each machine specified in the FROM list. This issue is to introduce command scope to specify which machine gets the command.

For instance,

FROM hosts:"host0.local host1.local"
CAPTURE cmd:"command -param0 -param1" scope:"all"
COPY path:"/var/logs/log.txt" scope:"host0.local"
ENV JUMP="12.34.56.78"
RUN cmd:"ssh -i $HOME/.ssh/key_rsa -J user@$JUMP user@$FROM_HOST uptime" scope:"local"

Scopes

All - means the command will be executed on all machines in the FROM list
Host List - scope can be a list of machines where command will be executed
Local - a local scope means the command will be executed once using the local machine (most likely using a shell runtime, see #54)

Starlark - Implement base package for Starlark support

A base package is needed for supporting running Starlark code. This package should expose a simple way to run the code via an io.Reader

source :=`
ssh_config(username="sshuser", private_key_path="/some/path")
`
exe := starlark.New()
exe.Exec("file.star", strings.NewReader(source)

Tasks:

Create a package to façade the stalark package
Add support functions

Connection settings for executor commands (RUN/CAPTURE/COPY)

Commands that execute on remote machines will retry uses SSH/SCP by default. However, these commands will retry for a set amount of time or a set number of time.

This issue is to create a new directive called CONNECTION to specify connection settings for all executing commands.

[code reorg] Implement the Configurator component

A configurator is the component and (package and types) that can represent script config information. The configurator shall create and manage configuration and make config data available at script-scope.

See design doc for detail
https://docs.google.com/document/d/1pqYOdTf6ZIT_GSis-AVzlOTm3kyyg-32-seIfULaYEs/edit#heading=h.24xlmcp7u50l

New RUN directive for running commands uncaptured

Currently CAPTURE is the only way to run a command. However, it captures the result in a file that gets aggregated as part of the generated diagnostic archive bundle.

A new action is needed (i.e. RUN) to execute a command and capture the result as a session variable that can e accessed by downstream actions:

RUN ls /var/log/containers/file.log

The result of the command is stored in the session variable can can be accesses using variable expansion.

This should help solve #4

Make the FROM directive more user-friendly

A few ideas:

enable the user to provide the Node names; crash-diagnostics can connect to the K8s apiserver to retrieve the IPs
enable the user to provide host names along with an ssh-config file
enable wildcard support so that commands can be run easily on all Nodes

[code reorg] Implement support for Executor component

This code reorg is to introduce support for multiple code executor backend. Currently, the executor backend is hardcoded to only support SSH on a remote machine. This code reorg will make it possible to support diverse execution backends for different compute infrastructures (local, remote, containerized, etc).

See design doc for detail
https://docs.google.com/document/d/1pqYOdTf6ZIT_GSis-AVzlOTm3kyyg-32-seIfULaYEs/edit#heading=h.tc5gyk93z2c6

`OUTPUT` to support standard output

Currently, the OUTPUT only supports a file path. For quick debugging scenarios, it would be helpful/useful to be able to output directly to standard out as in something like

OUTPUT stdout

[Umbrella] Starlark - Configuration Language for Crash Diagnostics Files

Currently, Crash Diagnostics uses a custom file format, for its diagnostics file, to orchestrate information collection from specified compute resources (such as clustered virtual machines or Kubernetes pods).

While the file format of Crash Diagnostics can be described as simplistic, maintaining the hand-written parser is a large burden. The parser and its test code make up at least 70% of the codebase (as of version v0.2.2).
Adding new features to the current language takes time and requires massive amounts of tests to ensure it works as intended
Switching to an established language (or configuration dialect) will allow the Crash Diagnostics team to focus on augmenting the functionality of the tool and not maintaining a language.

This issue proposes the use of the Starlark configuration language to specify directives in Crash Diagnostics files. Starlark is a derivative of Python that originated from the Bazel build system from Google. Starlark is a stand-alone project and is used in several high-profile projects in the cloud native space.

Please see the proposed design changes and its use of Starlark in this Google Doc document

Parity Implementation

In order to convert the code base to support Starlark, a parity implementation will be needed to bring the Crash Diagnostics on par with the current release version of 0.2.2. The followings are tasks to bring v0.3.0 to parity with current features sets:

AS directive -> crashd_config
AUTHCONFIG -> ssh_config
KUBECONFIG -> kube_config
OUTPUT -> crashd_config
WORKDIR -> crashd_config
The FROM directive will go away and will be replaced with function resources() which will return a list of compute resources.
Implement existing command directives as Starlark functions
- RUN -> run
- CAPTURE -> capture
- COPY -> copy
Implement existing Kubernetes directives as Starlark functions
- KUBEGET -> kubeget, with convenient functions kubeget_objects, kubeget_logs, kubeget_all

Starlark - Implement Crash Diagnostics Configuration Functions

Implement new starlark Go built-in functions to replace existing configuration directives:

AS directive -> crashd_config
AUTHCONFIG -> ssh_config
OUTPUT -> crashd_config
WORKDIR -> crashd_config

Memory Capture

I was thinking it could be useful to have a command to capture a memory dump. Particularly in the context of incident response and forensics (not just because of a crashing).

A similar tool to what I'm thinking of is margaritashotgun.