cppforlife / turbulence-release Goto Github PK

View Code? Open in Web Editor NEW

49.0 11.0 17.0 15.3 MB

Turbulence release is used for injecting failure scenarios into any BOSH deployment.

License: Apache License 2.0

Shell 0.51% HTML 0.03% Go 97.45% Ruby 0.05% CSS 0.03% Protocol Buffer 1.55% Makefile 0.38% PowerShell 0.01%

turbulence-release's People

Stargazers

Watchers

Forkers

lnguyen voelzmo dpb587 petahhh dpb587-pivotal jhvhs mcwumbly akshaymankar bosh-turbulence tmobile acloudtechie dohq h0nig t3hprofit etsangsplk iq-scm

turbulence-release's Issues

Shutdown command fails

I receive the following error when attempting to use the Shutdown attack without force enabled:

Task execution: Halting: Running command: 'halt', stdout: '', stderr: 'shutdown: Unable to shutdown system
 ': exit status 1

Full json configuration:

{
 "Tasks": [{
   "Type": "Shutdown"
 }],
 
 "Selector": {
	"Deployment": {
		"Name": "cf"
	}
 }
}

While the systems do not end up shutting down, they do receive a broadcast saying the system is going to shutdown despite the command failing:

Broadcast message from root@5dc8c8d9-32a8-48c6-9585-95d70fdecbfc
        (unknown) at 21:48 ...

The system is going down for halt NOW!

I also tried using force and reboot, the end result was the targeted instances become "unresponsive agents" with no sign of booting back up.

add type=random task selected out set of tasks

post director events when taking actions

use upcoming director's generic config api for managing incidents/scheduled incidents

cc @voelzmo

Turbulence agent is taking first_address, but API might have a floating IP

I configured the turbulence API to have a floating IP, so I can access it with the browser without tunneling. Therefore, the SSL cert is also for the floating IP, not for the internal IP.

However, the turbulence-agent takes simply api.first_address as contact, which resolves to the internal IP. Consequently, the certificate cannot be validated by the agent and connection fails.

$ tail /var/vcap/sys/log/turbulence_agent/stderr.log
[Agent] 2016/07/26 16:28:30 ERROR - Failed fetching tasks: Fetching tasks '984e4ae8-064c-45f0-b917-4a0d19e702b9': Performing request POST 'https://turbulence:[email protected]:8080/api/v1/agents/984e4ae8-064c-45f0-b917-4a0d19e702b9/tasks': Performing POST request: Post https://turbulence:[email protected]:8080/api/v1/agents/984e4ae8-064c-45f0-b917-4a0d19e702b9/tasks: x509: certificate is valid for <redacted>, not 192.168.0.13

It would be useful, in order to test certain failure modes, to have the ability to simulate packet drops. Something like this would likely be enough, but even killing random TCP connections would be handy.

Killing VMs on azure does not work as of semicolon in VM CID

Hi,

trying to kill VMs on Azure does not work as of semicolon in VM CID:
(related issue cloudfoundry/bosh-cli#512 was fixed with cloudfoundry/bosh-cli#513)

Deleting VM 'agent_id:eb967e44-3ce2-4602-adc1-8c4f4bde0ed3;resource_group_name:test;storage_account_name:test': Expected task '581861' to succeed but was state is 'error'

$ bosh task 581861
...
Task 581861 | 23:25:16 | Delete VM: agent_id:eb967e44-3ce2-4602-adc1-8c4f4bde0ed3 (00:00:03)
                      L Error: CPI error 'Bosh::Clouds::CloudError' with message 'Invalid instance id (plain) 'agent_id:eb967e44-3ce2-4602-adc1-8c4f4bde0ed3'' in 'delete_vm' CPI method (CPI request ID: 'cpi-569776')

Please adapt the fix of bosh-cli to this turbulence-release as well.

support specifying different port for agents comm vs api comm

cc @voelzmo

Turbulence can not be authorized on BOSH director

Hey, @cppforlife.

I am trying to run turbulence (v0.5) with BOSH (v1.3262.3) that has authentication through UAA with CA cert. I provided all necessary data during BOSH deployment:

bosh -n -d turbulence deploy ./manifests/turbulence.yml \
  -v turbulence_api_ip=$TURBULENCE_API_IP \
  -v director_ip=$DIRECTOR_IP \
  --var-file director_ssl_ca=$DIRECTOR_SSL_CA \
  -v director_client=$DIRECTOR_CLIENT \
  -v director_client_secret=$DIRECTOR_CLIENT_SECRET \
  --vars-store ./creds.yml

When I try to run "kill" incident I get following error:

Failed to execute incident: Finding deployments: Director responded with non-successful status code '401' response 'Not authorized: '/deployments'

You can find here logs with more details. With the same values I can login to director using golang bosh-cli. I've create a simple go script to reproduce this problem.

Could you please tell what I possibly do wrong? Thank you.

Turbulence agent job assumes hostname returns BOSH agent ID

This is not the case with xenial stemcells. Ideally, it should look up the agent id in /var/vcap/bosh/settings.json.

This causes turbulence to not work when the agent is on xenial stemcells.
This is the error that we get from the API:

Timed out waiting for agent 'fc9fd54b-765a-4268-bc20-95597ead9110' to consume tasks

Lines of code causing this problem:

turbulence-release/jobs/turbulence_agent/templates/ctl.erb

Lines 14 to 16 in f620936

 # todo hostname is not that good 

 agent_id=$(hostname) 

 sed -i "s:_agent_id_:${agent_id}:g" $CONF_DIR/config.json

cc @akshaymankar

Turbulence kills the wrong index, sometimes.

this stuff is broken

turbulence-release/src/github.com/cppforlife/turbulence/incident/incident.go

Lines 95 to 108 in 1e1e044

 actualInstances, err := actualJob.InstancesWithVMs() 

 if event.MarkError(err) { 

 continue 

 } 

 event = i.Events.Add(Event{ 

 Type: EventTypeSelectInstances, 

 DeploymentName: depl.Name, 

 JobName: actualJob.Name, 

 }) 

 selectedIndices, err := job.SelectedIndices(len(actualInstances)) 

 if event.MarkError(err) { 

 continue 

 }

A few questions

The depend on k8s(kubernetes) and docker?
How to deply the project at server(VMs), deploy agent at (VMs)?
We need to simulate the following actions at agent server(VMs),
Incident Tasks: Noop、Kill(Kill Process),Stress,Firewall,Control Network,Fill Disk,Shutdown

Looking forward to reply，thanks a lot!

tc error when using Loss and Delay with ControlNet

I am consistently getting the error RTNETLINK answers: File exists when I attempt to use a ControlNet incident with both Delay and Loss. Furthermore, after this configuration is used, the error continues to happen even for ControlNet incidents which have only one or the other and have been tested to work on their own otherwise.

The following configuration consistently yields this problem.

{
 "Tasks": [{
   "Type": "ControlNet",
   "Timeout": "30s",
   "Delay": "100ms",
   "Loss": "30%"
 }],

 "Selector": {
	"Deployment": {
		"Name": "cf"
	},
	"Group": {
		"Name": "diego-cell"
	}
 }
}

The full error reported is:

Task execution: Shelling out to tc to add packet loss: Running command: 'tc qdisc add dev silk-vtep root netem loss 30% 75%', stdout: '', stderr: 'RTNETLINK answers: File exists
': exit status 2

When using ping -f -c 500 <DIEGO-CELL-IP> I find that it does introduce some latency which does not seem to go away but it does not introduce any packet loss.

This issue does not occur if the incidents are separated as such:

{
 "Tasks": [{
   "Type": "ControlNet",
   "Timeout": "30s",
   "Delay": "100ms"
 },{
   "Type": "ControlNet",
   "Timeout": "30s",
   "Loss": "30%"
 }],

 "Selector": {
	"Deployment": {
		"Name": "cf"
	},
	"Group": {
		"Name": "diego-cell"
	}
 }
}

The error also occurs if Loss and LossCorrelation are specified without Delay.

"firewall" BlockBOSHAgent description isn't clear

Hi,

The description of the BlockBOSHAgent [1] boolean for firewall incidents seem to be contradictory. It says:

set BlockBOSHAgent (bool) to false to block access to the BOSH Agent

It seems like you would want to say:

set BlockBOSHAgent (bool) to true to block access to the BOSH Agent

Given BlockBOSHAgent is set to false by default.

[1] https://github.com/cppforlife/turbulence-release/blob/master/docs/api.md#firewall

Can't update submodules

When you clone this repo and run git submodule update, it fails with fatal: No url found for submodule path 'src/github.com/onsi/ginkgo' in .gitmodules

We are trying to submodule this release in our repo, and so when we recursively clone this project it fails. It looks like you need to commit the .gitmodules file.

New release?

Hello, you made some changes after release 0.4, would you mind to create and upload a 0.5 release?

Thanks

You shouldn't be using pid_utils with a script name of 'ctl'

pid_utils redirects the output to /var/vcap/monit/<script_name>.{out,err}.log, so if other releases do the same thing, everything ends up in one logfile.

Probably you should either:

name the logfile after the job's name, not the script's name
include the job's name in the script's name, like e.g. cf-release or concourse do

Incident json schema doesn't conform to manifest schema v2

An incident is scheduled like this:

    "Tasks": [{
        "Type": "kill"
    }],
    "Deployments": [{
        "Name": "dummy",
        "Jobs": [{
            "Name": "*_z1",
            "Limit": "1"
        }]
    }]
}

but should probably look like this:

    "Tasks": [{
        "Type": "kill"
    }],
    "Deployments": [{
        "Name": "dummy",
        "instance_groups": [{
            "Name": "*_z1",
            "Limit": "1"
        }]
    }]
}

I.e., rename jobs to instance_groups

Kill-process of a certain process doesn't last long

Hello.

I was trying to use the turbulence-release to perform turbulence-tests on one of our bosh deployments. I am facing some issues with the kill-process incident. When I am triggering this incident, the process is going down, however just for a few (~2-3) seconds. After that, monit process revives the failing process and the process becomes running again.

Upon taking a look at the codebase, I find that the incident causes a kill or pkill of the process specified in the incident. Thus, monit brings it back in its next iteration of monitoring. Can we instead do a monit stop <process> in order to simulate a longer downtime for the process? We can then bring back the process after a timeout, else leave it to the user to bring the process back.

Thanks!

	# todo hostname is not that good
	agent_id=$(hostname)
	sed -i "s:_agent_id_:${agent_id}:g" $CONF_DIR/config.json

	actualInstances, err := actualJob.InstancesWithVMs()
	if event.MarkError(err) {
	continue
	}

	event = i.Events.Add(Event{
	Type: EventTypeSelectInstances,
	DeploymentName: depl.Name,
	JobName: actualJob.Name,
	})
	selectedIndices, err := job.SelectedIndices(len(actualInstances))
	if event.MarkError(err) {
	continue
	}

cppforlife / turbulence-release Goto Github PK

turbulence-release's People

Stargazers

Watchers

Forkers

turbulence-release's Issues

Recommend Projects

Recommend Topics

Recommend Org