cppforlife / turbulence-release Goto Github PK
View Code? Open in Web Editor NEWTurbulence release is used for injecting failure scenarios into any BOSH deployment.
License: Apache License 2.0
Turbulence release is used for injecting failure scenarios into any BOSH deployment.
License: Apache License 2.0
I receive the following error when attempting to use the Shutdown attack without force enabled:
Task execution: Halting: Running command: 'halt', stdout: '', stderr: 'shutdown: Unable to shutdown system
': exit status 1
Full json configuration:
{
"Tasks": [{
"Type": "Shutdown"
}],
"Selector": {
"Deployment": {
"Name": "cf"
}
}
}
While the systems do not end up shutting down, they do receive a broadcast saying the system is going to shutdown despite the command failing:
Broadcast message from root@5dc8c8d9-32a8-48c6-9585-95d70fdecbfc
(unknown) at 21:48 ...
The system is going down for halt NOW!
I also tried using force and reboot, the end result was the targeted instances become "unresponsive agents" with no sign of booting back up.
cc @voelzmo
I configured the turbulence API to have a floating IP, so I can access it with the browser without tunneling. Therefore, the SSL cert is also for the floating IP, not for the internal IP.
However, the turbulence-agent takes simply api.first_address
as contact, which resolves to the internal IP. Consequently, the certificate cannot be validated by the agent and connection fails.
$ tail /var/vcap/sys/log/turbulence_agent/stderr.log
[Agent] 2016/07/26 16:28:30 ERROR - Failed fetching tasks: Fetching tasks '984e4ae8-064c-45f0-b917-4a0d19e702b9': Performing request POST 'https://turbulence:[email protected]:8080/api/v1/agents/984e4ae8-064c-45f0-b917-4a0d19e702b9/tasks': Performing POST request: Post https://turbulence:[email protected]:8080/api/v1/agents/984e4ae8-064c-45f0-b917-4a0d19e702b9/tasks: x509: certificate is valid for <redacted>, not 192.168.0.13
It would be useful, in order to test certain failure modes, to have the ability to simulate packet drops. Something like this would likely be enough, but even killing random TCP connections would be handy.
Hi,
trying to kill VMs on Azure does not work as of semicolon in VM CID:
(related issue cloudfoundry/bosh-cli#512 was fixed with cloudfoundry/bosh-cli#513)
Deleting VM 'agent_id:eb967e44-3ce2-4602-adc1-8c4f4bde0ed3;resource_group_name:test;storage_account_name:test': Expected task '581861' to succeed but was state is 'error'
$ bosh task 581861
...
Task 581861 | 23:25:16 | Delete VM: agent_id:eb967e44-3ce2-4602-adc1-8c4f4bde0ed3 (00:00:03)
L Error: CPI error 'Bosh::Clouds::CloudError' with message 'Invalid instance id (plain) 'agent_id:eb967e44-3ce2-4602-adc1-8c4f4bde0ed3'' in 'delete_vm' CPI method (CPI request ID: 'cpi-569776')
Please adapt the fix of bosh-cli to this turbulence-release as well.
cc @voelzmo
Hey, @cppforlife.
I am trying to run turbulence (v0.5) with BOSH (v1.3262.3) that has authentication through UAA with CA cert. I provided all necessary data during BOSH deployment:
bosh -n -d turbulence deploy ./manifests/turbulence.yml \
-v turbulence_api_ip=$TURBULENCE_API_IP \
-v director_ip=$DIRECTOR_IP \
--var-file director_ssl_ca=$DIRECTOR_SSL_CA \
-v director_client=$DIRECTOR_CLIENT \
-v director_client_secret=$DIRECTOR_CLIENT_SECRET \
--vars-store ./creds.yml
When I try to run "kill" incident I get following error:
Failed to execute incident: Finding deployments: Director responded with non-successful status code '401' response 'Not authorized: '/deployments'
You can find here logs with more details. With the same values I can login to director using golang bosh-cli. I've create a simple go script to reproduce this problem.
Could you please tell what I possibly do wrong? Thank you.
This is not the case with xenial stemcells. Ideally, it should look up the agent id in /var/vcap/bosh/settings.json
.
This causes turbulence to not work when the agent is on xenial stemcells.
This is the error that we get from the API:
Timed out waiting for agent 'fc9fd54b-765a-4268-bc20-95597ead9110' to consume tasks
Lines of code causing this problem:
turbulence-release/jobs/turbulence_agent/templates/ctl.erb
Lines 14 to 16 in f620936
this stuff is broken
The depend on k8s(kubernetes) and docker?
How to deply the project at server(VMs), deploy agent at (VMs)?
We need to simulate the following actions at agent server(VMs),
Incident Tasks: Noop、Kill(Kill Process),Stress,Firewall,Control Network,Fill Disk,Shutdown
Looking forward to reply,thanks a lot!
I am consistently getting the error RTNETLINK answers: File exists
when I attempt to use a ControlNet incident with both Delay
and Loss
. Furthermore, after this configuration is used, the error continues to happen even for ControlNet incidents which have only one or the other and have been tested to work on their own otherwise.
The following configuration consistently yields this problem.
{
"Tasks": [{
"Type": "ControlNet",
"Timeout": "30s",
"Delay": "100ms",
"Loss": "30%"
}],
"Selector": {
"Deployment": {
"Name": "cf"
},
"Group": {
"Name": "diego-cell"
}
}
}
The full error reported is:
Task execution: Shelling out to tc to add packet loss: Running command: 'tc qdisc add dev silk-vtep root netem loss 30% 75%', stdout: '', stderr: 'RTNETLINK answers: File exists
': exit status 2
When using ping -f -c 500 <DIEGO-CELL-IP>
I find that it does introduce some latency which does not seem to go away but it does not introduce any packet loss.
This issue does not occur if the incidents are separated as such:
{
"Tasks": [{
"Type": "ControlNet",
"Timeout": "30s",
"Delay": "100ms"
},{
"Type": "ControlNet",
"Timeout": "30s",
"Loss": "30%"
}],
"Selector": {
"Deployment": {
"Name": "cf"
},
"Group": {
"Name": "diego-cell"
}
}
}
The error also occurs if Loss
and LossCorrelation
are specified without Delay
.
Hi,
The description of the BlockBOSHAgent
[1] boolean for firewall
incidents seem to be contradictory. It says:
set BlockBOSHAgent (bool) to false to block access to the BOSH Agent
It seems like you would want to say:
set BlockBOSHAgent (bool) to true to block access to the BOSH Agent
Given BlockBOSHAgent
is set to false
by default.
[1] https://github.com/cppforlife/turbulence-release/blob/master/docs/api.md#firewall
When you clone this repo and run git submodule update
, it fails with fatal: No url found for submodule path 'src/github.com/onsi/ginkgo' in .gitmodules
We are trying to submodule this release in our repo, and so when we recursively clone this project it fails. It looks like you need to commit the .gitmodules
file.
Hello, you made some changes after release 0.4, would you mind to create and upload a 0.5 release?
Thanks
pid_utils redirects the output to /var/vcap/monit/<script_name>.{out,err}.log
, so if other releases do the same thing, everything ends up in one logfile.
Probably you should either:
An incident is scheduled like this:
"Tasks": [{
"Type": "kill"
}],
"Deployments": [{
"Name": "dummy",
"Jobs": [{
"Name": "*_z1",
"Limit": "1"
}]
}]
}
but should probably look like this:
"Tasks": [{
"Type": "kill"
}],
"Deployments": [{
"Name": "dummy",
"instance_groups": [{
"Name": "*_z1",
"Limit": "1"
}]
}]
}
I.e., rename jobs to instance_groups
Hello.
I was trying to use the turbulence-release to perform turbulence-tests on one of our bosh deployments. I am facing some issues with the kill-process
incident. When I am triggering this incident, the process is going down, however just for a few (~2-3) seconds. After that, monit
process revives the failing process and the process becomes running again.
Upon taking a look at the codebase, I find that the incident causes a kill
or pkill
of the process specified in the incident. Thus, monit
brings it back in its next iteration of monitoring. Can we instead do a monit stop <process>
in order to simulate a longer downtime for the process? We can then bring back the process after a timeout
, else leave it to the user to bring the process back.
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.