Git Product home page Git Product logo

turbulence-release's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

turbulence-release's Issues

Shutdown command fails

I receive the following error when attempting to use the Shutdown attack without force enabled:

Task execution: Halting: Running command: 'halt', stdout: '', stderr: 'shutdown: Unable to shutdown system
 ': exit status 1

Full json configuration:

{
 "Tasks": [{
   "Type": "Shutdown"
 }],
 
 "Selector": {
	"Deployment": {
		"Name": "cf"
	}
 }
}

While the systems do not end up shutting down, they do receive a broadcast saying the system is going to shutdown despite the command failing:

Broadcast message from root@5dc8c8d9-32a8-48c6-9585-95d70fdecbfc
        (unknown) at 21:48 ...

The system is going down for halt NOW!

I also tried using force and reboot, the end result was the targeted instances become "unresponsive agents" with no sign of booting back up.

Turbulence agent is taking first_address, but API might have a floating IP

I configured the turbulence API to have a floating IP, so I can access it with the browser without tunneling. Therefore, the SSL cert is also for the floating IP, not for the internal IP.

However, the turbulence-agent takes simply api.first_address as contact, which resolves to the internal IP. Consequently, the certificate cannot be validated by the agent and connection fails.

$ tail /var/vcap/sys/log/turbulence_agent/stderr.log
[Agent] 2016/07/26 16:28:30 ERROR - Failed fetching tasks: Fetching tasks '984e4ae8-064c-45f0-b917-4a0d19e702b9': Performing request POST 'https://turbulence:[email protected]:8080/api/v1/agents/984e4ae8-064c-45f0-b917-4a0d19e702b9/tasks': Performing POST request: Post https://turbulence:[email protected]:8080/api/v1/agents/984e4ae8-064c-45f0-b917-4a0d19e702b9/tasks: x509: certificate is valid for <redacted>, not 192.168.0.13

Packet drop task

It would be useful, in order to test certain failure modes, to have the ability to simulate packet drops. Something like this would likely be enough, but even killing random TCP connections would be handy.

Killing VMs on azure does not work as of semicolon in VM CID

Hi,

trying to kill VMs on Azure does not work as of semicolon in VM CID:
(related issue cloudfoundry/bosh-cli#512 was fixed with cloudfoundry/bosh-cli#513)

Deleting VM 'agent_id:eb967e44-3ce2-4602-adc1-8c4f4bde0ed3;resource_group_name:test;storage_account_name:test': Expected task '581861' to succeed but was state is 'error'
$ bosh task 581861
...
Task 581861 | 23:25:16 | Delete VM: agent_id:eb967e44-3ce2-4602-adc1-8c4f4bde0ed3 (00:00:03)
                      L Error: CPI error 'Bosh::Clouds::CloudError' with message 'Invalid instance id (plain) 'agent_id:eb967e44-3ce2-4602-adc1-8c4f4bde0ed3'' in 'delete_vm' CPI method (CPI request ID: 'cpi-569776')

Please adapt the fix of bosh-cli to this turbulence-release as well.

Turbulence can not be authorized on BOSH director

Hey, @cppforlife.

I am trying to run turbulence (v0.5) with BOSH (v1.3262.3) that has authentication through UAA with CA cert. I provided all necessary data during BOSH deployment:

bosh -n -d turbulence deploy ./manifests/turbulence.yml \
  -v turbulence_api_ip=$TURBULENCE_API_IP \
  -v director_ip=$DIRECTOR_IP \
  --var-file director_ssl_ca=$DIRECTOR_SSL_CA \
  -v director_client=$DIRECTOR_CLIENT \
  -v director_client_secret=$DIRECTOR_CLIENT_SECRET \
  --vars-store ./creds.yml

When I try to run "kill" incident I get following error:

Failed to execute incident: Finding deployments: Director responded with non-successful status code '401' response 'Not authorized: '/deployments'

You can find here logs with more details. With the same values I can login to director using golang bosh-cli. I've create a simple go script to reproduce this problem.

Could you please tell what I possibly do wrong? Thank you.

Turbulence agent job assumes hostname returns BOSH agent ID

This is not the case with xenial stemcells. Ideally, it should look up the agent id in /var/vcap/bosh/settings.json.

This causes turbulence to not work when the agent is on xenial stemcells.
This is the error that we get from the API:

Timed out waiting for agent 'fc9fd54b-765a-4268-bc20-95597ead9110' to consume tasks

Lines of code causing this problem:

# todo hostname is not that good
agent_id=$(hostname)
sed -i "s:_agent_id_:${agent_id}:g" $CONF_DIR/config.json

cc @akshaymankar

A few questions

  1. The depend on k8s(kubernetes) and docker?

  2. How to deply the project at server(VMs), deploy agent at (VMs)?

  3. We need to simulate the following actions at agent server(VMs),
    Incident Tasks: Noop、Kill(Kill Process),Stress,Firewall,Control Network,Fill Disk,Shutdown

Looking forward to reply,thanks a lot!

tc error when using Loss and Delay with ControlNet

I am consistently getting the error RTNETLINK answers: File exists when I attempt to use a ControlNet incident with both Delay and Loss. Furthermore, after this configuration is used, the error continues to happen even for ControlNet incidents which have only one or the other and have been tested to work on their own otherwise.

The following configuration consistently yields this problem.

{
 "Tasks": [{
   "Type": "ControlNet",
   "Timeout": "30s",
   "Delay": "100ms",
   "Loss": "30%"
 }],

 "Selector": {
	"Deployment": {
		"Name": "cf"
	},
	"Group": {
		"Name": "diego-cell"
	}
 }
}

The full error reported is:

Task execution: Shelling out to tc to add packet loss: Running command: 'tc qdisc add dev silk-vtep root netem loss 30% 75%', stdout: '', stderr: 'RTNETLINK answers: File exists
': exit status 2

When using ping -f -c 500 <DIEGO-CELL-IP> I find that it does introduce some latency which does not seem to go away but it does not introduce any packet loss.

This issue does not occur if the incidents are separated as such:

{
 "Tasks": [{
   "Type": "ControlNet",
   "Timeout": "30s",
   "Delay": "100ms"
 },{
   "Type": "ControlNet",
   "Timeout": "30s",
   "Loss": "30%"
 }],

 "Selector": {
	"Deployment": {
		"Name": "cf"
	},
	"Group": {
		"Name": "diego-cell"
	}
 }
}

The error also occurs if Loss and LossCorrelation are specified without Delay.

Can't update submodules

When you clone this repo and run git submodule update, it fails with fatal: No url found for submodule path 'src/github.com/onsi/ginkgo' in .gitmodules

We are trying to submodule this release in our repo, and so when we recursively clone this project it fails. It looks like you need to commit the .gitmodules file.

New release?

Hello, you made some changes after release 0.4, would you mind to create and upload a 0.5 release?

Thanks

You shouldn't be using pid_utils with a script name of 'ctl'

pid_utils redirects the output to /var/vcap/monit/<script_name>.{out,err}.log, so if other releases do the same thing, everything ends up in one logfile.

Probably you should either:

  • name the logfile after the job's name, not the script's name
  • include the job's name in the script's name, like e.g. cf-release or concourse do

Incident json schema doesn't conform to manifest schema v2

An incident is scheduled like this:

    "Tasks": [{
        "Type": "kill"
    }],
    "Deployments": [{
        "Name": "dummy",
        "Jobs": [{
            "Name": "*_z1",
            "Limit": "1"
        }]
    }]
}

but should probably look like this:

    "Tasks": [{
        "Type": "kill"
    }],
    "Deployments": [{
        "Name": "dummy",
        "instance_groups": [{
            "Name": "*_z1",
            "Limit": "1"
        }]
    }]
}

I.e., rename jobs to instance_groups

Kill-process of a certain process doesn't last long

Hello.

I was trying to use the turbulence-release to perform turbulence-tests on one of our bosh deployments. I am facing some issues with the kill-process incident. When I am triggering this incident, the process is going down, however just for a few (~2-3) seconds. After that, monit process revives the failing process and the process becomes running again.

Upon taking a look at the codebase, I find that the incident causes a kill or pkill of the process specified in the incident. Thus, monit brings it back in its next iteration of monitoring. Can we instead do a monit stop <process> in order to simulate a longer downtime for the process? We can then bring back the process after a timeout, else leave it to the user to bring the process back.

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.