Git Product home page Git Product logo

worker's Introduction

Worker Build Status

Worker is the component of Travis CI that will run a CI job on some form of compute instance.

It's responsible for getting the bash script from travis-build, spinning up the compute instance (VM, Docker container, LXD container, or maybe something different), uploading the bash script, running it, and streaming the logs back to travis-logs. It also sends state updates to travis-hub.

Installing

from binary

Find the version you wish to install on the GitHub Releases page and download either the darwin-amd64 binary for macOS or the linux-amd64 binary for Linux. No other operating systems or architectures have pre-built binaries at this time.

from package

Use the ./bin/travis-worker-install script, or take a look at the packagecloud instructions.

from snap

Using a linux distribution which supports Snaps you can run: sudo snap install travis-worker --edge

from source

  1. install Go v1.7+
  2. clone this down into your $GOPATH
  • mkdir -p $GOPATH/src/github.com/travis-ci
  • git clone https://github.com/travis-ci/worker $GOPATH/src/github.com/travis-ci/worker
  • cd $GOPATH/src/github.com/travis-ci/worker
  1. install gometalinter:
  • go get -u github.com/alecthomas/gometalinter
  • gometalinter --install
  1. install shellcheck
  2. make

Configuring Travis Worker

Travis Worker is configured with environment variables or command line flags via the urfave/cli library. A list of the non-dynamic flags and environment variables may be found by invoking the built-in help system:

travis-worker --help

Environment-based image selection configuration

Some backend providers support image selection based on environment variables. The required format uses keys that are prefixed with the provider-specific prefix:

  • TRAVIS_WORKER_{UPPERCASE_PROVIDER}_IMAGE_{UPPERCASE_NAME}: contains an image name string to be used by the backend provider

The following example is for use with the Docker backend:

# matches on `dist: trusty`
export TRAVIS_WORKER_DOCKER_IMAGE_DIST_TRUSTY=travisci/ci-connie:packer-1420290255-fafafaf

# matches on `dist: bionic`
export TRAVIS_WORKER_DOCKER_IMAGE_DIST_BIONIC=registry.business.com/fancy/ubuntu:bionic

# resolves for `language: ruby`
export TRAVIS_WORKER_DOCKER_IMAGE_RUBY=registry.business.com/travisci/ci-ruby:whatever

# resolves for `group: edge` + `language: python`
export TRAVIS_WORKER_DOCKER_IMAGE_GROUP_EDGE_PYTHON=travisci/ci-garnet:packer-1530230255-fafafaf

# used when no dist, language, or group matches
export TRAVIS_WORKER_DOCKER_IMAGE_DEFAULT=travisci/ci-garnet:packer-1410230255-fafafaf

Development: Running Travis Worker locally

This section is for anyone wishing to contribute code to Worker. The code itself should have godoc-compatible docs (which can be viewed on godoc.org: https://godoc.org/github.com/travis-ci/worker), this is mainly a higher-level overview of the code.

Environment

Ensure you've defined the necessary environment variables (see .example.env).

Pull Docker images

docker pull travisci/ci-amethyst:packer-1504724461
docker tag travisci/ci-amethyst:packer-1504724461 travis:default

Configuration

For configuration, there are some things like the job-board (TRAVIS_WORKER_JOB_BOARD_URL) and travis-build (TRAVIS_WORKER_BUILD_API_URI) URLs that need to be set. These can be set to the staging values.

export TRAVIS_WORKER_JOB_BOARD_URL='https://travis-worker:[email protected]'
export TRAVIS_WORKER_BUILD_API_URI='https://x:[email protected]/script'

TRAVIS_WORKER_BUILD_API_URI can be found in the env of the job board app, e.g.: heroku config:get JOB_BOARD_BUILD_API_ORG_URL -a job-board-staging.

Images

TODO

Configuring the requested provider/backend

Each provider requires its own configuration, which must be provided via environment variables namespaced by TRAVIS_WORKER_{PROVIDER}_.

Docker

The backend should be configured to be Docker, e.g.:

export TRAVIS_WORKER_PROVIDER_NAME='docker'
export TRAVIS_WORKER_DOCKER_ENDPOINT=unix:///var/run/docker.sock        # or "tcp://localhost:4243"
export TRAVIS_WORKER_DOCKER_PRIVILEGED="false"                          # optional
export TRAVIS_WORKER_DOCKER_CERT_PATH="/etc/secret-docker-cert-stuff"   # optional

Queue configuration

File-based queue

For the queue configuration, there is a file-based queue implementation so you don't have to mess around with RabbitMQ.

You can generate a payload via the generate-job-payload.rb script on travis-scheduler:

heroku run -a travis-scheduler-staging script/generate-job-payload.rb <job id> > payload.json

Place the file in the $TRAVIS_WORKER_QUEUE_NAME/10-created.d/ directory, where it will be picked up by the worker.

See example-payload.json for an example payload.

AMQP-based queue

export TRAVIS_WORKER_QUEUE_TYPE='amqp'
export TRAVIS_WORKER_AMQP_URI='amqp://guest:guest@localhost'

The web interface is accessible at http://localhost:15672/

To verify your messages are being published, try:

rabbitmqadmin get queue=reporting.jobs.builds

Note: You will first need to install rabbitmqadmin. See http://localhost:15672/cli

See script/publish-example-payload for a script to enqueue example-payload.json.

Building and running

Run make build after making any changes. make also executes the test suite.

  1. make
  2. ${GOPATH%%:*}/bin/travis-worker

or in Docker (FIXME):

  1. docker build -t travis-worker . # or docker pull travisci/worker
  2. docker run --env-file ENV_FILE -ti travis-worker # or travisci/worker

Testing

Run make test. To run backend tests matching Docker, for example, run go test -v ./backend -test.run Docker.

Verifying and exporting configuration

To inspect the parsed configuration in a format that can be used as a base environment variable configuration, use the --echo-config flag, which will exit immediately after writing to stdout:

travis-worker --echo-config

Stopping Travis Worker

Travis Worker has two shutdown modes: Graceful and immediate. The graceful shutdown will tell the worker to not start any additional jobs but finish the jobs it is currently running before it shuts down. The immediate shutdown will make the worker stop the jobs it's working on, requeue them, and clean up any open resources (shut down VMs, cleanly close connections, etc.)

To start a graceful shutdown, send an INT signal to the worker (for example using kill -INT). To start an immediate shutdown, send a TERM signal to the worker (for example using kill -TERM).

Go dependency management

Travis Worker is built via the standard go commands and dependencies managed by using Go Modules.

Release process

Since we want to easily keep track of worker changes, we often associate them with a version number. To find out the current version, check the changelog or run travis-worker --version. We typically use semantic versioning to determine how to increase this number.

Once you've decided what the next version number should be, update the changelog making sure you include all relevant changes that happened since the previous version was tagged. You can see these by running git diff vX.X.X...HEAD, where v.X.X.X is the name of the previous version.

Once the changelog has been updated and merged to master, the merge commit needs to be signed and manually tagged with the version number. To do this, run:

git tag --sign -a vX.X.X -m "Worker version vX.X.X"
git push origin vX.X.X

The Travis build corresponding to this push should build and upload a worker image with the new tag to Dockerhub.

The next step is to create a new Github release tag with the appropriate information from the changelog.

License and Copyright Information

See LICENSE file.

© 2018 Travis CI GmbH

worker's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

worker's Issues

Make the log output limit of 4mb configurable

Hi Hi

Several Enterprise customers have asked to be able to customize the log output limit from 4mb to 40mb.

Can we please make this limit configurable so customers can choose their own limit.

Thanks a bundle

Josh

Image names in mac osx infrastructure

We are providing image names through
env vars, so we have in IMAGE_ALIASES the value default,xcode8_3 and
then we define IMAGE_ALIAS_DEFAULT as test and IMAGE_ALIAS_XCODE8_3
as xcode8.3 (this is how the image is called in vsphere).

We start a project with osx_image: xcode8.3 in the .travis.yml and
the alias it picks is always the default one, I'm getting that in the
worker logs:

time="2017-06-19T10:08:36Z" level=info msg="selected image name" dist=precise group=stable image_name=test job=355 language=swift os=osx osx_image=xcode8.3 pid=28516 processor=b26790ac-e82f-4c48-b079-2001081a4b96 repository="albert-manya/simple-project"

Should the image alias be encoded in a specific way?

Verify bootstrap success and instance health

Instances sometimes fail to bootstrap:

  • Start hooks can fail to download [1]
  • SSH public keys can fail to download

Instances sometimes become unhealthy in ways that aren't measured by our health checks:

  • Certain Docker commands hang forever [2]
  • Abusive jobs can hog CPU (to be addressed in #366)

We need a way to ensure instance health at bootstrap and on an ongoing basis. I'd like to use this issue as a place to brainstorm on design. (If a similar issue already exists somewhere, please point me to it!)

I think if I needed such a check on a bunch of my own servers, I'd use an approach like the following:

  • Create a /tmp/health directory
  • make the cloud init script write results to this directory, e.g. /tmp/health/cloud-init.ok if everything completed successfully, /tmp/health/cloud-init.nok if any errors were encountered
  • Use a cron job to occasionally check the status of required services (docker, travis-worker) and take appropriate action (e.g. restarting Docker, imploding the instance)

One problem: The only way I know to confirm that docker isn't working as expected is to try a command, e.g. docker ps, and observe that it just hangs forever. I'm not sure how to check this in a script without making the script hang forever, too. Maybe we could:

  • run docker ps&, wait a few seconds, then check if a process with that PID is still running?
  • check the modification date on docker log file?

Thoughts?

hostname -s no longer resolvable

Hello,

With the release of 3.1.0 the hostname command when running travis on a ec2 docker worker changed from being a shortname to being a fqdn. This change broke our ci smoke tests at http://github.com/erlang/otp because we rely on being able to resolve both hostname -s and hostname. This was not a problem prior to 3.1.0 as hostname and hostname -s were the same thing.

I've implemented a workaround for us that skips the tests that rely on this behavior for now. But it would be great if both hostname and hostname -s could be resolvable in the container.

PS, there is no problem when running on gce, as there the shortname seems to be resolvable. DS.

The big 1.0 todo list ticket

Instead of opening an issue for every single thing I can think of that needs to be done for 1.0 (some of these are relatively minor and no discussion is needed), I'm opening this and I'm adding a big TODO list. If you want to discuss one of these, then open a new issue for it and link to it in the TODO list.

  • Write an integration test (and some unit tests where it makes sense). Unit tests are hard to write, since so many components are dependent on an external system such as RabbitMQ.
  • Pass the context.Context around everywhere.
  • Set up SIGTERM to cause a graceful shutdown.
  • Send state updates to travis-hub over RabbitMQ
  • Send log updates to travis-logs over RabbitMQ

Multi-Source Job Queue holds a single job

When the multi-source job queue is used, such as when the queue type is configured as a comma-delimited value, there is a single job held in a variable between input channel recv and output channel send. This results in "stuck job" behavior whenever all available processors are busy.

Inform backend on JSON parse error

If you specify the following in your .travis.yml:

osx_image: 9.1

Worker will crap out with:

level=error
msg="start attributes JSON parse error, attempting to ack+drop delivery"
err="json: cannot unmarshal number into Go struct field StartAttributes.osx_image of type string" 

There are two problems here:

  1. Arguably, this field should be cast into a string
  2. The error handling should improve, as hub has not been informed of the inability to run the job when this error occured. After hub's timeout, the job was marked as done, and in that time, this defunct job consumed a slot of the concurrency.

Reference: https://admin.travis-ci.org/job/311441913

log timeout

Just had my local install issue a log timeout... while I was tailing the log file and it clearly had content in it! Does the log timeout not measure the same thing as what goes into the actual log file?

Bug in worker output

See this line:

_, err := logWriter.WriteAndClose([]byte(fmt.Sprintf("\n\nNo output has been received in the last %v, this potentially indicates a stalled build or something wrong with the build itself.\n\nThe build has been terminated\n\n", s.hardTimeout)))

Shouldn't it print the log timeout instead of the hard time out?

In this example, the job is terminated after 10 minutes, as expected. But it prints 50m.

screen shot 2015-07-14 at 16 34 32

[requirements] compilation fails

in the requirements it's specified that you need to install go and gvt. after installing that, compilation don't success because something is missing : gometalinter

+ exec gometalinter --disable-all -E goimports -E gofmt -E goconst -E deadcode -E golint -E vet --deadline=1m --vendor --tests --errors ./...
./utils/lintall: line 25: exec: gometalinter: not found
make: *** [lintall] Error 127

State is not correctly updated after hard timeout for Linux jobs

Worker version 3.0.2 doesn't update the state after a hard timeout is reached.
The logs state that The job exceeded the maximum time limit for jobs, and has been terminated, but the job continues running.

After looking at the logs, it seems that the instances running the jobs are terminating properly, so this is strictly a state update issue.

Example job: https://travis-ci.org/joepvd/travis-experiment/builds/278560238

Example logs:

Sep 22 12:27:11 i-01032f5-production-2-worker-org-ec2 travis-worker: time="2017-09-22T10:27:05Z" level=info msg="running script" job_id=278560240 job_path="joepvd/travis-experiment/jobs/278560240" pid=1 processor="1e7cf5cf-3498-4818-b351-f22f4d5c84f5@1.i-01032f5-production-2-worker-org-ec2.travisci.net" repository="joepvd/travis-experiment" self="step_run_script"  
Sep 22 12:32:11 i-01032f5-production-2-worker-org-ec2 travis-worker: time="2017-09-22T10:32:02Z" level=info msg="hard timeout exceeded, terminating" job_id=278560240 job_path="joepvd/travis-experiment/jobs/278560240" pid=1 processor="1e7cf5cf-3498-4818-b351-f22f4d5c84f5@1.i-01032f5-production-2-worker-org-ec2.travisci.net" repository="joepvd/travis-experiment" self="step_run_script"  
Sep 22 12:32:11 i-01032f5-production-2-worker-org-ec2 travis-worker: time="2017-09-22T10:32:02Z" level=info msg="finishing job" job_id=278560240 job_path="joepvd/travis-experiment/jobs/278560240" pid=1 processor="1e7cf5cf-3498-4818-b351-f22f4d5c84f5@1.i-01032f5-production-2-worker-org-ec2.travisci.net" repository="joepvd/travis-experiment" self="amqp_job" state=errored  
Sep 22 12:32:11 i-01032f5-production-2-worker-org-ec2 travis-worker: time="2017-09-22T10:32:02Z" level=error msg="couldn't update job state" err="context deadline exceeded" job_id=278560240 job_path="joepvd/travis-experiment/jobs/278560240" pid=1 processor="1e7cf5cf-3498-4818-b351-f22f4d5c84f5@1.i-01032f5-production-2-worker-org-ec2.travisci.net" repository="joepvd/travis-experiment" self="step_run_script" state=errored  
Sep 22 12:32:11 i-01032f5-production-2-worker-org-ec2 travis-worker: time="2017-09-22T10:32:02Z" level=info msg="finished script" job_id=278560240 job_path="joepvd/travis-experiment/jobs/278560240" pid=1 processor="1e7cf5cf-3498-4818-b351-f22f4d5c84f5@1.i-01032f5-production-2-worker-org-ec2.travisci.net" repository="joepvd/travis-experiment" self="step_run_script"  
Sep 22 12:32:41 i-01032f5-production-2-worker-org-ec2 travis-worker: time="2017-09-22T10:32:34Z" level=info msg="stopped instance" job_id=278560240 job_path="joepvd/travis-experiment/jobs/278560240" pid=1 processor="1e7cf5cf-3498-4818-b351-f22f4d5c84f5@1.i-01032f5-production-2-worker-org-ec2.travisci.net" repository="joepvd/travis-experiment" self="step_start_instance"  
Sep 22 12:32:41 i-01032f5-production-2-worker-org-ec2 travis-worker: time="2017-09-22T10:32:34Z" level=info msg="finished job" job_id=278560240 job_path="joepvd/travis-experiment/jobs/278560240" pid=1 processor="1e7cf5cf-3498-4818-b351-f22f4d5c84f5@1.i-01032f5-production-2-worker-org-ec2.travisci.net" repository="joepvd/travis-experiment" self=processor  

travis worker filesystem does not seem to handle fallocate properly

Hi, I am not sure what filesystem underneath is used. when I run df -T, it seems to
show xfs as the filesystem inside the docker. However some other reading tells me that
it could be wrapped up in AUFS.

Here is what I am noticing. We are running a database unit test in travis and we are passing
what should not increase the size of a file on fallocate. Because we are passing in
FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE to the fallocate call.

However strace is showing that fallocate succeeded and the size of the file increased.

Look at the strace calls below. The file size should not increase due to 03 as the mode for
fallocate. However instead of 106118, the file size is getting reset to 106496.

I am wondering whether this is AUFS, and AUFS has not implemented fallocate, so it is falling back to posix_fallocate, which does not support these modes?

[pid 3576] write(15, "\265\20\34\0\34\200\01000b0.\1\0\0243990\1\0\t\1\0a\376\1\0\376\1\0\376"..., 147) = 147
[pid 3576] mprotect(0x2b2250030000, 20480, PROT_READ|PROT_WRITE) = 0
[pid 3576] write(15, "\0$\4rocksdb.block.based.table.ind"..., 5886) = 5886
[pid 3576] fsync(15) = 0
[pid 3576] ftruncate(15, 106118) = 0
[pid 3576] fallocate(15, 03, 106118, 73713632) = 0
[pid 3576] close(15) = 0
[pid 3576] open("/tmp/rocksdbtest-1000/db_tailing_iterator_test/000037.sst", O_RDONLY) = 8
[pid 3576] fcntl(8, F_GETFD) = 0
[pid 3576] fcntl(8, F_SETFD, FD_CLOEXEC) = 0
[pid 3576] fadvise64(8, 0, 0, POSIX_FADV_RANDOM) = 0
[pid 3576] pread(8, "\1\370\224\6"\237\225\6\255'\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 106065) = 53
[pid 3576] fstat(8, {st_dev=makedev(252, 7), st_ino=17731611, st_mode=S_IFREG|0644, st_nlink=1, st_uid=1000, st_gid=1000, st_blksize=4096, st_blocks=208, st_size=106496, st_atime=2017/01/06-01:37:56, st_mtime=2017/01/06-01:37:56, st_ctime=2017/01/06-01:37:56}) = 0

Jobs that are cancelled while queued don't get cancelled

I'm seeing a log like this:

time="2017-03-08T20:01:23Z" level=info msg="received amqp delivery" job=[redacted] pid=11807 processor=d96a4562-264c-47a8-90d9-9956279393e6 
time="2017-03-08T20:01:23Z" level=info msg="starting job" job=[redacted] pid=11807 processor=d96a4562-264c-47a8-90d9-9956279393e6 repository="[redacted]" 
time="2017-03-08T20:01:23Z" level=info msg="using build script generator to generate script" job=[redacted] pid=11807 processor=d96a4562-264c-47a8-90d9-9956279393e6 repository="[redacted]" 
time="2017-03-08T20:01:23Z" level=info msg="generated script" job=[redacted] pid=11807 processor=d96a4562-264c-47a8-90d9-9956279393e6 repository="[redacted]" 
time="2017-03-08T20:01:23Z" level=info msg="starting instance" job=[redacted] pid=11807 processor=d96a4562-264c-47a8-90d9-9956279393e6 repository="[redacted]" 
time="2017-03-08T20:01:23Z" level=info msg="cancelling job" command="cancel_job" component=canceller job=[redacted] pid=11807 
time="2017-03-08T20:01:23Z" level=warning msg="job already cancelled" command="cancel_job" component=canceller job=[redacted] pid=11807 
time="2017-03-08T20:01:24Z" level=info msg="selected image name" dist=precise group=stable image_name=travis-ci-macos10.12-xcode8.2-1481567376 job=[redacted] language=objective-c os=osx osx_image=xcode8.2 pid=11807 processor=d96a4562-264c-47a8-90d9-9956279393e6 repository="[redacted]" 
time="2017-03-08T20:02:23Z" level=warning msg="job already cancelled" command="cancel_job" component=canceller job=[redacted] pid=11807 
time="2017-03-08T20:02:23Z" level=warning msg="job already cancelled" command="cancel_job" component=canceller job=[redacted] pid=11807 
time="2017-03-08T20:02:24Z" level=warning msg="job already cancelled" command="cancel_job" component=canceller job=[redacted] pid=11807 

It looks like hub is trying to tell us that the job should be cancelled, but for some reason we think it's already cancelled and never actually cancel it.

Extract SSH private key parsing to only do it once per process

Right now we load and parse the SSH key multiple times per job (once per SSH connection), and this is the largest source of CPU usage for the worker currently.

I've started to draft out this locally, will open a PR 🔜, just opening this to track it.

Understanding travis-worker and installing it locally.

I am interested to know about travis-worker and other components connected to it.
Is there any specific document which can help me to setup and configure travis-worker and rest
of the components (travis-build, travis-log, travis-hub) locally on a VM.

Also, I would appreciate if you can provide any documents or links which can help me understand the below better
a) There are installation steps available for travis-worker in README.md, but how do we configure travis-build, travis-log and travis-hub with it ?
b) How could we run a sample job with travis-worker and docker as a backend ?
c) As per my understanding travis-worker is responsible for creating VM/container with configured backend (docker, gce, etc. ) then in such a scenario what is the role of travis-cookbooks ?

Update: The action I would like to see based on this issue is improvements and additions to README.md, meaning that I consider this issue to be about documentation enhancement.

Add support for volume mounting in docker backend

When using Docker as provider, Travis Worker doesn't currently allow to control the options for the command:

docker run [OPTIONS] IMAGE [COMMAND] [ARG...]

Right now, only providing a CMD for [COMMAND] is allowed.

Allowing to define some default [OPTIONS] used by the worker when launching containers could be beneficial to, for example, provide a list of volumes to mount in the container.

I can try to provide a PR in the next days, but my knowledge of Go is close to 0.
In the meantime, what do you guys think?

Thanks
Ivan

Stream Output from travis_wait to Standard Output and Error

Presently, it appears that the only workaround for long-running builds is to use travis_wait N where N is the number of minutes to allow a build to run for.

This command unfortunately does not stream its output to the user, so it is necessary to wait until a job has completely finished before knowing what went wrong. In large test cases or scripts that take 30 minutes to run, it would be nice to see test failures without having to wait the full 30 minutes.

Add SIGUSR1 support to worker

Add support for handling the unix SIGUSR1 signal.

Things for it to output should include

  • version information
  • some info on the process uptime
  • a dump of relevant "current activities/work in progress"

travis-build appears to be botching envvar decoding

Sorry, I'm pretty sure this belongs in travis-build, but you have issues turned off over there

On allowing the ruby commandline to add encrypted envvars, the export step in decryption is:

  1. apparently broken
  2. logged wholesale

This means that the exports aren't set correctly, and that once they're fixed, your logs will contain the very same information that the entire purpose of this system is to keep out of logs

Setting environment variables from .travis.yml
$ export ABC_KEY={:secure=>
/home/travis/build.sh: eval: line 45: syntax error near unexpected token `newline'
/home/travis/build.sh: eval: line 45: `export ABC_KEY={:secure=>'
$ export 1a2b3c4d="} ABC_KEY={:secure=>"

Other than that I've replaced the real name of the environment variable with ABC_KEY and the real value of the key with 1a2b3c4d, this is unedited. Note that each (ruby hash?) is cut off after their (whatever ruby people call fat arrows.)

The censored key is the encrypted value, not the unencrypted value.

There are two such keys in my .travis.yml; only the first appears in the log. However, using code, I've verified that neither are set in the environment.

Here is an NSA-certified .travis.yml for you:

https://gist.github.com/StoneCypher/a5e516889b95c324be5421e1f9a17ce3

The referenced gulp deploy steps work locally, but won't work in Travis land because they rely on the envvars for their AWS keys.

I am somewhat confused. I believe this may be a Travis bug, but I'm open to the possibility that I screwed this up somehow.

image

Ensure instance cleanup runs even when context is cancelled

We use context cancellations for things like shutdown with SIGTERM and timeouts, but some log messages seem to indicate that this could mean that the instance shutdown we do to clean up the instance after a cancellation doesn't finish. We should ensure that a context cancellation would still allow this cleanup to run.

Backends to check:

  • cloudbrain
  • docker
  • gce
  • jupiterbrain
  • local (this might not apply to this backend)

Docker build failure attempting to run sudo apt-get update

Attempting to follow the documentation for enabling [docker builds] (https://enterprise.travis-ci.com/docker-builds) for may travis workers and see the following..

Get:62 http://ppa.launchpad.net precise/main i386 Packages [3,133 B]
Err http://ppa.launchpad.net precise/main amd64 Packages
404 Not Found
Err http://ppa.launchpad.net precise/main i386 Packages
404 Not Found
Fetched 62 B in 3s (16 B/s)
W: Failed to fetch http://ppa.launchpad.net/rwky/redis/ubuntu/dists/precise/main/binary-amd64/Packages 404 Not Found

W: Failed to fetch http://ppa.launchpad.net/rwky/redis/ubuntu/dists/precise/main/binary-i386/Packages 404 Not Found

E: Some index files failed to download. They have been ignored, or old ones used instead.

The command "sudo apt-get update" failed and exited with 100 during .

Complete logs here
travis_log.txt

Add metrics to count jobs finished.

After some discussion, we decided it would be really neat to have the worker output a metric which counts jobs completed, which could give further insight into how many jobs we process at one time, so that we can observe our overall performance on the platform.

Preferably these would be completed jobs tagged with build images used, build infrastructure used, and "errored"/"failed"/"passed", which could also help us in the future, as we push out new build images, to see how many jobs error or fail when we push out a new build image.

The sum of these would also be useful for determining the overall performance of a build infrastructure -- seeing how quickly we're processing jobs, since we can't really see how quickly we're processing jobs from a queue which is having jobs added to it (which is our queue, in 99% of cases.)

Coordinate on unifying Travis Enterprise and .com/.org worker approaches

In Slack I brought up some questions that a customer had brought up around the Quay images for worker that are located here: https://quay.io/repository/travisci/worker?tag=latest&tab=tags and some questions around which job execution images to use and which are the most up to date.

This lead to me linking the current Enterprise worker install scripts which apparently are very dated and not very related to the current state of the art in .com/.org.

@meatballhat suggested that we catch up to coordinate on Enterprise vs .com/.org and @solarce suggested I open an issue to track this coordination!

(did I get this all right?)

So this is the issue! Ideally I'd love to run everything in Enterprise as close as possible to what we do in .com/.org. What all this will entail, I do not know yet. I get the feeling I'm missing a good bit of context in the current status of things so I'd love a catch up there.

I'm adding this to the Enterprise 2.1 tracking issues internally so we're sure to talk about this in the near future as it may have some pretty big implications for 2.1.

Cancellation mechanism can break if job is requeued to same worker

I've seen the "there's already a subscription for job …" error message pop up somewhat regularly and decided to try to take a look at what's going on.

It looks like it's possible for a job to be queued onto the same worker after a requeue (more common during slow times, since RabbitMQ seems to queue jobs in "order", so if a worker has 20 consumers, it will get 20 jobs in a row before another worker gets any), and since the cancellation "unsubscription" happens at the very end it can go a few seconds between a requeue and a cancellation unsubscription (e.g. if the instance is shut down in between).

I think a solution for this would be to change the canceller to allow multiple cancellation subscriptions for the same job ID. Another option is to make sure that the canceller is "unsubscribed" before requeueing a job, but that's involving a lot more different code paths that we need to check this, and it'd probably be easy to miss one when editing code, so that feels like a more fragile solution to me, which is why I think allowing multiple cancellations would be better.

Automated builds on Docker Hub

In trying to determine where the Docker Hub travisci/worker images come from, I was pointed here:

- if [[ $TRAVIS_PULL_REQUEST = 'false' && $DOCKER_LOGIN_PASSWORD && $DOCKER_LOGIN_USERNAME ]]; then
    make docker-build smoke-docker docker-push;
  fi

Would we consider using automated builds on Docker Hub to improve visibility of the image source and ensure the Dockerfile in the repo really corresponds to the images on the Hub? (The Dockerfile currently on master doesn't build at all for me locally; I think this would help with issues like #284.)

Compilation error: xargs: shfmt: No such file or directory && Makefile:59: recipe for target 'test-no-cover' failed

Anything else is fine except:

~/go/src/github.com/travis-ci/worker ~/go/src/github.com/travis-ci/worker
+ gometalinter --disable-all -E goimports -E gofmt -E goconst -E deadcode -E golint -E vet --deadline=1m --vendor --tests --errors ./...
[proxychains] DLL init: proxychains-ng 4.12-git-14-g06c20ed
+ git grep -l '^#!/usr/bin/env bash'
+ xargs shfmt -i 2 -w
[proxychains] DLL init: proxychains-ng 4.12-git-14-g06c20ed
[proxychains] DLL init: proxychains-ng 4.12-git-14-g06c20ed
xargs: shfmt: No such file or directory
Makefile:49: recipe for target 'lintall' failed
make: *** [lintall] Error 127

proxychains is what I'm using for network proxy.

Print error to job log when instance is shut down prematurely

If a VM is shut down (either because of shutdown command in the script, or gcloud-cleanup cleaning it up, or some other reason), the job appears to just suddenly end with no information about why. We should probably print an error to the logs in this case. Here's an example log output from a worker where this happened (gcloud-cleanup cleaned up the VM since this repository had a timeout that's longer than gcloud-cleanup's timeout).

time="2017-03-22T18:52:50Z" level=info msg="received amqp delivery" job=[reacted] pid=12810 processor=1228619a-5bfe-4626-aa0c-4d73e697151c  
time="2017-03-22T18:52:50Z" level=info msg="starting job" job=[redacted] pid=12810 processor=1228619a-5bfe-4626-aa0c-4d73e697151c repository="[redacted]"  
time="2017-03-22T18:52:50Z" level=info msg="generated script" job=[redacted] pid=12810 processor=1228619a-5bfe-4626-aa0c-4d73e697151c repository="[redacted]"  
time="2017-03-22T18:52:50Z" level=info msg="starting instance" job=[redacted] pid=12810 processor=1228619a-5bfe-4626-aa0c-4d73e697151c repository="[redacted]"  
time="2017-03-22T18:52:50Z" level=info msg="starting instance" job=[redacted] pid=12810 processor=1228619a-5bfe-4626-aa0c-4d73e697151c repository="[redacted]"  
time="2017-03-22T18:53:12Z" level=info msg="started instance" boot_time=21.853888377s job=[redacted] pid=12810 processor=1228619a-5bfe-4626-aa0c-4d73e697151c repository="[redacted]"  
time="2017-03-22T18:53:52Z" level=info msg="uploaded script" job=[redacted] pid=12810 processor=1228619a-5bfe-4626-aa0c-4d73e697151c repository="[redacted]"  
time="2017-03-22T18:53:52Z" level=info msg="running script" job=[redacted] pid=12810 processor=1228619a-5bfe-4626-aa0c-4d73e697151c repository="[redacted]"  
time="2017-03-22T21:52:58Z" level=error msg="couldn't run script" completed=true err="error running script: error running script: wait: remote command exited without exit status or exit signal" job=[redacted] pid=12810 processor=1228619a-5bfe-4626-aa0c-4d73e697151c repository="[redacted]"  
time="2017-03-22T21:52:58Z" level=info msg="finished script" job=[redacted] pid=12810 processor=1228619a-5bfe-4626-aa0c-4d73e697151c repository="[redacted]"  
time="2017-03-22T21:52:58Z" level=info msg="deleting instance" instance=[redacted] job=[redacted] pid=12810 processor=1228619a-5bfe-4626-aa0c-4d73e697151c repository="[redacted]"  
time="2017-03-22T21:52:58Z" level=info msg="stopped instance" job=[redacted] pid=12810 processor=1228619a-5bfe-4626-aa0c-4d73e697151c repository="[redacted]"  
time="2017-03-22T21:52:58Z" level=info msg="finished job" job=[redacted] pid=12810 processor=1228619a-5bfe-4626-aa0c-4d73e697151c repository="[redacted]"  

The exact error is this one:

time="2017-03-22T21:52:58Z" level=error msg="couldn't run script" completed=true err="error running script: error running script: wait: remote command exited without exit status or exit signal" job=[redacted] pid=12810 processor=1228619a-5bfe-4626-aa0c-4d73e697151c repository="[redacted]"  

I'm a little confused about this error too—looking at backend/gce.RunScript I don't see any way for Completed to be true while also returning an error…so there's something else odd going on here too…

py.test not returning after passing all tests in container

I'm having a seemingly random problem.

I made a commit which on my machine passes, but on travis passed and then never exited: https://travis-ci.org/somakeit/door-controller2/builds/130965876

I then worked on a branch changing the way the recent tests I had added ran until travis worked again, I made a PR back to master and all was passing: https://travis-ci.org/somakeit/door-controller2/builds/132084380

I merged the PR into master and it failed again: https://travis-ci.org/somakeit/door-controller2/builds/132085414 :'-(

Is there a way I can find out what's randmly keeping py.test from returning?

Error does not result in canceled job

The following happened with worker v2.11.0 on MacStadium:

Oct 05 00:34:28 wjb-1 travis-worker-production-com-1:
  time="2017-10-04T22:34:27Z"
  level=info
  msg="received amqp delivery"
  job_id=93257062
  pid=19758
  processor=271e149b-8091-4bfd-990a-bbfc657bce97
  self="amqp_job_queue" 
Oct 05 00:34:28 wjb-1 travis-worker-production-com-1:
  time="2017-10-04T22:34:27Z"
  level=error
  msg="start attributes JSON parse error, attempting to nack delivery"
  err="json: cannot unmarshal array into Go struct field StartAttributes.osx_image of type string"
  pid=19758 processor=271e149b-8091-4bfd-990a-bbfc657bce97
  self="amqp_job_queue" 

Not sure what this error means, but dealing with this error was suboptimal. Hub reaped this job after a few hours and marked it as canceled. During this time, the concurrency was consumed by this malfunctioning job. Please extend the error handling so the backend knows that this job is to be considered cancelled.

Could not find image: no such id: quay.io/travisci/travis-xxx:latest

When trying to install travis-ci/worker via ./bin/travis-worker-install I get:

latest: Pulling from quay.io/travisci/travis-ruby
511136ea3c5a: Already exists
bc1f0427b833: Already exists
0cf29d6c76b4: Already exists
b560c37f1efa: Already exists
69c02692b0c1: Already exists
18bb34c4d5de: Already exists
0cb3406d162a: Already exists
a054df88b84d: Already exists
1d2dceaa3088: Already exists
5c12541b06a6: Already exists
6a927ef6bd71: Already exists
bafbb3c33652: Already exists
ffb7430cc29e: Already exists
77b54ec89152: Already exists
276af20c9081: Already exists
56ef0acdeadb: Already exists
6cf578cf4a1f: Already exists
1bd392deef90: Already exists
c65ae5c5d52b: Already exists
5adaf370d628: Already exists
3cbd4e5fef2d: Already exists
2b412eda4314: Already exists
ee4f8c49a7d9: Already exists
Digest: sha256:c2a29ddc7f814649c048c00acaaed76f607abec267fd531680becbc6b77c1b7c
Status: Image is up to date for quay.io/travisci/travis-ruby:latest
Error response from daemon: could not find image: no such id: quay.io/travisci/travis-clojure:latest

Checking the link used by the script it returns a 403 as well so something looks broken there?
Same happens for other languages (travis-java,... not all though).

Inform hub of un-requeueable job

A GCE job could not get started, and the requeue ended up in error.

Oct 23 20:31:57 production-1-worker-com-c-5-gce level=error msg="couldn't start instance" err="context deadline exceeded"
Oct 23 20:31:57 production-1-worker-com-c-5-gce level=info msg="requeueing job"
Oct 23 20:31:57 production-1-worker-com-c-5-gce level=error msg="couldn't requeue job" err="context deadline exceeded" 

Now, hub could not be informed of the failure of the job, and only after quite some time, hub did a cleanup:

travis-com-hub-production
Erroring stale job: id=123 state=received updated_at=2017-10-23 18:26:41 UTC. 

Amount of occurrences of couldn't requeue job in the last 7 hours, grouped by hour (cest):

GCE .org:

06 321
07 58
08 1
09 3
10 2
11 2
13 2

GCE .com:

06 232
07 47
09 9
10 1
11 59
12 5
13 3

This means that the concurrency has been consumed by this stale job fore quite a while. It would be good if more attempts at informing hub in this error scenario would be taken.

Some extra details in this support ticket.

2.9.2 no longer picks up builds in Enterprise

We've seen a pattern recently with upgrades and fresh installs for Enterprise using 2.9.2 where it ends up commenting out the export TRAVIS_ENTERPRISE_BUILD_ENDPOINT="__build__" line. As a result the worker can't receive the build scripts from travis-build and builds are not picked up.

I'm not sure if it replaces all of the file or just edits this line. This file also contains our caching settings still for Enterprise as well as hostnames and passwords, so it feels like I would have heard about those going missing as well but I haven't heard anything so far.

It seems uncommenting this line and restarting the server fixes the issue.

State update messages do not include the queued_at timestamp

We've agreed it would be good to include all known timestamps to all state update messages sent to Hub, but it seems we've missed the job.queued_at timestamp at that point.

For example, this is a finished message received in Hub on staging:

{:finished_at=>"2017-03-14T18:11:04Z", :id=>548890, :received_at=>"2017-03-14T18:10:46Z", :started_at=>"2017-03-14T18:10:47Z", :state=>"passed"}

It seems Scheduler includes this timestamp to its own message sent to Worker: https://github.com/travis-ci/travis-scheduler/blob/master/spec/travis/scheduler/serialize/worker_spec.rb#L64

Would it be possible for this timestamp to be included in state update messages sent back to Hub?

Not working package through apt addon

I'm not sure if it is the right place to report it but when I set android-tools-adb in the apt addon inside .travis.yml then I get "Unable to locate package android-tools-adb".

The package is whitelisted so I don't know what is the problem.

Ruby 2.3.3 doesn't work in non-travis environment

Is rubies binaries on http://rubies.travis-ci.org/, supposed to work in non-travis environment?

Since 2.3.x and 2.4.x versions for OSX 10.12 (at least) do not work, showing me:

dyld: Library not loaded: /Users/travis/.rvm/rubies/ruby-2.3.2/lib/libruby.2.3.0.dylib
  Referenced from: <removed>
  Reason: image not found

Not sure if this issue belongs to this repo, so apologies if that the case

Limit docker disk space?

Travis offers easy options to limit the CPUs and RAM used by docker containers -- but what about disk space? Is there any easy way to limit that?

Travis Enterprise Worker doesn't like ZFS, but works great with BTRFS

I have successfully managed to setup Travis Worker with Docker 1.10.3.

First, in a gist:

This gave me a working Docker+Btrfs Travis Worker. All good.

Then I did the same for ZFS (Docker ZFS setup) but that never worked.
While I was able to launch Travis containers on ZFS, as explained in the official Travis CI docs, Travis Worker never managed to pick up job and entered in an ever retrying state.

By looking at /var/log/upstart/travis-worker.log I could see errors of the Worker trying to connect to 127.17.0.2:22 (what I presume is the SSH connection through which the Worker hands off work to the Container build.sh).
I have dug around the code in this repo a bit to see where that code was generated, but I couldn't find anywhere a log line resembling travis worker dial tcp 172.17.0.2:22: getsockopt: connection refused.

I hope this can be helpful for future version of Travis that, I hope, can support the awesome ZFS: a filesystem that is WAY more stable and reliable than BTRFS or AUFS.

Docker native mode doesn't seem to work with v2.50

Hi,

I'm almost sure I'm doing something wrong. I have travis worker configured via env variables to run NATIVE docker mode which should avoid the ssh upload of build.sh.

However while I can see the PRIVILEGED=true flag working, setting TRAVIS_WORKER_DOCKER_NATIVE=true will still launch a worker trying to use ssh.

Any idea what I'm doing wrong?

starting container process caused "could not create session key: disk quota exceeded"

   Worker information
   hostname: 2829068b-297f-439a-89b2-cd35435c72b5@1.i-008af78-production-2-worker-org-ec2.travisci.net
   version: v3.4.0 https://github.com/travis-ci/worker/tree/ce0440bc30c289a49a9b0c21e4e1e6f7d7825101
   instance: fd9bc7f travisci/ci-garnet:packer-1512502276-986baf0 (via amqp)
   startup: 974.040577ms
   oci runtime error: exec failed: container_linux.go:265: starting container process caused "could not create session key: disk quota exceeded"

What dose this mean? And how to solve this problem?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.