hashicorp / levant Goto Github PK

An open source templating and deployment tool for HashiCorp Nomad jobs

License: Mozilla Public License 2.0

Makefile 2.20% Go 88.35% HCL 7.78% Dockerfile 1.07% Shell 0.60%

nomad nomad-job golang hashicorp continuous-deployment go hashicorp-nomad templating

levant's Introduction

Levant

Levant is an open source templating and deployment tool for HashiCorp Nomad jobs that provides realtime feedback and detailed failure messages upon deployment issues.

Features

Realtime Feedback: Using watchers, Levant provides realtime feedback on Nomad job deployments allowing for greater insight and knowledge about application deployments.
Advanced Job Status Checking: Particularly for system and batch jobs, Levant ensures the job, evaluations and allocations all reach the desired state providing feedback at every stage.
Dynamic Job Group Counts: If the Nomad job is currently running on the cluster, Levant dynamically updates the rendered template with the relevant job group counts before deployment.
Failure Inspection: Upon a deployment failure, Levant inspects each allocation and logs information about each event, providing useful information for debugging without the need for querying the cluster retrospectively.
Canary Auto Promotion: In environments with advanced automation and alerting, automatic promotion of canary deployments may be desirable after a certain time threshold. Levant allows the user to specify a canary-auto-promote time period, which if reached with a healthy set of canaries, automatically promotes the deployment.
Multiple Variable File Formats: Currently Levant supports .json, .tf, .yaml, and .yml file extensions for the declaration of template variables.
Auto Revert Checking: In the event that a job deployment does not pass its healthy threshold and the job has auto-revert enabled; Levant tracks the resulting rollback deployment so you can see the exact outcome of the deployment process.

Download & Install

Official Levant binaries can be downloaded from the HashiCorp releases site.
Levant can be installed via the go toolkit using go get github.com/hashicorp/levant && go install github.com/hashicorp/levant
A docker image can be found on Docker Hub. The latest version can be downloaded using docker pull hashicorp/levant.
Levant can be built from source by firstly cloning the repository git clone git://github.com/hashicorp/levant.git. Once cloned, a binary can be built using the make dev command which will be available at ./bin/levant.
There is a Levant Ansible role available to help installation on machines. Thanks to @stevenscg for this.
Pre-built binaries of Levant from versions 0.2.9 and earlier can be downloaded from the GitHub releases page page. These binaries were released prior to the migration to the HashiCorp organization. For example: curl -L https://github.com/hashicorp/levant/releases/download/0.2.9/linux-amd64-levant -o levant

Templating

Levant includes functionality to perform template variables substitution as well as trigger built-in template functions to add timestamps or retrieve information from Consul. For full details please consult the templates documentation page.

Commands

Levant supports a number of command line arguments which provide control over the Levant binary. For detail about each command and its supported flags, please consult the commands documentation page.

Clients

Levant utilizes the Nomad and Consul official clients and configuration can be done via a number of environment variables. For detail about these please read through the clients documentation page.

Contributing

Community contributions to Levant are encouraged. Please refer to the contribution guide for details about hacking on Levant.

levant's People

Contributors

Stargazers

Watchers

Forkers

pmcatominey ygersie bdclark dansteen mrjacek redfive miserlou zillow-oc mlehner616 maksym-iv qivers stack72 bogdanov1609 dlaredod lopcode pathcl havk64 myena bergur1 urjitbhatia ebarriosjr evanercolano mwalters-workmarket msvbhat jsipprell numiralofe tamasn uzzz mre tirimia rafaelpirolla ping2balaji goarchiver mongey dnalchemist pznamensky 505games aladler greut teever joshuaclausen spatialbuzz agamdua rkettelerij briantenazas benmarsden artem151193 verifa franckverrot guaychou etsangsplk goldstar sh-turakhia angrycub fuleow edms robertdigital luca-spopo merwan discourse liemle3893 skynetcmg47 gridscale kearme gpaggi atoz-technology boost-entropy-repos-org isabella232 gurpartap benjaminrumble-tc mocofound morningconsult ilhicas sgrankin hoylubert pscheit rodzyn generalcommission quinndiggity skjold pkolyvas lhayhurst guardaco andrew8088 seros pop apanagiotou mabunixda atavakoliyext cgamesplay isgasho loq9 ernetas rutori scottdef hristian-bonev jwz-ecust matthiew atomlab kamilkitowski

levant's Issues

[FEATURE] Emit detailed message upon deploy exit

It would be advantageous for Levant to emit a detailed JSON message upon deployment finish with metrics and information regarding the deployment. This information could include:

time to render
total time of deployment
allocs as part of deployment
deployment id
diff from last version

Variable files with either `.yml` or `.yaml` are not being rendered

The error [ERROR] levant/command: 'job' stanza not found is raised every time as template is rendered using a .yaml/.yml variable file extension.

Evaluations can incur errors which Levant does not catch

When triggering a Nomad job registration; the evaluation can incur errors as seen below using the Nomad CLI:

==> Monitoring evaluation "a8a2fcf1"
    Evaluation triggered by job "traefik"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "a8a2fcf1" finished with status "complete" but failed to place all allocations:
    Task Group "traefik" (failed to place 4 allocations):
      * Resources exhausted on 1 nodes
      * Class "public" exhausted on 1 nodes
      * Dimension "network: bandwidth exceeded" exhausted on 1 nodes

When running this same registration through Levant all seems to go well:

[DEBUG] levant/templater: variable file not passed, using any passed CLI variables
[DEBUG] levant/deploy: running dynamic job count updater for job traefik
[INFO] levant/deploy: job traefik not running, using template file group counts
[INFO] levant/deploy: triggering a deployment of job traefik
[DEBUG] levant/deploy: job type system does not support Nomad deployment model
[INFO] levant/command: deployment of job traefik successful

Levant should therefore be updated to inspect the Nomad evaluation of a job deployment to catch these types of issues and provide feedback to the user.

[BUG] Levant does not handle periodic job types properly

While the service scheduler is the only one to properly support deployments, there is value in using levant to render job file templates, it should therefore support registering other scheduler type jobs.

Currently solved with the following:

levant render -var x=y job.nomad > rendered.nomad
nomad run rendered.nomad

Error when trying to deploy a batch job

Description

Howdy,
I'm attempting to deploy a batch job using Levant, and Levant is crashing, I believe because it is trying to get the task group count from a batch job, which doesn't have such a thing.

Relevant Nomad job specification file

This is somewhat redacted but just names of things:

job "thing-periodic" {
  type = "batch"
  periodic {
    cron             = "*/5 * * * * *"
    prohibit_overlap = false
  }
  datacenters = ["us-west-2", "us-west-1", "us-west-1a", "us-west-1c", "us-west-2a", "us-west-2b", "us-west-2c"]
  group "thing-periodic" {
    task "thing-herd" {
      driver = "docker"
      config {
        image = "place.jfrog.io/thing:[[.DEPLOY_SUB___VERSION]]"
        command = "/usr/bin/python3"
        args = ["herd.py"]
        port_map = {}
        logging {
          type = "journald"
          config {
            tag = "${NOMAD_META_SPLUNK_INDEX}:${NOMAD_ALLOC_NAME}.${NOMAD_TASK_NAME}.${NOMAD_ALLOC_ID}"
          }
        }
      }
      resources {
        cpu = 200
        memory = 256
        network {
          mbits = 20
        }
      }
      env {
        CONFIG_FILE = "/local/herd.yml"
      }
      meta {
        SPLUNK_INDEX = "thing"
      }
      template {
        data = <<EOH
---
statsd_host: '{{ env "attr.driver.docker.bridge_ip" }}'
shepherds:{{ range service "thing@us-west-1" }}
- {{ .NodeAddress }}:{{ .Port }}{{ end }}{{ range service "thing@us-west-2" }}
- {{ .NodeAddress }}:{{ .Port }}{{ end }}
        EOH
        destination = "/local/herd.yml"
        change_mode = "noop"
      }
    }
    restart {
      attempts = 2
      delay = "15s"
      interval = "5m"
      mode = "fail"
    }
  }
}

Output of levant version:

Levant v0.0.3

Output of consul version:

Consul v0.7.5
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

Output of nomad version:

Nomad v0.7.0

Additional environment details:

Levant is running in our CD environment, which is GoCD on Ubuntu 14.04

Debug log outputs from Levant:

17:45:31.837 [INFO] levant/deploy: job thing-periodic not running, using template file group counts
17:45:31.839 panic: runtime error: invalid memory address or nil pointer dereference
17:45:31.839 [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xb2f968]
17:45:31.839 
17:45:31.840 goroutine 1 [running]:
17:45:31.840 github.com/jrasell/levant/levant.(*nomadClient).Deploy(0xc42000cb98, 0xc420083680, 0x0, 0xc42000cb00, 0x0)
17:45:31.840 	/home/travis/gopath/src/github.com/jrasell/levant/levant/deploy.go:70 +0xd8
17:45:31.840 github.com/jrasell/levant/command.(*DeployCommand).Run(0xc4201fb560, 0xc42000e0f0, 0x1, 0x6, 0xc4200c78a0)
17:45:31.840 	/home/travis/gopath/src/github.com/jrasell/levant/command/deploy.go:123 +0x5a7
17:45:31.840 github.com/jrasell/levant/vendor/github.com/mitchellh/cli.(*CLI).Run(0xc420099a40, 0xc420099a40, 0x3, 0xc4200c7900)
17:45:31.840 	/home/travis/gopath/src/github.com/jrasell/levant/vendor/github.com/mitchellh/cli/cli.go:255 +0x1eb
17:45:31.840 main.RunCustom(0xc42000e090, 0x7, 0x7, 0xc4201fb410, 0xc4201fb350)
17:45:31.840 	/home/travis/gopath/src/github.com/jrasell/levant/main.go:49 +0x433
17:45:31.840 main.Run(0xc42000e090, 0x7, 0x7, 0xb37f08)
17:45:31.840 	/home/travis/gopath/src/github.com/jrasell/levant/main.go:17 +0x56
17:45:31.840 main.main()
17:45:31.841 	/home/travis/gopath/src/github.com/jrasell/levant/main.go:11 +0x63

Utilise Nomad Plan before running a deployment

Description

Levant has the possibility to utilise Nomad Plan in order to better understand what changes will be made before a deployment. This would also provide better user feedback as change details can be logged for operator uses.

Integration testing

I'm planning to build some integration testing against a Nomad agent running in dev, all of the scenarios we want to test can be created with job files and shell scripts running inside containers to simulate failures of various kinds.

I'm opening this to discuss the approach these tests should take:

a shell script using the levant cli
tests in the levant package (Go)
a test package using the levant cli (Go)

Test scenarios:

successful deployment of a simple job
deployment failure (container exits with code 1)
driver error (bad docker image tag)
evaluation failure (unsatisfiable constraint)
canary auto promote success
canary auto promote failure

All scenarios can be implemented with the Docker driver and the alpine image.

Nomad acl tokens are not supported

With nomad 0.7 you can close down the nomad api by use of ACLs

It seems levant does not use the available NOMAD_TOKEN in the environment nor will it accept the -token argument.

Update config to allow TLS config passing

In order to work in certain situations; it would be useful if Levant allows the user to pass TLS certificate details on deployment including the following:

ca-cert
cert
cert-key

Segvault

With any version (0.0.1 or current master) I get this nil pointer.

[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xb36977]

goroutine 1 [running]:
github.com/jrasell/levant/levant.(*nomadClient).Deploy(0xc420089e60, 0xc4200b2d80, 0x0, 0xc420089e60)
	/var/lib/jenkins/jobs/levant/workspace/src/github.com/jrasell/levant/levant/deploy.go:63 +0x377
github.com/jrasell/levant/command.(*DeployCommand).Run(0xc42022be00, 0xc42008e060, 0x5, 0x5, 0xc420174f80)
	/var/lib/jenkins/jobs/levant/workspace/src/github.com/jrasell/levant/command/deploy.go:101 +0x524
github.com/jrasell/levant/vendor/github.com/mitchellh/cli.(*CLI).Run(0xc420199680, 0xc420199680, 0x3, 0xc420174fe0)
	/var/lib/jenkins/jobs/levant/workspace/src/github.com/jrasell/levant/vendor/github.com/mitchellh/cli/cli.go:255 +0x1eb
main.RunCustom(0xc42008e010, 0x6, 0x6, 0xc42022bcb0, 0xc42022bbf0)
	/var/lib/jenkins/jobs/levant/workspace/src/github.com/jrasell/levant/main.go:49 +0x433
main.Run(0xc42008e010, 0x6, 0x6, 0xb3cdf8)
	/var/lib/jenkins/jobs/levant/workspace/src/github.com/jrasell/levant/main.go:17 +0x56
main.main()
	/var/lib/jenkins/jobs/levant/workspace/src/github.com/jrasell/levant/main.go:11 +0x63

Problem is that *job.Type == nil because the Nomad job doesn't contain an explicit type, like so:

job "foobar" {
  type = "service"
}

This should be handled a bit more gracefully

[FEATURE] Use Consul KV data for vars

Summary

Enhance levant to use data from Consul KV as variables when rendering nomad jobs.

Background

When an organization uses Consul KV as a configuration repository for applications running within a particular environment and datacenter, the ability to access this data when rendering and deploying nomad jobs is very appealing.

The configuration data in Consul KV is often backed by a version controlled git repository, loaded using a tool like git2consul, and already accounts for per-environment and per-region / per-datacenter differences.

When similar data is also required to render and deploy nomad jobs, we are faced with either fetching the data from Consul prior to calling levant or replicating the required data locally to levant in the form of disparate vars and/or vars files structured by environment and/or region.

A very simple key keyword could support many of the more common use cases without embedding or attempting to emulate full consul-template functionality:

[[ key "config/serviceA/keyA" ]]

User feedback could help decide if support for just key is sufficient for the most common use cases. keyOrDefault is frequently useful as well.

Several new configuration items will be required to access the consul API whether running on the same host that is running levant or remotely. Following the patterns established by HashiCorp for consul environment variable and cli options should help users leverage their existing consul configuration settings.

The existing -var and -var-file should remain and can take precedence.

Examples

For each example, the syntax shown would be part of the nomad job file.

A dynamic job name for a service:

job "[[ key "config/serviceA/nomad_job_name" ]]" {
  ....
}

The nomad region is a required attribute and deploys to multiple datacenters each need to specify a value for it:

job "api" {
    type = "service"

    region = "[[ key "config/serviceA/nomad_region" ]]"
    datacenters = ["us1"]

Container registry and artifact paths are frequently dynamic and vary across projects and environments:

job "api" {
    config {
        image = "[[ key "config/serviceA/nomad_registry_url" ]]:${NOMAD_META_VERSION}"
     }

     artifact {
         source = "[[ key "config/serviceA/nomad_artifact_url" ]]/${NOMAD_META_VERSION}/foo.tar.gz"
     }

[FEATURE] required vars

Just like with Terraform it would be nice to be able to set required vars:

variable foo {}

In yaml that would be an empty key:

---
foo:

This would make it easier to force a user setting a dynamic var like version

Support Local Template Rendering

Feature Request

It would be really useful if Levant supported a mechanism to render the template to stdout or a local file. This would allow you to test the template rendering without submitting the job to the Nomad API.

Proposal

Add a levant render command that will, by default, render the job file to stdout and support an optional flag that will render the job file to disk. This could also be implemented as a flag to deploy that instructs Levant to render-only.

[FEATURE] Canary workflow enhancements with auto-promote

Hi Folks,

Levant is perfect for our rolling deploy workflow, and we're trying it out at the moment, but we also have a canary workflow where we do canary app -> wait 5-10 mins -> promote canary. I'm wondering if this is something you would consider accepting in to Levant and if so, whether you think it should be an addition to levant deploy like levant deploy -canary=300s or a new command like levant canary -wait=300s

Thanks!

Release version 0.0.1 of Levant

Release version of Levant 0.0.1 including release binaries and docker image.

Levant Allows Deployment When Template Rendering Is Incomplete

Description

If Levant fails to render variables in a template or interpolations are defined within the job but have no corresponding variable declaration, the job still passes validation and the deployment is submitted. Depending on where the interpolations are defined, this can lead to a job where the allocation tasks cannot be successfully started.

Repro Steps
Example files can be found here

Run levant deploy -log-level=DEBUG -var-file=vars.yaml levant.nomad
Check job allocation status and observe failure to download artifact due to missing interpolations.

Expected Behavior

Levant should detect failed or missing interpolation variables and decline to submit the job for deployment.

Actual Behavior

Levant plows on with the deployment, the job passes Nomad validation but the task allocations fail since Levant interpolation tokens are present in the artifact source path.

root@ip-10-188-93-194:~# nomad alloc-status 0237ac55
ID                  = 0237ac55
Eval ID             = 8ddd633c
Name                = fabio.fabio[0]
Node ID             = 399ea2a6
Job ID              = fabio
Job Version         = 0
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created At          = 10/04/17 16:17:11 UTC

Task "fabio" is "pending"
Task Resources
CPU      Memory  Disk     IOPS  Addresses
250 MHz  60 MiB  300 MiB  0     http: 10.188.92.131:9999
                                ui: 10.188.92.131:9998

Task Events:
Started At     = N/A
Finished At    = N/A
Total Restarts = 1
Last Restart   = 10/04/17 16:17:13 UTC

Recent Events:
Time                   Type                      Description
10/04/17 16:17:13 UTC  Restarting                Task restarting in 16.122729546s
10/04/17 16:17:13 UTC  Failed Artifact Download  failed to download artifact "https://github.com/eBay/fabio/releases/download/v1.5.2/<no value>": bad response code: 404
10/04/17 16:17:13 UTC  Downloading Artifacts     Client is downloading artifacts
10/04/17 16:17:11 UTC  Task Setup                Building Task Directory
10/04/17 16:17:11 UTC  Received                  Task received by client

Levant Version

levant version
Levant v0.0.1-dev

Nomad Version

nomad version
Nomad v0.7.0-beta1

Update Levant logging around passed variables

Levants logging needs to be updated to better detail what is going on with passed variable files and single -var variables.

The current output:

[DEBUG] levant/templater: variable file not passed, using any passed CLI variable

The output should look more like:

[DEBUG] levant/templater: no variable file not passed
[DEBUG] levant/templater: passed <n> command line variables

Add metrics to allow tracking of key statistics

Description

It could be helpful for Levant to emit a small number of statics about deployment runs to a telemetry store. This could include details about overall deployment time, success and failure count and such to provide a central overview point in environments where multiple people or teams are using Levant to deploy to a cluster.

[BUG] Levant cannot handle system jobs

As a result of #16 and the merge that is associated Levant can not currently (0.0.3-dev) handle system jobs. This is because system jobs do not have a taskgroup count and the code does not handle this causing a panic:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1735bcc]

goroutine 1 [running]:
github.com/jrasell/levant/levant.(*nomadClient).Deploy(0xc42015e020, 0xc4200a1b00, 0x0, 0xc42015e000, 0x0)
	/Users/rasellj/go/src/github.com/jrasell/levant/levant/deploy.go:66 +0xbc
github.com/jrasell/levant/command.(*DeployCommand).Run(0xc4202036e0, 0xc420010180, 0x3, 0x3, 0xc4200e3940)
	/Users/rasellj/go/src/github.com/jrasell/levant/command/deploy.go:109 +0x58b
github.com/jrasell/levant/vendor/github.com/mitchellh/cli.(*CLI).Run(0xc4200b5a40, 0xc4200b5a40, 0x3, 0xc4200e39a0)
	/Users/rasellj/go/src/github.com/jrasell/levant/vendor/github.com/mitchellh/cli/cli.go:255 +0x1eb
main.RunCustom(0xc420010150, 0x4, 0x4, 0xc420203590, 0xc4202034d0)
	/Users/rasellj/go/src/github.com/jrasell/levant/main.go:49 +0x433
main.Run(0xc420010150, 0x4, 0x4, 0x173da88)
	/Users/rasellj/go/src/github.com/jrasell/levant/main.go:17 +0x56
main.main()
	/Users/rasellj/go/src/github.com/jrasell/levant/main.go:11 +0x63

Jobs without an update stanza will not result in deployments

Description

A Nomad job that does not have an update stanza will not result in a Nomad deployment. Levant should therefore check for this and log appropriately before exiting.

Output of levant version:

Levant v0.0.4-dev

Output of nomad version:

Nomad v0.7.1 (0b295d399d00199cfab4621566babd25987ba06e)

[BUG] failure_inspector doesn't handle muti-group jobs well

When a deployment fails on a job which includes more than 1 group; the failure_inspector indeed gets run but on both allocations even if only one failed. Levant also seems to struggle to output the information of the allocations as seen in the log output:

[ERROR] levant/deploy: deployment a69c0cdd-f260-3ec8-db0e-0a4f468b9603 has status failed
[DEBUG] levant/failure_inspector: launching allocation inspector for alloc ef86a33b-a088-c9db-a240-5ecef3b6683a
[DEBUG] levant/failure_inspector: launching allocation inspector for alloc ba5bb92b-4f71-a13c-f4ea-6c7374f3db9f

Batch jobs do get evaluations on registration

#52 reports that evaluations are not created for batch jobs on registration. This is only the case for paramaterized batch jobs. All other types of batch jobs do get evaluations on registration (including periodic jobs). This has led to some uneccesary limits and code fragmentation with regard to handling batch jobs.

dynamicGroupCountUpdater doesn't take into account job status

Description

If you stop a Nomad job, it is still present until the GC is run. When the dynamicGroupCountUpdater function runs it does not take into account jobs which may return but be stopped, and therefore can update a count based on a stopped job. This should be updated so that stopped jobs are ignored from being used in the dynamicGroupCountUpdater function.

Output of levant version:

Levant v0.1.0-dev

Output of nomad version:

Nomad v0.7.1 (0b295d399d00199cfab4621566babd25987ba06e)

Add contribution guide and issue template

Levant should include both a contribution guide and issue template.

[BUG] Levant doesn't correctly determine eval failures on constraints

When using constraints such as:

constraint {
  attribute = "${attr.kernel.name}"
  value     = "linux"
}

And attempting a deployment onto Nomad running in dev mode on a OSX machine; the following log line is presented to the user:

[ERROR] levant/deploy: task group hashi-ui failed to place 1 allocs, failed on [] and exhausted []
[ERROR] levant/deploy: evaluation c02b8915-7dba-82b7-4e05-1e8ddbc4fdcc finished with status complete but failed to place allocations

When inspecting the allocation directly the actual details are discovered:

Placement Failure
Task Group "hashi-ui":
  * Constraint "${attr.kernel.name} = linux" filtered 1 nodes

Levant therefore needs to be able to catch and log evaluation failures of this kind.

Failure Introspection leaves me wanting more

Hello, I was really hopeful that -log-level debug would assist me, but it still leaves me wanting more. I don't know how to get anymore info out of levant in this case. Any ideas?

20:49:09.479 [DEBUG] levant/deploy: running dynamic job count updater for job sre-example-app
20:49:09.485 [INFO] levant/deploy: using dynamic count 3 for job sre-example-app and group production
20:49:09.485 [INFO] levant/deploy: triggering a deployment of job sre-example-app
20:49:09.495 [DEBUG] levant/deploy: beginning deployment watcher for job sre-example-app
20:49:09.500 [DEBUG] levant/deploy: Nomad returned an empty deployment for evaluation 7d96d04e-92f6-f9be-e4cc-6e0c7d904d1e; retrying
20:49:11.521 [DEBUG] levant/deploy: deployment 1ac0c116-f987-af95-be37-292f8842c707 running for 2.02595389s
20:49:11.521 [ERROR] levant/deploy: deployment 1ac0c116-f987-af95-be37-292f8842c707 has status failed, Levant will now exit
20:49:11.529 [ERROR] levant/command: deployment of job sre-example-app failed

[FEATURE] Add API and Server framework

In larger environments and where Nomad is used in a PaaS setup it would be advantageous to run Levant as a sidecar application which exposes an API. This would allow users to then either CURL or use the Levant CLI to render or deploy their applications from a central point; at the same time this proxies Nomad API calls meaning the critical Nomad API doesn't need to be exposed to all users who wish to perform a deployment.

Levant running as a central server would allow telemetry to be collected, centrally tracking deployment metrics for the cluster for overview. It should also use some form of worker queue, limiting the impact and number of API calls made by deployments as to not overwhelm the Nomad servers.

The currently functionality should not be changed and the implementation should allow users to run Levant with direct Nomad connection. I would love any input, thoughts and general idea from @pmcatominey @ericwestfall and @mimato.

Support Autoloading of Default Files

Feature Description

It would be cool if Levant supported autoloading of default files when running levant deploy.

Proposal

When a user runs levant deploy, Levant would look in the directory for a default variable file levant.yaml|hcl|yml and if a single Nomad job file is present, would use the default variable file to render any parameterized sections and trigger a deployment.

This would be similar to the behavior of Terraform where a terraform.tfvars file is automatically referenced and any *.tf file is processed. For this use case, it would probably make sense to only support this feature if a single *.nomad file is present.

Add tests for helper/files.go

Tests should be added for this helper file, which should be a quick and easy win.

Only error messages from the final task in a job are reported.

Given a multi-task job like this:

job "test" {
  datacenters = ["awse"]
  type        = "service"

  # set our update policy
  update {
    max_parallel     = 1
    health_check     = "checks"
    min_healthy_time = "30s"
    healthy_deadline = "3m"
    auto_revert      = false

    #canary           = 1
    #stagger          = "30s"
  }

  group "jobs" {
    # set our restart policy
    restart {
      interval = "1m"
      attempts = 1
      delay    = "15s"
      mode     = "delay"
    }

    count = 1

    task "first" {
      leader = true

      # our image setup
      driver = "exec"

      config {
        command = "fail"
      }

      resources {
        cpu    = 100
        memory = 100
      }
    }

    task "second" {
      # grant access to secrets
      driver = "exec"

      config {
        command = "sleep"
        args    = ["infinity"]
      }

      resources {
        cpu    = 100
        memory = 100
      }
    }
  }
}

Only errors from starting the second task will be output. Specifically, the output from the above is:

$ levant deploy /tmp/test.nomad    
[INFO] levant/deploy: using dynamic count 1 for job test and group jobs
[INFO] levant/deploy: triggering a deployment of job test
[INFO] levant/deploy: evaluation 795fb11d-0eee-fd87-a4a5-f556fcb06d2b finished successfully
[INFO] levant/deploy: beginning deployment watcher for job test
[ERROR] levant/deploy: deployment 61e86125-6705-d5e6-05bf-3b4b7635f49b has status failed
[ERROR] levant/failure_inspector: alloc b420f331-a0ef-ea59-2fae-cac85a158138 incurred event sibling task failed because task's sibling "first" failed
[ERROR] levant/failure_inspector: alloc b420f331-a0ef-ea59-2fae-cac85a158138 incurred event driver failure because failed to start task "first" for alloc "b420f331-a0ef-ea59-2fae-cac85a158138": binary "fail" could not be found
[ERROR] levant/failure_inspector: alloc b420f331-a0ef-ea59-2fae-cac85a158138 incurred event not restarting because Error was unrecoverable
[INFO] levant/auto_revert: job test is not in auto-revert; POTENTIAL OUTAGE SITUATION
[ERROR] levant/command: deployment of job test failed

Notice that it says alloc b420f331-a0ef-ea59-2fae-cac85a158138 incurred event sibling task failed because task's sibling "first" failed but it doesn't actually show any logs from that first task.

In this case we get a message because the first task was set as the leader so the second task failed also, however, if it was just the first task that failed no logs or allocation listsings would be output at all.

Thanks!

New use of `path.Ext(variableFile)` results in 0 extension matches

The RenderTemplate function was changed to use path.Ext(variableFile) rather than splitting the path and results in the dot being included in the result, whereas previously this was not the case. The constants within templater.go therefore need to be updated to include the dot otherwise no var-file extensions will ever be matched.

Unable to run deploy with an inline Consul template

The template block below fails to render as a template:

template {
    data = <<EOH
APP_ENV={{ key "config/app/env" }}
APP_DEBUG={{ key "config/app/debug" }}
APP_KEY={{ secret "secret/key" }}
APP_URL={{ key "config/app/url" }}
EOH
    destination   = "core/.env"
    change_mode   = "noop"
}

[ERROR] levant/command: template: jobTemplate:26: function "key" not defined

Not sure what the best solution is, providing an option to disable templating for those only interested in deployment monitoring?

Check that the job has at least 1 task with count > 0

If you submit a job and it got all group counts = 0, there is no deployment id. This should therefore be checked before job submit.

multi-line environment variables result in an error by levant but are accepted by nomad

Description
The following env is correctly interpreted by nomad, but results in an error in levant.

env {
        CHEF_ENV = "${meta.env}"
        APP_NAME = "api"
        IP_ADDRESS = "${attr.unique.platform.aws.public-ipv4}"
        LOCAL_HOSTNAME = "${node.unique.name}"
        JAVA_TOOL_OPTIONS = "-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=${NOMAD_PORT_jmx}
-Dcom.sun.management.jmxremote.local.only=true
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false 
-Djava.rmi.server.hostname=localhost 
-Dnetworkaddress.cache.ttl=60 
-Xms1024M 
-XX:-UseConcMarkSweepGC 
-Xmx1024M"
}

Nomad has no trouble with this. Levant generates the following error:
[ERROR] levant/command: error parsing: At 62:61: literal not terminated

Levant seems to error out on the carriage returns in the JAVA_TOOL_OPTIONS env variable.

Rework checkJobStatus function

Description

The checkJobStatus or previously known as checkBatchJob is a little weak and should be updated to perform better checking of the Nomad response. This includes the fact that the function currently doesn't catch if the job is dead/complete/failed and only either returns on either a timeout or if the job reaches a status of running.

With the improvement Levant can also be updated to launch the failureInspector if the job does not reach the running state by the timeout or if it reaches an undesirable state. This ticket will be opened upon completion of this work.

Add 'force-batch-now' command to periodic jobs

Description

During code releases of jobs which are periodic, it may be preferable for operators to execute a run of the job immediately rather than waiting for the scheduled execution. Nomad provides this option via the API, so this would be adding a flag to the deploy command.

[FEATURE] Check batch jobs reach the `running` status

Currently if Levant registers a job of type batch it will just check for successful registration Nomad call and then exit. Levant can and should be updated to inspect the job and ensure that a running status has been at least achieved.

Due to the periodic nature of batch workload and the fact an allocation might not start for a significant time, I do not see any method whereby the job can be further verified.

Relates to GH-52

Add option to wait until batch jobs complete prior to returning

Currently levant only waits until the job has reached a "running" state. But it would be great (for use in scripts) to be able to, optionally, wait until it has completed. Then levant could return with the appropriate error code or success depending on whether the job failed or not.

Thanks!

Add dispatch wrapper so that dispatch jobs go through checking

Description

When running nomad dispatch the invocation is almost identical to that of deploying a periodic or system job with evaluations and allocations being created. This means dispatch jobs can also go through the same checks that using the deploy command gives. It could therefore be helpful to add a thin wrapper around the dispatch command to allow use of this checking and feedback.

Add a docs directory to allow better documentation

Description

Currently Levant is solely documented within the README file which provides a good enough overview. It would be ideal though to have more information documented around key features and how they work which would be to verbose for the README. Levant should therefore implement a docs directory where this kind of information can be stored.

[FEATURE] Jobs with type 'system' can undergo running state checks

Description

0.0.4 introduces the feature that allows Levant to confirm batch jobs reach the state of running even though they do not use Nomad's deployment functionality. Systems jobs can use the same functions to perform checks that they too reach the status of running to provide additional feedback.

Add Levant `job-logs` command to tail logs

Description

Nomad 0.8.0 will bring in a proxy style feature so that client based HTTP API requests can be proxied through the Nomad server API. This is required as Nomad clients should be running in private subnets and programatic access from applications such as Jenkins controlled and only allowed access through the server API.

This feature therefore is to create a new command using this feature which allows the streaming of job logs until the completion of the job. This is very helpful for batch and parameterised job where operators likely wish to trigger an instance of the job and then track the logs until completion.

Ansible role now available

My Ansible role to install Levant is available here:
https://github.com/stevenscg/ansible-role-levant

Since there are not any dependencies or required configuration, it simply deposits the binary into /usr/local/bin/levant on a Redhat-family of systems.

Is there a docs or wiki page for projects using or related to Levant?

[FEATURE] Connect Through Bastion / Jump Host

HashiConf 2017 follow-up....

This is a fantastic project and I can't wait to use it. I'm still surprised that some of these features aren't part of the nomad core, TBH.

I'm not sure if this is a "feature" request or a discussion item, but I'd be interested in hearing thoughts on it and see if I have an edge case or a mainstream one.

We have two primary deployment scenarios for our nomad applications:

Automatically via CI/CD service that is hosted outside of our datacenter (which is just a VPC on AWS), and
Operator-directed deployments when launching a new datacenter, DR, etc.

With our current templating and deployment system, ssh access through a bastion host is used to connect to datacenter and then to connect to the Nomad API running privately within the datacenter. This lets us use the consul DNS name for nomad (i.e. http://http.nomad.service.consul:4646) in the API url.

The job files are available outside of the datacenter (with the CI/CD service or the operator's workstation).

Both of the deployment scenarios I describe work if levant can access the job files locally and deploy the jobs into the remote datacenter(s).

I don't see any explicit support for this kind of thing in the project right now.

Is there some SSH proxy and tunneling magic that solves this without changing the project?

Is there native support for "communicators" in the low-level go libraries used by the project with configuration options that can be exposed? Packer has communicators and we've used them in our workflows successfully.

Failed auto promote to does not have to go through auto-revert

Description

The nature of a canary deployment means that failures do not need to go through auto-revert checking. Therefore Levant needs to be updated to remove the fact that canary failures do currently invoke the auto-revert functions which can produce erroneous messages which may confuse the operator.

Use specific delimiter for template rather than standard

As noted in #1 using the standard template delimiter causes issues when also using Nomad's template stanza. Levant should use a specific delimiter in order to help avoid issues.

Deployment watcher should log time stats in standard format

Current logging during deployment:

[INFO] levant/deploy: deployment 35f8e7e9-be24-4c1c-f93d-cd46bc627e8f running for 2.00932722s
[INFO] levant/deploy: deployment 35f8e7e9-be24-4c1c-f93d-cd46bc627e8f running for 7.126101742s
[INFO] levant/deploy: deployment 35f8e7e9-be24-4c1c-f93d-cd46bc627e8f running for 12.15899068s
[INFO] levant/deploy: deployment 35f8e7e9-be24-4c1c-f93d-cd46bc627e8f running for 17.322398309s

Levant should log with a standard rounding; suggested is 2 decimal places

[FEATURE] If a deployment fails; Levant should check rollback

Nomad 0.6.0 introduced the idea of auto-revert; whereby a failed deployment can revert back to the last known stable version of the job. Currently Levant can detect when a deployment has failed and inspect the allocations for why the failure occurred, but does not check whether the job is configured to auto-revert or monitor the status of this revert.

Therefore Levant could include auto-revert tracking to further enhance deployment feedback and usefulness.

Levant should use a config struct to build each deployment

Description

Currently Levant does not use a central deployment struct to build information about the deployment object and the binary run as a whole. It would be beneficial to move the key parameters and objects to a struct which would allow the easier tracking of information about a deployment, as well as make future extensibility easier.

Levant logging should add timestamp

When running Levant from the CLI; it outputs a stream of text which looking at retrospectively can be annoying without the timestamp reference. Levant should therefore be updated to include a timestamp in the logging output to provide greater clarity to users.

hashicorp / levant Goto Github PK

levant's Introduction

Levant

Features

Download & Install

Templating

Commands

Clients

Contributing

levant's People

Contributors

Stargazers

Watchers

Forkers

levant's Issues

Summary

Background

Examples

Recommend Projects

Recommend Topics

Recommend Org