Git Product home page Git Product logo

data-cat's Introduction

data-cat

Deploying DataDog for a large scale infrastructure

image <3 image

Definitions

  • Geographic Regions
  • Stages
  • Applications

Geographic Regions

Matches the definitions of AWS Regions. It can be used for GCP or on-prem datacenter as well.

Stages

Different stages of application deployments, usually: dev, qa, prod.

Applications

A service that provides a distinct business functionality.

Goals

  • having all monitors and dashboards in version control
  • having all monitors templated
  • being able to address smaller parts of the infrastructure

Implementation

4 files represent the DataDog configuration for the whole infrastructure.

  • infrastructure.yaml

It contains the logical grouping of applications into stages and regions. The relations are always N:M. 1 region can contain many stages and many applications in each stage.

  • region.yaml

Defaults for a certain region (region).

  • stage.yaml

Defaults for a certain stage (region, stage).

  • application.yaml

Configuration that is specific for a certain application (region, stage, application).

Generating infrastructure.yaml

I recently discovered Dhall that seems like the perfect fit to write the infrastructure in and than generate the YAML files.

The type safe definitions looks like the following:

let keyValue =
        λ(k : Type)
       λ(v : Type)
       λ(mapKey : k)
       λ(mapValue : v)
       { mapKey = mapKey, mapValue = mapValue }

let ApplicationConfig : Type = { created_at : Text } 

let Application = < etcd | postgresql | hadoop >
let Applications = Prelude.Map.Type Application ApplicationConfig
let application = keyValue Application ApplicationConfig

let Stage = < dev | qa | prod >
let Stages = Prelude.Map.Type Stage Applications
let stage = keyValue Stage Applications

let AwsRegion = < us-east-1 | eu-central-1 | eu-west-1 >
let AwsRegions = Prelude.Map.Type AwsRegion Stages
let awsRegion = keyValue AwsRegion Stages

After having these definitions we can create the infrastructure:

in  [ awsRegion AwsRegion.us-east-1
        [ stage Stage.dev
             [ application Application.hadoop { created_at = "2019-11-04T09:00:00Z" } 
             , application Application.etcd { created_at = "2019-11-04T09:00:00Z" } 
             ]
        , stage Stage.qa
             [ application Application.hadoop { created_at = "2019-11-04T09:00:00Z" } 
             , application Application.etcd { created_at = "2019-11-04T09:00:00Z" } 
             ]
        ]
        
    , awsRegion AwsRegion.eu-west-1
        [ stage Stage.dev
             [ application Application.hadoop { created_at = "2019-11-04T09:00:00Z" } 
             , application Application.etcd { created_at = "2019-11-04T09:00:00Z" } 
             ]
        ]
    , awsRegion AwsRegion.eu-central-1
        [ stage Stage.dev
            [ application Application.hadoop { created_at = "2019-11-04T09:00:00Z" } 
            , application Application.etcd { created_at = "2019-11-04T09:00:00Z" } 
            ]
        ]
    ]

Generating the YAML:

dhall-to-yaml --file infrastructure.dhall > infrastructure.yaml

Generating the folder structure

python3 gen.py                                                                        
region: eu-central-1, stage: dev
region: eu-central-1, stage: dev, app: etcd
region: eu-central-1, stage: dev, app: hadoop
region: eu-west-1, stage: dev
region: eu-west-1, stage: dev, app: etcd
region: eu-west-1, stage: dev, app: hadoop
region: eu-west-1, stage: prod
region: eu-west-1, stage: prod, app: etcd
region: eu-west-1, stage: prod, app: hadoop

Temlates

Templates folder has the monitor templates.

Example template:

---
name: High CPU load on application_name:{application_name} stage:{stage} {{{{host.name}}}} / {{{{host.ip}}}}
tags:
  - application_name:{application_name}
  - stage:{stage}
  - region:{region}
type: metric alert
query: avg(last_5m):avg:system.load.norm.5{{application_name:{application_name},stage:{stage}}} by {{host}} > {critical_threshold}
message: >-2
  High CPU load on application_name:{application_name} stage:{stage} {{{{host.name}}}} / {{{{host.ip}}}} for 5 consecutive minutes on this node.
  Url: https://wd-global-prod.datadoghq.com/monitors/{monitor_id}
  {slack_notification_channel}
monitor_options:
  notify_audit: False
  locked: False
  timeout_h: 0
  silenced: {{}}
  include_tags: True
  require_full_window: True
  new_host_delay: 300
  notify_no_data: False
  renotify_interval: 0
  escalation_message: >-2
    CPU load is still damn high.
  thresholds:
    critical: {critical_threshold}
    warning: {warning_threshold}

This gets rendered using Python format and converted to a dict that used to talk to the DataDog API.

Defaults and specifics

Defaults are stage wide settings specifics are specific to a single application (in a region & stage).

Tags alignment

For all of these above to work together nicely there is a dependency on tags being deployed every node, ELB, etc., so that we can reference those in monitors and dashboards.

Deployment

I gave up on Conda and now just using venv from Python.

/usr/local/opt/python3/bin/python3 -m venv venv
. venv/bin/activate.fish #or the shell you are using
pip install --upgrade pip
pip install --upgrade toml pyyaml

Deploying monitors

Deploying a whole stage:

./data-cat/data-cat.py deploy-monitors -r eu-west-1 -s qa 

Deploying a single application:

./data-cat/data-cat.py deploy-monitors -r eu-west-1 -s qa -a etcd

Deploying dashboards

Deploying a whole stage:

./data-cat/data-cat.py deploy-dashboards -r eu-west-1 -s qa 

Deploying a single application:

./data-cat/data-cat.py deploy-dashboards -r eu-west-1 -s qa -a etcd

What to monitor

Following Brendan Gregg's use method and the suggested things to monitor:

  • CPUs: sockets, cores, hardware threads (virtual CPUs)
  • Memory: capacity
  • Network interfaces
  • Storage devices: I/O, capacity
  • Controllers: storage, network cards
  • Interconnects: CPUs, memory, I/O

How to monitor it (examples):

  • utilization: as a percent over a time interval. eg, "one disk is running at 90% utilization"
  • saturation: as a queue length. eg, "the CPUs have an average run queue length of four"
  • errors: scalar counts. eg, "this network interface has had fifty late collisions"

About Us

LambdaInsight is a consultancy located in Europe working on large scale infrastructures mostly in the intersection of data and cloud.

Let us know if you are interested in a project with us.

data-cat's People

Contributors

l1x avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.