newrelic / nr1-slo-r Goto Github PK

View Code? Open in Web Editor NEW

21.0 28.0 21.0 7.64 MB

NR1 SLO-R allows you to define, calculate and report on service-level objective (SLO) attainment.

Home Page: https://discuss.newrelic.com/t/track-your-service-level-objectives-with-the-slo-r-nerdpack/90046

License: Apache License 2.0

JavaScript 83.55% SCSS 16.45%

newrelic nerdpack nr1 nr1-slo-r slo error-slos alert-slos

nr1-slo-r's Introduction

SLO/R

Announcement

New Relic Service Level Management is now in Beta! To find out more please take a look at the docs.

Service Level Management introduces the ability to define and analyse Service Levels with a scalable and centralized user experience.

For users of SLO/R you can migrate the existing SLOs you have defined to this new format. Just follow the instructions in our migration companion.

What does this mean for the future of SLO/R?

We will be retiring SLO/R from the New Relic Apps Catalog. We highly recommend any new users take advantage of the in-product SLM experience as it is far superior to the SLO/R open source project.

We will update this repo to legacy status, and keep the code available as an example of working with SLOs.

For active users of SLO/R we will be reaching out to ensure your transition to the in-product experience is as easy as possible.

Usage

SLO/R lets you quickly define SLOs for error, availability, capacity, and latency conditions.

You can use the application for reporting out your results. By measuring SLO attainment across your service estate, you’ll be able to determine what signals are most important.

Using New Relic as a consistent basis to define and measure your SLOs offers better insight into comparative SLO attainment in your service delivery organization.

SLO/R provides two mechanisms for calculating SLOs: event based - availability/latency (calculated by defects or a specified latency on transactions) and custom (alert) based which includes availability, capacity, and latency types (calculated by total duration of alert violation).

We are keen to see SLO/R evolve and grow to include additional features and visualizations. For version 1.0.1, we wanted to ship the core SLO calculation capabilities. We expect to rapidly build upon this core functionality through several releases. Please add an issue to the repo is there's a feature you'd like to see. For more details about the SLOs and their calculations, please see error driven SLOs and alert driven SLOs.

Open source license

This project is distributed under the Apache 2 license.

Dependencies

Requires New Relic APM.

SLO/R is intended to work specifically with services reporting to New Relic via an APM Agent. The service provides an entity upon which to define SLOs.

Event-based SLO’s work with APM Transaction data.
Custom (alert-based) SLO’s require a custom webhook configured to write SLOR_ALERTS events to NRDB. See Configuring SLO/R Alert Webhook for specific instructions.

Getting started

First, ensure that you have Git and NPM installed. If you're unsure whether you have one or both of them installed, run the following command(s) (If you have them installed these commands will return a version number, if not, the commands won't be recognized):
```
git --version
npm -v
```
Next, install the New Relic One CLI by going to this link and following the instructions (5 minutes or less) to install and set up your New Relic development environment.
Next, to clone this repository and run the code locally against your New Relic data, execute the following command:
```
nr1 nerdpack:clone -r https://github.com/newrelic/nr1-slo-r.git
cd nr1-slo-r
nr1 nerdpack:serve
```

Visit https://one.newrelic.com/?nerdpacks=local, navigate to the Nerdpack, and ✨

Deploying this Nerdpack

Open a command prompt in the nerdpack's directory and run the following commands.

# To create a new uuid for the nerdpack so that you can deploy it to your account:
nr1 nerdpack:uuid -g [--profile=your_profile_name]

# To see a list of APIkeys / profiles available in your development environment:
# nr1 profiles:list
nr1 nerdpack:publish [--profile=your_profile_name]
nr1 nerdpack:deploy [-c [DEV|BETA|STABLE]] [--profile=your_profile_name]
nr1 nerdpack:subscribe [-c [DEV|BETA|STABLE]] [--profile=your_profile_name]

Visit https://one.newrelic.com, navigate to the Nerdpack, and ✨

Configuring SLO/R Alert Webhook

The custom events - availability, capacity, and latency SLO types within SLO/R are calculated using the total duration of alert violations. In order to record those alert violations we need to enable an Insights directed Webhook to capture the open and close events.

The alert payload needs to be as specified for SLO/R to operate as expected. Please follow these instructions to enable the alert event forwarding.

For more information on sending alert data to New Relic, see Sending Alerts data to New Relic.

How to configure and use SLO/R

Configuration in Entity Explorer

SLO definitions are scoped and stored with service entities. Open a service entity by exploring your services in the Entity explorer from the New Relic One homepage.

Select the service you are interested in creating SLOs for. In our example we will be using the Origami Portal Service.

Select the SLO/R New Relic One app from the left-hand navigation in your entity.

If you (or others) haven't configured an SLO the canvas will be empty. Just click on the Define an SLO button to begin configuring your first SLO.

The UI will open a side-panel to facilitate configuration. Fill in the fields:

SLO Name: Give your SLO a name, this has to be unique for the service or will overwrite similarly named SLOs for this entity.
Description: Give a quick overview of what you're basing this SLO on.
SLO Group: This is grouping meta-data. Typically organizations are responsible for multiple services and SLOs. This gives us an ability to roll up the SLO to an organizational attainment.
Target attainment: The numeric value as a percentage, you wish as your SLO target (e.g. 99.995)
Indicator: There are four indicators for SLOs in SLO/R - Error, Availability, Capacity, and Latency. Error SLOs are calculated from Transaction event defects. Availability, latency, and capacity SLOs are calculated by alert violations.

Example error SLO

For Error SLOs you need to define the defects you wish to measure and the transaction names you want to associate with this SLO.

Example Availability SLO

Alert driven SLOs depend on alert events being reported in the SLOR_ALERTS table. Please see SLO/R alerts config to ensure you're set up to capture alert events.

Once you've created a few SLOs you should see a view like the following:

Configuration in Launcher app

Other way of configuring SLO is through Launcher app. Difference between creating SLOs from Entity Explorer and from Launcher is that entity must be selected first.

Using app from Launcher

It is possible to combine multiple SLOs into tables and user selection is stored in NRDB.

SLOs can be filtered by tags attached to them:

How is SLO/R arriving at the SLO calculations?

For details, see Alert SLOs and Error SLOs.

Community Support

New Relic hosts and moderates an online forum where you can interact with New Relic employees as well as other customers to get help and share best practices. Like all New Relic open source community projects, there's a related topic in the New Relic Explorers Hub. You can find this project's topic/threads here:

https://discuss.newrelic.com/t/track-your-service-level-objectives-with-the-slo-r-nerdpack/90046

Please do not report issues with SLO/R to New Relic Global Technical Support. Instead, visit the Explorers Hub for troubleshooting and best-practices.

Issues and enhancement requests

Issues and enhancement requests can be submitted in the Issues tab of this repository. Please search for and review the existing open issues before submitting a new issue.

Security

As noted in our security policy, New Relic is committed to the privacy and security of our customers and their data. We believe that providing coordinated disclosure by security researchers and engaging with the security community are important means to achieve our security goals.

If you believe you have found a security vulnerability in this project or any of New Relic's products or websites, we welcome and greatly appreciate you reporting it to New Relic through HackerOne.

Contributing

Contributions are welcome (and if you submit an enhancement request, expect to be invited to contribute it yourself 😁). Please review our contributors guide.

Keep in mind that when you submit your pull request, you'll need to sign the CLA via the click-through using CLA-Assistant. If you'd like to execute our corporate CLA, or if you have any questions, please drop us an email at [email protected].

nr1-slo-r's People

Contributors

Stargazers

Watchers

nr1-slo-r's Issues

Configure for circle ci

Prerequisites:

Github Personal Access token with "public_repo" scope
Snyk API Token
CircleCI Personal API Token

Required setup items:

Sample CSV import file:

user,email,agreement
circleci[bot],[email protected],TRUE
@semantic-release-bot,[email protected],TRUE

Alert Condition based SLOs

Summary

Currently you can only assign an alert driven SLO by policy. However that would mean you need a 1:1:1 on entity - condition - policy for these alerts to be accurate. Example - entity: app1, condition: throughput high, policy: app1 high throughput

Desired Behaviour

It would be great if we could select by condition under the policy.
Example - Policy: Backend, conditions: high CPU, high response time, low apdex

If you have policies grouped by app or function there needs to be a way to specify which condition we would like to pull out of the policy.

Possible Solution

Adding one more layer to the form to select the condition under the policy if needed. The condition name is being captured in the JSON payload from alerts already.

Automatically refresh the SLO's every 60 seconds

View details needs to respect time range in NRQL output

If you don't have any SLO Alert Policies, point the user to docs

Allow editing of SLO Result to indicate backout windows or rejected defects

Provide a mechanism to take a given SLO result and post-edit it to annotate defects or alert periods that were part of expected blackout or items that should not apply to the SLO calculation.

In sure these items are well documented and the revised SLO calculations appear with suitable annotation.

link to errors.md not working to setup error alert

https://github.com/newrelic/nr1-slo-r/blob/master/error_slos.md

link does not work

Launcher - Summary View

For each indicator (have a section per indicator):

render a summary "row" summarizing each indicator (in-memory using row-level data)
render a table or list of each SLO

// Some psuedo-code

render () {
  return SLO_INDICATORS.map((indicator) => {
    return <>
      <IndicatorSummary></IndicatorSummary>
      <IndicatorTable></IndicatorTable>
    </>
  )
}

View the SLO definition outside of an edit function

Calculation Blackout Periods

The ability to specify a blackout period for an SLO definition so that known downtimes will be excluded from the calculations of SLO attainment.

Summary

Allows us to make better SLO designs that represent some of the variable aspects of time based SLO calculation.

Desired Behaviour

there should be a policy dialog with the SLO configuration - ability to specify a recurring policy or a one-off period of time. These should persist with the policy. Probably an array of them or something like that.

Possible Solution

as above dialog a "blackouts" section of the slo.json

Additional context

Just want to make the most useful configurator evah!

Provide alerting on SLO budget consumption

Per feedback -

The amount of budget we have left is what we need to monitor and also to alert on the rate of which that budget is consumed. We need to know if something is about to fall over outside of the norm.

Replace the README screenshots

View SLO Definition

Currently, it's not possible to review the SLO definition. It'd be nice to be able to do that.

per @AlecIsaacson

Style the "view details" modal

It needs:

a heading
A section for NRQL that is either an accordion or tabs
That json output should be some pretty UI showing the definition of the document

Implement config screen in a modal

Error indicator not using correct field

When setting up an Error indicator SLO, it is filtering only on the httpResponseCode field and not including the response.status one. For at least .NET Framework agents, the httpResponseCode field is not populated but the response.status one is.

Replace the table component

Add a link to the SLO definition docs to the UI

@ricegi is creating docs in the repo.

Add a link to it in the empty state
Add a link to it in the edit screen

NRQL query for alert based SLO not correct

Description

After defining SLO it doesn't work. When viewing the details of the SLO the NRQL has WHERE policy_name IN (’’) which obviously won't work.

Steps to Reproduce

Define a SLO

Expected Behaviour

Should generate the correct NRQL.

Relevant Logs / Console output

Your Environment

NR1 CLI version used: @datanerd/nr1/1.2.2 win32-x64 node-v10.16.3
Browser name and version: Chrome
Operating System and version: Windows 10

Additional context

Style the new SLO table

We're using a new, more capable, table component. However, since the switch the styling of the table has regressed. Fix that.

SLO/R Overview shows no SLOs defined, even though several are created

Description

When I launch the SLO/R Overview page, it shows now SLOs defined. The SLO Group dropdowns are empty, even though several SLOs and Groups have been created in NR1.

Steps to Reproduce

1 - Go to a service and define a new Errors SLO
2 - Name the group while defining the SLO
3 - Go the main NR1 page, and click on the SLO/R Launcher
4 - No SLOs are showing

Expected Behaviour

Defined SLOs should show in the SLO/R Overview page.

NR1 CLI version used: 1.10.10 darwin-x64 node-v10.16.3
Browser name and version: Chrome
Operating System and version: MacOS

Additional Attachments

Add favorites to a listing / card view

Edit SLO definitions

Amend the Add functionality to support edits.

Address alert violation overlaps

Currently, the code doesn't address overlaps in the alert violations for time attainment. We need to fix that.

Add ability to view details, edit, and delete from table view

You can do all 3 of them from the grid view, but not the table view. Right now I'm thinking we just add in that menu button into a new column in the table for each row.

Add a description to an SLO

display description in view details (4c0e5b3)
display description in grid card view
mouseover description in table view

Use time picker to determine the limit for config transaction loads

The current config defaults to look at the top 100 transactions from the last Month. For accounts with billions of events this can pretty easily time out. So we should tie the transaction and alerts selection lists to those events discovered during the time picker window.

Review terminology

Per feedback

The easiest feedback we got was around how we named items, eg. the SLI's are the latency, throughput, uptime and error. The objective would then be defined (the SLO) as the target, where we named the SLI's as "Type", he said that was confusing, it should be the Indicator.

Also the term "error budget" as an SLO is not correct in that dropdown. It should be "errors" as that is the indicator.

The way he explained it was that if you have 100 transactions a day, the target would be that I want 90 of those 100 transactions to be error free. And the other 10 is the "budget" which is your SLO.

Two SLO's based upon Error indicator in the same SLO group overwrite the defects selected in the dropdown

Description

When creating two SLO's in the same SLO group, if both use the Error indicator then changing the defects selected in the dropdown (5xx errors, 401-unauthorised etc) are shared between the two SLO's. For example if you wanted one SLO for 5xx errors and another for everything else, this is impossible because when you change one, the other is changed too.

Steps to Reproduce

Have two Error SLO's in a single SLO group. They have different transactions selected. Set one to use 5XX defects, the other to use 401 - Unauthorised. Then change one Error SLO and add a new defect. You should observe they now BOTH have this defect selected.

Expected Behaviour

I'd expect to be able to have two (or more) separate Error SLO's in an SLO group each scoped to different transactions and different defects (5xx, 401, 403 etc).

Relevant Logs / Console output

None unfortunately.

Your Environment

NR1 CLI version used:
@datanerd/nr1/1.10.10 darwin-x64 node-v10.16.3
Browser name and version:
Chrome 79 64 bit. Also observed on customer machine (versions unknown)
Operating System and version:
Mac OS 10.14.3

Additional context

None.

SLO/R Entity Nerdlet not refreshing on Entity select

When looking at SLOs in the SLO/R entity nerdlet and you select a new entity in the breadcrumbs selected the view does not update

Description

see above

Steps to Reproduce

see above

Expected Behaviour

The context should switch and you should see the SLOs associated with the new entity you've selected.

Relevant Logs / Console output

N/A

Your Environment

N/A

Additional context

None

Color code the table cells based on attainment

Provide sort of SLOs

Sort SLOs - It'd be nice if the SLO tiles could be sorted by best / worst performing.

per - @AlecIsaacson

Create Calendar View for each SLO

Provide a Weekly / Monthly view for each SLO attainment calculation.

Summary

Provides a view of SLO attainment in the conventional calendar sense rather than the rolling current, 7 day, and 30 day window.

Desired Behaviour

Possible toggle between calendar view and the current rolling SLO calculations.

Possible Solution

...

Additional context

Most people will want to ahve a calendar view fo SLO attainment.

Modify the team definition into org language

SLO as Code

Provide documentation and tooling for defining SLO definition via GraphQL mutation and integration into CICD pipeline.

Save Each Monthly/Weekly SLO Attainment in NerdStore

As the basis of providing a more comprehensive on-going report - save off monthly calculations in NerdStore and provide a series of reporting options.

v. 1 SLO/R

Complete milestone 1 https://github.com/newrelic/nr1-csg-slo-r/milestone/1

https://docs.google.com/document/d/1Mu9u1X3o6kcY8kZG4IXUxD51PH7FDh6YiNl8dzHUyU8/edit

Open this NRQL in chart builder

Allow the NRQL behind an SLO definition to be automatically examined in Chart Builder.

See Graphiql Notebook or Datalyzer for a crib.

Auto Assign SLOs

Ability to scan services and auto assign SLOs for latency, availability, throughput and error budget.

Using historical data, take a running average to define the "99.5 percentile" and auto assign those numbers as targets. In this way, large organizations can quickly ramp up to speed and only adjust as necessary, rather than manually set up each process.

Error when editing an Error-based SLO

Description

When editing an existing Error SLO, I click on the three dots, then Edit. Add a new defect to an Error SLO and click Update Service, intermittently, it doesn't update. So I click back on Edit on the same SLO and the defect I added is not there.

Steps to Reproduce

See description above.

Expected Behaviour

When I add a new defect type to an existing Error SLO using the Edit option and click Update Service, any edits I make are persisted.

Relevant Logs / Console output

When I click on the Update Service button, I observed this error in the console

single-document.js:25 Uncaught (in promise) TypeError: e.map is not a function
    at c (single-document.js:25)
    at s (single-document.js:33)
    at single-document.js:65
    at c (runtime.js:45)
    at Generator._invoke (runtime.js:271)
    at Generator.T.forEach.e.<computed> [as next] (runtime.js:97)
    at c (runtime.js:45)
    at t (runtime.js:135)
    at runtime.js:170
    at new Promise (<anonymous>)

Your Environment

NR1 CLI version used:
@datanerd/nr1/1.10.10 darwin-x64 node-v10.16.3
Browser name and version:
Chrome 79
Operating System and version:
Mac OS 10.14.3

Additional context

SLO Group had multiple Error SLO's defined within it.

defects don't seem to be making it into the SLO definitions

When creating new SLO's, make SLO Group a drop down with options of existing groups

Summary

It's hard to remember/know which groups you've already created. To aid the user experience and prevent lots of duplicated SLO groups being created, you should be able to pick an SLO group from the dropdown when defining a new SLO

Desired Behaviour

When defining a new SLO, the field for SLO Group should be a dropdown (if groups exist) or if not, allow user to create the first new group.

Possible Solution

Store groups in nerdstorage so the component for defining a new SLO can check to see if groups already exist, if so, display them in a dropdown.

Additional context

It's a poor user experience where you have to remember the groups already existing and spell the group exactly right for the SLO to end up in the right group.

Breakdown SLO Compliance Calculation by Transaction or Alert

It would be good to have a summary how each element in the SLO calculation contributes to the overall calculation - either in a details drilldown or some indication.

Tab b/w a card vs. tabular view

Select / design icon for Launcher

Create SLO definition doc in the repo

Provide definitions for SLO calculations

We need to define what the SLO is based on specifically, a period of time, a total capacity or consumption of error budget.

Alert defined for SLOs (Budget Perspective)

Summary

As SLOs are defined we should think about them in terms of their overall budget. If the attainment objective is 99.98 ... alerting on the rate of budget consumption versus the total amount of time remaining in the time period.

e.g. - Error SLO of 99.5 ... halfway through the measurement period we are at 99.6 attainment - meaning based on a straight line rate calculation for the SLO we are not going to make our time-bound objective.

Desired Behaviour

Alerts defined for SLOs that are sophisticated enough to execute the rate based budget consumption alerting for an SLO

Possible Solution

TBD

Additional context

Use time window and rate of consumption for the alerting context ...

7 day SLO alert
30 day SLO alert
Specific Month SLO alert

Need way to define a new SLO from the Launcher

Summary

The definition of an SLO linked to a simple entity is too opaque - we need to make it easier to get to the SLO definition. In the case of Alert derived SLOs there is a really loose correlation between the entity and the Alert. So what's the point of limiting the definition of SLOs at the entity.

Desired Behaviour

It is easy to define everything you need for an SLO in one place.

Possible Solution

TBD - modification of the SLO definition experience

Additional context

Use entity meta-data (tags / labels) to provide the overview orchestration

Summary

On shipping SLO/R relies on a construct of an SLO Group (nee organization nee team) to group multiple SLOs into one attainment. This is adding an artificial construct that won't age well in New Relic - so it would be better to use the meta-data that is already available with the entities as the basis for grouping. NR users can then just worry about organizing their entities with proper taxonomy instead of having to re-do it in SLO/R.

Desired Behaviour

Select from a list of available metadata or being typing the metadata and you get an overview report for all the SLOs on all the entities that contains that metadata. We could allow for multiple tags narrowing the context.

Possible Solution

update the composite / organization query logic to take an array of applicable entities based on the various tags chosen.

Additional context

I think this would dramatically improve the flexibility to report overviews for SLOs.

newrelic / nr1-slo-r Goto Github PK

nr1-slo-r's Introduction

SLO/R

Announcement

Usage

Open source license

Dependencies

Getting started

Deploying this Nerdpack

Configuring SLO/R Alert Webhook

How to configure and use SLO/R

Configuration in Entity Explorer

Configuration in Launcher app

Using app from Launcher

How is SLO/R arriving at the SLO calculations?

Community Support

Issues and enhancement requests

Security

Contributing

nr1-slo-r's People

Contributors

Stargazers

Watchers

Forkers

nr1-slo-r's Issues

Summary

Desired Behaviour

Possible Solution

Summary

Desired Behaviour

Possible Solution

Additional context

Description

Steps to Reproduce

Expected Behaviour

Relevant Logs / Console output

Your Environment

Additional context

Description

Steps to Reproduce

Expected Behaviour

Additional Attachments

Description

Steps to Reproduce

Expected Behaviour

Relevant Logs / Console output

Your Environment

Additional context

Description

Steps to Reproduce

Expected Behaviour

Relevant Logs / Console output

Your Environment

Additional context

Summary

Desired Behaviour

Possible Solution

Additional context

Description

Steps to Reproduce

Expected Behaviour

Relevant Logs / Console output

Your Environment

Additional context

Summary

Desired Behaviour

Possible Solution

Additional context

Summary

Desired Behaviour

Possible Solution

Additional context

Summary

Desired Behaviour

Possible Solution

Additional context

Summary

Desired Behaviour

Possible Solution

Additional context