slo

Title of the SLO Document

This document describes the SLO for chatbot service.

Status	Published
Author	Aamir Raza
Date	2021-04-26
Reviewers
Approvers
Approval date
Revisit date

Service Overview

Transform chatbot based user and cosumer experience that is integrated with Facebook and Line Messenger.Platform core services are

Chatbot management web application
Message sending and receiving service to messenger
Group of services that cut out common processing of prior services.

Users interacts with application through messenger,its inflows and utterances are sent back to messagig service .Serverless pipeline is in built between user and messaging service.Serverless stacks JSON data and returns HTTP response immediately. Web application is for user management ,CRUD of chatbot utterance content and DB linkage with client.It faces cloud load balancer and fastly for static file delivery. gRPC based group of services for providing common functionlity such as narrowing down of users. Memory store is used for scenarios distribution to users. All data is persisted into single database for CRUD ops.

The SLO is uses a four week rolling window.
Each objective has a separate error budget
Formula = 100% minus (-) the goal for that objective.

SLI's capture ratio of good events to total events
Error budget gives number of allowed bad events.
Error rate is the ratio of bad events to total events

SLIs and SLOs

Category	SLI	SLO	Error budget	Error rate
Request Driven API			Total no. of requests are 1,000,000 than value of error budget is
Availability	Any HTTP status code other than 500-599 is considered successful
	Proportion of successful http requests / total http requests	97% success	3% = 30000 errors	3%
Latency	Proportion of fast reqs <400 ms / total no. of reqs	90% reqs <400ms	10% = 1,000,00 reqs<400ms	10%
	Proportion of fast reqs <800 ms / total no. of reqs	97% reqs <800ms	3% = 30000 reqs<800ms	3%
	Proportion of slow reqs <6000 ms / total no. of reqs	80% reqs <6000ms	20% = 2,000,00 reqs<6secs	20%
	Proportion of slow reqs <8000 ms / total no. of reqs	89% reqs <8000ms	11% = 1,100,00 reqs<8secs	11%
Error	Explicit: HTTP 500-599
	Proportion of errors having status code / total http reqs	3% error	3% = 3,000,0 errors	3%
	Implicit: HTTP 200 but coupled with wrong content
	Proportion of errors having wrong content / total http reqs	1% error	1,000,0 errors	1%
	Policy:
	Committed to 1 sec response time but delayed	3% conflict with defined policy	3,000,0 errors	3%
Quality	Proportion of successful reqs when cpu overloaded 90 %	80 % success	20% = 2,000,00 errors	20%
	Proportion of successful reqs when memory overloaded 90 %	80% success	20% = 2,000,00 errors	20%
	Proportion of successful reqs whe datastore is unavailable %	80% success	20% = 2,000,00 errors	20%
Web server
Availability	Proportion of successful web requests / total web requests	99.9% success	0.1% = 1000errors	0.1%
Latency	Proportion of fast reqs <200 ms / total no. of reqs	90% reqs <200ms	10% = 1,000,00 reqs<200ms	10%
	Proportion of fast reqs <1000 ms / total no. of reqs	99% reqs <1000ms	1% = 1,000,0 reqs<1secs	1%
	Proportion of slow reqs <6000 ms / total no. of reqs	80% reqs <6000ms	20% = 2,000,00 reqs<6secs	20%
	Proportion of slow reqs <8000 ms / total no. of reqs	89% reqs <8000ms	11% = 1,100,00 reqs<8secs	11%
gRPC Server
Availability	Proportion of successful grpc requests/ total grcp requests	99.99% success	0.01% = 100 errors	0.01%
Latency	Proportion of fast reqs <200 ms / total no. of reqs	90% reqs <200ms	10% = 1,000,00 reqs < 200ms	10%
	Proportion of fast reqs <1000 ms / total no. of reqs	97% reqs <1000ms	3% = 3,000,0 reqs<1secs	3%
	Proportion of slow reqs <6000 ms / total no. of reqs	80% reqs <6000ms	20% = 2,000,00 reqs<6secs	20%
	Proportion of slow reqs <8000 ms / total no. of reqs	89% reqs <8000ms	11% = 1,100,00 reqs<8secs	11%
Pipeline
Freshness	Proportion of records read from table recently
	Recently is defined by 1 min to 10 min
	Use metrics from API and HTTP server
	Count of all data reqs for "api" & "webserver" with 1 min freshness / total no. of data reqs	90% of reads use data written previous 1 min	10% = 1,000,00 reads use data written more than 1 min	10%
	Count of all data reqs for "api" & "webserver" with 10 min freshness / total no. of data reqs	99% of reada use data written previous 10 min	1% = 1,000,0 reads use data written more than 10 min	1%
Correctness	Proportion of records injected into table by prober
	Result in correct data beingg read Prober should export outcome metric	99.999% of records injected by prober results in correct output
Completeness	Proportion of hours in which 100% of data processed (no data skipped) count of pipeline runs that procssed 100 percent of records divided by total pipeline runs	99 % of pipeline runs cover 100% data	In case of total 1000 pipelines runs 1% = 10 pipelines	1%

Suggestions: Overview of monitoring technique and existing infra should also mentioned.
Development technological stack with exact versions and languages should be mentioned

References:

aamir-raza-1 / slo Goto Github PK

slo's Introduction

slo

Title of the SLO Document

Service Overview

SLIs and SLOs

slo's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent