Proposal: Define a framework for system "criticality." For each sy

I have another How this framework can be used : <p

I'm with <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[RFC] Classify systems by criticality about readme HOT 15 CLOSED

artsy commented on May 31, 2024 5

[RFC] Classify systems by criticality

from readme.

Comments (15)

joeyAghion commented on May 31, 2024 1

I totally agree that this RFC doesn't need to fully define and populate the framework. Mostly I'd like to get consensus around this direction, and then we can debate the system<>level mappings and specific expectations separately.

I do think it's useful to consider some specific expectations as examples, because these immediately bring up relevant questions (like is it even reasonable to apply one latency expectation to front-end and API applications, or do availability targets such as 99.9...% make sense for auction systems).

from readme.

joeyAghion commented on May 31, 2024 1

Criticality can definitely change.

Regarding deprecations, sure! Any downgrade to a lower level is a sort of deprecation in the sense that support will be decreased accordingly. There could even be a Level 0 ("unsupported"). I could imagine using that for experiments/hackathon-type applications that might share infrastructure, but wouldn't want any in-use system to reside there for long. In my opinion, teams should be forced to consider whether to support a system (e.g., as per level 1) or retire it.

from readme.

sweir27 commented on May 31, 2024 1

I have another How this framework can be used:

For critical services that may depend on a less-critical service, we can be more aware (i.e. better handling of downtime in the backing service).

An example could be: if Metaphysics connects to many services, some critical and some less critical, metaphysics should continue to operate even if one of the less critical backing services is down.

In any case this could provide a good way to talk about dependencies between services.

from readme.

dblandin commented on May 31, 2024 1

Thanks for writing this up!

What do you think of inverting the scale?

L1 - Highest impact
L2
L3
L4 - Lowest impact

This aligns with incident priority class guidelines I've seen before. P1 (Priority 1) incidents have the highest priority, triggering specific response and resolution expectations.

https://wiki.en.it-processmaps.com/index.php/Checklist_Incident_Priority

I don't even think we need to assign names like "Critical" or "Important" to these levels. We might start with three levels and iterate on it as we need more granularity. As we shift levels around, L1 would always be the most impactful and have the strongest requirements.

from readme.

dleve123 commented on May 31, 2024

I’m excited about this RFC! I definitely think defining a criticality abstraction is very useful for reasoning about dev lifecycle and support expectations for Artsy’s large service footprint.

In my personal opinion, I think the first step here is to label systems with criticality levels, and define the impact of criticality levels just enough to feel confident with said labeling. I think trying to itemize all of the obligations of service criticality is too large of a task to get the ball rolling here, and should be delegated to further conversations.

In short: Very positive on this RFC, but suggest keeping the goal of this RFC scoped to the smallest first step (labeling systems with criticality levels). Also, Level 1 could be titled “Supporting”...

Clarification: To be extra clear, I do think it critically important to define expectations for criticality levels (SLO expectations, vuln management expectations, etc), otherwise, the criticality levels don't actually mean anything – and should happen shortly after this RFC - I just imagine it'll be a lot of work to perform the criticality level assignment, and should tackle that first.

from readme.

orta commented on May 31, 2024

I'm with @dleve123 - I like the examples, and I think the actual expectation can be different depending on:

is an external facing server (grav/MP/exchange/volt/force)
is a consumer facing "app" (Eigen/reaction/emission/folio)

"app" here being a very loose term. but this would cover most of the important front-end-y stuff. iOS is useful because we already treat it like this, and has projects from each bucket: 1 (Eigen) 2 (Folio/Kiosk) 3 (Apple TV)

from readme.

ashkan18 commented on May 31, 2024

this makes sense to me. i like how this adds more visibility to where our systems are and how are they used, i have few questions though:

Can a criticality of a project change? I can imagine it can, what happens then and how are we going to track it.
Whats our threshold on calling something critical or not? for example if our internal process is dependent to a function of a system, does that make it critical? As an example, APRb is pretty non-critical for the most part but for example CRT uses it for their somewhat critical processes for example when a failed charged happened and etc.

from readme.

SamRozen commented on May 31, 2024

Should there be an explicit level for systems that are not supported anymore / should be deprecated?

from readme.

joeyAghion commented on May 31, 2024

That's a great point... system couplings should always be done conservatively, but especially when more critical services depend on less critical ones. A relevant discussion from the google post:

Over time, we found that the failures of the global instance of Chubby consistently generated service outages, many of which were visible to end users. As it turns out, true global Chubby outages are so infrequent that service owners began to add dependencies to Chubby assuming that it would never go down. Its high reliability provided a false sense of security because the services could not function appropriately when Chubby was unavailable, however rarely that occurred.

The solution to this Chubby scenario is interesting: SRE makes sure that global Chubby meets, but does not significantly exceed, its service level objective. In any given quarter, if a true failure has not dropped availability below the target, a controlled outage will be synthesized by intentionally taking down the system.

from readme.

dleve123 commented on May 31, 2024

@joeyAghion - My perspective: I really like the conversation happening here and am interested in what it would mean to resolve this RFC.

Would positive resolution be signalled by overall positivity from the commentors for such a framework to exist and result in you leading the specification of the framework (given input from others as it makes sense)? I would be pro that resolution :D

from readme.

joeyAghion commented on May 31, 2024

There seems to be support for this general idea. There hasn't been much discussion of the example levels I've described or the associated expectations, so if anyone has suggestions there, please chime in.

Otherwise, I'll resolve this later today and follow up by:

documenting the 3 or 4 criticality levels in a new playbook
adding a column to the large project list 🔒 for each system's level
asking associated tech leads to decide where each system fits

We probably need several follow-up discussions to set specific expectations for each level.

from readme.

joeyAghion commented on May 31, 2024

@dblandin I puzzled over the ordering as well, since what you're describing was initially more intuitive to me. However, when I looked at literature out there about maturity models, they all seemed to map "higher" levels to higher numbers. One minor nice thing about that is that it neatly leaves level 0 for everything else.

I just want to choose whatever will be least ambiguous. What do others think?

from readme.

dblandin commented on May 31, 2024

However, when I looked at literature out there about maturity models, they all seemed to map "higher" levels to higher numbers. One minor nice thing about that is that it neatly leaves level 0 for everything else.

Gotcha! Thanks for pointing out those resources 👍. I don't feel too strongly about the direction. Leveling up in number might be more intuitive. Leaving off the names might give us the flexibility to more easily change the scale and definitions going forward.

I'm excited about these new levels 💯

from readme.

izakp commented on May 31, 2024

We probably need several follow-up discussions to set specific expectations for each level.

I think it would make sense if we start at the top (L3) and consider our public-facing and business critical APIs first, i.e. Gravity, Metaphysics, Causality and Positron, then work down to lower levels - also considering that even within a level response time expectations will vary between each service, and we have to consider the Force is a client app, so there may be some distinctions we need to draw there.

Also wonder if it's worth considering our backing services within this framework, i.e. MongoDb, Elasticsearch and RabbitMQ. It would be a good way to evaluate our providers and deployment options.

from readme.

joeyAghion commented on May 31, 2024

Resolution

We'll define this framework as levels 0-3, with 3 being the most critical. Specific expectations for systems are to-be-determined.

Level of Support

2: Positive feedback.

Additional Context:

There's ambiguity about what order the level numbers should go in. Based on other online content about maturity models I've opted for higher numbers to correspond with more critical systems.

More descriptive labels like critical/important/supported/unsupported have also been discussed. Given the potential ambiguity of the numbering, I think these labels improve overall clarity.

Next Steps

Documenting the 3 or 4 criticality levels in a new playbook
Adding a column to the large project list for each system's level
Asking associated tech leads to decide where each system fits

Exceptions

None that I know of yet.

from readme.

[RFC] Classify systems by criticality about readme HOT 15 CLOSED

Comments (15)

Resolution

Level of Support

Additional Context:

Next Steps

Exceptions

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent