Comments (15)
I totally agree that this RFC doesn't need to fully define and populate the framework. Mostly I'd like to get consensus around this direction, and then we can debate the system<>level mappings and specific expectations separately.
I do think it's useful to consider some specific expectations as examples, because these immediately bring up relevant questions (like is it even reasonable to apply one latency expectation to front-end and API applications, or do availability targets such as 99.9...% make sense for auction systems).
from readme.
Criticality can definitely change.
Regarding deprecations, sure! Any downgrade to a lower level is a sort of deprecation in the sense that support will be decreased accordingly. There could even be a Level 0 ("unsupported"). I could imagine using that for experiments/hackathon-type applications that might share infrastructure, but wouldn't want any in-use system to reside there for long. In my opinion, teams should be forced to consider whether to support a system (e.g., as per level 1) or retire it.
from readme.
I have another How this framework can be used
:
For critical services that may depend on a less-critical service, we can be more aware (i.e. better handling of downtime in the backing service).
An example could be: if Metaphysics connects to many services, some critical and some less critical, metaphysics should continue to operate even if one of the less critical backing services is down.
In any case this could provide a good way to talk about dependencies between services.
from readme.
Thanks for writing this up!
What do you think of inverting the scale?
L1 - Highest impact
L2
L3
L4 - Lowest impact
This aligns with incident priority class guidelines I've seen before. P1 (Priority 1) incidents have the highest priority, triggering specific response and resolution expectations.
https://wiki.en.it-processmaps.com/index.php/Checklist_Incident_Priority
I don't even think we need to assign names like "Critical" or "Important" to these levels. We might start with three levels and iterate on it as we need more granularity. As we shift levels around, L1 would always be the most impactful and have the strongest requirements.
from readme.
I’m excited about this RFC! I definitely think defining a criticality abstraction is very useful for reasoning about dev lifecycle and support expectations for Artsy’s large service footprint.
In my personal opinion, I think the first step here is to label systems with criticality levels, and define the impact of criticality levels just enough to feel confident with said labeling. I think trying to itemize all of the obligations of service criticality is too large of a task to get the ball rolling here, and should be delegated to further conversations.
In short: Very positive on this RFC, but suggest keeping the goal of this RFC scoped to the smallest first step (labeling systems with criticality levels). Also, Level 1 could be titled “Supporting”...
Clarification: To be extra clear, I do think it critically important to define expectations for criticality levels (SLO expectations, vuln management expectations, etc), otherwise, the criticality levels don't actually mean anything – and should happen shortly after this RFC - I just imagine it'll be a lot of work to perform the criticality level assignment, and should tackle that first.
from readme.
I'm with @dleve123 - I like the examples, and I think the actual expectation can be different depending on:
- is an external facing server (grav/MP/exchange/volt/force)
- is a consumer facing "app" (Eigen/reaction/emission/folio)
"app" here being a very loose term. but this would cover most of the important front-end-y stuff. iOS is useful because we already treat it like this, and has projects from each bucket: 1 (Eigen) 2 (Folio/Kiosk) 3 (Apple TV)
from readme.
this makes sense to me. i like how this adds more visibility to where our systems are and how are they used, i have few questions though:
- Can a criticality of a project change? I can imagine it can, what happens then and how are we going to track it.
- Whats our threshold on calling something critical or not? for example if our internal process is dependent to a function of a system, does that make it critical? As an example, APRb is pretty non-critical for the most part but for example CRT uses it for their somewhat critical processes for example when a failed charged happened and etc.
from readme.
Should there be an explicit level for systems that are not supported anymore / should be deprecated?
from readme.
That's a great point... system couplings should always be done conservatively, but especially when more critical services depend on less critical ones. A relevant discussion from the google post:
Over time, we found that the failures of the global instance of Chubby consistently generated service outages, many of which were visible to end users. As it turns out, true global Chubby outages are so infrequent that service owners began to add dependencies to Chubby assuming that it would never go down. Its high reliability provided a false sense of security because the services could not function appropriately when Chubby was unavailable, however rarely that occurred.
The solution to this Chubby scenario is interesting: SRE makes sure that global Chubby meets, but does not significantly exceed, its service level objective. In any given quarter, if a true failure has not dropped availability below the target, a controlled outage will be synthesized by intentionally taking down the system.
from readme.
@joeyAghion - My perspective: I really like the conversation happening here and am interested in what it would mean to resolve this RFC.
Would positive resolution be signalled by overall positivity from the commentors for such a framework to exist and result in you leading the specification of the framework (given input from others as it makes sense)? I would be pro that resolution :D
from readme.
There seems to be support for this general idea. There hasn't been much discussion of the example levels I've described or the associated expectations, so if anyone has suggestions there, please chime in.
Otherwise, I'll resolve this later today and follow up by:
- documenting the 3 or 4 criticality levels in a new playbook
- adding a column to the large project list 🔒 for each system's level
- asking associated tech leads to decide where each system fits
We probably need several follow-up discussions to set specific expectations for each level.
from readme.
@dblandin I puzzled over the ordering as well, since what you're describing was initially more intuitive to me. However, when I looked at literature out there about maturity models, they all seemed to map "higher" levels to higher numbers. One minor nice thing about that is that it neatly leaves level 0
for everything else.
I just want to choose whatever will be least ambiguous. What do others think?
from readme.
However, when I looked at literature out there about maturity models, they all seemed to map "higher" levels to higher numbers. One minor nice thing about that is that it neatly leaves level 0 for everything else.
Gotcha! Thanks for pointing out those resources 👍. I don't feel too strongly about the direction. Leveling up in number might be more intuitive. Leaving off the names might give us the flexibility to more easily change the scale and definitions going forward.
I'm excited about these new levels 💯
from readme.
We probably need several follow-up discussions to set specific expectations for each level.
I think it would make sense if we start at the top (L3) and consider our public-facing and business critical APIs first, i.e. Gravity, Metaphysics, Causality and Positron, then work down to lower levels - also considering that even within a level response time expectations will vary between each service, and we have to consider the Force is a client app, so there may be some distinctions we need to draw there.
Also wonder if it's worth considering our backing services within this framework, i.e. MongoDb, Elasticsearch and RabbitMQ. It would be a good way to evaluate our providers and deployment options.
from readme.
Resolution
We'll define this framework as levels 0-3, with 3 being the most critical. Specific expectations for systems are to-be-determined.
Level of Support
2: Positive feedback.
Additional Context:
There's ambiguity about what order the level numbers should go in. Based on other online content about maturity models I've opted for higher numbers to correspond with more critical systems.
More descriptive labels like critical/important/supported/unsupported have also been discussed. Given the potential ambiguity of the numbering, I think these labels improve overall clarity.
Next Steps
- Documenting the 3 or 4 criticality levels in a new playbook
- Adding a column to the large project list for each system's level
- Asking associated tech leads to decide where each system fits
Exceptions
None that I know of yet.
from readme.
Related Issues (20)
- RFC:🚰 Water Cooler Break HOT 6
- RFC: Automate dependency updates with Depfu HOT 10
- RFC: Implement Dependency Rotation HOT 8
- [RFC] Feedback Friday time reschedule HOT 2
- RFC: Catch more WTFs during onboarding HOT 2
- RFC: Protect main/master branches HOT 5
- RFC: We are all solely responsible for ensuring that we are not disturbed outside of working hours HOT 16
- RFC: Incrementally adopt I18n library in Rails projects HOT 11
- RFC: Adopt Codecov at Artsy, starting with Gravity HOT 8
- RFC: Adopt inclusive language for repository naming as well as allow/deny lists HOT 12
- RFC: Rename product slack channels to `prd-*` HOT 17
- RFC: Host one Hackathon per quarter in 2022 HOT 8
- RFC: Host one Codebase Refinement per quarter in 2022 HOT 11
- RFC: Officially recommend against using GraphQL Stitching in Gravity HOT 19
- RFC: Reusable components HOT 21
- RFC: Updating Best Practices Documentation HOT 10
- RFC: Retiring Torque HOT 1
- RFC: Feature Flags Naming Conventions / Maintenance HOT 14
- RFC: disallow squashing and rebasing on PRs HOT 17
- Want access of Web & Mobile best practices documentation
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from readme.