"fun" fact: I learned this week that on-call is a hyphenated word, oddly enough autocorrect hasn't figured out that "oncall" and "on-call" are related
this got me thinking about how the longer you stay in tech the more its like you speak a different language especially as you go deeper into a specific domain like say monitoring and observability
(looking at you telemetry, cardinality, cloud-native...)
and also how widely the culture of oncall and incident response varies between companies. i've heard horror stories of folks who's first "page" was a midnight cell phone call from a director, junior engineers tossed to the wolves with no training, and myself have experimented with various ways to increase my own confidence on-call
so for part 1 I will share the good that I've seen, tried, or had friend share about wrt to on-call onboarding.
away we go!
Mock Incidents aka fire drills 🧯
at InVision before you were cleared to join the rotation you went through a mock incident where one co-worker reported symptoms of an outage and the other playing the role of customer support. This helped familiarize responders with the motions of incident response for the company, how to declare an incident, the established channels for communication and was good practice leafing through our various dashboards and monitoring tools to debug.
10/10 recommend giving folks a safe place to practice the motions of incident response
Standardized Service Documentation 🪧
"docs get stale so fast its not even worth writing them"
if I had a penny for every time I have had to defend the value of the written word aka the durable knowledge format that has propelled our species to great heights i'd be a very rich man
what I mean by "standardized service documentation" is the ability of someone who is not on your team to be able to understand:
- the tl;dr of what the service does
- its criticality to normal business operations
- who is currently primary on-call
- the team that owns it
- valid instructions for how to run it locally
- diagram/overview of its deployment process
- where report issues
- what is in the service neighborhood (aka upstream and downstream dependencies)
- quicklinks to key queries, dashboards, monitors, etc.
this can live in a README, Confluence, homegrown wiki, a service catalog, wherever. What matters most is that it is consistent across teams and departments so that no one is blocked trying to figure out where team X keeps documentation.
On-call Log / Diary / Journal
Today I got paged 10 times and none of them were actionable. I know that mercury is in retrograde but cut me some slack! Here's hoping for a silent night so I can watch Real Housewives in peace.
jk that is not what I mean by on-call diary. but keeping a WRITTEN RECORD of what the pager load looked like, alerts you ACK'd and investigated with actions taken, or helpdesk request that came in through Slack is incredibly helpful for someone joining a new rotation to get a sense of the workload.
over time if you keep this practice up there's a rich source of historical information about the people, process, and tech involved in keeping the business lights on. if it feels like your team is mired in toil - these diaries can be a key resource in showing the ROI of automating or investing time in addressing certain pain points.
Shadowing and Reverse Shadowing
ideally you will be able to both shadow before you're officially oncall and have a reverse shadow for your first shift
shadow - the gist is that you pair up with a primary on-call for a week and get added to the paging notifications and pair on or review together any activities that arise (e.g. tuning an alert, rolling back a bad deploy, investigating a regression in prod). this is key for sharing how you use the specific tools and signals available in your system to figure out wtf is happening. this isn't something that folks can easily intuit considering the wide array and philosophies of different monitoring/alerting/observability tooling today.
there's flavors of this where the shadow is only receiving notifications for the daytime business hours
reverse shadow - for your first shift having the secondary (or your shadow/onboarding guide) also receiving pages and notifications. the idea is that you are taking the lead on response but have someone experienced available for guidance when needed.
stay tuned for part 2 oncall onboarding: the bad and part 3 oncall onboarding: the ugly