4 min read

on-call onboarding: the good

"fun" fact: I learned this week that on-call is a hyphenated word, oddly enough autocorrect hasn't figured out that "oncall" and "on-call" are related

google docs marking "oncall" as an unknown word and offering to add it to my personal dictionary

this got me thinking about how the longer you stay in tech the more its like you speak a different language especially as you go deeper into a specific domain like say monitoring and observability

(looking at you telemetry, cardinality,  cloud-native...)

and also how widely the culture of oncall and incident response varies between companies. i've heard horror stories of folks who's first "page" was a midnight cell phone call from a director, junior engineers tossed to the wolves with no training, and myself have experimented with various ways to increase my own confidence on-call

so for part 1 I will share the good that I've seen, tried, or had friend share about wrt to on-call onboarding.

away we go!

Mock Incidents aka fire drills 🧯

at InVision before you were cleared to join the rotation you went through a mock incident where one co-worker reported symptoms of an outage and the other playing the role of customer support. This helped familiarize responders with the motions of incident response for the company, how to declare an incident, the established channels for communication and was good practice leafing through our various dashboards and monitoring tools to debug.

10/10 recommend giving folks a safe place to practice the motions of incident response

Standardized Service Documentation 🪧

"docs get stale so fast its not even worth writing them"

....

Gina Linetti from Brooklyn 99 pausing from coffee consumption to roll her eyes and stick her tongue out

if I had a penny for every time I have had to defend the value of the written word aka the durable knowledge format that has propelled our species to great heights i'd be a very rich man

what I mean by "standardized service documentation" is the ability of someone who is not on your team to be able to understand:

  • the tl;dr of what the service does
  • its criticality to normal business operations
  • who is currently primary on-call
  • the team that owns it
  • valid instructions for how to run it locally
  • diagram/overview of its deployment process
  • where report issues
  • what is in the service neighborhood (aka upstream and downstream dependencies)
  • quicklinks to key queries, dashboards, monitors, etc.

this can live in a README, Confluence, homegrown wiki, a service catalog, wherever. What matters most is that it is consistent across teams and departments so that no one is blocked trying to figure out where team X keeps documentation.

On-call Log / Diary / Journal

Dear diary,

Today I got paged 10 times and none of them were actionable. I know that mercury is in retrograde but cut me some slack! Here's hoping for a silent night so I can watch Real Housewives in peace.

xoxo,
paigerduty

jk that is not what I mean by on-call diary. but keeping a WRITTEN RECORD of what the pager load looked like, alerts you ACK'd and investigated with actions taken, or helpdesk request that came in through Slack is incredibly helpful for someone joining a new rotation to get a sense of the workload.

over time if you keep this practice up there's a rich source of historical information about the people, process, and tech involved in keeping the business lights on. if it feels like your team is mired in toil - these diaries can be a key resource in showing the ROI of automating or investing time in addressing certain pain points.

Shadowing and Reverse Shadowing

two shadow figures in long coats kick dancing in sync 


ideally you will be able to both shadow before you're officially oncall and have a reverse shadow for your first shift

shadow - the gist is that you pair up with a primary on-call for a week and get added to the paging notifications and pair on or review together any activities that arise (e.g. tuning an alert, rolling back a bad deploy, investigating a regression in prod). this is key for sharing how you use the specific tools and signals available in your system to figure out wtf is happening. this isn't something that folks can easily intuit considering the wide array and philosophies of different monitoring/alerting/observability tooling today.

there's flavors of this where the shadow is only receiving notifications for the daytime business hours

reverse shadow - for your first shift having the secondary (or your shadow/onboarding guide) also receiving pages and notifications. the idea is that you are taking the lead on response but have someone experienced available for guidance when needed.


stay tuned for part 2 oncall onboarding: the bad and part 3 oncall onboarding: the ugly