5 min read

on-call onboarding: the bad

In the last post, on-call onboarding "the good" I mused about 🌈 the good πŸ¦‹ practices and processes I've experimented with for onboarding both new engineers and engineers new-to-the-org

so keeping with the whole "the good, the bad, and the ugly" thing i've got going on in this mini-series today's post will be about the bad wrt to on-call onboarding and pager life


reminder that today is about bad practices or as Seth Meyers says here with a wagging finger "Not Good."

the very first thing that comes to mind

smol rotations

if you were active on Twitter this spring you may have seen my tweet about this

tweet that says "a 3 person on-call rotation is a bigger risk to the system than relying on a single AZ the end" from defunct acct @alpacatron3000

it was a throwaway tweet reacting to another ops pal sharing the enormous burden and impact an understaffed rotation was having on their life. blended with my own experiences. in concert with the stories I hear from friends, colleagues, acquaintances.

based on anecdata, I consider it to be the biggest driver of burnout within tech.

not having on-call onboarding at all

i find it quite odd and stressful that so many companies out there just expect their engineers to hop into a rotation without any training.

to me this would be like handing a teenager car keys without having them pass a learner's permit test first.

but paigerduty, you say, we have a shadow rotation and a whole incident response guidebook

THAT IS NOT ENOUGH. ahem. a shadow rotation should be like releasing animals into the wild. the should have the skills to fend for themselves!

man looking like a history professor stroking his chin and saying "I don't understand any of this"

you've got to equip them with skills. you do not want any engineer looking at a page and thinking "I don't understand any of this".

so why is that not enough? for starters every company i've been at has Lego'd together different monitoring components let alone tooling. if you ever want to see true horror on a devs face, take someone used to the walled garden of enterprise monitoring tools and show them OSS monitoring tools.

spammy alerts

this is a cutesy phrase for an abhorrent reality.

alerts, which normal humans associate with like horrible blaring beeps we in tech have decided is super cool to just pipe to Slack and ignore.

can you imagine actually hooking up your org's alert channels to something that created sound? shudder.

at a certain point with a mountain of alert noise primary on-callers kinda give up it feels impossible.

i've genuinely suggested deleting all alerts and starting fresh kind of like filing for Chapter STFU Alert Bankruptcy. for some reason no one's taken me up on it

why do spammy alerts make for shitty on-call onboarding?

glad you asked ;)

before you normalize to the noise - you have to develop your system spidey sense to understand what alerts indicate REAL DANGER and what alerts just "go off every deploy" or whatever.

no concept of a centralized info source on services + infra

yes this could be what they call a "service catalog" but to de-mystify things I see it as really a big directory of who is on-call for what (notice how I didn't say own!) links to high level docs and data signals and crucially the zillion places you can find monitoring data and how to interpret it.

you don't need a fancy web app to do this...one of the better one's I've seen was brilliantly designed in genuinely my fave Atlassian product, Confluence! If you're not up on the Page Properties Macro then welcome to the light my friend

Want a high level table view of all of something? In this case a PNW Fiber Seller profile look how nice and structured the info is....I wonder if its using a template...

Confluence screenshot of an info table PNW Fiber Sellers listing Title, Fibers, Location

clicking into one and we can see a nice lil table on each PNW Fiber Seller page, how handy. Wonder what's going on under the hood...

Page for Skyline Alpacas showing a table with Fibers, Location, Store link 

aha! Our new BFF, the Page Properties macro!

editing the Skyline Alpacas page shows that table is in the Page Properties macro

tying it all together on that main TOC page is the Page Properties Report.

if this blew your mind...then oh gosh I'll just have to become a Confluencer aka confluence influencer. just lmk.

no holiday compensation

so why holiday and not anything else? well for one I'm American and can only dream of getting EU worker's protections. secondly its what I've seen be achievable in practice and tbh the spot bonus did make me feel like valued for my shift. it recognized that this is an additional burden to spend your holiday holding ze pager.

no training on monitoring tool

look at the difference in airplane cockpits and monitoring "dashboard" for a smol plane vs a big jet

they're different air crafts and require different levels of visibility for the same general thing (flying in the air)

Cockpit Detail - SplitShire
pilot at the wheel of a small plane with a car dashboard size holding the instruments, gauges and operations panel 
Cockpit of a Boeing 747 Jumbo jet in a maintenance hanger, Stock Photo ...
allegedly the view of the cockpit in a boeing 747. there are approx 2x number of gauges and controls and two wheels.

Would you expect a smol plane pilot to confidently walk in and fly that big jet?

tbh leaving the walled garden of Fancy Enterprise Monitoring TM tooling and diving into the OSS world can have as much of a UX shock as above (and vice versa).

and if there's any tool to have sharp and at the ready ... I am obvs biased but would say whatever your org uses for monitoring/observability.


so there we have it...a non-exhaustive list of "the baaaad" of on-call onboarding. Β in the final post i'll wrap up with "the ugly" and if you missed part 1 "the good" <- click that handy lil' link

CAT TAX

walked into the living room to find Norman paw deep holding open a knitting pattern book with a startled realizing how similar knitting and coding are