May 9, 2023 2 min read production

Production Scavenger Hunt

When opening up an observability tool it is no wonder many developers use these platforms out of spite dodging bespoke query languages, confusing product names, and friction-y UX. Enter the Production Scavenger Hunt, an attempt to emulate my approach exploring production!

a group of nondescript beings sit in a circle, contemplating a giant red question mark in the middle. very similar to someone opening up an enterprise monitoring tool after a lifetime of working within grafana/prometheus

When opening up a monitoring and observability tool, some people see a convoluted UI, terribly confusing product names, and it is no wonder many developers use these platforms out of fear (and occasional spite) for mostly reactive "oh shit I just got paged" workflows.

I see a treasure chest full of exciting insights to mine from the telemetry with a few well formed queries and an afternoon to explore.

Enter, the production scavenger hunt, a game I just made up that is an attempt to emulate my own production exploration sessions and spark an interest in you and your system.

4 simple rules

pair up, ideally across teams/depts. this exercise is way more fun, informative, and valuable when done in community
you must use available telemetry data to back up your answers, no vibes, no "I just know OK?!"
aim for quality > quantity. if a certain finding surprises you or inspires another question, follow that instead of trying to check off the entire list. feel free to skip or modify questions that are irrelevant for your system
stay curious and open to the many surprises production has in store for you

Ready, Set, Query!

What is the time window users are most active for your set of services?
Any surprising traffic patterns crop up? Over what periodicity?
Has a given services CI/CD pipeline gotten faster, slower, or stayed the same since January?
What was the most reported error during the peak of user traffic yesterday?
What service is emitting the most metrics? The least?
Where is all the telemetry going from one of your services? 1 tool? 2 tools? Bonus points: Do you have logins for them all?
What engineer has had the most on-call shifts the past 18 months? What engineer has been interrupted after business hours the most?
Of the services you are on-call for, what was the max throughput during the last week? Is that normal and expected?
What OS & version is running in pre-production environments vs. production?
Can you find evidence of a memory leak anywhere in the system?
How many production deploys in total happened today? Last week? How many were rolled back?
What is the ratio of incidents triggered by config changes vs code changes vs infrastructure snafus/changes?

This is merely a springboard, something to get you proactively engaging with your own production data and get familiar with your org's observability platform of choice.

n.b. prepare to have challenges in satisfactorily answering these questions! I personally prefer having access to raw data and being able to query away to my heart's content but if all you have is a vendor's prescriptive and limiting analytics view, hey start there.

Best of luck and let me know how it goes and what questions would make great add-ons.

girl from Moonrise Kingdom, holding binoculars and blowing a kiss out the window

CAT TAX

dapper grey and white tabby cat wearing a bow, gold name tag and round glasses peering at a laptop intently hunting for bugs

-ttfn paigerduty

Ready, Set, Query!

CAT TAX

You might also like...

Meat Based Monitoring

On-Call Feels Survey

on-call onboarding: the bad

on-call onboarding: the good