When opening up a monitoring and observability tool, some people see a convoluted UI, terribly confusing product names, and it is no wonder many developers use these platforms out of fear (and occasional spite) for mostly reactive "oh shit I just got paged" workflows.
I see a treasure chest full of exciting insights to mine from the telemetry with a few well formed queries and an afternoon to explore.
Enter, the production scavenger hunt, a game I just made up that is an attempt to emulate my own production exploration sessions and spark an interest in you and your system.
4 simple rules
- pair up, ideally across teams/depts. this exercise is way more fun, informative, and valuable when done in community
- you must use available telemetry data to back up your answers, no vibes, no "I just know OK?!"
- aim for quality > quantity. if a certain finding surprises you or inspires another question, follow that instead of trying to check off the entire list. feel free to skip or modify questions that are irrelevant for your system
- stay curious and open to the many surprises production has in store for you
Ready, Set, Query!
- What is the time window users are most active for your set of services?
- Any surprising traffic patterns crop up? Over what periodicity?
- Has a given services CI/CD pipeline gotten faster, slower, or stayed the same since January?
- What was the most reported error during the peak of user traffic yesterday?
- What service is emitting the most metrics? The least?
- Where is all the telemetry going from one of your services? 1 tool? 2 tools? Bonus points: Do you have logins for them all?
- What engineer has had the most on-call shifts the past 18 months? What engineer has been interrupted after business hours the most?
- Of the services you are on-call for, what was the max throughput during the last week? Is that normal and expected?
- What OS & version is running in pre-production environments vs. production?
- Can you find evidence of a memory leak anywhere in the system?
- How many production deploys in total happened today? Last week? How many were rolled back?
- What is the ratio of incidents triggered by config changes vs code changes vs infrastructure snafus/changes?
This is merely a springboard, something to get you proactively engaging with your own production data and get familiar with your org's observability platform of choice.
n.b. prepare to have challenges in satisfactorily answering these questions! I personally prefer having access to raw data and being able to query away to my heart's content but if all you have is a vendor's prescriptive and limiting analytics view, hey start there.
Best of luck and let me know how it goes and what questions would make great add-ons.