3 min read

What’s the biggest unsolved problem within Site Reliability Engineering?

What’s the biggest unsolved problem within Site Reliability Engineering? Paiges take.
What’s the biggest unsolved problem within Site Reliability Engineering?
SRECon Americas banner, SF, March 18-20 2024

a curious SRE pokes their head at a to-be-solved labyrinth

Get ready to mull this one over before submitting your SRECon attendee survey (open until April 3rd! check your inbox!)

What a weighty question!

I don't have access to the form anymore to get my specific answer but for me it boils down to, surprise surprise, existential risks if we don't collectively share knowledge across generations and cultivate meaningful learning experiences.

What do I mean by that?

Well I, and many in my SRE generation, were "born in the cloud" and thrust into managing critical systems with massive complexity at scale. We weren't there to see how that complexity built up over time and evolved from 3-tier web app architectures to the Death Star monstrosities shown in slides. We weren't there to rack and stack and fawn over beautifully managed cables. We haven't had to first-hand deal with the idea that adding server capacity isn't a button-click or TF apply away...but could be a 6 month long process. A NOC is as foreign a concept to me as a mainframe.

What does it mean for our profession and the ability to pass the baton from one generation to the next in a world where our occupational experiences differ so widely?

How do mental models of infra and systems differ between those who have that hands-on data center experience and those without for whom servers are abstract?

This isn't to say we (cloud baby SREs) are dodos who can't understand the cloud is just someone else's computer we're renting or get up to speed on managing on-prem infra...or that everyone should have to follow the path from datacenter -> sysadmin -> sre. It's also not acceptable or sustainable or accessible in my opinion to continue to expect people set up home-labs and experiment themselves. We don't ask veterinarians to practice surgery in their free time at home and I don't think we should expect that from future SRE.

the biggest unsolved problem in my opinion is how to collectively pass on the valuable lessons learned and perspectives from ye olde SREs to the next generation and beyond when we have such different contexts and relationships to technology.
^ this why I'm SO STOKED about the work Courtney Nash and co are doing with database of incidents over at The VOID as well as Daria Barteneva's talk on The Art of SRE and the legendary Teaching SRE from Mikey Dickerson.

In the medium term while we have a nice overlap between pioneers that built the internet, protocols, tools and tech that are now critical load bearing pieces of society AND newer generations "digital natives" (or whatever they're called) joining SRE we have a window of opportunity to actively bridge the gaps of understanding across  generations and I think it’s time to take it. 



Norman a sleek black cat contrasted against a tomato red carpet. He's got a shocked deer-in-the-headlights expression as if he just realized that the number of original COBOL developers is dwindling and all he knows is Scratch