99.99% of Your Traces Are Trash - p99

Distributed tracing is still finding its footing in many organizations today, one challenge to overcome is the data volume - keeping 100% of your traces is expensive and unnecessary. Enter sampling - head vs tail how do you decide? Let’s review the tradeoffs associated with different types of sampling strategies and how they can be mixed and matched.

Power Up With Podman

Curious about containers? There’s a new generation of containers on the scene, Podman! Supporting secure, rootless containers for Kubernetes microservices, it was designed and built with the cloud in mind. Benefitting from the lessons learned out in the open from Docker, this next generation of containers will quickly become a trusted daily driver in your dev workflow.

Covering what you need to know as an end-user from the UI to the backend, sharing a real world use case leveraging Podman for open source observability workshops Paige will share how Podman and the adorable seal mascots Caitlín, Maighréad and Róisín have transformed her local development!

Pushing O11y Uphill: The Single PAIN of Glass

For years tech companies have chased the fabled “single pane of glass” , the one observability tool to understand your system from north to south and east to west. Leafing through promo materials promising instant insights and seamless turnkey integrations you’d think increasing system observability is as easy as assembling a Lego set.

In my experience chasing the “single pane of glass” translates to “pain in the ass”. Survey data supports this revealing the majority of engineers cite tool sprawl as a minor or non-existent problem despite relying on several tools. As alluring as the siren call of “single pane of glass” is, let's be practical and examine how to best observe systems across a myriad of tooling.

Tracing Adventures From PR to Production

Many things can go awry on the journey from pull request (PR) open to merge to production deployment. Issues can arise from the application code, layers of YAML configuration, underlying infrastructure or pipeline logic itself. How can distributed tracing and trace-derived metrics bring developers and operators together for troubleshooting paradise? I’ll unpack a deploy gone bad from both vantage points, gaining an empathy for the engineer who needs to deploy their changes and an ops engineer who is responsible for keeping the system up and running. With signals from OpenTelemetry I will show how increasing the observability of your deploy system can facilitate better collaboration and quicker troubleshooting.

There's No Place Like Production - Shift Conf

There’s a reason “i test in prod” isn’t a cheeky take but a lived reality. And that reason is there is no place like production. Not local dev or staging or other environment. This became clear when I deployed a tiny config change that passed all checks, reviews and pre-production environments that triggered a SEV-1.

Examining each step of the journey from PR to production I uncovered the snafu that had occurred (unsurprisingly it relates to overwriting key blocks on nested YAML files).

I’ll share how difficult it was to reconstruct the chain of events in the system compared to the ideal case of a highly observable system and how to share your own incident learnings since we all test in prod!

Intro to Threat Modeling in the Cloud - Open Source Summit NA

Threat modeling is a key part of securing your cloud computing environment and is unique to each organization's people, practices and processes. Without a threat model you could end up going on wild goose chases and miss the forest for the trees. This talk will introduce the concept of threat modeling including how to get started, guiding questions and recap findings from the Argo Project Threat Model from 2021. Learn how and why to start threat modeling today!

Avoiding Alert Fatigue and Burnout - WTFisSRE

The signal to noise ratio can be frustratingly high for on-call alerts and downright mystifying when joining a new company or rotation. If you treat PagerDuty like the boy who cried wolf you’ve certainly mired in monitoring debt. Monitoring debit is tragically common and with investment mitigatable! Just how much of a risk are false positives? The data shows that a high number of false alerts trains workers to assume most alerts will be false. Repetition of the same alerts causes even greater alert fatigue. For clinicians the likelihood of acknowledging an alert dropped 30% for each reminder. Paige will share the nuts and bolts of how to start auditing your own alerts to begin your journey towards an on-call oasis.

Cognitive Apprenticeship in Action with Alert Triage Hour of Power - SRECon23 Americas

Cognitive apprenticeship is the philosophy that it is more effective to learn in context and real-world situations compared to following a tutorial in a sandboxed environment. In a nutshell it is “learning-through-guided-experience” and shines when teaching problem-solving processes experts use to handle complex tasks like say…investigating an alert by spelunking in production observability and monitoring data. Learn how Alert Triage Hour of Power became a can’t miss meeting of camaraderie and system surprises!

Taming Feral DevOps - DevOpsDaysLA

The misunderstanding that devops is a tool or team or role has tragically increased the gulf between developers and operators.If you have been unofficially on-call 24/7, toiled away in turmoil, managed initiatives across multiple silos sans project manager with a directive to “influence without authority”...surprise you have lived the contradiction that is feral devops!

Consequences of letting this fester are: burn out, learned helplessness and missing out on the speed that happens when developers and operators communicate openly. Taming feral devops is a journey you can start by gathering feedback on the status quo, iterating on onboarding and training and holistically examining on-call health.Whether you’re an individual contributor, manager or leader you will find ideas to experiment with to “domesticate” devops in your organization

30 Interviews Later...

Lightning talk reflecting on the his and lows of a summer of 30 SRE interviews

Best Vends: The How and Why to Befriend Your Vendors

If you are on the “buy” side of build versus buy this talk is for you! By buying a vendor’s product you are inviting them to be a part of your team and your platform - why not make the most out of your investment? From participating in betas to simply reading vendor emails I will give you a dazzling array of activities you can do today to start on the path of becoming best vends.

Microservices Means Macrocommunication

The tl;dr is “how microservices can better facilitate communication between team members and from team to team”.

Adopting microservices means a shift in architecture of our system and in communication. Letting your upstream and downstream dependencies know of major changes is not just a courtesy but required in order to keep things moving.

Communicating early and often takes shape in strategies like:
- versioning
- informative deprecation warnings
- temporarily providing backwards compatibility

I’ll take the approach of “do this not that” and look at realistic examples such as:

  • during active development a team picks up your work and suddenly becomes a dependency - how do you gracefully communicate constant changes?
  • Or its endpoint deprecation day and you still have active consumers - how to provide the tools they need to migrate to the new stuff.
  • Or the nuts and bolts of why and how keeping documentation up to date will add to your daily productivity.

Disposable Infrastructure: You Only Build Me Up To Tear Me Down

I’ll start with my definition of disposable infrastructure,

Automating the process of provisioning, configuring, deploying, and tearing down cloud infrastructure and services.

Essentially 1 script to go from 0 to live application and from live application back to 0 again.

Then talk through and show the code for each major step the program is running through:

  • Reading in the configuration file describing the desired infrastructure and set up (1 service per instance)
  • Generating a plan and invoking Terraform to provision EC2 instances, security groups, load balancers, and a database
  • Generating a playbook and running Ansible to configure each instance according to what service is running on it
  • Packaging the deployed environment