my entrypoint to the tech industry was a monitoring company.
little did i know back then how lucky i was to "grow up" in an org where engineers had an above average expert level of operations and knowledge of how to monitor all of this containerized software
i remember getting shown a blank query bar with a lead engineer excitedly prompting me to query the system. in that moment he saw the query bar as a treasure chest waiting to be opened and i saw the query bar as a blank void.
after that experience i dug into the docs and figured out what tripped me up at the event query bar. and then i wrote up my learnings in the document below.
without further ado pls enjoy my circa 2017 take on "events vs metrics":
When reviewing data in your monitoring/observability tool it is critical to understand the difference between event and metric data.
Knowing which data type you are working determines how you query, visualize, interpret and even alert on the findings.
A record of a single "action" associated with attributes that provide dimensionality and can be used to group, filter, and query the data further.
Events are not aggregated since each represents a separate record but can be sampled at a high enough throughput.
Events are useful when you need to filter data by different attributes like:
- request ID
- response headers
- AWS region
We can ask questions like these and so much more:
"What userIDs got 500 errors when making requests to Service X during the last hour?"
"What is the throughput broken out by day of the week for Service Y?"
Metrics are data points consisting of a name, an associated numeric value and segment of time.
Metrics are especially useful for host/infra/datastore visibility such as tracking:
- CPU utilization
- Disk space
- Memory utilization
They can answer questions like
"Is there a memory leak?"
"How full is the disk?"
Looking at response time for the past hour between two services you could look at metrics or events.
A metric could be named something like like
httpResponseTime with a single value representing how long it took to receive a response from another service.
With metrics we can look at average, min, max and throughput for response time for a given hour but cannot further query or filter them. That's as far as the data goes - 10ms or however it took. Depending on the attributes added you could filter to only look at Tier 1 services or only those in a certain availability zone.
If you wanted to get say a view of the p99 response time for a given service...that'd be another metric
With events we can query all the response times during that hour and get the p99 or p95 of response times for the last hour, and more!!!!
Metrics are limited and can't be queried in the same way events are since they are likely missing the attributes that provide filter-ability due to the cost and resources to store them.
huh! interesting stuff, I definitely seemed to grok that metrics != events and find some limitations with metric data. But I'm not convinced I 100% understood things here. because obvs you can add attributes to metrics and get some of the filterability (e.g. response time for service X in Y region) but afaict that gets really expensive quick.
stay tuned for my updated version in the coming week :)