BigData, Machine-learning, AI & AnalyticsConference50min
Did My Deploy Degrade Production?
The speaker built a production anomaly-detection system for slowly rising CPU usage using simple statistical methods instead of off-the-shelf ML. The talk covers algorithm evolution, edge cases, and real customer regressions detected since 2020, plus lessons on why technically successful tools may still fail adoption.
talk.summaryAiDisclaimer
Ines Panker
A few years ago, I got tired of the same story playing out: the CPU usage creeps up for months, but nobody notices, until... one day it crosses some arbitrary threshold that we set ages ago. However, by then, it's easier to just spin up another instance than to figure out what's been slowly breaking things for the last 3 months.
So I built a system to detect these changes automatically. I'm not a data scientist (I don't even have an ML degree), which meant my first attempts were... not great. I threw some off-the-shelf ML libraries at the problem, hoping they'd just work. Which, of course, they didn't.
I had no choice, but to actually sit down, understand what production time series data really looks like, and then reach for simpler statistical tools. It turns out, they work better than anything off-the-shelf—once you figure out how to apply them.
I'll walk through the evolution of the algorithm, the edge cases I uncovered, and the tricks I've set up. The system ran in production from 2020, detecting regressions on real customer data. It worked great, spotted real degradations, and yet... it kinda never got many fans. So, it will soon be archived.
You'll leave with a practical understanding of how to build anomaly detection without needing a statistics PhD, and maybe some thoughts on why technically successful systems sometimes fail to get adopted.
So I built a system to detect these changes automatically. I'm not a data scientist (I don't even have an ML degree), which meant my first attempts were... not great. I threw some off-the-shelf ML libraries at the problem, hoping they'd just work. Which, of course, they didn't.
I had no choice, but to actually sit down, understand what production time series data really looks like, and then reach for simpler statistical tools. It turns out, they work better than anything off-the-shelf—once you figure out how to apply them.
I'll walk through the evolution of the algorithm, the edge cases I uncovered, and the tricks I've set up. The system ran in production from 2020, detecting regressions on real customer data. It worked great, spotted real degradations, and yet... it kinda never got many fans. So, it will soon be archived.
You'll leave with a practical understanding of how to build anomaly detection without needing a statistics PhD, and maybe some thoughts on why technically successful systems sometimes fail to get adopted.
Ines Panker
Ines Panker is a software engineer with almost two decades of experience writing code, shaping architecture, and leading teams. These days Python is her weapon of choice.
She is particularly interested in the human side of software, how technical decisions and human dynamics influence each other, which is what led her to the stage in the first place.
When she's not simplifying unwieldy codebases, she reads poetry.
She is particularly interested in the human side of software, how technical decisions and human dynamics influence each other, which is what led her to the stage in the first place.
When she's not simplifying unwieldy codebases, she reads poetry.