Observability has been around for a long time, but I first heard about it around 3 years ago when Cloud-Native software development was becoming mainstream and I was examining the challenges facing dev teams. It very rapidly became a buzz word in the IT world, often used by marketeers (guilty). It’s quite common now but not always understood, not helped I think by various vendors redefining it to suit their needs, so this article aims to explain the term in a way everyone can understand.
“Observability is the measure of how well you can determine something is performing, by looking at it from the outside”Tim Shackleton 12/04/2023
“Observability helps you determine behaviours in your software that is otherwise unexpected by observing it’s outputs”Tim Shackleton 12/04/2023
That might sound counter intuitive. I mean, it’s hard to know what’s going on inside if you don’t look inside. That’s the point though, observability isn’t about how healthy something is, it’s about how well you can figure that out without pealing back the covers.
As IT has infiltrated all parts of our lives and business, it has become critical to the running of life, the economy, the government and the world. The software and devices we use have become mission critical and while traditional DR is still in play, such as replicated systems and backups, it’s always reactive and often involves downtime, sometimes lengthy periods of it. We’re at a point now where seconds of downtime can cause significant problems, sometimes life threatening problems.
Unerstanding what is going on in real-time has become a critical part of continuous development.
Using an Example we can all Understand
All of us either drive a vehicle or have relied on vehicles at some point in our life. It’s something we can all relate to.
Let’s imagine we’re looking at a car. Cars are complicated with lots of complex engineering and software. We can see the paintwork is shiny and buffed, looking super cool and enabling us to see there are no dents or scratches. Great, body work = tick.
Bodywork doesn’t make the car go though. The engine, the fluid lines, the clutch, the brakes, the electrics are all hidden. So our observability is very low. We’ve got no idea how the engine is doing or any of the electrics or other moving parts.
So what if we installed a bonnet release and hinged the bonnet, then we can see into the engine compartment. We can look for rust and other types of corrosion. Adding a dip stick enables us to see whether there is enough oil for the engine. Marking level lines on the power steering, brake fluid, antifreeze and screen wash reservoirs enables us see whether they are too low, too full, or empty.
Now we know if the fluids are good so our observability has improved because we have more readings to understand the health of the car without taking it apart.
Turning the ignition tells us if the car starts but that is a by-product of normal operation, not a deliberate instrument for measuring wellbeing. Here is where observability really takes off. If we build sensors into various parts of the car, and the circuits / software to interpret the measurements, then the dashboard instruments will tell us more about the working order of the car. It will tell us whether the oil is critically low, if the tyres are too soft, whether the fuel is low, if the engine is too hot, or whether there is an electrical fault.
This improves our observability yet again to give us more outputs (instruments) to indicate how well the car is working.
Finally, we can install an interface port so engineers/mechanics can plug in laptops with sophisticated diagnostics software to analyse the working order of the car, meaning they can understand how the car is running without taking it apart.
I could go on, but you get the point. There are lots of things on a vehicle that are engineered specifically to tell us how well it is performing. The more there are, the better the observability. The more we know about how something we can’t see is working.
Observability in Action – Formula 1
One of the greatest examples of extremely high levels of observability is Formula 1. There are up to 300 sensors on an F1 car that generate roughly 3TB of data per car, per race. Moreover the software running the car is configured to emit logs, traces and metrics continuously. The race team, the pit wall and the factory all have fantastic software and systems to collect this “telemetry” in real-time, and analyse it which enables the team to manage problems as they happen, or advise the driver how to adjust their driving, rather than pulling the car into the pit for a check, or at worst, race retirement. They can also use that data for post race analysis to improve the performance of car, driver and team for the next race.
Why is Observability Important?
There are a thousand things on an F1 car all working together so speed the driver along. Great observability is absolutely imperative as the difference between starting first or fifth, can be a 10th, or 100th of a second. Only understanding the temperature of the engine isn’t going to get the marginal time difference you need, however granular information (greater observability) from all over the car, will.
The point being that getting the telemetry from the car means they can spend more time on track racing and less time in the pits or factory with the car in pieces trying to understand what they need to do to make it go faster.
Knowledge is power. Knowing what’s going on with your business critical systems in exceptional detail can be the difference between leading the market with new innovations or following in the pack. It’s not just about catching issues, but about improving what’s there. Observability is a critical part of ongoing software design and management.
Observability vs Monitoring
“Monitoring” is historically related to alerting on infrastructure health with some insight into common applications that sit on top. Observability is an evolution centred around understanding application behaviours as a whole from end to end. Apps don’t always behave as originally intended. Observability is as much about catching abnormal or unexpected behaviours as it is about alerting on a fault.
So How Can I Get Great Observability?
The answer is twofold. It’s not just about the tools you use to collect, interpret, display and alert on data. If your system isn’t configured or coded to provide the right logs, traces, and metrics (pillars of observability) then the best and most expensive tooling in the world won’t help you. Think of it like sensors on a car, without the sensors there’s nothing to read. Conversely, the more sensors you have the more you’ll understand about how you’re car is running.