Infrastructure vs. Data vs. Machine Learning Observability

Hardik Shah
4 min readDec 14, 2022

--

For any organization, it’s critical to know what happens in your infrastructure and applications. But how do you identify and resolve issues? Are you able to monitor for anomalies and failures?

In this post, we’ll walk through the differences between infrastructure management, application monitoring, machine learning observability, and alerting.

Recommended: The Ultimate Guide to Three Types of Observability

Infrastructure observability

The first step in ensuring the health of your infrastructure is to know what’s happening on it.

Infrastructure observability, or simply “infrastructure monitoring,” gives you a bird’s-eye view of your entire stack and makes possible what would otherwise be a laborious process: identifying bottlenecks that are impacting performance and pinpointing their causes.

With infrastructure monitoring, all system components should be able to report back to a central point (a “monitoring agent”) with information about their status. A good way to get started with this is by using a service like PagerDuty, which can integrate into just about any existing monitoring solution.

The ability to easily collect data from every critical piece of your infrastructure gives you complete visibility into how those pieces interact and lets you identify problems before they become critical issues affecting user experience or revenue generation.

Data observability:

Monitoring data in motion. Data is constantly moving from one place to another, and it’s important to understand how this data is being used at different points in time.

Monitoring data at rest. A lot of information can be gleaned by analyzing what’s happening with your stored data, whether it’s on disk or in memory.

Monitoring the way your systems are using the live data they receive and process — this includes monitoring code execution as well as application performance metrics like latency, throughput, error rates and CPU usage (among others).

Machine learning observability

Machine learning observability is the ability to monitor a model’s performance and troubleshoot issues. The term is used in various contexts, including data science, analytics, and operations management. It can be broken down into three primary areas:

  • Monitoring and alerting: How do we know if something is wrong? How can we tell if our model has a problem before it’s too late?
  • Diagnostics: What should we look for when diagnosing an issue in our models (and how much experience do I need)?
  • Analytics/automation: How can these problems be solved so that they don’t happen again in the future?

Infrastructure vs. data vs. machine learning observability

If you’ve ever wondered why you have to have different tools for infrastructure, data, and machine learning observability, here’s a breakdown of what each one does:

Infrastructure observability refers to the ability to observe the state of your entire environment. This can include machines (both virtual and physical), software components, networking devices and even third-party services.

The goal is to see how all those resources are connected together so that you can detect issues before they affect users or applications.

Machine learning observability refers specifically to how well models behave within production environments. You want to be able to track anomalies in model behavior as quickly as possible so that you can make adjustments while they’re still manageable.

For example, if your model breaks down when predicting weather patterns or predicting user behavior at an e-commerce site during certain times of day/year then it might need some more training data before being released again into production!

Conclusion

A key takeaway is that observability is an important part of any software system. One of the many roles it can play is to give visibility into how well our infrastructure and ML systems are working.

Observability helps us understand why problems occur, which in turn helps us fix them faster and more proactively before they become big issues that affect customers.

--

--

Hardik Shah
Hardik Shah

No responses yet