Top LinkedIn Content on Understanding System Observability

Principal Data Engineer @ Amazon | Data Engineering

60,729 followers 8mo

Imagine you’re a data engineer. It’s 3 AM on a Friday. You’re home, asleep, but back in the office, your data pipeline is busy. And tonight, a bug sneaks into production. Just a tiny change, a single wrong script runs. Nobody notices at first (well, cause they’re busy on the weekend) Suddenly, fake transactions start landing in your main tables. Customer data gets mixed up. Dashboards shift, and nobody knows why. Years ago, this would have been a nightmare. By Monday morning, you’d be scrambling to guess what happened and where the mess began. But tonight is different, Because every step your data takes is recorded. Your system has data lineage. It’s like having security cameras for your entire pipeline. Every row knows where it came from, every script leaves a footprint, and every transformation is logged. So when you wake up and check the dashboard, you see the story: ↬ What script ran ↬ When it started ↬ Which tables it touched ↬ Where the wrong values spread You hit rewind, isolate the problem, and fix only what needs fixing. And as a result, there will be no mass panic or engineers searching endlessly. You can get answers even at 3 AM! This is the power of data lineage and observability: That’s how you sleep well as a data engineer. That’s how you build pipelines you can trust. – P.S: Did you learn something new with this post? Would you want more posts like this?

60 Comments

Arpit Bhayani

274,050 followers 1mo

Most systems detect node or master failures using simple polling, and while this approach sounds straightforward, it has an interesting reliability issue... The typical approach is to observe a node directly. This usually means pinging it, checking if a port is open, or running a lightweight query to confirm it is alive. On paper, this seems fine, but all of these methods share the same weakness - what if the observer itself is wrong? In a distributed setup, network glitches are normal. Temporary packet loss, routing hiccups, or partial network partitions can easily make a healthy node appear unreachable to the observer. The usual way to deal with this is to retry multiple times and declare failure after the n-th consecutive failure. This creates a classic tradeoff. If n is small (or polling happens frequently), failure detection becomes fast, but false positives increase. A short-lived network blip can trigger an unnecessary failover, which can sometimes be more disruptive than the original issue. If n is large (or polling intervals are longer), false positives decrease, but real failures take longer to detect. That delay directly increases downtime. But there is a more reliable way to think about this problem when you already have a cluster of nodes available. Instead of relying on a single observer repeatedly polling a target node, you can allow multiple nodes in the cluster to independently perform health checks. The system then treats a node as failed only when a majority of observers agree that the node is unreachable. This consensus-based approach reduces the risk of false positives caused by network partitioning. Even if one observer loses connectivity, the rest of the cluster can still provide an accurate view of system health. Consensus is costly, so this approach is not the most cost-efficient. However, it can be very useful if your system is large enough and distributed across multiple geographies.

10 Comments

Dr. Brindha Jeyaraman

18,079 followers 1mo

If your agent runs for 10 minutes, you need to know what happened at minute 3. High-performing teams don’t just log outputs. They trace steps. For long-running agents, you need: 🔍 Step-level execution logs 🧠 Intermediate reasoning checkpoints 🛠 Tool invocation metadata 📊 Token consumption visibility ⏱ Latency per action Without tracing: 1. You can’t debug hallucinations. 2. You can’t explain decisions. 3. You can’t detect drift. 4. You can’t prove compliance. Observability turns agents from magic into machinery. If your only metric is “final output quality,” you’re blind to systemic fragility. Would you ship a distributed system without tracing? Then why ship agents without it? #AIEngineering #Observability #AIOps #AgentSystems #Tracing #ProductionAI #SystemReliability #ModelMonitoring #LLMOps #EnterpriseAI

36 Comments

Ricardo Castro

Director of Engineering | Tech Speaker & Writer. Opinions are my own.

11,730 followers 2y

Another SRE anti-pattern stems from not having adequate observability which is the practice of understanding how systems behave by collecting and analyzing data from various sources. Without adequate observability, SREs and engineering teams are essentially flying blind, making it difficult to identify, diagnose, and resolve issues effectively. Some of the problems and consequences associated with inadequate observability can be: - Increased Mean Time to Detection (MTTD): With inadequate observability, it takes longer to detect issues in your system. This can lead to increased downtime and negatively impact user experience. - Increased Mean Time to Resolution (MTTR): Once you detect a problem, troubleshooting becomes more challenging without proper observability tools and data. This results in longer downtime and more significant disruptions. - Difficulty in Root Cause Analysis: Without comprehensive data on system performance, it's hard to pinpoint the root causes of incidents. This can lead to "fixing symptoms" rather than addressing underlying issues, leading to recurring problems. - Inefficient Capacity Planning: Inadequate observability can hinder your ability to monitor resource utilization and plan for scaling. This may result in overprovisioning or underprovisioning resources, both of which can be costly. - Limited Understanding of User Behavior: Observability isn't just about monitoring system internals; it also includes understanding user interactions. Without this knowledge, it's challenging to optimize your system for user needs and preferences. What are some of the practices and tools that SREs can use? - Logging: Implement structured logging and ensure that logs are collected, centralized, and easily searchable. Use logging toolings like Elasticsearch, Fluentd, or Loki. - Metrics: Define relevant metrics for your system and collect them using tools like Prometheus or InfluxDB. - Distributed Tracing: Implement distributed tracing to track requests as they traverse various services. Tools like Jaeger and OpenTelemetry can help you gain insights into service dependencies and latency issues. - Event Tracking: Capture important events and errors in your system using event tracking systems like Kafka or RabbitMQ. - Monitoring and Alerting: Set up monitoring and alerting systems that can notify you of critical issues in real time. Tools like Grafana or Prometheus help in this regard. - Anomaly Detection: Consider implementing anomaly detection techniques to automatically identify unusual behavior in your system. - User Analytics: Collect data on user behavior and interactions to better understand user needs and improve the user experience. By investing in observability, teams can proactively identify and address issues, improve system reliability, and provide a better overall user experience. It's a fundamental aspect of SRE principles and practices.

2 Comments

Gurumoorthy Raghupathy

Expert in Solutions and Services Delivery | SME in Architecture, DevOps, SRE, Service Engineering | 5X AWS, GCP Certs | Mentor

14,071 followers 5mo

🚀 Building Observable Infrastructure: Why Automation + Instrumentation = Production Excellence and Customer Success After building our platform's infrastructure and application automation pipeline, I wanted to share why combining Infrastructure as Code with deep observability isn't optional—it's foundational as shown in screenshots implemented on Google Cloud. The Challenge: Manual infrastructure provisioning and application onboarding creates consistency gaps, slow deployments, and zero visibility into what's actually happening in production. When something breaks at 3 AM, you're debugging blind. The Solution: Modular Terraform + OpenTelemetry from Day One with our approach centered on three principles: 1️⃣ Modular, Well architected Terraform modules as reusable building blocks. Each service (Argo CD, Rollouts, Sonar, Tempo) gets its own module. This means: 1. Consistent deployment patterns across environments 2. Version-controlled infrastructure state 3. Self-service onboarding for dev teams 2️⃣ OpenTelemetry Instrumentation of every application during onboarding as a minimum specification. This allows capturing: 1. Distributed traces across our apps / services / nodes (Graph) 2. Golden signals (latency, traffic, errors, saturation) 3. Custom business metrics that matter. 3️⃣ Single Pane of Glass Observability Our Grafana dashboards aggregate everything: service health, trace data, build pipelines, resource utilization. When an alert fires, we have context immediately—not 50 tabs of different tools. Real Impact: → Application onboarding dropped from days to hours → Mean time to resolution decreased by 60%+ (actual trace data > guessing) → nfrastructure drift: eliminated through automated state management → Dev teams can self-service without waiting on platform engineering Key Learnings: → Modular Terraform requires discipline up front but pays dividends at scale. → OpenTelemetry context propagation consistent across your stack. → Dashboards should tell a story by organising by user journey. → Automation without observability is just faster failure. You need both. The Technical Stack: → Terraform for infrastructure provisioning → ArgoCD for GitOps-based deployments → OpenTelemetry for distributed tracing and metrics → Tempo for trace storage → Grafana for unified visualisation The screenshot shows our command center : → Active services → Full trace visibility → Automated deployments with comprehensive health monitoring. Bottom line: Modern platform engineering isn't about choosing between automation OR observability. It's about building systems where both are inherent to the architecture. When infrastructure is code and telemetry is built-in, you get reliability, velocity, and visibility in one package. Curious how others are approaching this? What's your observability strategy look like in automated environments? #DevOps #PlatformEngineering #Observability #InfrastructureAsCode #OpenTelemetry #SRE #CloudNative

+7

4 Comments

Santiago Valdarrama

Computer scientist and writer. I teach hard-core Machine Learning at ml.school.

121,741 followers 3mo

If you can't see what an agent does, you can't improve it, you can't debug it, and you can't trust it. It's crazy how many teams are building agents with no way to understand what they're doing. Literally ZERO observability. This is probably one of the first questions I ask every new team I meet: Can you show me the traces of a few executions of your agents? Nada. Zero. Nilch. Large language models make bad decisions all the time. Agents fail, and you won't realize it until somebody complains. At a minimum, every agent you build should produce traces showing the full request flow, latency analysis, and system-level performance metrics. This alone will surface 80% of operational issues. But ideally, you can do something much better and capture all of the following: • Model interactions • Token usage • Timing and performance metadata • Event execution If you want reliable agents, Observability is not optional.

9 Comments

Milan Jovanović

Practical .NET and Software Architecture Tips | Microsoft MVP

274,461 followers 8mo

Observability in .NET doesn’t have to be expensive. I'll show you how to build a production-ready observability setup using OpenTelemetry and Grafana Cloud. There's a generous free tier: - 10k active series - 50GB logs - 50GB traces You’ll learn how to: - Set up OpenTelemetry in a real .NET microservices app - Capture metrics, traces, and logs - Connect everything to Grafana Cloud with minimal config - Trace database queries, HTTP calls, and message queues - View it all in a unified Grafana dashboard This is a complete walkthrough, from NuGet packages to visualizing real traces. Watch it here: https://lnkd.in/eZautsNa

12 Comments

Paul Iusztin

Senior AI Engineer • Founder @ Decoding AI • Author @ LLM Engineer’s Handbook ~ I ship AI products and teach you about the process.

95,984 followers 11mo

LLM systems don’t fail silently. They fail invisibly. No trace, no metrics, no alerts - just wrong answers and confused users. That’s why we architected a complete observability pipeline in the Second Brain AI Assistant course. Powered by Opik from Comet, it covers two key layers: 𝟭. 𝗣𝗿𝗼𝗺𝗽𝘁 𝗠𝗼𝗻𝗶𝘁𝗼𝗿𝗶𝗻𝗴 → Tracks full prompt traces (inputs, outputs, system prompts, latencies) → Visualizes chain execution flows and step-level timing → Captures metadata like model IDs, retrieval config, prompt templates, token count, and costs Latency metrics like: Time to First Token (TTFT) Tokens per Second (TPS) Total response time ...are logged and analyzed across stages (pre-gen, gen, post-gen). So when your agent misbehaves, you can see exactly where and why. 𝟮. 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻 𝗳𝗼𝗿 𝗔𝗴𝗲𝗻𝘁𝗶𝗰 𝗥𝗔𝗚 → Runs automated tests on the agent’s responses → Uses LLM judges + custom heuristics (hallucination, relevance, structure) → Works offline (during dev) and post-deployment (on real prod samples) → Fully CI/CD-ready with performance alerts and eval dashboards It’s like integration testing, but for your RAG + agent stack. The best part? → You can compare multiple versions side-by-side → Run scheduled eval jobs on live data → Catch quality regressions before your users do This is Lesson 6 of the course (and it might be the most important one). Because if your system can’t measure itself, it can’t improve. 🔗 Full breakdown here: https://lnkd.in/dA465E_J

22 Comments

Nicola Sabatelli 🟦

18,859 followers 3w

🚨Observability is no longer about monitoring systems. It’s about steering the enterprise.🚨 The latest IDC MarketScape makes a powerful statement: Observability platforms are becoming the foundation of digital decision-making at scale : And yet, the data reveals a paradox: • 100% of organizations share observability data • 43% still struggle with collaboration • 37% struggle with scaling • 33% struggle with integration Observability should not just answer: “What failed?” It must answer: “What matters to the business right now?” The organizations outperforming their peers are: ✅ Linking telemetry to SLOs and revenue KPIs ✅ Embedding AI into workflows (not just dashboards) ✅ Defining ownership for ingestion, retention, and governance ✅ Treating observability vendors as strategic partners — not tool providers The executive question is no longer: “Which tool should we consolidate?” It’s: 👉 Does our observability posture help us restore incidents or help us shape outcomes? Curious to hear from CIOs, CTOs, and engineering leaders: Are you building an observability toolset… or an observability practice? #Observability #AIOps #EnterpriseArchitecture #CIO #DigitalTransformation #ITLeadership #AI #PlatformStrategy Datadog Dynatrace Splunk Oracle Elastic New Relic ServiceNow IBM Grafana Labs BMC Software Sumo Logic LogicMonitor HPE Broadcom PIC Source IDC Marketplaces

2 Comments

Understanding System Observability

More in Understanding System Observability

More User Experience topics

Explore categories