Meetup 活动 ClickHouse 北京第三届 Meetup 火热报名中,详见文末海报!
有越来越多的人都认为,AI 驱动的可观测性系统将会很快消减,甚至取代站点可靠性工程师(Site Reliability Engineers,SRE)的职责。这种观点相当激进——在 ClickHouse,我们也很好奇,这一愿景到底离现实还有多远。
为此,我们从 SRE 的诸多职责中选择了一项具体任务——根本原因分析(RCA-Root cause analysis),并设计了一个实验,来验证现有 AI 模型在独立完成这项任务时的表现。
继续阅读,了解我们是如何构建并执行这个实验的,更重要的是,我们从中得到了哪些有价值的结论。
总结:自主完成根本原因分析(RCA)的能力尚未成熟。我们评估发现,利用大语言模型(Large Language Models,LLM)以更低成本、更高效率定位生产问题的设想尚未实现,即使是 GPT-5 表现也未能优于其他模型。
当前更适合的做法是将大语言模型用于协助调查过程、总结调查结果、撰写状态更新以及提出后续建议,而核心判断仍应由工程师通过高效、可搜索的可观测性系统掌控。
实验:用大语言模型(LLM)通过简单提示做根因分析(RCA)
这个实验本身并不复杂。我们为模型提供了一个正在运行的应用的可观测性数据,并用一个非常基础的提示词,让模型尝试识别用户报告异常的根本原因。
参与验证的模型
为了进行实验,我们首先需要选出参与的模型。我们挑选了五个模型,其中既包括广为人知的选手,也有一些不太常见但颇具潜力的模型:
Claude Sonnet 4:Claude 以其结构化的推理能力和细致的回答风格著称。它擅长按部就班地分析问题,在处理跨系统的复杂故障时尤其有帮助。
OpenAI GPT-o3:OpenAI 推出的进阶模型,专为高速响应和多模态输入而优化,在性能和响应速度之间取得了良好平衡。
OpenAI GPT-4.1:在通用推理和语言理解方面依然表现强劲,不过速度较慢,反应也不如 GPT-o3 灵敏。我们将其作为一个对照基线。
Gemini 2.5 Pro:这是 Google 最新推出的 Pro 版本,已深度整合入其生态系统。它在多步推理方面表现优秀,在代码分析和故障排查任务中尤为突出。
我们并没有测试市面上所有可用模型——这在现实中几乎不可能做到——而是选择了我们可以方便访问,并且看起来适合这项任务的模型。我们的目标不是评选冠军,而是了解这些模型在面对真实事故数据时,各自能发挥出什么样的表现。
异常情况
我们需要一份来自具有异常问题的实时应用的可观测性数据。为此,我们选择了运行 OpenTelemetry 演示应用(https://opentelemetry.io/docs/demo/),以生成包含不同类型异常的四个数据集。
正如下方架构图所示,OpenTelemetry 演示应用本身具有一定复杂性,包含多个服务组件、一个用户可交互的前端页面以及一个负载生成器。整体架构在很大程度上贴近真实的生产环境。此外,该应用还预置了一些可以通过 功能开关(https://opentelemetry.io/docs/demo/feature-flags/) 启用的异常场景,便于模拟不同类型的问题。
我们基于这个应用构建了三个新的数据集,每个数据集都聚焦于一个特定的异常,并覆盖约 1 小时的运行数据。第四个测试则使用了我们已有的 ClickStack 公共演示(https://play-clickstack.clickhouse.com/) 数据集,时长为 48 小时。
下表总结了这四个数据集及其对应的功能开关配置:
我们的数据采集流程如下:
首先在 Kubernetes 集群中部署 OpenTelemetry 演示应用,并通过 ClickStack 进行 数据采集配置(https://clickhouse.com/docs/use-cases/observability/clickstack/ingesting-data/overview)。具体操作可以参考这个 GitHub 仓库(https://github.com/ClickHouse/opentelemetry-demo) 中的说明文档。
应用启动并开始将遥测数据写入 ClickHouse 后,我们将系统负载提升至 1000 名并发用户。当用户数量稳定后,开启目标异常对应的功能开关,并持续运行约 40 分钟。随后关闭该开关,再运行 10 分钟进行收尾。最终,每个数据集包含约 1 小时的完整数据。
SELECT
min(TimestampTime),
max(TimestampTime)
FROM otel_anomaly_3.otel_logs
复制代码
┌──min(TimestampTime)─┬──max(TimestampTime)─┐
│ 2025-07-22 08:25:40 │ 2025-07-22 09:36:01 │
└─────────────────────┴─────────────────────┘
复制代码
现在我们已经完成了数据集的采集,接下来将首先进行人工分析,然后测试各个模型在排查这些异常时的实际表现,看看哪个模型更具实用价值。
实验方法
手动排查
在将问题交给大语言模型(LLM)分析之前,我们需要像 SRE 一样,先对每个异常进行全面的人工排查,找出其真实的根本原因。这一步是后续验证模型效果的基准:如果模型的判断正确,我们能确认;如果出现偏差,我们也可以适时纠正并辅助其深入。
在这一环节中,我们使用了基于 ClickHouse 构建的可观测性平台 ClickStack(https://clickhouse.com/use-cases/observability)。
我们逐个定位问题,确认异常点,并记录下完整的排查路径。这将成为之后评估模型表现的参考标准。
AI 驱动的排查
接下来轮到 AI 模型上场。
我们通过连接到 ClickHouse MCP 服务(https://clickhouse.com/docs/use-cases/AI/MCP/librechat) 的 LibreChat(https://www.librechat.ai/),让每个大语言模型在完全相同的场景下执行任务。借助这一配置,模型可以直接查询实际的可观测性数据,自主进行分析推理。
为便于观测整个过程中的 Token 使用量,我们还对 LibreChat 和 ClickHouse MCP Server 进行了仪表化(https://clickhouse.com/blog/llm-observability-clickstack-mcp)。遥测数据保存在 ClickHouse 中,因此我们可以通过查询语句统计整个分析过程中所消耗的 Token 数。
SELECT
LogAttributes['conversationId'] AS conversationId,
sum(toUInt32OrZero(LogAttributes['completionTokens'])) AS completionTokens,
sum(toUInt32OrZero(LogAttributes['tokenCount'])) AS tokenCount,
sum(toUInt32OrZero(LogAttributes['promptTokens'])) AS promptTokens,
anyIf(LogAttributes['text'], (LogAttributes['text']) != '') AS prompt,
min(Timestamp) AS start_time,
max(Timestamp) AS end_time
FROM otel_logs
WHERE conversationId =
GROUP BY conversationId
复制代码
每次测试我们都以相同的提示词开始:
"你是一个可观测性智能体,可以访问一个演示应用的 OpenTelemetry 数据。用户报告了使用应用时遇到问题,你能识别出具体问题、根本原因,并提出可能的解决方案吗?"
如果模型在首次提示后就能准确定位问题,那就非常理想。如果没有,我们会根据其回答进一步追问;若完全偏离方向,则提供更多上下文帮助其重回正轨。
针对每个异常,我们都会记录以下几个维度:
模型发现了哪些问题
判断是否准确
是否需要人工引导
使用了多少 Token
整体分析耗时
这些数据帮助我们评估模型在真实 SRE 场景下的效率与可靠性。
实验详解
在这一部分,我们将逐一分析各个异常情况。每个异常都先进行人工排查,随后在模型中使用统一提示词进行测试,并记录模型的响应过程。如果模型在初始阶段未能定位问题,我们会通过进一步提示引导其深入分析。
异常 1:支付服务故障
场景一:用户反馈在结账流程中遇到问题,填写完所有订单信息后点击“下单”按钮时出现错误提示。
手动排查
这个问题在 ClickStack 中排查起来相对简单。我们首先查看一个包含异常的用户会话。
在该会话的 trace 视图中,我们发现 payment service 返回了错误信息,提示 loyalty 等级为 gold 的用户所用的 Token 无效,导致支付请求失败。
接着,我们在 traces explorer 视图中进一步筛选仅与 payment service 相关的追踪数据,验证了该错误确实只影响 Gold 等级用户。
综上所述,payment service 在处理 Gold loyalty 用户的支付请求时存在问题。
LLM 排查
Claude Sonnet 4 模型
Claude Sonnet 4 模型在首次提示后便成功定位了问题根因,并给出了解释。
查看完整交互过程:
Claude Sonnet 4 - 异常 1 分析过程
提示词 1:
You're an Observability agent and have access to OpenTelemetry data from a demo application. Users have reported issues using the application, can you identify what is the issue, the root cause and suggest potential solutions?
复制代码
模型响应片段:
OpenTelemetry Analysis Report
Issues Identified
1. Primary Issue: Payment Service Gold Tier Failure
- What: Starting at 11:52:35 on July 22, 2025, all payment transactions for users with "gold" loyalty level began failing with "Invalid token" errors
- Impact: 50% payment failure rate (13,505 failed transactions out of 27,000 total payment attempts)
- Error Message: "Payment request failed. Invalid token. app.loyalty.level=gold"
复制代码
OpenAI o3 模型
OpenAI o3 模型在首次响应中未能使用 MCP 服务连接查询 ClickHouse 数据。在后续提示后,它才成功查询了数据并进行了分析,但分析重点落在受影响的上层 trace 节点上,而未触及真正的根因。在我们提供明确引导后,它才定位到问题源头。
查看完整交互过程:
OpenAI o3 - 异常 1 分析过程
提示词 1
You're an Observability agent and have access to OpenTelemetry data from a demo application. Users have reported issues using the application, can you identify what is the issue, the root cause and suggest potential solutions?
复制代码
模型响应片段:
Evidence
1. Traces
– Payment service: 13,512 error spans, all named
"Error: Payment request failed. Invalid token. app.loyalty.level=gold"
– Checkout service: 27,024 error spans (same TraceId chain) with gRPC method oteledemo.PaymentService/Charge returning INTERNAL.
– Frontend / frontend-proxy: ~29,000 error spans ("Failed to place order").
2. Logs
– 27,010 "failed to charge card: ... Invalid token ... loyalty.level=gold" log records.
– Small number (~6,000) of ECONNREFUSED errors, but they are an order of magnitude lower than the “Invalid token” failures.
3. Metrics
– No CPU / memory anomalies; latency spikes coincide with spikes in payment errors.
– Error rate is almost 100% for PlaceOrder when loyalty.level = gold.
复制代码
OpenAI GPT4.1 模型
OpenAI GPT-4.1 模型在多轮提示下逐步收敛到正确答案。初始分析偏离方向,经过两轮进一步提示后,成功识别出支付服务中 loyalty 等级处理异常的问题。
查看完整交互过程:
OpenAI GPT-4.1 - 异常 1 分析过程
提示词 1:
You're an Observability agent and have access to OpenTelemetry data from a demo application. Users have reported issues using the application, can you identify what is the issue, the root cause and suggest potential solutions?
复制代码
响应节选:
To help you identify the issue, root cause, and suggest solutions, I’ll need to analyze relevant OpenTelemetry data from your demo application. This typically includes traces, metrics, and logs. Here’s how we’ll approach this:
1. Issue Identification: I’ll look for error traces, high latencies, dropped requests, or anomalies.
2. Root Cause Analysis: I’ll analyze the traces to see where bottlenecks, errors, or resource issues might exist (e.g., slow database calls, timeouts, dependency errors).
3. Potential Solutions: Based on findings, I’ll suggest actionable solutions.
复制代码
提示词 2:
You have access to the data, can you start analyzing?
复制代码
响应节选:
Observed Issue
- High average span durations in several services, especially load-generator, frontend-proxy, and frontend.
- No explicit errors in logs or trace statuses.
- This points toward a potential performance/bottleneck issue rather than a code crash or failure.
复制代码
提示词 3:
Users have been reported issues during the payment process
复制代码
响应节选:
Impact
- Users, particularly with gold loyalty level, cannot complete payment, resulting in user-facing failures and high error rates in backend logs and traces.
复制代码
Gemini 2.5 Pro
Gemini 2.5 Pro 模型在尝试使用 MCP 工具查询数据库时最初遇到了一些问题。最开始分析的方向偏向于某个非关键 API 的延迟指标。经过引导后,它逐步聚焦于 payment 流程,并准确识别出了 Gold 用户支付失败的问题。
查看完整交互过程:
Gemini 2.5 Pro - 异常 1 分析过程
提示词 1:
You're an Observability agent and have access to OpenTelemetry data from a demo application. Users have reported issues using the application, can you identify what is the issue, the root cause and suggest potential solutions?
复制代码
响应节选:
### Issue
The application is experiencing high latency, with the `frontend-proxy` service being the most affected. The P95 latency for this service is over 1 second, which is significantly impacting the user experience.
### Root Cause
The root cause of the high latency is long-running calls from the `frontend-proxy` service to the `flagd.evaluation.v1.Service/EventStream` endpoint of the `flagd` service. The trace analysis shows that these calls can take over 3 minutes to complete. This is because the `EventStream` is a streaming endpoint designed to provide real-time updates of feature flags. The `frontend-proxy` is making a request to this endpoint and holding the connection open, which is causing the long-running spans and, consequently, the high latency.
The `canceled: "true"` attribute on the `router flagservice egress` span in the `frontend-proxy` service suggests that the proxy is timing out and canceling the request. However, the `flagd` service does not seem to be handling this cancellation correctly and continues to process the request, keeping the stream open for an extended period.
### Solution
To resolve this issue, the following actions are recommended:
* **Use a Short-Lived RPC for Feature Flag Evaluation**: Instead of using the `EventStream` endpoint, the `frontend-proxy` should use a more appropriate, short-lived RPC for fetching feature flags. This will ensure that the connections are not held open for extended periods, reducing the latency of the `frontend-proxy` service.
* **Investigate the `flagd` Service**: The `flagd` service should be investigated to understand why it's not handling client-side cancellations correctly. The developers should ensure that the service gracefully handles canceled requests and closes the streams promptly.
* **Review the `frontend-proxy` Configuration**: The `frontend-proxy`'s configuration should be reviewed to ensure that it's correctly handling streaming responses and that the timeouts are set to appropriate values.
By implementing these solutions, you should be able to significantly reduce the latency of the `frontend-proxy` service and improve the overall performance and reliability of your application.
复制代码
提示词 2:
Users have reported issues during the payment process
复制代码
响应节选:
Based on my investigation, I've identified the root cause of the payment issues and have recommendations for a solution.
### Summary of Findings
* **No Explicit Errors:** My initial search for traces and logs with an "ERROR" status for the `payment` service came up empty. This indicates that the application is not correctly flagging these issues as errors.
* **Anomalous Span Names:** An analysis of the service's trace data revealed a peculiar pattern: spans being created with names that are actually error messages. For example:
* `Error: Payment request failed. Invalid token. app.loyalty.level=gold`
* `Error: The credit card (ending 1791) expired on 1/2025.`
* **Error Context:** By examining these "error spans," I found that the "Invalid token" error consistently occurs for users with the `app.loyalty.level` attribute set to `gold`.
### Root Cause
The user-reported payment issues are caused by a bug in the `payment` service that specifically affects customers with a "gold" loyalty status. When these users attempt to make a payment, the service fails with an "Invalid token" error.
Furthermore, the application is handling these errors incorrectly from an observability standpoint. Instead of setting the status of the parent span (e.g., the `charge` or `grpc.oteldemo.PaymentService/Charge` span) to `ERROR`, the developers have written code that creates a new, zero-duration span with the error message as its name. This practice hides the errors from standard monitoring dashboards and makes it difficult to track the error rate of the service.
### Recommended Solutions
1. **Fix the Payment Logic:**
* **Immediate Priority:** The development team must investigate and fix the bug in the `payment` service that causes token validation to fail for "gold" loyalty members. This is the primary cause of the user-facing issue.
2. **Correct the Instrumentation:**
* **Best Practice:** The practice of creating new spans to log errors should be stopped. Instead, developers should adhere to OpenTelemetry best practices:
* When an error occurs, the `StatusCode` of the existing span in which the error happened should be set to `ERROR`.
* The error details should be recorded as an `Event` (also known as a `Log`) within that span, including attributes for the exception type, message, and stack trace. This ensures that errors are properly tracked, correlated with traces, and can be used to trigger alerts.
By addressing both the underlying bug in the payment logic and the incorrect instrumentation, you will resolve the user-reported issues and significantly improve the observability and reliability of your application.
复制代码
大语言模型调查总结
异常 2:推荐缓存泄漏
接下来分析的异常发生在 recommendation cache 服务中,该服务引入了一个问题,导致 CPU 使用率异常飙高。
手动排查
我们首先从日志入手,发现错误信息数量显著上升。
在初步浏览日志时,并未立刻发现异常的具体原因。我们注意到多条连接错误日志,例如:⨯ Error: 14 UNAVAILABLE: No connection established. Last error: connect ECONNREFUSED 34.118.225.39:8080 (2025-07-22T11:05:44.834Z)。
接着,我们在 Trace Explorer 中将视图过滤为仅包含错误的 trace,并排除了产生大量干扰信息的 load-generator 服务。
进一步排查发现,这些错误主要集中在 recommendation service 上。于是我们进一步聚焦,只查看该服务的 trace 数据。
在实验过程中,trace 数量出现明显下降。我们通过 Services 视图中的请求吞吐量图进一步验证这一情况。
结果显示,recommendation service 的吞吐量下降,而延迟上升。这表明服务虽然仍在响应,但响应变慢,可能已成为系统瓶颈,导致超时或连接失败。
我们返回 Traces 页面,尝试在事件差异(Event deltas)视图中寻找异常模式。
在按 get_product_list 函数过滤后,我们利用 outliers 属性对比快慢请求行为。
结果显示,慢请求普遍具有以下特征:返回超过 10 个产品,且启用了 recommendation cache。
进一步分析发现,在启用缓存的请求中,若产品数量较多且发生缓存未命中,系统响应时间会显著增加。
最终确认,问题与 recommendation cache 的启用状态及返回产品数量有关。
LLM 排查
Claude Sonnet 4 模型
Claude Sonnet 4 模型在初步分析中定位到了 recommendation service 是故障源,但未能立刻识别出缓存相关的问题。
虽然整体进展有限,但模型准确指出该服务因资源耗尽而频繁重启,累计达 12 次——这是我们在人工排查中未曾注意到的细节。
模型进一步分析指出,重启是由于 CPU 达到上限,并非内存泄漏造成。这一发现虽然有价值,但仍未触及缓存机制的问题。
这也反映了大语言模型的一个典型特征:一旦进入某个推理方向,就倾向于在该路径上反复深入,而不太主动探索其他可能性。在我们进一步追问 “为什么该 pod 会占用 100% CPU” 时,模型终于定位到问题可能出在 recommendation cache 的处理方式上。
查看完整交互过程:
Claude Sonnet 4 - 异常 2 分析流程
提示词 1:
You're an Observability agent and have access to OpenTelemetry data from a demo application. Users have reported issues using the application, can you identify what is the issue, the root cause and suggest potential solutions?
复制代码
响应节选:
The Recommendation Service experienced a complete outage, causing cascading failures throughout the application:
1. Primary Issue:
The recommendation service became completely unavailable between 10:13–11:05 UTC
• Normal span count dropped from 14,000 per 5-min window to 300–400
• Service was essentially down for nearly an hour
2. Connection Failures:
Frontend services couldn’t connect to recommendation service endpoints:
• IP addresses being refused: 34.118.225.39:8080, 34.118.236.87:8080, 34.118.226.49:8080
• 295,538 error logs from frontend service with "ECONNREFUSED" errors
• 51,190 failed gRPC calls to RecommendationService/ListRecommendations
复制代码
提示词 2:
Why does the RecommendationService/ListRecommendations fail?
复制代码
响应节选(1/2):
Root Cause Analysis: Why RecommendationService/ListRecommendations Fails
Based on my deep analysis of the observability data, here’s exactly why the RecommendationService/ListRecommendations is failing:
The Core Problem: Resource Exhaustion Leading to Cascading Failures
Primary Root Cause:
The recommendation service is experiencing resource starvation in its Kubernetes environment, leading to repeated container restarts.
复制代码
响应节选(2/2):
Restart Pattern:
• 12 service restarts in just 1 hour (10:13–11:07)
• Restart frequency: Every 3–7 minutes
• Restart timestamps show clear crash–restart cycles
复制代码
提示词 3:
Why do you think the recommendation service restarted?
复制代码
响应节选:
Why the Recommendation Service Restarted – The Complete Analysis
Based on my deep dive into the observability data, I can now provide the exact reason for the recommendation service restarts:
⸻
Root Cause: CPU Throttling Leading to Kubernetes Health Check Failures
The Primary Trigger:
The recommendation service is being killed by Kubernetes due to failed health checks, not by memory issues (OOMKilled).
⸻
Evidence-Based Analysis:
Restart Pattern Analysis:
• 12 restarts in 54 minutes (10:13:40 – 11:07:23)
• Restart intervals: 54s, 96s, 77s, 179s, 248s, 436s, 411s, 488s, 414s, 400s, 420s
• Pattern: Initial rapid restarts (54–96s), then gradually increasing intervals as system stabilizes
复制代码
提示词 4:
Why is the pod using 100% CPUs?
复制代码
响应节选:
Root Cause Analysis:
The Problem: Algorithmic Complexity Explosion
What’s happening inside the recommendation service:
1. Cache Warming Issue:
At 10:12, something changes in the request patterns.
2. Cold Cache Performance:
When cache misses happen, the recommendation algorithm becomes computationally expensive.
3. Product List Processing:
The get_product_list function is doing heavy computation.
4. Cascade Effect:
As processing time increases, requests queue up, making everything slower.
复制代码
OpenAI o3 模型
OpenAI o3 模型识别出 recommendation service 不可用,但初始阶段未能解释具体原因。进一步提示后,模型确认 pod 已崩溃,但无法明确其触发因素,仅推测可能由内存问题或某个功能开关引起,方向基本合理。
当我们持续引导模型关注内存方面,最终它识别出问题可能与启用缓存功能(cache-enabled)相关。然而在提出修复建议时,它生成了一段并不存在于实际代码库中的方案,可能是因为它无法访问应用源代码所致。
查看完整交互过程:
OpenAI o3 - 异常 2 分析流程
提示词 1:
You're an Observability agent and have access to OpenTelemetry data from a demo application. Users have reported issues using the application, can you identify what is the issue, the root cause and suggest potential solutions?
复制代码
响应节选:
Root cause (most likely)
The Recommendation Service is down or unreachable from the cluster:
• Pod(s) may be CrashLooping or scaled to 0
• Service/Endpoint list may be empty or pointing to the wrong port
• Container may start but not bind to 0.0.0.0:8080 (misconfiguration, wrong PORT env-var)
复制代码
提示词 2:
Why does the RecommendationService/ListRecommendations fail?
复制代码
完整响应:
In short:
ListRecommendations fails whenever traffic is routed to an orphan / wrong IP (34.118.225.39).
Clean up that bad endpoint and the errors disappear.
复制代码
提示词 3:
Can you check before the recommendation service became unavailable what could have caused to crash?
复制代码
响应节选:
What probably caused the crash
– Memory leak or a code path that loads a huge product list into RAM
(note the 424 MiB / 332 MiB jumps right after fetching the product catalogue)
– A recent feature-flag or config change that increases the amount of data the service keeps in memory on start-up
复制代码
提示词 4:
"note the 424 MiB/332 MiB jumps right after fetching the product catalogue" This is interesting. Can you check why fetching the product catalogue increased the memory a lot?
复制代码
响应节选(1/2):
10:12 – flag flips → cache enabled
• ResolveBoolean/flagd traces appear for key "recommendation.cache_enabled"
• Subsequent get_product_list spans show cache_enabled = true
• In the very first minute after the flip, the value of app.products.count jumps from 10 to 20,226,406 (≈ 2 × 10⁷)
• container.memory.usage spikes to 830 MiB and the pod is OOM-killed
复制代码
响应节选(2/2):
What the numbers tell us
The code that builds the cached product list is appending the entire list to itself on every cache miss / cold start instead of replacing it:
new_list = cache.get() # [] on first run
new_list += remote_product_list # +10
cache.set(new_list)
After a restart, the cache is empty again, so the first call fetches 10 products and stores them.
But a second call (still within the same process) does:
new_list = cache.get() # 10 items
new_list += remote_product_list # +10 ⇒ 20
cache.set(new_list)
复制代码
OpenAI GPT4.1
GPT-4.1 在独立排查根因方面存在一定困难。除非我们明确提示,它不会主动使用 MCP 工具查询数据。默认分析范围主要集中在日志层面,若希望它参考 trace 和指标数据,需要额外提示。
在分析所有数据后,模型指出 flagd 服务可能是导致多个组件性能问题的瓶颈。虽然这个判断部分成立,但我们在人工调查中已确认,这并非主要问题。
我们进一步引导模型聚焦于 Recommendation 服务,它才在我们明确指出后,识别出缓存机制存在问题。
查看完整交互过程:
OpenAI GPT4.1 - 异常 2 分析详情
提示词 1:
You're an Observability agent and have access to OpenTelemetry data from a demo application. Users have reported issues using the application, can you identify what is the issue, the root cause and suggest potential solutions?
复制代码
响应节选:
Absolutely, I can help you analyze OpenTelemetry data to diagnose issues within your demo application.
However, I’ll need more specific information or access to the relevant datasets (for example, traces, logs, or metrics from your observability backend).
复制代码
提示词 2:
You have access to the OTel data in clickhouse, please query it.
复制代码
响应节选:
What this means
• The backend services (like API, web, or data processors) are not showing trace-level errors in the trace data.
• The only error detected is from the load-generator, which is likely the synthetic traffic generator for testing/demo purposes.
This could be expected and not user-impacting.
复制代码
提示词 3:
You also have access to traces and metrics data to further continue the investigation
复制代码
响应节选:
Issue Identified
A major performance bottleneck is occurring in the /flagd.evaluation.v1.Service/EventStream span across multiple critical services (recommendation, fraud-detection, ad, flagd).
This is causing very high latencies, likely resulting in a degraded or unusable application experience for users.
复制代码
提示词 4:
Yes, can you focus on the recommendation service?
复制代码
完整响应:
Conclusion & Root Cause:
• The recommendation service is blocked waiting for a response from flagd’s EventStream, hitting timeouts.
This means either the flagd service is slow, unavailable, or network connectivity to it is broken.
• Because flagd (feature flag evaluation) is used within critical request paths,
when it fails or hangs, the entire recommendation chain suffers, severely impacting user experience.
复制代码
提示词 5:
Top durations for /oteldemo.RecommendationService/ListRecommendations are also very high (over 21–22 seconds), suggesting calls to downstream dependencies (like flagd) are causing severe slowness in client-facing endpoints. Why do you think recommendation service (ListRecommendations) is calling flagd?
复制代码
完整响应:
Trace Duration Patterns: • The /flagd.evaluation.v1.Service/EventStream spans within the recommendation service show extremely high durations(all stuck at 600+ seconds, matching the gRPC deadline). • The /oteledemo.RecommendationService/ListRecommendations spans also have high durations (~21–22 seconds),though not as long as the flagd EventStream spans, but still anomalous for a user-facing endpoint.
复制代码
提示词 6:
You have access to tracing data, you can check the direct trace linking.
复制代码
完整响应:
• Based on the evidence, I cannot confirm with this trace dataset that ListRecommendations is directly calling flagd via EventStream in a way that is instrumented by distributed traces.
• The slowness seen in both operations remains highly correlated.
复制代码
提示词 7:
if you can't find a direct link. Can you see another reason the recommendation service is running slow?
复制代码
完整响应:
Key Observations: • Durations are very high (around 19–22 seconds). • All these slow requests have StatusCode “Unset” and rpc.grpc.status_code = "0"(which means the gRPC call technically succeeded from a protocol point of view). • The span attributes are consistent—each request recommended 5 products, the peer IP was the same (10.32.10.87), and the user agent is grpc-node-js/1.12.6. • No explicit errors or useful StatusMessage are present.
复制代码
提示词 8:
We recently upgraded the function get_product_list to optimize the cache logic. Can you check this didn't create a regression?
复制代码
完整响应:
Is this a regression? • A p95 response time of 200–250 ms might be acceptable for some backends, but if this represents a rise compared to previous levels, it could indicate a performance regression from the cache optimization. • If this is much higher than before the change, it could mean the cache is not being hit effectively, or new cache logic is causing delays (e.g., added serialization, locking, or slower fallback path).⸻Signs of regression: • There’s a jump from 204 ms to 251 ms during the 10:00–11:00 interval, possibly coinciding with the upgrade/deployment window.
复制代码
提示词 9:
Looking at the data it looks like the request latency for get_product_list went up though.
复制代码
完整响应:
Summary:There is strong evidence from tracing data that the cache logic upgrade in get_product_list introduced increased latency, likely hurting overall recommendation service responsiveness.Rolling back or tuning the new cache logic should be prioritized.
复制代码
Gemini 2.5 Pro
Gemini 模型一开始识别出系统中包含 trace、日志和指标数据,并表示将以 trace 为起点进行分析。然而在整个过程中,它并未使用日志或指标数据,而是在 trace 层面陷入错误路径。
在我们提示其使用日志后,Gemini 开始转向正确方向,识别出 recommendation 服务存在请求超时。但它未能进一步探明原因。当我们继续引导其分析超时根因时,模型再次回到 trace,并判断问题为 DNS 错误——尽管数据中确实存在 DNS 错误,但这些错误与当前问题无关。我们再次提示查看日志,结果依旧无新发现。
随后,我们引导其查询指标表,模型成功发现 pod 的 CPU 使用率达到了 100%,并判断这可能是故障根源。但当我们进一步要求其分析为何出现高 CPU 使用时,模型开始在已有推理中反复打转,并未提出新见解。
更令人担忧的是,模型随后开始构建一个“产品 ID 被污染”的假设,尽管它缺乏任何支持这一推断的证据,但仍试图通过数据去印证自己的设想。虽然这一假设最终未被验证,模型也坦诚未能找到依据。
我们进一步要求其查看 CPU 峰值期间是否存在其他资源竞争,但这一操作反而让 Gemini 放弃了 CPU 问题判断,再度回到“DNS 是问题”的结论(尽管事实并非如此)。
查看完整交互过程:
Gemini 2.5 Pro - 异常 2 分析详情
提示词 1:
You're an Observability agent and have access to OpenTelemetry data from a demo application. Users have reported issues using the application, can you identify what is the issue, the root cause and suggest potential solutions?
复制代码
响应节选:
This is a very interesting finding. The top 10 longest-running spans are all related to the `flagd.evaluation.v1.Service/EventStream`. The durations are extremely high, in the order of 600 seconds (10 minutes), which is a strong indicator of a timeout or a serious performance bottleneck.
The services affected are `recommendation`, `fraud-detection`, and `ad`, which suggests a widespread problem with a shared dependency. In this case, the dependency is the `flagd` service, which appears to be a feature flagging service.
Based on this, I have identified the following:
**Issue:**
Users are experiencing significant performance issues with the application. These issues are not presenting as explicit errors but as very slow responses, leading to a poor user experience.
**Root Cause:**
The root cause of the performance degradation is the `/flagd.evaluation.v1.Service/EventStream` endpoint. This service is used by multiple other services (`recommendation`, `fraud-detection`, and `ad`) to receive feature flag updates. The extremely long duration of these calls indicates that the client services are waiting for a response from the `flagd` service for up to 10 minutes, which is causing them to become unresponsive.
**Potential Solutions:**
To resolve this issue, I recommend the following actions:
* **Investigate the `flagd` service:** The `flagd` service should be the primary focus of the investigation. Check its logs and metrics to understand why the `EventStream` endpoint is not responding in a timely manner.
* **Review network policies:** It is possible that a network policy is preventing the client services from connecting to the `flagd` service, leading to a timeout. Review the network policies to ensure that there are no rules blocking this traffic.
* **Implement connection timeouts:** The client services that connect to the `flagd` service should have a much shorter connection timeout. This will prevent them from being stuck for long periods and will allow them to fail fast and recover more gracefully. A timeout of a few seconds would be a good starting point.
* **Add readiness and liveness probes:** If the `flagd` service is running in a containerized environment (like Kubernetes), adding readiness and liveness probes will help automatically detect and recover from situations where the service is not responding.
By implementing these recommendations, you should be able to resolve the performance issues and improve the user experience of your application.
复制代码
提示词 2:
I dont think this is the problem. have you considered using the logs table as well?
复制代码
完整响应:
Excellent! I've found the smoking gun. The logs from the `frontend-proxy` service show a clear pattern of `504 UT response_timeout` errors when calling the `frontend` service's `/api/recommendations` endpoint. This is a strong indicator that the `frontend` service is the bottleneck. The `response_timeout` message in the logs confirms that the `frontend-proxy` is waiting too long for a response from the `frontend` service.
Now that I've narrowed down the problem to the `frontend` service, I will investigate its traces to understand what's causing the delay. I'll focus on the `/api/recommendations` endpoint and look for the longest-running spans to pinpoint the exact operation that's causing the bottleneck.
I will now query the `otel_traces` table for the top 10 longest-running spans in the `frontend` service with the `SpanName` of `/api/recommendations`.
...
This is unexpected. I see that there are no traces for the `/api/recommendations` span in the `frontend` service, even though the logs from the `frontend-proxy` clearly show that this endpoint is being called.
This could mean that the `frontend` service is not correctly configured to generate traces for this endpoint. This is a common problem in distributed tracing setups, where some services might not be properly instrumented.
Even though I can't see the traces for the `/api/recommendations` endpoint, the logs from the `frontend-proxy` provide enough evidence to conclude that the `frontend` service is the root cause of the performance issues. The `504 UT response_timeout` errors are a clear indication that the `frontend` service is not responding in a timely manner.
Based on this, I can provide the following updated analysis:
**Issue:**
Users are experiencing significant performance issues with the application. These issues are not presenting as explicit errors but as very slow responses, leading to a poor user experience.
**Root Cause:**
The root cause of the performance degradation is the `frontend` service. The `frontend-proxy` is timing out when calling the `/api/recommendations` endpoint on the `frontend` service, as evidenced by the `504 UT response_timeout` errors in the `frontend-proxy`'s logs.
**Potential Solutions:**
To resolve this issue, I recommend the following actions:
* **Investigate the `frontend` service:** The `frontend` service should be the primary focus of the investigation. Check its logs and metrics to understand why the `/api/recommendations` endpoint is not responding in a timely manner.
* **Instrument the `frontend` service for tracing:** To get a better understanding of the performance of the `frontend` service, it's crucial to instrument it to generate traces. This will allow you to see the full lifecycle of a request and identify the exact operations that are causing the delay.
* **Review the code for the `/api/recommendations` endpoint:** The code for the `/api/recommendations` endpoint should be reviewed to identify any potential performance bottlenecks. This could include inefficient database queries, slow external service calls, or other performance anti-patterns.
* **Add readiness and liveness probes:** If the `frontend` service is running in a containerized environment (like Kubernetes), adding readiness and liveness probes will help automatically detect and recover from situations where the service is not responding.
By implementing these recommendations, you should be able to resolve the performance issues and improve the user experience of your application. The most critical step is to instrument the `frontend` service so you can get a better understanding of its performance and identify the root cause of the timeouts.
复制代码
提示词 3:
Why does the recommendation service timeout?
复制代码
响应节选:
This is the exact question we need to answer. The trace data has provided a crystal-clear picture of what's happening. Let's break it down.
**Analysis of the Trace (`425b78d730282b74af1a880fa7758a3b`)**
1. **The Client's Perspective (`frontend` service):**
* The `frontend` service makes a call to the `recommendation` service (`grpc.oteldemo.RecommendationService/ListRecommendations`).
* This call takes **307 milliseconds** from the `frontend` service's point of view.
2. **The Server's Perspective (`recommendation` service):**
* The `recommendation` service receives the request, processes it, and sends a response.
* The actual processing time inside the `recommendation` service (`/oteldemo.RecommendationService/ListRecommendations` span) is only **5 milliseconds**.
**The Root Cause: Massive Network Overhead**
The difference between the client's measured time (307 ms) and the server's processing time (5 ms) is **302 milliseconds**. This massive delay is happening *outside* the recommendation service's code.
I found the same pattern for all other downstream calls from the `frontend` service. For example, calls to the `product-catalog` service take about **218 ms** from the client's side but are processed in less than **1 ms** on the server side.
This means there is a consistent, large overhead of **~200-300 milliseconds** for every single outgoing gRPC call made by the `frontend` service.
**Conclusion**
The **recommendation service is not the problem**. It's extremely fast. The problem lies within the `frontend` service's environment, which is adding a significant delay to every network call it makes.
This is the root cause of the entire issue. This fixed delay makes the `frontend` service incredibly slow, which in turn causes the `frontend-proxy` to time out, just as we saw in the logs.
**Likely Causes for the Overhead and Next Steps**
This type of fixed overhead on outgoing network calls is often caused by misconfigurations in the environment, such as:
* **DNS Resolution Issues:** The `frontend` service's container might be trying to resolve service names (like `recommendation.default.svc.cluster.local`) using an external DNS server that fails or times out, before falling back to the correct internal cluster DNS.
* **Service Mesh Misconfiguration:** If you are using a service mesh like Istio or Linkerd, a misconfigured sidecar proxy could be adding this latency to every call.
* **Network Policies:** A network policy could be slowing down connection establishment.
**Recommendation:**
Your next step should be to **investigate the runtime environment of the `frontend` service pod**. Specifically, check its DNS configuration (`/etc/resolv.conf`) and any service mesh or network policy configurations that apply to it. Resolving this environmental issue should eliminate the network call overhead and fix the application's performance problems.
复制代码
提示词 4:
Ok but what is causing the timeout?
复制代码
完整响应:
You are asking the central question. I will connect all the pieces of evidence to give you a definitive answer.
The timeout is caused by a **"death by a thousand cuts"** scenario happening inside the `frontend` service. The root cause is the **~250ms of network overhead** that is added to *every single* outbound call it makes.
Here is the step-by-step breakdown of how this leads to a 15-second timeout for a single `/api/recommendations` request:
1. **A Request Arrives:** A user requests recommendations, hitting the `frontend-proxy`, which forwards the request to the `frontend` service's `/api/recommendations` endpoint. The `frontend-proxy` starts a 15-second timer.
2. **The `frontend` Service's Job:** To build the response, the `frontend` service needs to gather data from other services. Looking at the trace data, we know for this single request it makes **5 downstream calls**:
* 1 call to the `recommendation` service.
* 4 calls to the `product-catalog` service.
3. **The Hidden "Network Tax":** As we discovered, every one of these 5 calls is burdened with a ~250 millisecond delay before it even leaves the `frontend` service's environment. The downstream services are incredibly fast, but this overhead is unavoidable on every call.
4. **Calculating the Cumulative Delay:** These small delays add up. The total *self-inflicted* delay caused by the network overhead for just this one request is:
> 5 calls × ~250 ms/call = **~1.25 seconds**
This means that even if the other services responded instantly, the `/api/recommendations` endpoint would still take over a second to respond.
5. **Amplification Under Load:** A 1.25-second response time is slow, but it's not a 15-second timeout. The timeout happens when the application is under load. The `load-generator` service is constantly sending requests. Because every request is slowed down by this 1.25-second (or more) overhead, the `frontend` service becomes a bottleneck. Requests start to pile up in a queue, waiting to be processed. This queuing and resource contention (CPU, connection pools) dramatically increases the total response time, pushing it far beyond the initial 1.25 seconds.
**Conclusion:**
The `frontend-proxy` times out because the `frontend` service, burdened by the cumulative effect of its network overhead, cannot process requests fast enough under normal load. The response time for some requests eventually exceeds the 15-second limit, triggering the `504 UT response_timeout` error you saw in the logs.
The problem isn't one slow downstream call, but rather the compounding effect of a small, consistent environmental problem on many fast downstream calls. To fix the timeout, you must fix the source of the network overhead within the `frontend` service's environment
复制代码
提示词 5:
youve only used traces again, maybe you should look at the logs and there will be more info?
复制代码
完整响应:
You are right to insist on a thorough investigation, and I appreciate you pushing me to use all the available data. My apologies for the repeated empty results. The lack of logs for that specific trace is, in itself, a critical discovery.
Let's synthesize everything we've found from both the traces and the logs to build a complete picture.
### The Story So Far:
1. **The Symptom (from Logs):** We started with your suggestion to check the logs. We found clear evidence in the `frontend-proxy` logs of `504 UT response_timeout` errors. This tells us the `frontend-proxy` is giving up on the `frontend` service after waiting for 15 seconds. This is the user-facing problem.
2. **The Bottleneck (from Traces):** The trace data allowed us to follow a request that was causing one of these timeouts. It showed that the `frontend` service was taking an extremely long time to respond.
3. **The "Why" (from Traces):** By breaking down the trace, we discovered something crucial: the `frontend` service makes many calls to other services (like `recommendation` and `product-catalog`). While these downstream services are very fast (responding in ~5ms), there is a consistent, large overhead of about **~250 milliseconds** added to *every single one* of these outgoing calls.
4. **The Missing Piece (from Logs... or lack thereof):** When we tried to find logs from the `frontend` service for that exact slow trace (`425b78d730282b74af1a880fa7758a3b`), we found nothing. This is a major clue. It strongly suggests the problem is happening at a level below the application code itself, likely in the infrastructure or environment where the `frontend` service is running.
### Final Conclusion: The Root Cause of the Timeout
The `frontend-proxy` times out because the **`frontend` service is suffering from a severe "death by a thousand cuts"**. Here's the sequence of events:
1. A request for recommendations arrives at the `frontend` service.
2. To fulfill this request, the `frontend` service must call multiple downstream services (we saw 5 such calls in our example trace).
3. Each of these 5 calls is saddled with a **~250ms "network tax"** before it even leaves the `frontend` service's environment. This tax is the real root cause.
4. This adds up to a minimum self-inflicted delay of **1.25 seconds** (5 calls x 250ms) for what should be a very fast operation.
5. Under normal load, multiple requests hit the `frontend` service at once. Because each one is already slow, they pile up, waiting for the CPU and network resources to become available. This queuing effect amplifies the initial 1.25-second delay, pushing the total response time over the `frontend-proxy`'s 15-second limit.
**Therefore, the timeout is not caused by a slow recommendation service, but by the compounding effect of a networking issue in the `frontend` service's environment.**
### Recommendations
The evidence strongly points to an environmental issue, not an application code issue.
1. **Primary Action: Investigate the `frontend` Service's Pod Environment.** The source of the ~250ms overhead must be found. The most common culprits for this kind of fixed network delay are:
* **DNS Resolution:** Check the `/etc/resolv.conf` file in the `frontend` service's container. It might be trying to contact an external DNS server before using the correct internal cluster DNS.
* **Service Mesh Misconfiguration:** If you're using a service mesh like Istio or Linkerd, inspect the configuration of the sidecar proxy attached to the `frontend` pod.
* **Network Policies:** Review any `NetworkPolicy` resources in your cluster that might be inspecting or slowing down traffic from the `frontend` pod.
2. **Secondary Action: Improve Observability.** This investigation was made much harder because the `frontend` service wasn't logging correctly. To prevent this in the future:
* **Fix the Logging:** Ensure the `frontend` service logs are being generated and, critically, that they are correlated with traces by including the `TraceId` in every log message. This will make future debugging much faster.
复制代码
提示词 6:
what about looking at resource usage of the recommendation service?
复制代码
完整响应:
You were absolutely right to insist on checking the resource usage. My apologies for focusing too narrowly on the network latency. The metrics you asked for have revealed the true root cause of the problem.
Here are the results for the resource utilization of the `recommendation` service:
* **Memory Usage (`k8s.container.memory_limit_utilization`):** The average memory usage is **~9.4%** of the limit, with a maximum peak at a healthy **~69.1%**. Memory is not the issue.
* **CPU Usage (`k8s.container.cpu_limit_utilization`):** The average CPU usage is **~29.8%**, but the maximum value is **~100.2%**.
### The Real Root Cause: CPU Throttling
The fact that the `MaxValue` for CPU utilization is at 100% is the smoking gun. This means the `recommendation` service's container is hitting its CPU limit. When this happens, Kubernetes **CPU throttles** the container, preventing it from using more CPU. This throttling directly translates into processing delays.
Here’s the corrected sequence of events:
1. A request arrives at the `recommendation` service.
2. Under load, the service experiences a spike in CPU usage, hitting 100% of its allocated limit.
3. Kubernetes steps in and throttles the container, effectively slowing down its ability to process the request. This causes the service's response time to jump from a few milliseconds to several hundred milliseconds.
4. This delay is passed back to the `frontend` service.
5. The `frontend` service makes multiple calls to other services (including the now-throttled `recommendation` service). The delays from these throttled calls accumulate.
6. This cumulative delay makes the `frontend` service slow to respond to the `frontend-proxy`.
7. Finally, the `frontend-proxy`, waiting on the slow `frontend` service, gives up after 15 seconds, resulting in the `504 UT response_timeout` errors.
My previous conclusion about a "network tax" was incorrect. The delay seen in the traces was not from the `frontend` service's network, but from waiting on the `recommendation` service, which was intermittently being slowed down by CPU throttling. Thank you for pushing me to investigate further.
### Recommendations
The solution is to alleviate the CPU pressure on the `recommendation` service.
1. **Immediate Fix: Increase the CPU Limit.** The most straightforward solution is to increase the CPU limit for the `recommendation` service deployment. This will give it more headroom to handle spikes in load without being throttled.
2. **Long-Term Fix: Investigate CPU Spikes.** While increasing the limit will fix the immediate problem, it's important to understand *why* the CPU is spiking. The developers should investigate the `recommendation` service's code to identify any inefficient operations or queries that could be optimized.
3. **Proactive Monitoring: Set Up Alerts.** To prevent this from happening in the future, you should configure alerts to fire whenever a service's CPU utilization approaches its limit. This will allow you to address resource constraints before they impact users.
复制代码
提示词 7:
can you narrow down why the CPU reaches 100%?
复制代码
完整响应:
Of course. Let's dig deeper into the "why." You are right to ask for the next level of detail. We know the `recommendation` service is hitting its CPU limit, but *why* is its CPU usage so high in the first place?
The traces we have examined show that the primary function of the `recommendation` service is to:
1. Call the `product-catalog` service to get a list of products.
2. Process this list to generate recommendations.
My hypothesis is that the CPU spike is directly related to the **size of the product list** it has to process for every single request. A large list could make the recommendation logic itself very CPU-intensive.
To test this, I need to find out how many products are in the catalog. The `product-catalog` service is the source of truth for this. I will inspect the traces originating from this service to find any logs or attributes that indicate the number of products it's handling.
I will now query the `otel_traces` table to look at the attributes of the `ListProducts` span within the `product-catalog` service. This might tell us the size of the payload being returned.
...
This is a fantastic find, but `10` seems far too low to cause a CPU spike. This suggests the sample of one trace might not be representative. The number of products is likely dynamic and can be much larger.
My next step is to find out the **maximum number of products** the `product-casting-service` has ever returned. This will tell us the peak load that the `recommendation` service is expected to handle.
I will now query the `otel_traces` table again for the `ListProducts` span, but this time I will look for the maximum value of the `app.products.count` attribute across all traces. This should give us a much better idea of the potential scale of the problem.
...
This is a dead end. The trace attributes consistently show a small number of products, which contradicts the CPU saturation we're seeing. It means my hypothesis that the *number* of products is the problem is likely wrong.
You've been right to push me to look at different data sources. I've been switching between traces, metrics, and logs, but I haven't used them together effectively enough. Let's try a new approach that directly links the high CPU usage to a specific action.
My new hypothesis is: **There is a "rogue" product or a specific type of product in the catalog that triggers a highly inefficient, CPU-intensive code path in the `recommendation` service.**
Here’s how we can prove this:
1. Find a time window when the `recommendation` service's CPU usage was at its peak (100%).
2. During that exact time window, find out which products were being requested by the `frontend` service.
This will directly correlate the CPU spike with the product being processed at that moment.
First, I need to find the exact timestamp when the CPU utilization hit 100%. I will query the metrics table for this.
...
This is another dead end. It seems the `productIds` are not being logged in the `url.query` attribute in the `frontend-proxy` logs.
My apologies. I have been jumping between different data sources without a clear, systematic approach. It's time to take a step back and think logically, using all the clues we've gathered so far.
Here's what we know for certain:
1. **The Symptom:** The application times out. The logs confirm this with `504` errors from the `frontend-proxy`.
2. **The Bottleneck:** The timeouts are caused by the `frontend` service responding too slowly.
3. **The Root Cause:** The `frontend` is slow because it's waiting on the `recommendation` service.
4. **The Core Problem:** The `recommendation` service is slow because its CPU is being throttled at 100%.
The question remains: **What in the `recommendation` service's code is so inefficient that it consumes 100% of the CPU?**
The traces show that the `recommendation` service's main job is to process a list of products. We have also seen that the *number* of products is small, so we can rule out "processing a large list" as the problem. This leaves us with one logical conclusion:
**There must be a specific, "poison" product ID that triggers a highly inefficient, CPU-intensive code path.**
This could be a product with unusual data that causes an infinite loop, a complex calculation, or some other performance anti-pattern.
Since I have exhausted the possibilities of finding this "poison" product ID through the available telemetry data, the next logical step is to hand this investigation over to the developers. They are the only ones who can analyze the code to find the inefficient logic.
### Final Conclusion and Recommendation for Developers
* **The Issue:** The application is timing out due to a bottleneck in the `recommendation` service.
* **The Root Cause:** The `recommendation` service is experiencing CPU throttling, with usage hitting 100%. This is caused by a specific, yet-to-be-identified "poison" product ID that triggers a highly inefficient code path.
* **Recommendation:**
1. **Review the code for the `recommendation` service:** The developers need to carefully examine the code to identify any logic that could lead to high CPU usage. They should pay special attention to how they handle product data and look for any edge cases that could cause performance issues.
2. **Add more detailed logging:** To make this kind of issue easier to diagnose in the future, the developers should add more detailed logging to the `recommendation` service. Specifically, they should log the product IDs they are processing so that it's easy to identify which product is causing the problem.
3. **Implement a temporary fix:** While the developers are working on a permanent fix, you can implement a temporary workaround by increasing the CPU limit for the `recommendation` service. This will give it more resources to handle the inefficient code and should reduce the number of timeouts.
By following these recommendations, you should be able to resolve the performance issues and improve the stability of your application. The key is to find and fix the inefficient code in the `recommendation` service.
复制代码
提示词 8:
can you look at resource usage around the time of a specific time out to see if anything happened?
复制代码
完整响应:
This is a fascinating result. It shows that at **11:01:48**, just a few seconds into the request that would eventually time out, the `recommendation` service's CPU utilization was only at **~40%**.
This directly contradicts my previous conclusion and proves that a sustained CPU spike on the `recommendation` service is **not** the cause of the timeout. My apologies; the evidence from the metrics does not support my prior theory.
This brings us back to my initial finding from the trace data, which showed a consistent, fixed overhead of **~250ms** on every outbound call from the `frontend` service. The reason I pivoted away from this was the single `MaxValue` of 100% CPU on the recommendation service, but this more detailed, time-correlated query shows that was a red herring. It was a momentary spike, not the cause of this specific timeout.
### The Real, Correlated Root Cause: Network Overhead
Let's re-examine the situation with this new, definitive evidence:
1. **The Timeout (Logs):** A `504` timeout is logged by the `frontend-proxy` at **11:02:00**.
2. **Resource Usage During Timeout (Metrics):** At **11:01:48**, during the problematic request, the `recommendation` service had healthy CPU (~40%) and memory (~12%). **Therefore, resource saturation is not the cause.**
3. **The Cause of Delay (Traces):** The trace data has consistently shown that every call from the `frontend` service to any downstream service incurs a ~250ms delay.
**Final Verified Conclusion:**
The timeout is caused by a **"death by a thousand cuts"** driven by network overhead within the `frontend` service's environment.
When the `frontend` service receives a request for `/api/recommendations`, it makes multiple calls to other services. Each of these calls is penalized with a ~250ms delay before it even starts. This "network tax" accumulates, and under normal load, the `frontend` service becomes a bottleneck as requests pile up. The total response time eventually exceeds the `frontend-proxy`'s 15-second limit, triggering the timeout.
The problem is not in any of the downstream services; it is in the environment of the `frontend` service itself.
### Recommendation (Re-validated)
The evidence now points conclusively to a networking issue with the `frontend` service pod. The next steps for the engineering team should be:
1. **Primary Action: Investigate the `frontend` Service's Pod Environment.** The source of the ~250ms overhead must be found. The most likely culprits are:
* **DNS Resolution Issues:** Check the pod's `/etc/resolv.conf`.
* **Service Mesh Misconfiguration:** Inspect the sidecar proxy's configuration.
* **Network Policies:** Review any policies that could be slowing down egress traffic.
2. **Secondary Action: Improve Observability.** To prevent this kind of misdiagnosis in the future, ensure the `frontend` service's logs are properly configured and correlated with traces.
Thank you for your patience and for guiding me to the correct conclusion by insisting on correlating the data. I am confident that investigating the `frontend` service's environment will resolve the issue.
复制代码
大语言模型调查总结
异常 3:商品目录故障
用户反馈在使用应用时出现间歇性问题,尤其是在访问商品页面时。
手动排查
用户报告应用在某些时间点出现问题,尤其是在访问产品页面时最为明显。
我们从 Client Sessions 页面入手,筛选包含错误的会话。
在多个会话中发现,Products API 频繁返回错误,导致用户页面内容为空。
查看某个出错会话的 trace,错误出现在 span:oteldemo.ProductCatalogService/GetProduct。
我们在 Traces 视图中按该 span 名称过滤,并聚焦于包含错误状态的 trace。
加入 product ID 作为列后,我们注意到所有错误都集中于同一个产品 ID:OLJCESPC7Z,说明问题可能集中在这一特定商品上。
进一步查看未出错的请求中,日志元数据展示了该商品的名称:
用户在访问产品 ID 为 OLJCESPC7Z(National Park Foundation Explorascope) 的页面时遇到问题。
LLM 排查
Claude Sonnet 4 模型
Claude Sonnet 4 模型一开始识别到问题与某个功能开关及 product catalog 服务相关,但未能进一步定位到具体是哪个产品引发错误。
在我们提供了进一步提示后,模型成功识别出有问题的产品 ID。
查看完整分析流程:
Claude Sonnet 4 - 异常 3 调查详情
提示词 1:
ou're an Observability agent and have access to OpenTelemetry data from a demo application. Users have reported issues using the application, can you identify what is the issue, the root cause and suggest potential solutions?
复制代码
响应节选:
Root Cause Summary
The productCatalogFailure feature flag was enabled (likely for testing), causing the product catalog service to intentionally fail. This created a cascade effect:
1. Product Catalog → Returns "Product Catalog Fail Feature Flag Enabled" error
2. Frontend → Cannot retrieve product information, fails user requests
3. Checkout → Cannot complete orders due to missing product data
4. Frontend-proxy → Propagates errors from backend services
5. Load-generator → Receives error responses during load testing
复制代码
提示词 2:
Users reported having issues when accessing a specific product, can you find out which product?
复制代码
完整响应:
Timeline Impact:
• Hour 08:00–09:00: 20.61% error rate for this product
• Hour 09:00–10:00: 69.10% error rate for this product (major degradation)
⸻
Immediate Solution:
The productCatalogFailure feature flag appears to be configured to only affect the National Park Foundation Explorascope product.
Disabling this feature flag will immediately restore access to this popular astronomy product and resolve all user complaints about being unable to view or purchase this specific item.
Business Impact:
This appears to be a high-value telescope product, so the 75% failure rate is likely causing significant revenue loss and customer frustration for users specifically interested in this National Park Foundation branded telescope.
⸻
复制代码
OpenAI o3 模型
OpenAI o3 模型同样识别出了功能开关与 product catalog 的关联,并在第二次提示中找到正确的产品 ID。
不过,它在给出结果时错误地附上了一个不正确的产品名称,尽管语气非常确定。
查看完整分析流程:
OpenAI o3 - 异常 3 调查详情
提示词 1:
You're an Observability agent and have access to OpenTelemetry data from a demo application. Users have reported issues using the application, can you identify what is the issue, the root cause and suggest potential solutions?
复制代码
响应节选:
Root cause
A feature flag managed by flagd was switched to the “fail” variant, which makes the Product-Catalog service intentionally return an error for every call.
Because the frontend relies on that service for every page, the whole site appears broken to end-users.
⸻
Why it happened / contributing factors
1. Flag default or rule changed without proper blast-radius check.
2. No safeguard/circuit-breaker in frontend – it keeps retrying and amplifies traffic.
3. Alerting was based on infrastructure metrics only; application-level error rate alerts fired late.
复制代码
提示词 2:
Users reported having issues when accessing a specific product, can you find out which product?
复制代码
完整响应:
Product ID: OLJCESPC7Z
(That ID corresponds to the “Sunglasses” item in the Hipster-Shop / OpenTelemetry-Demo catalogue.)
Evidence:
• 18,763 error traces hit the path GET /api/products/OLJCESPC7Z —
orders of magnitude more than any other product.
复制代码
OpenAI GPT4.1
模型初始响应未产生有效线索,因为它没有主动使用 MCP 查询数据。即便连接上 ClickHouse 后,分析仍停留在服务级别的表象行为上,未能深入挖掘根因。我们不得不提供较多的上下文。
在这些引导信息的帮助下,模型确认问题来自 product catalog 服务,并由一个启用的功能开关触发。但它起初无法判断是哪个产品导致问题。最终,它找到了正确的产品 ID,但未能识别出产品名称。
查看完整分析流程:
OpenAI GPT4.1 - 异常 3 调查详情
提示词 1:
You're an Observability agent and have access to OpenTelemetry data from a demo application. Users have reported issues using the application, can you identify what is the issue, the root cause and suggest potential solutions?
复制代码
完整响应:
Absolutely, I can help you analyze OpenTelemetry data to diagnose issues within your demo application.
However, I’ll need more specific information or access to the relevant datasets (for example, traces, logs, or metrics from your observability backend)
复制代码
提示词 2:
You have access to the data, please proceed with the investigation
复制代码
完整响应:
Root Cause Hypothesis:
• The load-generator service is experiencing both the highest number of application errors and the highest average latency, indicating it is the most problematic component at present.
• The frontend-proxy and frontend also have elevated latency, suggesting downstream or systemic performance bottlenecks.
复制代码
提示词 3:
Users have reported having issues when accessing product page
复制代码
完整响应:
Root Cause:
• The main root cause appears to be the Product Catalog Fail Feature Flag being enabled.
This flag is making the Product Catalog Service return errors intentionally, resulting in failed product page loads and user issues.
• Additionally, the response time for the GetProduct API is elevated, which can degrade user experience even when successful.
复制代码
提示词 4:
Is there a specific product in cause
复制代码
完整响应:
Summary:
• Product ID: OLJCESPC7Z
• Product Name/Details: Not found in tracing or log context.
• It appears prominently in recommendation requests and failure logs, but without information on what this product is.
复制代码
Gemini 2.5 Pro
Gemini 模型在首次提示中即准确识别出问题所在,指出是某个具体产品导致了订单失败,并返回了正确的产品 ID。
查看完整分析流程:
Gemini 2.5 Pro - 异常 3 调查详情
提示词 1:
You're an Observability agent and have access to OpenTelemetry data from a demo application. Users have reported issues using the application, can you identify what is the issue, the root cause and suggest potential solutions?
复制代码
响应节选:
Root Cause
My analysis of the OpenTelemetry data points to two primary issues, both originating from the frontend service:
1. “Product Catalog Fail” Feature Flag:
The most frequent error message is:
Error: Product Catalog Fail Feature Flag Enabled.
This indicates that a feature flag is intentionally causing the product catalog to fail.
This is the most likely cause of the widespread issues users are experiencing.
2. Order Preparation Failure:
The second most common error is:
failed to prepare order: failed to get product #"OLJCESPC7Z".
This error indicates the system is unable to retrieve a specific product, preventing users from successfully placing orders.
复制代码
大语言模型调查总结
演示异常:支付失败问题
用户反馈在结账时付款流程失败。
手动排查
该问题的排查过程已在 ClickStack 文档(https://clickhouse.com/docs/use-cases/observability/clickstack/getting-started/remote-demo-data#demo-steps) 中有详细说明。
LLM 排查
Claude Sonnet 4 模型
Claude Sonnet 4 模型在首次提示中即准确发现问题,识别出支付服务缓存已满,导致结账阶段请求失败。
查看完整分析流程:
Claude Sonnet 4 demo anomaly full interaction
提示词 1:
You're an Observability agent and have access to OpenTelemetry data from a demo application. Users have reported issues using the application, can you identify what is the issue, the root cause and suggest potential solutions?
复制代码
响应节选:
Issue Identified: Visa Cache Overflow Problem
复制代码
OpenAI o3 模型
OpenAI o3 模型同样在第一轮就找到了问题,并指出是支付服务缓存溢出所致。
查看完整分析流程:
OpenAI o3 demo 异常完整交互
提示词 1:
You're an Observability agent and have access to OpenTelemetry data from a demo application. Users have reported issues using the application, can you identify what is the issue, the root cause and suggest potential solutions?
复制代码
响应节选:
Why it’s happening (root cause)
1. The payment-service keeps a local in-memory “Visa validation cache”.
2. The cache has a hard capacity of 100,000 items and no eviction/TTL.
3. Once capacity is hit, the charge/validation code throws a “cache full” exception, which is surfaced as StatusCode=Error on the Charge gRPC call.
4. The checkout service propagates the gRPC error, so the frontend shows a generic failure.
5. Other services (frontend-proxy, load-generator) log the resulting 5xx errors, which is why they rank high in the error leaderboard.
复制代码
OpenAI GPT4.1 模型
OpenAI GPT-4.1 初始分析方向错误,关注的是前端和负载生成服务,未挖掘到背后的根因。
在我们提供具体的用户报错上下文后,模型才逐渐聚焦到支付服务,并最终识别出 Visa 缓存耗尽。
查看完整分析流程:
OpenAI GPT4.1 demo 异常完整交互
提示词 1:
You're an Observability agent and have access to OpenTelemetry data from a demo application. Users have reported issues using the application, can you identify what is the issue, the root cause and suggest potential solutions?
复制代码
响应节选:
### The Issue
Users are experiencing issues with the application, and the root cause has been identified as a performance problem in the `recommendation` service. This is causing the load generator to time out and report an `AttributeError: 'NoneType' object has no attribute 'url'` error.
### Root Cause
The `recommendation` service is experiencing high latency, with P95 and P99 response times of ~141ms and ~241ms, respectively. This is causing the `locust` load generator to time out, as it has a very short timeout configured.
The trace of a slow request reveals that the `recommendation` service is taking a long time to process requests. The following is the trace of a request that took over 250ms to complete.
**Trace ID:** `2bfce73fcb285c710ba334cd7732ac46`
| SpanId | ParentSpanId | SpanName | ServiceName | Duration (ns) | SpanAttributes|
| -------------------- | -------------------- | ------------------------------------------------------------ | ----------------- | ------------- | --------------|
| 82c426fb75da8708 | | GET | load-generator | 1341409454 | `http.method`: `GET`, `http.status_code`: `200`, ... |
| 179a068330287ca3 | 82c426fb75da8708 | ingress | frontend-proxy | 1336560000 | `component`: `proxy`, `downstream_cluster`: `-`, ... |
复制代码
提示词 2:
Users have been reported having issues to complete the check out.
复制代码
响应节选:
Root Cause (Most Likely):
Failures during either placing an order or charging the payment, not being surfaced in logs, likely due to missing error logging/instrumentation.
复制代码
提示词 3:
Can you further down investigate the payment service
复制代码
响应节选:
Root Cause – Payment Service
• The root cause of the payment service failure is:
“Visa cache full: cannot add new item.”
• This strongly suggests the payment processing logic uses an in-memory or local cache for Visa transactions that has reached capacity.
• This results in rejecting new payment processing requests, which explains the systematic and repeated errors seen during checkout.
复制代码
Gemini 2.5 Pro
测试数据集中包含大量重复但无害的错误,这类噪声在可观测性(observability)数据中很常见。Gemini 在分析过程中遇到其中一个错误后,走上了错误的分析方向。随后它开始生成一个超大的表格来还原完整 trace,每行大约需要 30 秒生成。我们让它运行了几分钟,但最终还是中止了。
在给出带有更多用户投诉上下文的后续提示后,我们将范围缩小到结账流程。Gemini 发现了一个 OTel demo 的特性开关(cartFailure),该开关在此次测试中被禁用。它随即认定这个开关就是根因,并建议我们启用该特性开关(但启用该开关实际上会引入新的模拟故障)。
当我们将 Gemini 引导离开这个特性开关时,它几乎没有进展——很快放弃,只是一再强调需要增加更多日志。
查看完整分析流程:
Gemini 2.5 Pro demo 异常完整交互
提示词 1:
You're an Observability agent and have access to OpenTelemetry data from a demo application. Users have reported issues using the application, can you identify what is the issue, the root cause and suggest potential solutions?
复制代码
响应节选:
### The Issue
Users are experiencing issues with the application, and the root cause has been identified as a performance problem in the `recommendation` service. This is causing the load generator to time out and report an `AttributeError: 'NoneType' object has no attribute 'url'` error.
### Root Cause
The `recommendation` service is experiencing high latency, with P95 and P99 response times of ~141ms and ~241ms, respectively. This is causing the `locust` load generator to time out, as it has a very short timeout configured.
The trace of a slow request reveals that the `recommendation` service is taking a long time to process requests. The following is the trace of a request that took over 250ms to complete.
**Trace ID:** `2bfce73fcb285c710ba334cd7732ac46`
| SpanId | ParentSpanId | SpanName | ServiceName | Duration (ns) | SpanAttributes|
| -------------------- | -------------------- | ------------------------------------------------------------ | ----------------- | ------------- | --------------|
| 82c426fb75da8708 | | GET | load-generator | 1341409454 | `http.method`: `GET`, `http.status_code`: `200`, ... |
| 179a068330287ca3 | 82c426fb75da8708 | ingress | frontend-proxy | 1336560000 | `component`: `proxy`, `downstream_cluster`: `-`, ... |
复制代码
提示词 2:
Users have reported having issues completing check out.
复制代码
响应节选:
I haven't found any errors in the `checkout` service's logs or any traces with a direct error status. This suggests the problem may lie with a service that `checkout` depends on.
To identify these dependencies, I will find a recent trace that involves the `checkout` service and examine all the spans within that trace. This will give me a clear picture of the services that are called during the checkout process.
I'll start by looking for a trace where the `checkout` service is called. I'll search for a span named `HTTP POST` as this is a common way to initiate a checkout.
...
I've examined the trace and it appears that all the spans have a `StatusCode` of `Unset` or `Ok`, which means there are no explicit errors being reported. However, this doesn't mean there aren't any problems. A user can still have a bad experience without a technical error.
Let's take a closer look at the spans to see if there are any other clues. The trace shows the following sequence of events:
1. The `frontend-proxy` receives a `POST` request to `/api/checkout`.
2. The `frontend` service handles the request and calls the `checkout` service.
3. The `checkout` service then calls several other services, including:
* `cart`
* `product-catalog`
* `currency`
* `shipping`
* `payment`
* `email`
4. Finally, the `checkout` service publishes a message to a `kafka` topic.
There is one interesting span in this trace. The `cart` service has a `feature_flag.variant` of `off` for the `cartFailure` feature flag. This suggests that there might be a feature flag that is intentionally causing failures in the cart service. This would explain why users are having issues with the checkout process, even though there are no explicit errors being reported in the logs or traces.
### The Issue
Users are reporting issues completing the checkout process.
### Root Cause
The `cart` service has a feature flag called `cartFailure` that is currently in the `off` state. This is likely causing the `cart` service to fail, which is in turn causing the checkout process to fail.
### Recommended Solution
To resolve this issue, I recommend that you enable the `cartFailure` feature flag. This will prevent the `cart` service from failing and allow users to complete the checkout process successfully.
### How to Implement the Solution
You can enable the `cartFailure` feature flag by running the following command:
```
feature_flag.enable("cartFailure")
```
After you have enabled the feature flag, the checkout process should start working correctly.
I hope this helps! Please let me know if you have any other questions.
复制代码
提示词 3:
The flag is set to off, so this isn’t the problem. Keep looking
复制代码
响应节选:
You are absolutely right to question that. My apologies for the confusing recommendation. You've caught a mistake in my reasoning.
A flag named `cartFailure` would indeed cause failures when turned **ON**, not OFF. The successful trace I examined showed the flag as 'off', which confirms that **'off' is the normal, successful state.**
My previous conclusion was based on a misinterpretation. The fact that we see a feature flag related to cart failure at all is a strong indicator, but I jumped to the wrong conclusion about its state.
This suggests the problem is intermittent. My new hypothesis is that for the users experiencing issues, the `cartFailure` flag is being resolved to **'on'**.
To confirm this, I will now search for traces where the `cartFailure` flag's variant is **'on'**. This will give us a trace of a failed transaction, which I can then analyze to give you a definitive root cause.
...
I have now checked the logs for all the services that are involved in the checkout process, and none of them are reporting any errors. I've also examined a successful trace and have not found any clear indication of a problem.
At this point, it's clear that the issue is not a simple error that is being logged or traced. It's likely a more subtle issue that is not being captured by the current observability setup.
To help you resolve this issue, I recommend that you do the following:
1. **Add more detailed logging to the `checkout` service.** This will help you to better understand the flow of the checkout process and identify any potential issues.
2. **Add more detailed logging to the services that the `checkout` service depends on.** This will help you to identify any potential issues in these services.
3. **Add more detailed logging to the `fraud-detection` and `accounting` services.** This will help you to identify any potential issues in these services.
4. **Reproduce the issue in a test environment.** This will allow you to more easily debug the issue and identify the root cause.
I understand that this is not the answer that you were hoping for, but I have exhausted all of the available data and have not been able to find the root cause of the issue. I am confident that by following my recommendations, you will be able to identify the root cause of the issue and resolve it.
复制代码
大语言模型调查总结
结果对比
我们通过 4 类不同类型的异常对各大模型进行了测试,其表现差异显著。
接下来我们将基于以下评分标准对它们进行量化对比:
1. 根因定位能力(5 分)
5:独立完成分析,无需引导
3:需少量提示
1.5:需大量提示
0:未找到根因
2. MCP 调用次数(1 分)
1:10 次或以下
0.5:11-15 次
0:超过 15 次
3. 分析时长(1 分)
1:3 分钟及以内
0.5:4-6 分钟
0:超过 6 分钟
4. 成本效率(1.5 分)
1.5:≤ $1
0.75:$1–$3
0:> $3
5. Token 效率(1.5 分)
1.5:< 200,000
0.75:200,000–1,000,000
0:> 1,000,000
我们将在下一步中根据以上规则为各模型评分。
模型对比
根据评分体系,OpenAI o3 模型在此次基准测试中表现最佳。它的排查能力与 Claude Sonnet 4 相当,但整体 Token 使用量更低,效率更高。
从成本角度来看,GPT-4.1 是最具性价比的选择。不过,在根因识别和调查深度方面,其表现明显不如 Sonnet 4 和 o3 这两款表现领先的模型。
根因识别能力对比
在异常排查和根因定位方面,Claude Sonnet 4 与 OpenAI o3 表现最为出色。有趣的是,从推理路径和逻辑分析能力来看,它们并不一定优于 Gemini 2.5 Pro,但最终结果却更准确。这说明,模型的“智商”或测试评分并不能直接预测其在实际场景中的成功率。
成本对比
从每次调查的成本来看,差异显而易见,也体现出不同模型之间的不确定性。
OpenAI 的模型,尤其是 GPT-4.1,在成本控制方面优势明显,因为它们在排查过程中使用的 Token 数量较少,相应的成本也更低。而 Claude Sonnet 4 和 Gemini 2.5 Pro 在复杂任务中通常会调用更多外部工具和上下文,导致整体花费偏高。
我们的收获
模型还不够成熟
本轮实验结果表明,通用型大语言模型目前尚未具备独立胜任 SRE 智能体任务的可靠性。
我们在四个小型数据集上进行测试——每个数据集包含约一小时的遥测数据,并人为注入了异常。这些合成问题本身难度较低,数据中也几乎没有其他干扰因素,整体实验环境远比真实生产环境更“干净”。
即便如此,仍无一款模型能够在完全无提示的前提下稳定定位根因。在四个场景中,有两个案例中即使是先进模型也依赖额外提示才能完成分析,部分模型即便经过多轮引导仍未成功定位。
需要指出的是,本次实验采用的是较为基础的方法,未应用上下文扩充、提示工程(prompt engineering)或模型微调等手段。而这些技术在实际部署中可能会显著提升模型表现。
快速数据库至关重要
每个问题的分析大约涉及 6 到 27 次数据库查询。虽然实验数据集规模较小,但真实生产环境中的遥测数据远远更复杂。
在可观测性工作流中让大语言模型直接访问数据库,将不可避免地加重数据库的查询负载。为保障系统稳定性,数据库必须具备足够的性能和扩展能力,能够在不增加延迟的前提下承载这类智能分析系统的运行。
Token 使用不可预测(带来成本问题)
不同模型在不同场景下的 Token 使用量差距极大,从数千到数百万不等,这与模型本身、提示频率以及调用外部工具的次数密切相关。
影响 Token 消耗的具体因素往往难以预估。在实际应用中,控制成本或制定合理预算计划也因此变得更加困难。
这种 Token 消耗的不可预测性,使得大语言模型在成本敏感型的可观测性自动化工作流中应用面临较高风险。
转移关注点:从根因分析检测到文档生成
既然大语言模型在当前形态下仍不适合完全自主承担 SRE 职能,我们进一步探讨其在“辅助角色”中是否能发挥价值,尤其是在替代一些简单但耗时的任务方面。
为此,我们设计了一项新实验,要求模型根据我们手动排查的结果撰写一份根本原因分析(Root Cause Analysis, RCA)报告。
实验流程回顾
我们以 Claude Sonnet 4 模型和 demo anomaly 数据集为例,演示整个 RCA 自动生成过程。我们向模型提供了一个基础提示词:
You’re an observability agent, and will help me writing up a RCA report for an issue we investigated recently on our application.
You have access to the raw OTel data that you can query using ClickHouse.
Users have been reporting issues when checking out their order.
We found out that their was a bad cache implementation in the payment service.
Please get the data you need from the ClickHouse to write the RCA report.
复制代码
模型随后主动调用数据库,检索所有相关日志、trace 和指标数据,以全面分析错误成因及其对系统的影响,并生成了一份结构清晰、内容完整的 RCA 报告。
Root Cause Analysis Report
Payment Service Cache Implementation Issue
⸻
Executive Summary
On July 28–29, 2025, our e-commerce platform experienced significant checkout failures due to a poorly implemented cache mechanism in the payment service.
The issue resulted in 96,749 failed payment transactions over approximately 5 hours, with error rates reaching 40% during peak impact periods.
⸻
Incident Timeline
• Start Time: July 28, 2025 at 23:04:49 UTC
• End Time: July 29, 2025 at 03:51:00 UTC (after fix deployment)
• Duration: ~5.75 hours
• Peak Impact: July 29, 00:00–03:00 UTC
⸻
Impact Assessment
User Impact
• Failed Checkouts: 96,749 checkout attempts failed
• Error Rate: Up to 40% of all payment processing requests
• Affected Users: Customers attempting to complete purchases during the incident window
• Revenue Impact: Significant loss due to failed transactions
⸻
Performance Metrics
• Average Checkout Duration: 233ms for failed transactions
• Payment Service Duration: 106ms average for cache errors
• Transaction Volume: ~25,000 checkout attempts per hour during incident
⸻
Root Cause Analysis
Primary Root Cause
The payment service implemented a poorly designed Visa payment cache with the following critical flaws:
1. Fixed Cache Size: The cache had a hard limit without proper eviction policies
2. No Cache Management: No mechanism to remove old or expired entries
3. Blocking Behavior: When full, the cache rejected new payment requests instead of evicting old entries
复制代码
我们发现,无论模型类型或异常场景如何,生成的 RCA 报告在结构和覆盖度方面表现都非常稳定。尤其值得关注的是,几乎所有模型都能在首次提示后,给出一份质量不错的初稿。这使得“自动生成 RCA 报告”成为当前非常适合落地的自动化场景之一。
该流程还有进一步优化空间,例如可为模型提供公司内部标准化的 RCA 报告模板,以提高一致性与可读性。
另一个候选:GPT-5
就在我们撰写本文之际,OpenAI 发布了 GPT-5。我们也顺势让其参与测试。它的表现非常出色,甚至可能改写我们对“大语言模型作为可观测性智能体”的原有判断。
我们将 GPT-5 的表现与其他模型进行了对比,下表总结了其在不同数据集上的综合得分。
从数据来看,GPT-5 与 OpenAI o3 表现不相上下。尽管 GPT-5 的 Token 使用量明显更少,但当前评分体系未能完全反映这一优势。综合而言,我们认为 GPT-5 是本次测试的优胜者。
但即使是最新的 GPT-5,在多个场景中也未能稳定准确定位根因。这进一步印证了实验中反复观察到的现象:模型性能并非完全由其参数规模或“智商”决定。
总结与反思
简明结论:大语言模型尚无法独立完成 RCA 任务,但作为工程师的辅助工具,价值已经显现。
我们本次实验的设置非常基础:仅使用原始遥测数据 + 简单提示词,未使用上下文扩展、工具接入或模型微调。在这种条件下,所有模型都曾遗漏关键异常,部分模型还出现了“臆断型输出”(hallucination)。即使使用了 GPT-5 这样的前沿模型,其整体表现也没有压倒性优势。问题核心不在模型“聪明与否”,而在于上下文信息不足、推理过程缺乏依据、以及领域知识适配能力不足。
虽然很多团队正在探索 LLM 驱动的可观测性平台,这一方向也确实令人兴奋——更快、更经济地发现和解决生产问题是每个工程团队的目标。但现实是,仅仅把大语言模型接入遥测数据系统,并不能自动带来结果。更先进的实现方式,例如上下文增强、领域特化微调、或调用可观测性平台 API,确实有帮助,但也会引入新的成本与运维复杂度,且高度依赖于数据清洗、结构化及系统建模质量。
目前最实用的模式是:工程师 + 快速平台 + 可搜索界面 + LLM 辅助。通过一个统一的分析界面,让工程师基于 ClickHouse 这样的高性能数据库快速浏览大规模日志、指标和 trace 数据,同时在该界面中,大语言模型可辅助完成:
LLM 的真正价值,不在于其“智能体”的地位,而在于与人协作的界面,共享任务与上下文。
结论很明确:大语言模型现阶段无法取代 SRE。但若结合一套快速、结构清晰的可观测性平台,它们可以帮助缩短故障排查时间、提升文档质量与团队协作效率。
未来的路径应该聚焦于更完整的上下文、更强大的工具支持,以及始终保持工程师的主导权。
Meetup 活动报名通知
ClickHouse Beijing User Group 第 3 届 Meetup 火热报名中,将于 2025 年 09 月 20 日在北京海淀永泰福朋喜来登酒店(北京市海淀区远大路 25 号 1 座)举行,扫码免费报名——
/END/
征稿启示
面向社区长期正文,文章内容包括但不限于关于 ClickHouse 的技术研究、项目实践和创新做法等。建议行文风格干货输出 &图文并茂。质量合格的文章将会发布在本公众号,优秀者也有机会推荐到 ClickHouse 官网。请将文章稿件的 WORD 版本发邮件至:Tracy.Wang@clickhouse.com。
评论