Incident Triage - OpenHands Docs

When production incidents occur, speed matters. OpenHands can help you quickly investigate issues, analyze logs and errors, identify root causes, and generate fixes—reducing your mean time to resolution (MTTR).

This guide is based on our blog post Debugging Production Issues with AI Agents: Automating Datadog Error Analysis.

Overview

Running a production service is hard. Errors and bugs crop up due to product updates, infrastructure changes, or unexpected user behavior. When these issues arise, it’s critical to identify and fix them quickly to minimize downtime and maintain user trust—but this is challenging, especially at scale. What if AI agents could handle the initial investigation automatically? This allows engineers to start with a detailed report of the issue, including root cause analysis and specific recommendations for fixes, dramatically speeding up the debugging process. OpenHands accelerates incident response by:

Automated error analysis: AI agents investigate errors and provide detailed reports
Root cause identification: Connect symptoms to underlying issues in your codebase
Fix recommendations: Generate specific, actionable recommendations for resolving issues
Integration with monitoring tools: Work directly with platforms like Datadog

Automated Datadog Error Analysis

The OpenHands Software Agent SDK provides powerful capabilities for building autonomous AI agents that can integrate with monitoring platforms like Datadog. A ready-to-use GitHub Actions workflow demonstrates how to automate error analysis.

How It Works

Datadog is a popular monitoring and analytics platform that provides comprehensive error tracking capabilities. It aggregates logs, metrics, and traces from your applications, making it easier to identify and investigate issues in production. Datadog’s Error Tracking groups similar errors together and provides detailed insights into their occurrences, stack traces, and affected services. OpenHands can automatically analyze these errors and provide detailed investigation reports.

Triggering Automated Debugging

The GitHub Actions workflow can be triggered in two ways:

Search Query: Provide a search query (e.g., “JSONDecodeError”) to find all recent errors matching that pattern. This is useful for investigating categories of errors.
Specific Error ID: Provide a specific Datadog error tracking ID to deep-dive into a known issue. You can copy the error ID from DataDog’s error tracking UI using the “Actions” button.

Automated Investigation Process

When the workflow runs, it automatically performs the following steps:

Get detailed info from the DataDog API
Create or find an existing GitHub issue to track the error
Clone all relevant repositories to get full code context
Run an OpenHands agent to analyze the error and investigate the code
Post the findings as a comment on the GitHub issue

The agent identifies the exact file and line number where errors originate, determines root causes, and provides specific recommendations for fixes.

The workflow posts findings to GitHub issues for human review before any code changes are made. If you want the agent to create a fix, you can follow up using the OpenHands GitHub integration and say @openhands go ahead and create a pull request to fix this issue based on your analysis.

Setting Up the Workflow

To set up automated Datadog debugging in your own repository:

Copy the workflow file to .github/workflows/ in your repository
Configure the required secrets (Datadog API keys, LLM API key)
Customize the default queries and repository lists for your needs
Run the workflow manually or set up scheduled runs

The workflow is fully customizable. You can modify the prompts to focus on specific types of analysis, adjust the agent’s tools to fit your workflow, or extend it to integrate with other services beyond GitHub and Datadog. Find the full implementation on GitHub, including the workflow YAML file, Python script, and prompt template.

Manual Incident Investigation

You can also use OpenHands directly to investigate incidents without the automated workflow.

Log Analysis

OpenHands can analyze logs to identify patterns and anomalies:

Analyze these application logs for the incident that occurred at 14:32 UTC:

Identify the first error or warning that appeared
Trace the sequence of events leading to the failure
Find any correlated errors across services
Identify the user or request that triggered the issue
Summarize the timeline of events

Log analysis capabilities:

Log Type	Analysis Capabilities
Application logs	Error patterns, exception traces, timing anomalies
Access logs	Traffic patterns, slow requests, error responses
System logs	Resource exhaustion, process crashes, system errors
Database logs	Slow queries, deadlocks, connection issues

Stack Trace Analysis

Deep dive into stack traces:

Analyze this stack trace from our production error:

[paste full stack trace]

Identify the exception type and message
Trace back to our code (not framework code)
Identify the likely cause
Check if this code path has changed recently
Suggest a fix

Multi-language support:

Java
Python
JavaScript

Analyze this Java exception:

java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:3210)
    at java.util.ArrayList.grow(ArrayList.java:265)
    at com.myapp.DataProcessor.loadAllRecords(DataProcessor.java:142)

Identify:
1. What operation is consuming memory?
2. Is there a memory leak or just too much data?
3. What's the fix?

Analyze this Python traceback:

Traceback (most recent call last):
  File "app/api/orders.py", line 45, in create_order
    order = OrderService.create(data)
  File "app/services/order.py", line 89, in create
    inventory.reserve(item_id, quantity)
AttributeError: 'NoneType' object has no attribute 'reserve'

What's None and why?

Analyze this Node.js error:

TypeError: Cannot read property 'map' of undefined
    at processItems (/app/src/handlers/items.js:23:15)
    at async handleRequest (/app/src/api/router.js:45:12)

What's undefined and how should we handle it?

Root Cause Analysis

Identify the underlying cause of an incident:

Perform root cause analysis for this incident:

Symptoms:
- API response times increased 5x at 14:00
- Error rate jumped from 0.1% to 15%
- Database CPU spiked to 100%

Available data:
- Application metrics (Grafana dashboard attached)
- Recent deployments: v2.3.1 deployed at 13:45
- Database slow query log (attached)

Identify the root cause using the 5 Whys technique.

Common Incident Patterns

OpenHands can recognize and help diagnose these common patterns:

Connection pool exhaustion: Increasing connection errors followed by complete failure
Memory leaks: Gradual memory increase leading to OOM
Cascading failures: One service failure triggering others
Thundering herd: Simultaneous requests overwhelming a service
Split brain: Inconsistent state across distributed components

Quick Fix Generation

Once the root cause is identified, generate fixes:

We've identified the root cause: a missing null check in OrderProcessor.java line 156.

Generate a fix that:
Adds proper null checking
Logs when null is encountered
Returns an appropriate error response
Includes a unit test for the edge case
Is minimally invasive for a hotfix

Best Practices

Investigation Checklist

Use this checklist when investigating:

Scope the impact
- How many users affected?
- What functionality is broken?
- What’s the business impact?
Establish timeline
- When did it start?
- What changed around that time?
- Is it getting worse or stable?
Gather data
- Application logs
- Infrastructure metrics
- Recent deployments
- Configuration changes
Form hypotheses
- List possible causes
- Rank by likelihood
- Test systematically
Implement fix
- Choose safest fix
- Test before deploying
- Monitor after deployment

Common Pitfalls

Avoid these common incident response mistakes:

Jumping to conclusions: Gather data before assuming the cause
Changing multiple things: Make one change at a time to isolate effects
Not documenting: Record all actions for the post-mortem
Ignoring rollback: Always have a rollback plan before deploying fixes

For production incidents, always follow your organization’s incident response procedures. OpenHands is a tool to assist your investigation, not a replacement for proper incident management.

OpenHands SDK Repository - Build custom AI agents
Datadog Debugging Workflow - Ready-to-use GitHub Actions workflow
Prompting Best Practices - Write effective prompts

Use Cases

​Overview

​Automated Datadog Error Analysis

​How It Works

​Triggering Automated Debugging

​Automated Investigation Process

​Setting Up the Workflow

​Manual Incident Investigation

​Log Analysis

​Stack Trace Analysis

​Root Cause Analysis

​Common Incident Patterns

​Quick Fix Generation

​Best Practices

​Investigation Checklist

​Common Pitfalls

​Related Resources

Overview

Automated Datadog Error Analysis

How It Works

Triggering Automated Debugging

Automated Investigation Process

Setting Up the Workflow

Manual Incident Investigation

Log Analysis

Stack Trace Analysis

Root Cause Analysis

Common Incident Patterns

Quick Fix Generation

Best Practices

Investigation Checklist

Common Pitfalls

Related Resources