Skip to main content
When production incidents occur, speed matters. OpenHands can help you quickly investigate issues, analyze logs and errors, identify root causes, and generate fixes—reducing your mean time to resolution (MTTR).

Overview

Running a production service is hard. Errors and bugs crop up due to product updates, infrastructure changes, or unexpected user behavior. When these issues arise, it’s critical to identify and fix them quickly to minimize downtime and maintain user trust—but this is challenging, especially at scale. What if AI agents could handle the initial investigation automatically? This allows engineers to start with a detailed report of the issue, including root cause analysis and specific recommendations for fixes, dramatically speeding up the debugging process. OpenHands accelerates incident response by:
  • Automated error analysis: AI agents investigate errors and provide detailed reports
  • Root cause identification: Connect symptoms to underlying issues in your codebase
  • Fix recommendations: Generate specific, actionable recommendations for resolving issues
  • Integration with monitoring tools: Work directly with platforms like Datadog

Automated Datadog Error Analysis

The OpenHands Software Agent SDK provides powerful capabilities for building autonomous AI agents that can integrate with monitoring platforms like Datadog. A ready-to-use GitHub Actions workflow demonstrates how to automate error analysis.

How It Works

Datadog is a popular monitoring and analytics platform that provides comprehensive error tracking capabilities. It aggregates logs, metrics, and traces from your applications, making it easier to identify and investigate issues in production. Datadog’s Error Tracking groups similar errors together and provides detailed insights into their occurrences, stack traces, and affected services. OpenHands can automatically analyze these errors and provide detailed investigation reports.

Triggering Automated Debugging

The GitHub Actions workflow can be triggered in two ways:
  1. Search Query: Provide a search query (e.g., “JSONDecodeError”) to find all recent errors matching that pattern. This is useful for investigating categories of errors.
  2. Specific Error ID: Provide a specific Datadog error tracking ID to deep-dive into a known issue. You can copy the error ID from DataDog’s error tracking UI using the “Actions” button.

Automated Investigation Process

When the workflow runs, it automatically performs the following steps:
  1. Get detailed info from the DataDog API
  2. Create or find an existing GitHub issue to track the error
  3. Clone all relevant repositories to get full code context
  4. Run an OpenHands agent to analyze the error and investigate the code
  5. Post the findings as a comment on the GitHub issue
The agent identifies the exact file and line number where errors originate, determines root causes, and provides specific recommendations for fixes.
The workflow posts findings to GitHub issues for human review before any code changes are made. If you want the agent to create a fix, you can follow up using the OpenHands GitHub integration and say @openhands go ahead and create a pull request to fix this issue based on your analysis.

Setting Up the Workflow

To set up automated Datadog debugging in your own repository:
  1. Copy the workflow file to .github/workflows/ in your repository
  2. Configure the required secrets (Datadog API keys, LLM API key)
  3. Customize the default queries and repository lists for your needs
  4. Run the workflow manually or set up scheduled runs
The workflow is fully customizable. You can modify the prompts to focus on specific types of analysis, adjust the agent’s tools to fit your workflow, or extend it to integrate with other services beyond GitHub and Datadog. Find the full implementation on GitHub, including the workflow YAML file, Python script, and prompt template.

Manual Incident Investigation

You can also use OpenHands directly to investigate incidents without the automated workflow.

Log Analysis

OpenHands can analyze logs to identify patterns and anomalies:
Analyze these application logs for the incident that occurred at 14:32 UTC:

1. Identify the first error or warning that appeared
2. Trace the sequence of events leading to the failure
3. Find any correlated errors across services
4. Identify the user or request that triggered the issue
5. Summarize the timeline of events
Log analysis capabilities:
Log TypeAnalysis Capabilities
Application logsError patterns, exception traces, timing anomalies
Access logsTraffic patterns, slow requests, error responses
System logsResource exhaustion, process crashes, system errors
Database logsSlow queries, deadlocks, connection issues

Stack Trace Analysis

Deep dive into stack traces:
Analyze this stack trace from our production error:

[paste full stack trace]

1. Identify the exception type and message
2. Trace back to our code (not framework code)
3. Identify the likely cause
4. Check if this code path has changed recently
5. Suggest a fix
Multi-language support:
Analyze this Java exception:

java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:3210)
    at java.util.ArrayList.grow(ArrayList.java:265)
    at com.myapp.DataProcessor.loadAllRecords(DataProcessor.java:142)

Identify:
1. What operation is consuming memory?
2. Is there a memory leak or just too much data?
3. What's the fix?

Root Cause Analysis

Identify the underlying cause of an incident:
Perform root cause analysis for this incident:

Symptoms:
- API response times increased 5x at 14:00
- Error rate jumped from 0.1% to 15%
- Database CPU spiked to 100%

Available data:
- Application metrics (Grafana dashboard attached)
- Recent deployments: v2.3.1 deployed at 13:45
- Database slow query log (attached)

Identify the root cause using the 5 Whys technique.

Common Incident Patterns

OpenHands can recognize and help diagnose these common patterns:
  • Connection pool exhaustion: Increasing connection errors followed by complete failure
  • Memory leaks: Gradual memory increase leading to OOM
  • Cascading failures: One service failure triggering others
  • Thundering herd: Simultaneous requests overwhelming a service
  • Split brain: Inconsistent state across distributed components

Quick Fix Generation

Once the root cause is identified, generate fixes:
We've identified the root cause: a missing null check in OrderProcessor.java line 156.

Generate a fix that:
1. Adds proper null checking
2. Logs when null is encountered
3. Returns an appropriate error response
4. Includes a unit test for the edge case
5. Is minimally invasive for a hotfix

Best Practices

Investigation Checklist

Use this checklist when investigating:
  1. Scope the impact
    • How many users affected?
    • What functionality is broken?
    • What’s the business impact?
  2. Establish timeline
    • When did it start?
    • What changed around that time?
    • Is it getting worse or stable?
  3. Gather data
    • Application logs
    • Infrastructure metrics
    • Recent deployments
    • Configuration changes
  4. Form hypotheses
    • List possible causes
    • Rank by likelihood
    • Test systematically
  5. Implement fix
    • Choose safest fix
    • Test before deploying
    • Monitor after deployment

Common Pitfalls

Avoid these common incident response mistakes:
  • Jumping to conclusions: Gather data before assuming the cause
  • Changing multiple things: Make one change at a time to isolate effects
  • Not documenting: Record all actions for the post-mortem
  • Ignoring rollback: Always have a rollback plan before deploying fixes
For production incidents, always follow your organization’s incident response procedures. OpenHands is a tool to assist your investigation, not a replacement for proper incident management.