Awesome AI SRE 
Applying artificial intelligence to site reliability engineering — autonomous incident response, intelligent observability, and self-healing infrastructure.
Contents
- AI SRE Agents
- AI Production Debugging
- Incident Management
- Observability Platforms
- AIOps Platforms
- Log Analysis and Anomaly Detection
- Chaos Engineering
- Runbook Automation
- Cloud Cost Optimization
- LLM-Powered DevOps Tools
- Agent Benchmarks
- Research Papers
- Blogs and Newsletters
- Community Lists
AI SRE Agents
Autonomous AI agents purpose-built for SRE workflows — investigating alerts, performing root cause analysis, and resolving incidents with minimal human intervention.
- Resolve AI - Autonomous SRE platform by OpenTelemetry co-creators that targets 80% autonomous resolution rate with parallel hypothesis investigation.
- Cleric - Autonomous AI SRE teammate that investigates alerts 24/7 and delivers root cause analysis in Slack.
- NeuBird - Agentic AI SRE co-pilot for enterprise IT with LLM-powered telemetry analysis and 230K+ alerts resolved.
- Phoebe AI - Predicts incidents from leading indicators and generates pre-emptive fixes using multi-agent AI swarms.
- Ciroos AI - Multi-agentic AI SRE teammate built on MCP and A2A architectures for extensible cross-tool orchestration.
- Dash0 - AI-native observability with specialized agents for on-call triage, PromQL queries, and dashboard automation.
- Datadog Bits AI - Autonomous AI on-call agent embedded in Datadog that analyzes runbooks and telemetry before responders log in.
- Harness AI SRE - Human-aware change agent with AI Scribe that captures Slack, Teams, and Zoom signals and correlates them with system changes.
- Azure SRE Agent - AI agent for monitoring, diagnosing, and resolving issues in Azure-hosted applications with no-code sub-agent builder.
- Causely - Causal AI engine that determines the single root cause from alert storms using causal reasoning rather than correlation.
- DrDroid - AI SRE agent with knowledge graph for investigation recommendations, PlayBooks automation, and AlertOps Slack bot.
- TierZero AI - Autonomous infrastructure issue management that auto-investigates, triages, and resolves infrastructure issues.
- Kubiya - Agentic engineering platform with natural language Slack and Teams commands, Terraform and CI/CD automation, and role-based access control.
- SRE.ai - Natural language AI agents for complex enterprise DevOps workflows including CI/CD and testing.
- Sherlocks.ai - AI-native SRE assistant that automates incident response, root cause analysis, and outage prevention with institutional memory.
- Parity - AI agent for cloud infrastructure reliability and Kubernetes operations.
- Beeps - On-call platform that helps developers and agents resolve downtime faster.
- Kura - AI DevOps copilot for AWS cloud infrastructure management and incident response.
- Wild Moose - AI first responder for production incidents that investigates and surfaces root cause in under one minute.
- Nudge Bee - Enterprise AI-agentic workflow platform for SRE and CloudOps with pre-built AI assistants and customizable workflows.
- Agent SRE - AI agent for autonomous site reliability engineering.
- Anyshift - AI SRE agent that investigates production incidents by tracing changes across a versioned infrastructure graph to identify root causes.
- Guardian by Metoro - AI SRE agent for Kubernetes that detects issues, finds the root cause, and opens fix PRs automatically.
- Hyground - A sovereign AI SRE agent built to operate complex software across your entire stack, automatically find root causes and cut DevOps toil.
AI Production Debugging
AI-powered tools for debugging production applications in real-time — adding observability without redeployments and autonomously remediating code issues.
- Lightrun - AI SRE platform for autonomous code remediation that lets you add logs, snapshots, and metrics to production without restarts.
- Sentry Seer - AI debugging agent built on production telemetry that identifies actionable issues, performs root cause analysis, and generates code fixes.
Incident Management
AI-enhanced platforms for managing the full incident lifecycle — detection, triage, response, communication, and post-mortems.
- PagerDuty AIOps - Enterprise incident management with ML-based noise reduction, AI Agent Suite with SRE Agent and Copilot, and MCP server integration.
- incident.io - Slack-native incident management with AI SRE, AI alert triage, AI postmortems, Scribe call transcription, and Claude and Cursor integration.
- Rootly - AI-native incident management with LLM-powered investigation across the observability stack.
- FireHydrant - AI-powered incident summaries, Zoom-aware context enrichment, and AI-drafted retrospectives. Being acquired by Freshworks.
- Squadcast - Incident management with AI-driven alert clustering and automatic grouping of related incidents. Acquired by SolarWinds.
- Zenduty - On-call and incident management with AI Summarizer, AI Postmortem, and AI Scheduling. Acquired by Xurrent, rebranding to Xurrent IMR.
- BetterStack - Developer-friendly uptime monitoring and incident management with integrated observability.
Observability Platforms
Full-stack observability with AI capabilities — anomaly detection, natural language querying, and intelligent alerting across metrics, logs, and traces.
- Datadog - Unified SaaS observability with Watchdog AI auto-detection, predictive metrics monitoring, and LLM observability across 600+ integrations.
- Dynatrace - Full-stack observability with Davis AI engine for continuous dependency analysis, anomaly detection, and Davis CoPilot for natural language remediation.
- New Relic - Full-stack observability with NRAI assistant for natural language queries and AI-powered anomaly detection.
- Grafana - Open source observability with Grafana Assistant for natural language queries, autonomous incident investigation, and ML-based anomaly detection.
- Splunk - Enterprise observability with AI-driven anomaly detection at scale and ITSI with ML-based predictive analytics. Part of Cisco.
- Elastic AI Assistant - AI assistant in Kibana for natural language log, metrics, and trace querying with contextual alert triage and RAG-powered knowledge base.
- Honeycomb - Observability for distributed services with Query Assistant, Honeycomb Intelligence, AI-guided Canvas workspace, and hosted MCP server.
- Coroot - Open source observability with AI-powered root cause analysis and eBPF-based auto-instrumentation.
- Last9 - Unified observability with Agentic SRE SDK, AI copilot integration with Claude, Cursor, and Slack, and managed TSDB.
- SigNoz - Open source OpenTelemetry-native observability platform for logs, metrics, and traces with unified correlation analysis.
- Metoro - Kubernetes native observability platform with built-in eBPF telemetry, AI investigation, deployment verification and root-cause analysis.
AIOps Platforms
Platforms that apply ML and AI to IT operations — correlating events, reducing alert noise, and automating operational workflows at scale.
- BigPanda - AIOps for high-alert-volume environments with event correlation reducing alert volume by 95%+ and AI Incident Assistant.
- Moogsoft - AIOps with event deduplication, contextual enrichment, intelligent correlation, and automated root cause analysis. Part of Dell Technologies.
- LogicMonitor - Cloud-based infrastructure monitoring with Edwin AI agent for plain-language summaries, predictive analytics, and capacity forecasting.
- Selector AI - AI-powered network observability with Network Large Language Model, 90% alert noise reduction, and digital twin modeling.
- Keep - Open source AIOps and alert management with correlation across monitoring tools and 50+ integrations.
Log Analysis and Anomaly Detection
Specialized tools for AI-driven log analytics, pattern recognition, and automated anomaly detection.
- Sumo Logic - Cloud-native log analytics with real-time AI-powered anomaly detection and ML-based pattern recognition.
- Graylog - Open source log management for centralized collection, indexing, and analysis with anomaly alerting.
- Logz.io - Cloud observability built on ELK and OpenSearch with AI-powered log analysis, ML-based anomaly detection, and MCP server.
- OpenObserve - Open source high-performance log, metrics, and trace platform with real-time analytics.
- LogAI - Open source library by Salesforce for log clustering, anomaly detection, and summarization with modular ML pipelines.
Chaos Engineering
Tools for proactively testing system resilience — now enhanced with AI for intelligent experiment design, blast radius control, and automated analysis.
- ChaosEater - Research tool using LLMs to fully automate the chaos engineering cycle from requirement identification through experiment design, execution, and analysis.
- Harness Chaos Engineering - Enterprise chaos engineering with LLM-derived test recommendations, intelligent blast radius downscaling, and MCP tool integration.
- Gremlin - Pioneer commercial chaos engineering tool with attack templates, infrastructure metrics monitoring, and multi-cloud support.
- Steadybit - Chaos engineering with open source extension framework, resilience policies, and experiment automation.
- LitmusChaos - CNCF open source chaos engineering for Kubernetes with ChaosHub for shared experiments.
- Chaos Mesh - CNCF open source Kubernetes-native chaos engineering with comprehensive fault injection for pods, network, IO, time, and kernel.
- AWS Fault Injection Service - AWS-native chaos engineering with integrated experiment templates and safety controls.
Runbook Automation
AI-powered tools for automating operational runbooks — converting manual procedures into self-executing workflows with intelligent decision-making.
- Rundeck - Open source and commercial runbook automation with self-service GUI, job scheduling, RBAC, and 1000+ integration plugins. Part of PagerDuty.
- StackStorm - Open source event-driven automation with rules engine, 6000+ actions, and ChatOps. Used by Netflix for self-healing infrastructure.
- Ansible Lightspeed - AI-powered Ansible playbook generation via IBM watsonx with natural language to Ansible code and MCP support.
- RunWhen - Platform for SRE agent orchestration and automated troubleshooting workflows.
Cloud Cost Optimization
AI-driven platforms for optimizing cloud spend — autonomous rightsizing, commitment management, and workload-aware cost allocation.
- CAST AI - Kubernetes cost optimization with real-time pod rightsizing, autoscaling optimization, predictive capacity forecasting, and advanced bin-packing.
- Sedai - Autonomous cloud optimization using patented reinforcement learning for rightsizing, workload-aware capacity scaling, and 30-50% cost savings.
- ProsperOps - Autonomous commitment optimization managing $6B+ annual cloud usage. Acquired by Flexera.
- Kubecost - Open source Kubernetes cost monitoring with real-time cost allocation and automated rightsizing recommendations.
- Vantage - Multi-cloud cost management with FinOps Agent for AI-driven savings identification and open source MCP server.
- nOps - AWS-focused FinOps with AI agent trained on customer data for automated commitment optimization.
- Finout - Enterprise FinOps with MegaBill for multi-provider cost consolidation and AI-powered cost attribution.
- Spot.io - Cloud infrastructure automation with spot instance optimization and commitment management. Part of NetApp.
- CloudPilot AI - Kubernetes-native capacity management with predictive scaling that anticipates usage spikes proactively.
LLM-Powered DevOps Tools
Tools leveraging large language models for natural language interaction with infrastructure, code generation for operations, and AI-assisted DevOps workflows.
- K8sGPT - CNCF project for AI-powered Kubernetes diagnostics with SRE experience codified into analyzers and multiple LLM backends.
- HolmesGPT - CNCF Sandbox project providing a 24/7 on-call AI agent with agentic loop querying live observability data from Prometheus, Grafana, Datadog, and Kubernetes.
- Kube-Copilot - Open source natural language to Kubernetes operations with manifest generation and security scanning.
- Lens Prism - AI copilot in Lens Desktop for context-aware natural language interaction with live Kubernetes clusters.
- GitHub Copilot Agent Mode - AI coding assistant with DevOps agent capabilities for infrastructure validation, incident response, and pipeline automation.
- GitLab Duo - AI throughout the DevSecOps lifecycle with failed job trace analysis, root cause identification, and Security Analyst Agent.
- Grafana Assistant - AI assistant for natural language dashboard creation, autonomous incident investigation, and query generation.
Agent Benchmarks
Frameworks and benchmarks for evaluating AI SRE agent performance.
- SRE Bench - Benchmark for evaluating AI SRE agents on realistic operational tasks.
Research Papers
Key academic and industry research on applying AI and ML to site reliability engineering and IT operations.
- STRATUS: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds — NeurIPS 2025.
- ChaosEater: Fully Automating Chaos Engineering with Large Language Models — ASE 2025.
- AIOps Solutions for Incident Management — Comprehensive literature review, 2024.
- A Survey of AIOps in the Era of Large Language Models — ACM Computing Surveys 2025.
- Automatic Root Cause Analysis via Large Language Models for Cloud Incidents — EuroSys 2024.
- FaultProfIT: Hierarchical Fault Profiling of Incident Tickets in Large-scale Cloud Systems — ICSE-SEIP 2024.
Blogs and Newsletters
- SRE Weekly - Curated newsletter on scalability, availability, incident response, and automation.
- Last Week in AWS - Weekly AWS news and commentary by Corey Quinn.
- Google Cloud DevOps and SRE Blog - Practices and tools for DevOps and SRE at scale.
- The New Stack - Cloud-native technology coverage with extensive AI and SRE content.
- incident.io Blog - Practical SRE and AI tools content.
- Doctor Droid Notes - AI-focused SRE engineering blog.
- NeuBird Blog - GenAI SRE predictions and industry analysis.
- Metoro Blog - Observability, AI SRE and Kubernetes content.
- Hyground Blog - AI SRE, observability, GenAI / security content.
Community Lists
Other curated collections in the AI and operations space.
- awesome-AIOps - Academic research and industrial materials on AIOps.
- awesome-LLM-AIOps - LLM-specific AIOps research and papers.
- awesome-chaos-engineering - Comprehensive chaos engineering resources.
Contributing
Contributions welcome! Read the contribution guidelines first.