About awesome-ai-sre

A curated list of 100+ AI-powered tools, platforms, and resources for Site Reliability Engineering (SRE) — agents, incident management, observability, AIOps, chaos engineering, and more.

a

Published by

agamm

Visit View Profile

README.md

View on GitHub

Awesome AI SRE

Applying artificial intelligence to site reliability engineering — autonomous incident response, intelligent observability, and self-healing infrastructure.

AI SRE Agents

Autonomous AI agents purpose-built for SRE workflows — investigating alerts, performing root cause analysis, and resolving incidents with minimal human intervention.

Resolve AI - Autonomous SRE platform by OpenTelemetry co-creators that targets 80% autonomous resolution rate with parallel hypothesis investigation.
Cleric - Autonomous AI SRE teammate that investigates alerts 24/7 and delivers root cause analysis in Slack.
NeuBird - Agentic AI SRE co-pilot for enterprise IT with LLM-powered telemetry analysis and 230K+ alerts resolved.
Phoebe AI - Predicts incidents from leading indicators and generates pre-emptive fixes using multi-agent AI swarms.
Ciroos AI - Multi-agentic AI SRE teammate built on MCP and A2A architectures for extensible cross-tool orchestration.
Dash0 - AI-native observability with specialized agents for on-call triage, PromQL queries, and dashboard automation.
Datadog Bits AI - Autonomous AI on-call agent embedded in Datadog that analyzes runbooks and telemetry before responders log in.
Harness AI SRE - Human-aware change agent with AI Scribe that captures Slack, Teams, and Zoom signals and correlates them with system changes.
Azure SRE Agent - AI agent for monitoring, diagnosing, and resolving issues in Azure-hosted applications with no-code sub-agent builder.
Causely - Causal AI engine that determines the single root cause from alert storms using causal reasoning rather than correlation.
DrDroid - AI SRE agent with knowledge graph for investigation recommendations, PlayBooks automation, and AlertOps Slack bot.
TierZero AI - Autonomous infrastructure issue management that auto-investigates, triages, and resolves infrastructure issues.
Kubiya - Agentic engineering platform with natural language Slack and Teams commands, Terraform and CI/CD automation, and role-based access control.
SRE.ai - Natural language AI agents for complex enterprise DevOps workflows including CI/CD and testing.
Sherlocks.ai - AI-native SRE assistant that automates incident response, root cause analysis, and outage prevention with institutional memory.
Parity - AI agent for cloud infrastructure reliability and Kubernetes operations.
Beeps - On-call platform that helps developers and agents resolve downtime faster.
Kura - AI DevOps copilot for AWS cloud infrastructure management and incident response.
Wild Moose - AI first responder for production incidents that investigates and surfaces root cause in under one minute.
Nudge Bee - Enterprise AI-agentic workflow platform for SRE and CloudOps with pre-built AI assistants and customizable workflows.
Agent SRE - AI agent for autonomous site reliability engineering.
Anyshift - AI SRE agent that investigates production incidents by tracing changes across a versioned infrastructure graph to identify root causes.
Guardian by Metoro - AI SRE agent for Kubernetes that detects issues, finds the root cause, and opens fix PRs automatically.
Hyground - A sovereign AI SRE agent built to operate complex software across your entire stack, automatically find root causes and cut DevOps toil.

AI Production Debugging

AI-powered tools for debugging production applications in real-time — adding observability without redeployments and autonomously remediating code issues.

Lightrun - AI SRE platform for autonomous code remediation that lets you add logs, snapshots, and metrics to production without restarts.
Sentry Seer - AI debugging agent built on production telemetry that identifies actionable issues, performs root cause analysis, and generates code fixes.

Incident Management

AI-enhanced platforms for managing the full incident lifecycle — detection, triage, response, communication, and post-mortems.

PagerDuty AIOps - Enterprise incident management with ML-based noise reduction, AI Agent Suite with SRE Agent and Copilot, and MCP server integration.
incident.io - Slack-native incident management with AI SRE, AI alert triage, AI postmortems, Scribe call transcription, and Claude and Cursor integration.
Rootly - AI-native incident management with LLM-powered investigation across the observability stack.
FireHydrant - AI-powered incident summaries, Zoom-aware context enrichment, and AI-drafted retrospectives. Being acquired by Freshworks.
Squadcast - Incident management with AI-driven alert clustering and automatic grouping of related incidents. Acquired by SolarWinds.
Zenduty - On-call and incident management with AI Summarizer, AI Postmortem, and AI Scheduling. Acquired by Xurrent, rebranding to Xurrent IMR.
BetterStack - Developer-friendly uptime monitoring and incident management with integrated observability.

Observability Platforms

Full-stack observability with AI capabilities — anomaly detection, natural language querying, and intelligent alerting across metrics, logs, and traces.

Datadog - Unified SaaS observability with Watchdog AI auto-detection, predictive metrics monitoring, and LLM observability across 600+ integrations.
Dynatrace - Full-stack observability with Davis AI engine for continuous dependency analysis, anomaly detection, and Davis CoPilot for natural language remediation.
New Relic - Full-stack observability with NRAI assistant for natural language queries and AI-powered anomaly detection.
Grafana - Open source observability with Grafana Assistant for natural language queries, autonomous incident investigation, and ML-based anomaly detection.
Splunk - Enterprise observability with AI-driven anomaly detection at scale and ITSI with ML-based predictive analytics. Part of Cisco.
Elastic AI Assistant - AI assistant in Kibana for natural language log, metrics, and trace querying with contextual alert triage and RAG-powered knowledge base.
Honeycomb - Observability for distributed services with Query Assistant, Honeycomb Intelligence, AI-guided Canvas workspace, and hosted MCP server.
Coroot - Open source observability with AI-powered root cause analysis and eBPF-based auto-instrumentation.
Last9 - Unified observability with Agentic SRE SDK, AI copilot integration with Claude, Cursor, and Slack, and managed TSDB.
SigNoz - Open source OpenTelemetry-native observability platform for logs, metrics, and traces with unified correlation analysis.
Metoro - Kubernetes native observability platform with built-in eBPF telemetry, AI investigation, deployment verification and root-cause analysis.

AIOps Platforms

Platforms that apply ML and AI to IT operations — correlating events, reducing alert noise, and automating operational workflows at scale.

BigPanda - AIOps for high-alert-volume environments with event correlation reducing alert volume by 95%+ and AI Incident Assistant.
Moogsoft - AIOps with event deduplication, contextual enrichment, intelligent correlation, and automated root cause analysis. Part of Dell Technologies.
LogicMonitor - Cloud-based infrastructure monitoring with Edwin AI agent for plain-language summaries, predictive analytics, and capacity forecasting.
Selector AI - AI-powered network observability with Network Large Language Model, 90% alert noise reduction, and digital twin modeling.
Keep - Open source AIOps and alert management with correlation across monitoring tools and 50+ integrations.

Log Analysis and Anomaly Detection

Specialized tools for AI-driven log analytics, pattern recognition, and automated anomaly detection.

Sumo Logic - Cloud-native log analytics with real-time AI-powered anomaly detection and ML-based pattern recognition.
Graylog - Open source log management for centralized collection, indexing, and analysis with anomaly alerting.
Logz.io - Cloud observability built on ELK and OpenSearch with AI-powered log analysis, ML-based anomaly detection, and MCP server.
OpenObserve - Open source high-performance log, metrics, and trace platform with real-time analytics.
LogAI - Open source library by Salesforce for log clustering, anomaly detection, and summarization with modular ML pipelines.

Chaos Engineering

Tools for proactively testing system resilience — now enhanced with AI for intelligent experiment design, blast radius control, and automated analysis.

ChaosEater - Research tool using LLMs to fully automate the chaos engineering cycle from requirement identification through experiment design, execution, and analysis.
Harness Chaos Engineering - Enterprise chaos engineering with LLM-derived test recommendations, intelligent blast radius downscaling, and MCP tool integration.
Gremlin - Pioneer commercial chaos engineering tool with attack templates, infrastructure metrics monitoring, and multi-cloud support.
Steadybit - Chaos engineering with open source extension framework, resilience policies, and experiment automation.
LitmusChaos - CNCF open source chaos engineering for Kubernetes with ChaosHub for shared experiments.
Chaos Mesh - CNCF open source Kubernetes-native chaos engineering with comprehensive fault injection for pods, network, IO, time, and kernel.
AWS Fault Injection Service - AWS-native chaos engineering with integrated experiment templates and safety controls.

Runbook Automation

AI-powered tools for automating operational runbooks — converting manual procedures into self-executing workflows with intelligent decision-making.

Rundeck - Open source and commercial runbook automation with self-service GUI, job scheduling, RBAC, and 1000+ integration plugins. Part of PagerDuty.
StackStorm - Open source event-driven automation with rules engine, 6000+ actions, and ChatOps. Used by Netflix for self-healing infrastructure.
Ansible Lightspeed - AI-powered Ansible playbook generation via IBM watsonx with natural language to Ansible code and MCP support.
RunWhen - Platform for SRE agent orchestration and automated troubleshooting workflows.

Cloud Cost Optimization

AI-driven platforms for optimizing cloud spend — autonomous rightsizing, commitment management, and workload-aware cost allocation.

CAST AI - Kubernetes cost optimization with real-time pod rightsizing, autoscaling optimization, predictive capacity forecasting, and advanced bin-packing.
Sedai - Autonomous cloud optimization using patented reinforcement learning for rightsizing, workload-aware capacity scaling, and 30-50% cost savings.
ProsperOps - Autonomous commitment optimization managing $6B+ annual cloud usage. Acquired by Flexera.
Kubecost - Open source Kubernetes cost monitoring with real-time cost allocation and automated rightsizing recommendations.
Vantage - Multi-cloud cost management with FinOps Agent for AI-driven savings identification and open source MCP server.
nOps - AWS-focused FinOps with AI agent trained on customer data for automated commitment optimization.
Finout - Enterprise FinOps with MegaBill for multi-provider cost consolidation and AI-powered cost attribution.
Spot.io - Cloud infrastructure automation with spot instance optimization and commitment management. Part of NetApp.
CloudPilot AI - Kubernetes-native capacity management with predictive scaling that anticipates usage spikes proactively.

LLM-Powered DevOps Tools

Tools leveraging large language models for natural language interaction with infrastructure, code generation for operations, and AI-assisted DevOps workflows.

K8sGPT - CNCF project for AI-powered Kubernetes diagnostics with SRE experience codified into analyzers and multiple LLM backends.
HolmesGPT - CNCF Sandbox project providing a 24/7 on-call AI agent with agentic loop querying live observability data from Prometheus, Grafana, Datadog, and Kubernetes.
Kube-Copilot - Open source natural language to Kubernetes operations with manifest generation and security scanning.
Lens Prism - AI copilot in Lens Desktop for context-aware natural language interaction with live Kubernetes clusters.
GitHub Copilot Agent Mode - AI coding assistant with DevOps agent capabilities for infrastructure validation, incident response, and pipeline automation.
GitLab Duo - AI throughout the DevSecOps lifecycle with failed job trace analysis, root cause identification, and Security Analyst Agent.
Grafana Assistant - AI assistant for natural language dashboard creation, autonomous incident investigation, and query generation.

Agent Benchmarks

Frameworks and benchmarks for evaluating AI SRE agent performance.

SRE Bench - Benchmark for evaluating AI SRE agents on realistic operational tasks.

Research Papers

Key academic and industry research on applying AI and ML to site reliability engineering and IT operations.

STRATUS: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds — NeurIPS 2025.
ChaosEater: Fully Automating Chaos Engineering with Large Language Models — ASE 2025.
AIOps Solutions for Incident Management — Comprehensive literature review, 2024.
A Survey of AIOps in the Era of Large Language Models — ACM Computing Surveys 2025.
Automatic Root Cause Analysis via Large Language Models for Cloud Incidents — EuroSys 2024.
FaultProfIT: Hierarchical Fault Profiling of Incident Tickets in Large-scale Cloud Systems — ICSE-SEIP 2024.

Blogs and Newsletters

SRE Weekly - Curated newsletter on scalability, availability, incident response, and automation.
Last Week in AWS - Weekly AWS news and commentary by Corey Quinn.
Google Cloud DevOps and SRE Blog - Practices and tools for DevOps and SRE at scale.
The New Stack - Cloud-native technology coverage with extensive AI and SRE content.
incident.io Blog - Practical SRE and AI tools content.
Doctor Droid Notes - AI-focused SRE engineering blog.
NeuBird Blog - GenAI SRE predictions and industry analysis.
Metoro Blog - Observability, AI SRE and Kubernetes content.
Hyground Blog - AI SRE, observability, GenAI / security content.

Community Lists

Other curated collections in the AI and operations space.

awesome-AIOps - Academic research and industrial materials on AIOps.
awesome-LLM-AIOps - LLM-specific AIOps research and papers.
awesome-chaos-engineering - Comprehensive chaos engineering resources.

Contributing

Contributions welcome! Read the contribution guidelines first.

awesome-ai-sre