Awesome Site Reliability Engineering Tools 
A curated list of Site Reliability and Production Engineering tools - Maintained by Raghu Chinnannan and Squadcast
Contents
- Development
- Continuous Testing
- Continuous Integration
- Continuous Delivery
- Continuous Monitoring
- Incident Management / Incident Response / IT Alerting / On-Call
- Internal Developer Portal
- AI SRE Tools & SRE Copilots
Development
Source Code Management
Project Management & Issue Tracking Software
- Jira
- Trello
- Zoho Sprints
- Taiga
- Wrike
- Asana
- Monday.com
- Clickup
- Basecamp
- Rally
- Teamwork
- Redmine
- Freedcamp
- Shortcut
- Azure Boards
- GitHub Projects
- GitLab Boards
- Bitbucket Issues
- Linear
Bug / Defect Tracking Software
Code Editors and IDEs
- GNU Emacs
- Notepad++
- Atom
- Visual Studio Code
- Sublime Text
- Vim
- Neovim
- Eclipse
- GNU Nano
- UltraEdit
- TextMate
- gedit
- WebStorm
- IntelliJ IDEA
- PyCharm
- Eclipse Che
- Bluefish
- CodeLobster
Continuous Testing
- Selenium
- JUnit
- TestNG
- NUnit
- TestSigma
- Unified Functional Testing (UFT)
- Tricentis Tosca
- IBM Rational Functional Tester
- TestComplete
- Waitr
- Zephyr
- accelQ
- Apache jMeter
- Appium
- steadybit
- k6
- Apache JMeter
- Gatling
- Cypress
- TestRail
- Bencher
Continuous Integration
Build
- Ninja
- Meson
- CMake
- Autotools/Automake
- premake
- Maven
- Ant
- Gradle
- Make
- Cake
- Rake
- MS Build
- Drill
- Hydra
- Bazel
- Azure DevOps
Integration
- Jenkins
- Bamboo
- Hudson
- CircleCI
- TeamCity
- Gitlab CI
- Travis CI
- AWS CodeStar
- Buildbot
- Semaphore CI
- Concourse CI
- Abstruse CI
- Appcenter
- Appveyor
- Assertible
- Badwolf
- Britise
- Buildkite
- Chrono CI
- Codacy
- CodeClimate
- CodeFresh
- Codeship
- Continuousphp
- Drone
- Hound CI
- Probo.CI
- Solano CI
- Visual Studio Team Services
- Go CD
Continuous Delivery
Deployment
- AWS CodeDeploy
- ElectricFlow
- Octopus Deploy
- IBM UrbanCode
- DeployBot
- Shippable
- Codar Continuous Delivery
- Wercker
- Humanitec
- ArgoCD
- Buddy Works
- werf
- Google Cloud Build
- Qovery - Enterprise Kubernetes management platform for deploying applications, databases, Helm charts, and Terraform modules on AWS, GCP, Azure, and Scaleway.
Infrastructure orchestration
- Vagrant
- Puppet
- Chef
- SaltStack
- Ansible
- Terraform
- AWS CloudFormation
- Rundeck
- Spacelift
- Selefra
- Scalr
- Stategraph
- Pulumi
- Google Cloud Deployment Manager
- OPS
- Kratix
- Terrateam
Container
Container Registry
- Docker Hub
- Google Container Registry
- Amazon ECR
- Gitlab Container Registry
- JFrog Artifactory
- Quay.io
- Azure Container Registry
- Oracle Container Registry
- Nexus Container Registry
- Harbor
Container Orchestration
Continuous Monitoring
- AWS CloudWatch
- DebugBear
- Prometheus
- StackDriver
- Sensu
- Sentry
- CopperEgg
- Crashlytics
- Kapacitor
- loggly
- logmatic
- Logstash
- MongoDB Atlas
- MongoDB Cloud Manager
- NewRelic
- ReleaseRun Vulnerability Scanner
- Papertrail
- PageGuard - Free all-in-one website health scanner. Core Web Vitals, SEO, WCAG 2.1 accessibility, and best practices. AI-generated action plan. No signup required.
- Pingdom
- ServerDensity
- Zabbix
- InsightOps
- AppSignal
- API Status Check - Centralized dashboard tracking real-time status and outages for 1,000+ popular APIs and services (AWS, Stripe, GitHub, Twilio, etc.). Monitor third-party dependencies, get instant outage alerts, reduce MTTR.
- Grafana
- VictoriaMetrics
- Chaos Genius
- Cloud Waste Scanner - Detects cloud waste and helps DevOps/platform teams identify quick cloud cost optimization opportunities.
- Thanos
- Mimir
- Hydrozen.io - Uptime monitoring & Statuspages
- SSL Certificate Monitor - Open-source SSL/TLS certificate expiry monitoring tool with email alerts
- DNS Propagation Checker - Open-source DNS propagation monitoring tool with global DNS server coverage
- whatbroke.today - AI-powered outage aggregator tracking 100+ cloud services with Telegram alerts
- Steampipe.io - Universal SQL interface to any cloud API
- Better Stack
- Netdata
- DoctorGPT - Brings GPT into production for application log error monitoring
- Dynatrace
- Datadog
- DevHelm - Developer-first uptime monitoring with HTTP, DNS, TCP, ICMP, and heartbeat checks, dependency intelligence for 80+ providers, hosted status pages, incident management, and a full developer surface (CLI, SDKs, Terraform provider, MCP server).
- Elastic APM
- Healthchecks.io
- OnlineOrNot - Uptime monitoring for websites, APIs, and cron jobs, with integrated status pages.
- Uptrack - Uptime monitoring with 30-second checks on free tier, consecutive-check alert confirmation to cut false positives, hosted status pages, and a built-in MCP server for AI agents.
- Streamdal - Code-Native Data Privacy - embed privacy controls in your application code to detect and monitor PII.
- Dash0 - OpenTelemetry Native Observability, built on CNCF Open Standards such as PromQL, Perses and OTLP with full cost control. Supporting Metrics, Traces and Logs with full custom dashboarding and alerting capabilities.
- CICube - AI DevOps monitoring platform by monitoring your CI workflows, detect anomalies, and provide actionable fixes.
- Middleware - A Full-Stack Cloud Observability Platform designed to empower developers and organizations to monitor, optimize, and streamline their applications and infrastructure in real-time.
- Shipfox - Boost GitHub Actions speed by 2x and cut costs by up to 75%, with smarter caching, deep CI insights, and zero-config setup.
- Ingero - eBPF-based GPU causal observability agent. Traces CUDA APIs and host kernel events to build causal chains explaining GPU latency. Includes MCP server for AI-assisted incident investigation.
- cloud-audit - AWS security auditing CLI that runs 17 checks across IAM, S3, EC2, VPC, and RDS with built-in remediation engine generating AWS CLI commands and Terraform snippets.
- FlareWarden - Uptime, content, and dependency monitoring with multi-region verification, status pages, and incident management.
- Phare - Shockingly good uptime monitoring, alerts, incident management, and status pages.
- API Status Check - Real-time status monitoring dashboard for 250+ developer APIs including AWS, Stripe, GitHub, and OpenAI. Free, no signup required.
- LynxDB - Lightweight columnar log analytics database for SRE workflows, with a pipe-style query language inspired by SPL for investigating production logs.
- KubeStellar Console - Open-source multi-cluster Kubernetes dashboard with AI-powered operations, MCP server bridging kubeconfig to LLM agents, and real-time observability across edge and cloud clusters. CNCF Sandbox.
- Apitally - API monitoring, analytics, and request logging for REST APIs, with lightweight open-source SDKs for Python, Node.js, Go, .NET, and Java.
- Riftmap - Cross-repo infrastructure dependency discovery and change impact analysis for multi-repo environments using Terraform, Docker, Helm, and more.
- Oack - HTTP monitoring with TCP kernel telemetry, 6-phase latency breakdown, Server-Timing header capture, Cloudflare CDN enrichment, and built-in incident management with on-call scheduling.
- OpenClaw Monitor - Real-time AI agent monitoring dashboard for OpenClaw agents. Track Gateway status, sessions, token usage & trends.
- agenttrace - TUI observability for AI coding agents. Track cost, tokens, tool failures, latency, anomalies, health, diffs, and CI gates across Claude Code, Codex CLI, Gemini CLI, Aider, and Cursor exports.
Incident Management / Incident Response / IT Alerting / On-Call
- Squadcast
- PagerDuty
- VictorOps
- OpsGenie
- AlertOps
- ~Blameless~ Now FireHydrant
- Jira Ops
- OnPage
- PagerTree
- Cabot
- AlertAgility
- xMatters
- Derdack Enterprise Alert
- Bigpanda
- OpenDuty
- ngDesk
- Geneos
- FireHydrant
- SLO exporter
- SLO Calculator
- Rootly
- Rootly CLI - Open-source CLI to manage Rootly incidents, alerts, services, teams, and on-call schedules from the terminal.
- Grafana OnCall
- Keep - CLI for alerting
- Better Stack
- Everbridge
- Moogsoft
- incident.io
- Next9.ai
- HolmesGPT - Investigate Prometheus alerts, Jira/Pagerduty/Opsgenie tickets automatically using AI.
- Merlinn - Open-source AI on-call developer
- Calmo - Debug Production x10 faster with AI.
- NthLayer - Reliability Shift Left platform. Generate dashboards, alerts, SLOs from YAML. Verify metrics exist before deploy. Block deploys when error budget exhausted.
- Runframe - Incident management platform with on-call scheduling, real-time collaboration, and automated escalations.
- Incidentary - Shared causal traces for incident response. Captures pre-alert causal chains across services and assembles them into a shared, replayable artifact before the war room starts—open-source SDKs.
- Regen - Open-source, self-hosted incident management with alert ingestion, on-call scheduling, escalation policies, AI-powered post-mortems, and Slack/Teams integration. AGPLv3 — self-hosted alternative to PagerDuty and Grafana OnCall.
IT Service Management
- FreshService
- ServiceNow
- BMC Remedy
- Jira Service Management(formerly Jira Service Desk)
- Samanage
- Cherwell
- SysAid
- ManageEngine Servicedesk plus
- Zendesk
Incident Communication
- Squadcast Statuspages
- StatusPal - communicate incidents and maintenance effectively with a beautiful hosted status page.
- Hydrozen.io Statuspages
- whatbroke.today - AI-powered outage aggregator tracking 100+ cloud services with Telegram alerts
- Atlassian Statuspages
- Instatus Statuspages - Quick and beautiful status page.
- Cachet
Internal Developer Portal
AI SRE Tools & SRE Copilots
- Sherlocks.ai
- Resolve.ai
- Deductive.ai
- Ingero - eBPF-based GPU causal observability agent. Traces CUDA APIs and host kernel events to build causal chains explaining GPU latency. Includes MCP server for AI-assisted incident investigation.
- IncidentFox (open source)
- metoro.io
- Ops AI by Middleware
- tailscale-mcp - MCP server with 52 tools for managing Tailscale tailnets from AI assistants like Claude Code and Cursor.
- KubeStellar Console - AI-powered multi-cluster Kubernetes management console with MCP server (kc-agent) for AI-assisted cluster operations, pod inspection, deployment management, and real-time observability across distributed environments.
Related Lists
- Awesome Performance Engineering - Observability and performance testing tools and resources for performance engineering.
Stargazers over time
Licence
This work is licensed under a Creative Commons Attribution 4.0 International License.
