versuscontrol

Professional software vendor delivering innovative solutions on the Softono platform. Specialized in both open-source and proprietary software development.

Visit Website

Total Products

Software by versuscontrol

Open Source

devops-ai-guidelines

# DevOps AI Guidelines & Learning Path [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![GitHub stars](https://img.shields.io/github/stars/VersusControl/devops-ai-guidelines.svg?style=social&label=Star)](https://github.com/VersusControl/devops-ai-guidelines) > **Your complete journey from DevOps Engineer to AI Infrastructure Architect - with comprehensive learning paths, practical tips, and enterprise guidelines** ## Overview This repository provides everything you need to master AI in DevOps - from your first AI tool to becoming an AI Infrastructure Architect. Whether you're starting your AI journey, implementing AI in your team, or leading enterprise AI adoption, we've got comprehensive resources and proven frameworks to guide your success. ### What You'll Find Here - **Complete learning roadmap** from DevOps to AI Infrastructure Architect - **Structured tutorials** for AI fundamentals and advanced techniques - **Enterprise AI frameworks** for safe team implementation - **Career acceleration** strategies and interview preparation - **Daily productivity** tips and automation workflows - **Cloud optimization** using AI tools and techniques ## Repository Contents ### AI Learning Path for DevOps **Complete 18-month journey from DevOps Engineer to AI Infrastructure Architect** | Document | Description | Status | | ------------------------------------------------------------------------- | ----------------------------------------------------------------- | ------------ | | [**AI Roadmap Overview**](./01-ai-roadmap-for-devops/01-roadmap.md) | Complete visual learning path with 3 phases and navigation | ✅ Available | | [**MCP for DevOps**](./02-mcp-for-devops/00-toc.md) | Complete guide to building MCP servers with Golang and Kubernetes | ✅ Available | | [**AI Agent for DevOps**](./03-ai-agent-for-devops/00-toc.md) | Complete guide to building AI Agent with Golang and LangChain | ✅ Available | | [**Building Your Business on AWS with AI Agent**](./04-ai-agent-for-aws/00-toc.md) | A Step-by-Step Guide for Beginners to Design, Build, and Scale Your Business Infrastructure on Amazon Web Services | ✅ Available | | [**AI Project Management**](./05-ai-project-management/00-toc.md) | An Engineer's Journey to Building an Intelligent Project Management with OpenClaw | ✅ Available | ### For Teams & Organizations | Document | Description | Time to Read | | --------------------------------------------------------------------------------- | ------------------------------------------------------------------ | ------------ | | [**AI Guidelines for DevOps Teams**](./resources/ai-guidelines-devops-team.md) | Complete framework for implementing AI safely in your organization | 15-20 min | ### For Individual Career Growth | Document | Description | Time to Read | | -------------------------------------------------------------------------------- | ------------------------------------------------------------ | ------------ | | [**AI Interview Preparation Guide**](./resources/ai-mock-interview.md) | Master AI-related DevOps interview questions and scenarios | 10 min | | [**10 Essential AI Prompts for DevOps**](./resources/10-ai-prompts-devops.md) | Ready-to-use prompts that will boost your daily productivity | 5 min | ### For Cloud & Certification | Document | Description | Time to Read | | ------------------------------------------------------------------------------------------ | ---------------------------------------------------------- | ------------ | | [**AWS Certification 10x Faster with AI**](./resources/aws-certification-10x-faster.md) | Accelerate your AWS learning using AI tools and strategies | 8 min | ### Resources Example | Document | Description | Time to Read | | ------------------------------------------------------------------------------------------ | ---------------------------------------------------------- | ------------ | | [**How I Use AI Assistant to Create AWS Infrastructure**](./resources/ai-assisted-aws-infrastructure-demo.md) | A conversation with your AI assistant about your infrastructure needs, and it builds everything for you. | 12 min | | [**AI Infrastructure Agent for AWS**](./resources/ai-infrastructure-agent-for-aws.md) | Transform your AWS infrastructure management with natural language commands! | 12 min | ## Quick Start Paths ### Individual DevOps Engineer 1. Start with [10 AI Prompts](./resources/10-ai-prompts-devops.md) for immediate productivity gains 2. Use [AWS AI Learning](./resources/aws-certification-10x-faster.md) to accelerate certifications 3. Prepare for interviews with [AI Interview Guide](./resources/ai-mock-interview.md) ### **Team Lead/Manager** 1. Review [Team Guidelines](./resources/ai-guidelines-devops-team.md) for safe AI implementation 2. Share [Daily Prompts](./resources/10-ai-prompts-devops.md) with your team 3. Establish security framework from the guidelines ### **Organization/CTO** 1. Implement [AI Guidelines](./resources/ai-guidelines-devops-team.md) across teams 2. Train teams using all resources as reference material 3. Monitor and optimize based on provided frameworks ## License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ## Support This Project If these guidelines have helped your team implement AI safely and effectively, consider: - ⭐ **Starring** this repository - 💖 **Sponsoring** continued development [![Sponsor](https://img.shields.io/badge/Sponsor-❤️-red?style=for-the-badge)](https://github.com/sponsors/hoalongnatsu) --- **Made with ❤️ by the DevOps VN community | Helping DevOps professionals thrive in the AI era**

1.1K Github Stars

Open Source

versus-incident

<h1 align="center" style="border-bottom: none"> <img alt="Versus" src="src/docs/images/versus.svg"> </h1> <p align="center"> <a href="https://goreportcard.com/report/github.com/VersusControl/versus-incident"><img src="https://goreportcard.com/badge/github.com/VersusControl/versus-incident" alt="Go Report Card"></a> <a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a> <a href="https://github.com/sponsors/versuscontrol"><img src="https://img.shields.io/badge/sponsor-%E2%9D%A4-ff69b4" alt="Sponsor"></a> </p> Versus is an open-source **incident management tool** built with an **AI SRE agent**. The agent watches your logs, learns what *normal* looks like, and automatically opens an incident the moment something new and unexpected appears — with no alert rules to write. It also accepts incidents from any webhook-capable tool (Alertmanager, Grafana, Sentry, CloudWatch, FluentBit). Versus supports **multi-channel notifications** to Slack, Microsoft Teams, Telegram, Viber, Email, and Lark with templates you fully control, plus **on-call escalation** to AWS Incident Manager or PagerDuty when an incident goes unacknowledged. ![Versus](src/docs/images/versus-dashboard-01.png) ## How Versus Creates Incidents Incidents reach Versus two ways, and both are handled by the same notification, templating, and on-call logic: - **AI SRE Agent (auto-detect)** — point the agent at your logs and it learns your normal patterns, then automatically raises an incident when a brand-new error or anomaly appears. No alert rules to maintain. - **Webhook alerts (you define)** — any tool that can POST a webhook (Alertmanager, Grafana, Sentry, CloudWatch SNS, FluentBit, or your own scripts) sends incidents straight to Versus, formatted with your own templates. Whichever source raises it, an incident is templated, fanned out to every channel you enable, and escalated to on-call if it isn't acknowledged in time. ## Features - 🤖 **AI SRE Agent** *(Beta)*: An AI agent that reads your logs, learns what normal looks like, and automatically opens an incident only when something new and unexpected appears. - 🌐 **Webhook Alerts**: Receive incidents from any tool that can POST a webhook — Alertmanager, Grafana, Sentry, CloudWatch SNS, FluentBit, and more. - 🚨 **Multi-channel Notifications**: Fan out every incident to Slack, Microsoft Teams, Telegram, Viber, Email, and Lark (more channels coming!) - 📝 **Custom Templates**: Define your own alert messages using Go templates - 🔧 **Easy Configuration**: YAML-based configuration with environment variables support - 📡 **REST API**: Simple HTTP interface to receive alerts - 📞 **On-Call**: On-Call integrations with AWS Incident Manager and PagerDuty ![Versus](src/docs/images/versus-architecture.png) ## Table of Contents - [Getting Started](#get-started) - [AI Agent](#ai-agent) - [Webhook Alerts](#webhook-alerts) - [Admin Dashboard](https://versuscontrol.github.io/versus-incident/configuration/admin-ui.html) - [Development Custom Templates](#development-custom-templates) - [On-Call](#on-call) - [Configuration](#complete-configuration) - [Roadmap](#roadmap) - [Contributing](#contributing) - [License](#license) ## Get Started ### Auto-detect incidents with the AI SRE Agent Let the agent learn your logs and surface what's new. The AI SRE Agent has three modes: - **`training`** — just watch and learn. No alerts. - **`shadow`** — watch and learn, plus write a "would have alerted" log entry every time a line would have triggered an alert. Still no real alerts. Good for checking the agent's judgement before going live. - **`detect`** — actually create incidents for lines the agent has never seen before. An AI SRE triages each one and writes the summary, severity, and suggested next steps before the incident is sent through every configured channel. Start it in **training mode** — it only watches and learns, and never sends an alert until you're ready. ```bash # Redis is used to remember where the agent left off in each log source. docker run -d --name versus-redis -p 6379:6379 redis:7 docker run -p 3000:3000 \ -e GATEWAY_SECRET=change-me \ -e AGENT_ENABLE=true \ -e AGENT_MODE=training \ -e REDIS_HOST=host.docker.internal \ -e REDIS_PORT=6379 \ -v $(pwd)/config:/app/config \ -v $(pwd)/data:/app/data \ ghcr.io/versuscontrol/versus-incident ``` The agent needs a `config.yaml` and an `agent_sources.yaml` that point it at your logs. Once it's running, review the patterns it learns on the **admin dashboard** at <http://localhost:3000/>, then switch `AGENT_MODE` from `training` → `shadow` → `detect` when you trust it. Full walkthrough (with ready-to-copy config and a sample log generator): [AI Agent — Getting Started](https://versuscontrol.github.io/versus-incident/agent/getting-started.html). ### Forward alerts from your existing tools Already have monitoring? Run Versus and POST your alerts to its webhook endpoint. ```bash docker run -p 3000:3000 \ -e GATEWAY_SECRET=change-me \ -e SLACK_ENABLE=true \ -e SLACK_TOKEN=your_token \ -e SLACK_CHANNEL_ID=your_channel \ ghcr.io/versuscontrol/versus-incident ``` Versus listens on port 3000 by default and exposes: - `POST /api/incidents` — webhook endpoint for monitoring tools. - `GET /` — the embedded **admin dashboard**, open <http://localhost:3000/> in your browser. For the full UI walkthrough and the build/watch scripts, see [Admin Dashboard](https://versuscontrol.github.io/versus-incident/configuration/admin-ui.html). > **You can use both.** The AI agent and webhook alerts are not mutually exclusive — run them together and every incident, whether auto-detected or forwarded from your tools, flows through the same channels, templates, and on-call logic. ## AI Agent The **AI SRE agent** is what makes Versus different: point it at your logs and it learns what *normal* looks like, then automatically opens an incident the moment a brand-new error or anomaly appears — no alert rules to maintain. Configuration example with agent features: ```yaml name: versus host: 0.0.0.0 port: 3000 # ... existing alert configurations ... # Shared secret required for ALL admin endpoints (`/api/admin/*` and # `/api/agent/*`). Sent by clients in the `X-Gateway-Secret` header. gateway_secret: ${GATEWAY_SECRET} # Storage backend for the pattern catalog, shadow log, and incident # history. Only `file` is implemented today; `redis` and `database` # are config stubs. storage: type: file # file | redis | database (env: STORAGE_TYPE) file: data_dir: ./data max_incidents: 1000 # rolling cap on persisted incidents agent: enable: false # Use this to enable or disable the agent for all sources mode: training # Valid values: "training", "shadow", or "detect" poll_interval: 30s # Sources are kept in a separate file so they can be managed independently # (e.g. swap fixtures, per-environment lists). Path is resolved relative to # this config file. Override via env: AGENT_SOURCES_PATH. sources_path: ./agent_sources.yaml catalog: persist_interval: 30s auto_promote_after: 100 # In detect mode, this many sightings = "known" redaction: enable: true redact_ips: false extra_patterns: # Optional: extra regex rules to scrub before clustering - "(?i)password=\\S+" - "Authorization:\\s*Bearer\\s+\\S+" miner: similarity_threshold: 0.4 tree_depth: 4 max_children: 100 regex: # Optional: tag any signal whose message matches this pattern # if none of the named rules below hit. Leave empty to disable. default_pattern: "(?i)error|exception|fatal|panic" # Named rules are tried first, in order. The first match wins. rules: - name: oom pattern: "(?i)out of memory|OOMKilled|java\\.lang\\.OutOfMemoryError" - name: db-timeout pattern: "(?i)(connection|query) timeout|deadlock detected" - name: auth-failure pattern: "(?i)401 unauthorized|invalid credentials|permission denied" redis: # Required for the agent to persist source cursors across restarts host: ${REDIS_HOST} port: ${REDIS_PORT} password: ${REDIS_PASSWORD} db: 0 ``` **Explanation:** The `agent` section includes: 1. `enable`: Turn the agent on or off (default: `false`). When disabled, nothing extra runs — no background processes, no extra files written. 2. `mode`: How the agent behaves after it has learned your log patterns: - `training`: observation only — the agent learns patterns and saves them, but sends no alerts. - `shadow`: same as training, but also logs a note every time it would have sent an alert. Good for reviewing before going live. - `detect`: the agent actively sends alerts for any pattern it has never seen before. 3. `poll_interval`: How often the agent checks your log sources for new entries. 4. `catalog`: Where the agent stores the list of known patterns and how often to write updates. `mode` selects the storage backend — only `file` is supported today, which writes to `<storage.file.data_dir>/patterns.json` (the filename is fixed). > **Admin secret.** All admin endpoints (`/api/admin/*` and > `/api/agent/*`) are protected by the **root-level** `gateway_secret` > (env `GATEWAY_SECRET`). Set it to any value you choose; clients send > the same value in the `X-Gateway-Secret` header. When no secret is > configured the admin endpoints are not registered and the agent > refuses to start. > **Storage.** The agent's catalog and the incident history shown in the > UI are persisted via the **root-level** `storage:` block (default: > `type: file`, `data_dir: ./data`). The agent's `data_dir` field has > been removed. 5. `redaction`: Rules for automatically removing sensitive information (passwords, tokens, emails, etc.) from logs before the agent processes them. 6. `miner`: Controls how aggressively the agent groups similar log lines together. The defaults work well for most setups. 7. `regex`: Acts as a **pre-filter** for the agent. Only signals whose message matches at least one rule (a named entry under `rules` or `default_pattern`) are forwarded to the pattern miner and stored in the catalog. Anything that doesn't match is dropped before clustering, so boring noise (200-OK requests, debug lines, etc.) never bloats `patterns.json`. - Named `rules` are tried in order; the first match wins and tags the signal with that `name` (stored as `rule_name` on the pattern). - If no named rule hits, `default_pattern` is tried. Matches there are tagged with `name=default`. - **To learn from every line, set `default_pattern: ".*"`.** This is useful in early training when you don't yet know what's interesting. - **To filter aggressively, set `default_pattern: ""` (empty)** and rely on your named rules — anything that doesn't match an explicit rule is dropped. 8. `sources_path`: Path to a separate YAML file that lists the log sources the agent should read from. Keeping sources in their own file makes it easier to manage per-environment source lists or swap fixtures without touching the rest of the config. The path is resolved relative to the main config file. Override via the `AGENT_SOURCES_PATH` env var. The sources file (default `./agent_sources.yaml`) has a single top-level `sources:` list. Each entry needs `name`, `type` (`file` or `elasticsearch`), `enable`, plus a matching `file:` or `elasticsearch:` block. Example: ```yaml sources: - name: prod-app type: elasticsearch enable: true elasticsearch: addresses: - https://es.example.internal:9200 username: ${ES_USERNAME} password: ${ES_PASSWORD} index: "logs-app-*" time_field: "@timestamp" query: 'log.level:(error OR warn)' message_field: message page_size: 500 - name: sample-app type: file enable: true file: path: ./local/resource/sample-app.log format: text from_beginning: true ``` The `redis` section is required when `agent.enable` is `true`. Redis is used to remember where the agent left off in each log source, so it picks up from the right place after a restart. For detailed information on integration, please refer to the document here: [Enable AI Agent](https://versuscontrol.github.io/versus-incident/agent/agent-introduction.html). ## Webhook Alerts Already using other monitoring tools? Versus also accepts incidents from anything that can POST JSON to `/api/incidents`, so you can route existing alerts through the same channels, templates, and on-call. ### Universal Alert Template Support Our default template (Slack, Telegram) automatically handles alerts from multiple sources, including: - Alertmanager (Prometheus) - Grafana Alerts - Sentry - CloudWatch SNS - FluentBit #### Example JSON Payload Sent by Alertmanager ```bash curl -X POST "http://localhost:3000/api/incidents" \ -H "Content-Type: application/json" \ -d '{ "receiver": "webhook-incident", "status": "firing", "alerts": [ { "status": "firing", "labels": { "alertname": "PostgresqlDown", "instance": "postgresql-prod-01", "severity": "critical" }, "annotations": { "summary": "Postgresql down (instance postgresql-prod-01)", "description": "Postgresql instance is down." }, "startsAt": "2023-10-01T12:34:56.789Z", "endsAt": "2023-10-01T12:44:56.789Z", "generatorURL": "" } ], "groupLabels": { "alertname": "PostgresqlDown" }, "commonLabels": { "alertname": "PostgresqlDown", "severity": "critical", "instance": "postgresql-prod-01" }, "commonAnnotations": { "summary": "Postgresql down (instance postgresql-prod-01)", "description": "Postgresql instance is down." }, "externalURL": "" }' ``` #### Example JSON Payload Sent by Sentry ```bash curl -X POST "http://localhost:3000/api/incidents" \ -H "Content-Type: application/json" \ -d '{ "action": "created", "data": { "issue": { "id": "123456", "title": "Example Issue", "culprit": "example_function in example_module", "shortId": "PROJECT-1", "project": { "id": "1", "name": "Example Project", "slug": "example-project" }, "metadata": { "type": "ExampleError", "value": "This is an example error" }, "status": "unresolved", "level": "error", "firstSeen": "2023-10-01T12:00:00Z", "lastSeen": "2023-10-01T12:05:00Z", "count": 5, "userCount": 3 } }, "installation": { "uuid": "installation-uuid" }, "actor": { "type": "user", "id": "789", "name": "John Doe" } }' ``` **Result:** ![Versus Result](src/docs/images/versus-result-01.png) ## Development Custom Templates For the custom templates, see [Development Custom Templates](https://versus-incident.devopsvn.tech/userguide/getting-started.html#development-custom-templates) ## Kubernetes For a complete `Deployment` + `Service` + `PersistentVolumeClaim` manifest (with the persistent data volume the admin dashboard needs), see [Deploy on Kubernetes](https://versuscontrol.github.io/versus-incident/configuration/kubernetes.html). ## Helm Chart For the packaged install, see [Helm Chart](https://versuscontrol.github.io/versus-incident/configuration/helm.html) or the chart source under [helm/versus-incident](https://github.com/VersusControl/versus-incident/blob/main/helm/versus-incident). ## On-Call Versus supports On-Call integrations with AWS Incident Manager and PagerDuty. Updated configuration example with on-call features: ```yaml name: versus host: 0.0.0.0 port: 3000 public_host: https://your-ack-host.example # Required for on-call ack # ... existing alert configurations ... oncall: ### Enable overriding using query parameters # /api/incidents?oncall_enable=false => Set to `true` or `false` to enable or disable on-call for a specific alert # /api/incidents?oncall_wait_minutes=0 => Set the number of minutes to wait for acknowledgment before triggering on-call. Set to `0` to trigger immediately initialized_only: true # Initialize on-call feature but don't enable by default; use query param oncall_enable=true to enable for specific requests enable: false # Use this to enable or disable on-call for all alerts wait_minutes: 3 # If you set it to 0, it means there's no need to check for an acknowledgment, and the on-call will trigger immediately provider: aws_incident_manager # Valid values: "aws_incident_manager" or "pagerduty" aws_incident_manager: # Used when provider is "aws_incident_manager" response_plan_arn: ${AWS_INCIDENT_MANAGER_RESPONSE_PLAN_ARN} other_response_plan_arns: # Optional: Enable overriding the default response plan ARN using query parameters, eg /api/incidents?awsim_other_response_plan=prod prod: ${AWS_INCIDENT_MANAGER_OTHER_RESPONSE_PLAN_ARN_PROD} dev: ${AWS_INCIDENT_MANAGER_OTHER_RESPONSE_PLAN_ARN_DEV} staging: ${AWS_INCIDENT_MANAGER_OTHER_RESPONSE_PLAN_ARN_STAGING} pagerduty: # Used when provider is "pagerduty" routing_key: ${PAGERDUTY_ROUTING_KEY} # Integration/Routing key for Events API v2 (REQUIRED) other_routing_keys: # Optional: Enable overriding the default routing key using query parameters, eg /api/incidents?pagerduty_other_routing_key=infra infra: ${PAGERDUTY_OTHER_ROUTING_KEY_INFRA} app: ${PAGERDUTY_OTHER_ROUTING_KEY_APP} db: ${PAGERDUTY_OTHER_ROUTING_KEY_DB} redis: # Required for on-call functionality insecure_skip_verify: true # dev only host: ${REDIS_HOST} port: ${REDIS_PORT} password: ${REDIS_PASSWORD} db: 0 ``` **Explanation:** The `oncall` section includes: 1. `enable`: A boolean to toggle on-call functionality for all incidents (default: `false`). 2. `initialized_only`: Initialize on-call feature but keep it disabled by default. When set to `true`, on-call is triggered only for requests that explicitly include `?oncall_enable=true` in the URL. This is useful for having on-call ready but not enabled for all alerts. 3. `wait_minutes`: Time in minutes to wait for an acknowledgment before escalating (default: `3`). Setting it to `0` triggers the on-call immediately. 4. `provider`: Specifies which on-call provider to use ("aws_incident_manager" or "pagerduty"). 5. `aws_incident_manager`: Configuration for AWS Incident Manager when it's the selected provider, including `response_plan_arn` and `other_response_plan_arns`. 6. `pagerduty`: Configuration for PagerDuty when it's the selected provider, including routing keys. The redis section is required when `oncall.enable` or `oncall.initialized_only` is true. It configures the Redis instance used for state management or queuing, with settings like host, port, password, and db. For detailed information on integration, please refer to the document here: [On-Call setup with Versus](https://versuscontrol.github.io/versus-incident/oncall/on-call-introduction.html). ## Complete Configuration A sample configuration file is located at `config/config.yaml`: ```yaml name: versus host: 0.0.0.0 port: 3000 public_host: https://your-ack-host.example # Required for on-call ack # Proxy configuration (global settings) # Use this when your network blocks access to messaging services like Telegram, Viber, or Lark proxy: url: ${PROXY_URL} # HTTP/HTTPS/SOCKS5 proxy URL (e.g., http://proxy.example.com:8080) username: ${PROXY_USERNAME} # Optional proxy username for authenticated proxies password: ${PROXY_PASSWORD} # Optional proxy password for authenticated proxies alert: debug_body: true # Default value, will be overridden by DEBUG_BODY env var slack: enable: false # Default value, will be overridden by SLACK_ENABLE env var token: ${SLACK_TOKEN} # From environment channel_id: ${SLACK_CHANNEL_ID} # From environment template_path: "config/slack_message.tmpl" message_properties: button_text: "Acknowledge Alert" # Custom text for the acknowledgment button button_style: "primary" # Button style: "primary" (default blue), "danger" (red), or empty for default gray disable_button: false # Set to true to disable the button, if you want to handle acknowledgment differently telegram: enable: false # Default value, will be overridden by TELEGRAM_ENABLE env var bot_token: ${TELEGRAM_BOT_TOKEN} # From environment chat_id: ${TELEGRAM_CHAT_ID} # From environment template_path: "config/telegram_message.tmpl" use_proxy: false # Set to true to use global proxy settings for Telegram API calls viber: enable: false # Default value, will be overridden by VIBER_ENABLE env var bot_token: ${VIBER_BOT_TOKEN} # From environment (token for bot or channel) api_type: ${VIBER_API_TYPE} # From environment - "channel" (default) or "bot" # Channel API (recommended for incident management) channel_id: ${VIBER_CHANNEL_ID} # From environment (required for channel API) # Bot API (for individual user notifications) user_id: ${VIBER_USER_ID} # From environment (required for bot API) template_path: "config/viber_message.tmpl" use_proxy: false # Set to true to use global proxy settings for Viber API calls email: enable: false # Default value, will be overridden by EMAIL_ENABLE env var smtp_host: ${SMTP_HOST} # From environment smtp_port: ${SMTP_PORT} # From environment username: ${SMTP_USERNAME} # From environment password: ${SMTP_PASSWORD} # From environment to: ${EMAIL_TO} # From environment, can contain multiple comma-separated email addresses subject: ${EMAIL_SUBJECT} # From environment template_path: "config/email_message.tmpl" msteams: enable: false # Default value, will be overridden by MSTEAMS_ENABLE env var power_automate_url: ${MSTEAMS_POWER_AUTOMATE_URL} # Automatically works with both Power Automate workflow URLs and legacy Office 365 webhooks template_path: "config/msteams_message.tmpl" other_power_urls: # Optional: Define additional Power Automate URLs for multiple MS Teams channels qc: ${MSTEAMS_OTHER_POWER_URL_QC} # Power Automate URL for QC team ops: ${MSTEAMS_OTHER_POWER_URL_OPS} # Power Automate URL for Ops team dev: ${MSTEAMS_OTHER_POWER_URL_DEV} # Power Automate URL for Dev team lark: enable: false # Default value, will be overridden by LARK_ENABLE env var webhook_url: ${LARK_WEBHOOK_URL} # Lark webhook URL (required) template_path: "config/lark_message.tmpl" use_proxy: false # Set to true to use global proxy settings for Lark API calls other_webhook_urls: # Optional: Enable overriding the default webhook URL using query parameters, eg /api/incidents?lark_other_webhook_url=dev dev: ${LARK_OTHER_WEBHOOK_URL_DEV} prod: ${LARK_OTHER_WEBHOOK_URL_PROD} queue: enable: true debug_body: true # AWS SNS sns: enable: false https_endpoint_subscription_path: /sns # URI to receive SNS messages, e.g. ${host}:${port}/sns or ${https_endpoint_subscription}/sns # Options If you want to automatically create an sns subscription https_endpoint_subscription: ${SNS_HTTPS_ENDPOINT_SUBSCRIPTION} # If the user configures an HTTPS endpoint, then an SNS subscription will be automatically created, e.g. https://your-domain.com topic_arn: ${SNS_TOPIC_ARN} # AWS SQS sqs: enable: false queue_url: ${SQS_QUEUE_URL} # GCP Pub Sub pubsub: enable: false # Azure Event Bus azbus: enable: false oncall: ### Enable overriding using query parameters # /api/incidents?oncall_enable=false => Set to `true` or `false` to enable or disable on-call for a specific alert # /api/incidents?oncall_wait_minutes=0 => Set the number of minutes to wait for acknowledgment before triggering on-call. Set to `0` to trigger immediately initialized_only: false # Initialize on-call feature but don't enable by default; use query param oncall_enable=true to enable for specific requests enable: false # Use this to enable or disable on-call for all alerts wait_minutes: 3 # If you set it to 0, it means there's no need to check for an acknowledgment, and the on-call will trigger immediately provider: aws_incident_manager # Valid values: "aws_incident_manager" or "pagerduty" aws_incident_manager: # Used when provider is "aws_incident_manager" response_plan_arn: ${AWS_INCIDENT_MANAGER_RESPONSE_PLAN_ARN} other_response_plan_arns: # Optional: Enable overriding the default response plan ARN using query parameters, eg /api/incidents?awsim_other_response_plan=prod prod: ${AWS_INCIDENT_MANAGER_OTHER_RESPONSE_PLAN_ARN_PROD} dev: ${AWS_INCIDENT_MANAGER_OTHER_RESPONSE_PLAN_ARN_DEV} staging: ${AWS_INCIDENT_MANAGER_OTHER_RESPONSE_PLAN_ARN_STAGING} pagerduty: # Used when provider is "pagerduty" routing_key: ${PAGERDUTY_ROUTING_KEY} # Integration/Routing key for Events API v2 (REQUIRED) other_routing_keys: # Optional: Enable overriding the default routing key using query parameters, eg /api/incidents?pagerduty_other_routing_key=infra infra: ${PAGERDUTY_OTHER_ROUTING_KEY_INFRA} app: ${PAGERDUTY_OTHER_ROUTING_KEY_APP} db: ${PAGERDUTY_OTHER_ROUTING_KEY_DB} redis: # Required for on-call functionality insecure_skip_verify: true # dev only host: ${REDIS_HOST} port: ${REDIS_PORT} password: ${REDIS_PASSWORD} db: 0 # ----------------------------------------------------------------------------- # AI agent mode (training | shadow | detect) — opt-in. # # When agent.enable=false (the default), nothing changes: no goroutines start, # no new dependencies are loaded, no Redis keys are created. # # Recommended rollout: # 1: mode=training, review the catalog via /api/agent/patterns # 2: mode=shadow, review log lines `agent[shadow]: would alert ...` # 3: mode=detect (AI emission ships in a follow-up milestone) # # ----------------------------------------------------------------------------- agent: enable: false # master switch (env: AGENT_ENABLE) mode: training # training | shadow | detect (env: AGENT_MODE) poll_interval: 30s # how often each source is pulled lookback: 5m # initial backfill window on startup batch_max: 1000 # safety cap per tick signal_max_bytes: 8192 # cap on Signal.Raw # Signal sources are kept in a separate file so users can manage them # independently of the main config. Path is resolved relative to this # config file. Override via env: AGENT_SOURCES_PATH. sources_path: ./agent_sources.yaml redaction: enable: true redact_ips: false # IPs are usually useful context; opt-in extra_patterns: - "(?i)password=\\S+" - "Authorization:\\s*Bearer\\s+\\S+" catalog: persist_interval: 30s auto_promote_after: 100 # in detect mode, this many sightings = "known" miner: similarity_threshold: 0.4 tree_depth: 4 max_children: 100 regex: # Set to ".*" to train on every line; leave empty to require # an explicit named rule match. default_pattern: "(?i).*error.*" rules: - name: oom-killer pattern: "Out of memory: Killed process" - name: panic pattern: "(?i)panic:" - name: 5xx-burst pattern: "HTTP/[0-9.]+\\s+5\\d\\d" ``` **For the detail configuration, see [Detail Configuration](https://versus-incident.devopsvn.tech/configuration/configuration.html)** ## Roadmap See [ROADMAP.md](ROADMAP.md) for the full list of shipped features, work in progress, and planned phases (more log sources, metrics, traces, cross-signal correlation). ## Support The Project [GitHub Sponsors](https://github.com/sponsors/versuscontrol) · see [SPONSORS.md](SPONSORS.md) ## Contributing Contributions are welcome. Please read [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, coding standards, and the PR process, and review the [Code of Conduct](CODE_OF_CONDUCT.md) and [security policy](SECURITY.md) before reporting vulnerabilities. Project governance is documented in [GOVERNANCE.md](GOVERNANCE.md). ## License Distributed under the MIT License. See `LICENSE` for more information.

AI Agents Incident Management

489 Github Stars