Overview
AxonOps provide various integrations for the notifications.
The functionality is accessible via Settings > Integrations
The current integrations are:
- SMTP / Email
- Pagerduty
- Slack
- Microsoft Teams
- ServiceNow
- OpsGenie
- Generic webhooks
- Log file (configurable through
axon-server.yml)
Infomy
Incident Management Integration¶
AxonOps is designed as a monitoring and alerting system that:
- Detects issues
- Triggers alerts
- Sends recovery events when conditions return to normal
However, AxonOps is not intended to replace dedicated incident management platforms like PagerDuty or OpsGenie.
Incident management platforms provide capabilities such as:
- Converting alerts into incidents with defined workflows
- Escalation policies when initial responders don't acknowledge
- Repeat notifications until someone takes action
- Acknowledgment to pause notifications while investigating
- Auto-resolution when recovery events arrive
Reducing Alert Fatigue¶
One of the most valuable features of incident management platforms is alert grouping. When a systemic issue affects your Cassandra or Kafka cluster, it often triggers alerts from multiple nodes simultaneously. Without grouping, an on-call engineer might receive dozens of notifications for what is essentially a single incident.
Alert grouping consolidates related alerts into a single incident, providing clarity on the nature of the outage while dramatically reducing notification noise.
For more information on configuring alert grouping and incident rules, see:
- OpsGenie: Automatically Create an Incident via Incident Rules - Configure rules to automatically create incidents from matching alerts, with built-in deduplication
- PagerDuty: Content-Based Alert Grouping - Group alerts based on matching field values like source, component, or severity
- PagerDuty: Time-Based Alert Grouping - Group all alerts on a service within a specified time window
Routing¶
AxonOps provide a rich routing mechanism for the notifications.
The current routing options are:
- Global - this will route all the notifications
- Metrics - notifications about the alerts on metrics
- Backups - notifications about the backups / restore
- Service Checks - notifications about the service checks / health checks
- Nodes - notifications raised from the nodes
- Commands - notifications from generic tasks
- Repairs - notifications from Cassandra repairs
- Rolling Restart - notification from the rolling restart feature
Each severity (info, warning, error) can be routed independently
Errors per routing mechanism and severity levels¶
Backup¶
| Source | Severity | Description |
|---|---|---|
| Backup | Critical | Any error that is returned from the 3rd party remote location providers. |
| Backup | Warning | Clear local snapshots timed out |
| Backup | Warning | Unable to find local snapshot |
| Backup | Warning | Local backup process erros |
| Backup | Warning | Clear remote snapshot timed out |
| Backup | Warning | Remote backup process errors |
| Backup | Warning | Unable to find remote snapshot |
| Backup | Warning | Clear remote snapshot timed out |
| Backup | Warning | Backup not triggered (Backups paused) |
| Backup | Warning | Failed to create backup |
| Backup | Warning | Failed to create remote config for backups |
| Backup | Warning | Create cassandra snapshot failed |
| Backup | Warning | Snapshot request timed out |
| Backup | Warning | Cassandra node is inactive |
| Backup | Info | Local backup created successfully |
| Backup | Info | Backup deleted succesfully |
Repair¶
| Source | Severity | Description |
|---|---|---|
| Repair | Critical | Update repairs error, can be casued by tables being created or removed while a repair is running |
| Repair | Critical | Any error that is generated by Cassandra for a repair processes |
| Repair | Critical | Repair job is over 60% complete and the estimated time to completion is after gc_grace deadline |
| Repair | Warning | Repair job is over 40% complete and the estimated time to completion is after gc_grace deadline |
| Repair | Warning | Repair segment failed |
| Repair | Warning | Repair segment timed out |
| Repair | Warning | Cassandra repair error after n-amount of retries |
| Repair | Warning | Repair unit errors |
| Repair | Warning | Repair errors for nonexistent correlation ID |
| Repair | Warning | Repair request timed out after n-amount of attempts to connect to host |

