Incidents
Overview
An incident represents an abnormal state detected by a sensor. When a check fails — a server is unreachable, a database query returns unexpected results, a certificate is about to expire — Onagre creates an incident and triggers notifications through your configured integrations.
Incidents provide the complete lifecycle of a problem: from detection to acknowledgement to resolution, with a full audit trail of events, notifications, and team comments.
You can view and manage incidents from Monitoring → Incidents in the Onagre dashboard, or directly from the native mobile app.
Incident dashboard
The incidents page provides a comprehensive overview of your monitoring health.
Key metrics
The top of the page displays real-time counters:
- Opened — Active incidents requiring attention.
- Acknowledged — Incidents someone is investigating.
- Resolved — Incidents that have been fixed.
- Uptime — Overall uptime percentage across your sensors.
Operational metrics
Onagre computes three standard reliability indicators:
| Metric | Description |
|---|---|
| MTTR | Mean Time To Recovery — average time from incident creation to resolution |
| MTBF | Mean Time Between Failures — average time between incidents |
| MTTA | Mean Time To Acknowledge — average time from creation to acknowledgement |
Trend chart
A 30-day stacked area chart shows incident volume over time, broken down by severity level. This helps identify patterns and recurring issues.
Sidebar insights
- Longest downtime — Top 5 sensors by cumulative incident duration.
- Most incidents — Top 5 sensors by incident count.
Incident list
The main table lists all incidents with the following information:
- Incident ID and timestamp.
- Reason — A human-readable description of what went wrong.
- Severity — Critical, Major, Minor, or Info (color-coded).
- Status — Active, Acknowledged, or Resolved.
- Sensor — The sensor that triggered the incident (clickable link).
- Duration — Elapsed time, or a pulsing "Ongoing" indicator for active incidents.
Filtering
You can filter incidents by:
- Status — All, Opened, Acknowledged, or Resolved (tab navigation).
- Severity — All, Critical, Major, Minor, or Info.
- Search — Full-text search by sensor name.
Pagination lets you choose 10, 25, or 50 incidents per page.
Incident lifecycle
Every incident follows a three-stage lifecycle:
Active → Acknowledged → Resolved
Active
The sensor has detected a failure. Onagre creates the incident, triggers notifications through your integrations, and starts tracking duration. The incident remains active until someone acknowledges it or the problem resolves itself.
Acknowledged
A team member has acknowledged the incident, signaling that someone is investigating. Acknowledgement requires a comment describing the actions being taken. This prevents duplicate investigations and keeps the team informed.
Resolved
The sensor check has returned to a healthy state. Onagre automatically resolves the incident and records the total duration. No further action is needed.
💡 Auto-resolution
Incidents are resolved automatically when the sensor reports a successful check. You do not need to close them manually.
Severity levels
Each sensor is configured with a severity level that determines the priority of its incidents:
| Severity | Color | Use case |
|---|---|---|
| Critical | Red | Production-breaking issues requiring immediate action |
| Major | Orange | Significant problems affecting service quality |
| Minor | Yellow | Degraded functionality with limited user impact |
| Info | Blue | Informational alerts for awareness |
Severity is defined when creating or editing a sensor. It cannot be changed on an individual incident.
Incident detail page
Click on any incident to open its detail page, which provides full context:
Summary cards
- Duration — Elapsed time, updated live for active incidents.
- Sensor — Link to the sensor that triggered the incident.
- Agent — The agent that executed the check.
- Notifications — Count of sent and failed alert deliveries.
Incident reason
The specific details of what went wrong, displayed in monospace font. This content is sensor-type specific — for example, an HTTP sensor shows the status code, a SQL sensor shows the query error.
Timeline
A chronological event log showing every state change:
- Incident triggered — The initial failure detection with details.
- Check retry — Subsequent check attempts that still fail.
- Acknowledged by [User] — When and who acknowledged, with their comment.
- Incident resolved — When the problem was fixed.
- Alert sent — Notification deliveries with integration name and status.
Notifications
A list of all alert deliveries for this incident, showing the integration type (Slack, Discord, Webhook, etc.), name, timestamp, and delivery status (Sent or Failed).
Comments
A team discussion thread attached to the incident. Any team member can leave comments to share context, progress updates, or resolution notes. Comments are displayed with the author name and timestamp.
Dependencies
A visual sketch showing the sensor that triggered the incident and any dependent sensors, with their current status. This helps understand the broader impact of the failure.
User actions
| Action | When available | Description |
|---|---|---|
| Acknowledge | Active incidents only | Mark as being investigated, with a required comment |
| Add comment | Any open incident | Contribute to the team discussion thread |
| View sensor | Always | Navigate to the sensor detail page |
| View agent | Always | Navigate to the agent detail page |
Summary
| Aspect | Details |
|---|---|
| Access | Monitoring → Incidents |
| Lifecycle | Active → Acknowledged → Resolved |
| Severity | Critical, Major, Minor, Info |
| Metrics | MTTR, MTBF, MTTA, uptime |
| Actions | Acknowledge (with comment), add comments |
| Resolution | Automatic when sensor returns to healthy state |
| Mobile | View and acknowledge from the native app |