Incidents

Overview

An incident represents an abnormal state detected by a sensor. When a check fails — a server is unreachable, a database query returns unexpected results, a certificate is about to expire — Onagre creates an incident and triggers notifications through your configured integrations.

Incidents provide the complete lifecycle of a problem: from detection to acknowledgement to resolution, with a full audit trail of events, notifications, and team comments.

You can view and manage incidents from Monitoring → Incidents in the Onagre dashboard, or directly from the native mobile app.

Incident dashboard

The incidents page provides a comprehensive overview of your monitoring health.

Key metrics

The top of the page displays real-time counters:

Opened — Active incidents requiring attention.
Acknowledged — Incidents someone is investigating.
Resolved — Incidents that have been fixed.
Uptime — Overall uptime percentage across your sensors.

Operational metrics

Onagre computes three standard reliability indicators:

Metric	Description
MTTR	Mean Time To Recovery — average time from incident creation to resolution
MTBF	Mean Time Between Failures — average time between incidents
MTTA	Mean Time To Acknowledge — average time from creation to acknowledgement

Trend chart

A 30-day stacked area chart shows incident volume over time, broken down by severity level. This helps identify patterns and recurring issues.

Longest downtime — Top 5 sensors by cumulative incident duration.
Most incidents — Top 5 sensors by incident count.

Incident list

The main table lists all incidents with the following information:

Incident ID and timestamp.
Reason — A human-readable description of what went wrong.
Severity — Critical, Major, Minor, or Info (color-coded).
Status — Active, Acknowledged, or Resolved.
Sensor — The sensor that triggered the incident (clickable link).
Duration — Elapsed time, or a pulsing "Ongoing" indicator for active incidents.

Filtering

You can filter incidents by:

Status — All, Opened, Acknowledged, or Resolved (tab navigation).
Severity — All, Critical, Major, Minor, or Info.
Search — Full-text search by sensor name.

Pagination lets you choose 10, 25, or 50 incidents per page.

Incident lifecycle

Every incident follows a three-stage lifecycle:

Active  →  Acknowledged  →  Resolved

Active

The sensor has detected a failure. Onagre creates the incident, triggers notifications through your integrations, and starts tracking duration. The incident remains active until someone acknowledges it or the problem resolves itself.

Acknowledged

A team member has acknowledged the incident, signaling that someone is investigating. Acknowledgement requires a comment describing the actions being taken. This prevents duplicate investigations and keeps the team informed.

Resolved

The sensor check has returned to a healthy state. Onagre automatically resolves the incident and records the total duration. No further action is needed.

💡 Auto-resolution

Incidents are resolved automatically when the sensor reports a successful check. You do not need to close them manually.

Severity levels

Each sensor is configured with a severity level that determines the priority of its incidents:

Severity	Color	Use case
Critical	Red	Production-breaking issues requiring immediate action
Major	Orange	Significant problems affecting service quality
Minor	Yellow	Degraded functionality with limited user impact
Info	Blue	Informational alerts for awareness

Severity is defined when creating or editing a sensor. It cannot be changed on an individual incident.

Incident detail page

Click on any incident to open its detail page, which provides full context:

Summary cards

Duration — Elapsed time, updated live for active incidents.
Sensor — Link to the sensor that triggered the incident.
Agent — The agent that executed the check.
Notifications — Count of sent and failed alert deliveries.

Incident reason

The specific details of what went wrong, displayed in monospace font. This content is sensor-type specific — for example, an HTTP sensor shows the status code, a SQL sensor shows the query error.

Timeline

A chronological event log showing every state change:

Incident triggered — The initial failure detection with details.
Check retry — Subsequent check attempts that still fail.
Acknowledged by [User] — When and who acknowledged, with their comment.
Incident resolved — When the problem was fixed.
Alert sent — Notification deliveries with integration name and status.

Notifications

A list of all alert deliveries for this incident, showing the integration type (Slack, Discord, Webhook, etc.), name, timestamp, and delivery status (Sent or Failed).

Comments

A team discussion thread attached to the incident. Any team member can leave comments to share context, progress updates, or resolution notes. Comments are displayed with the author name and timestamp.

Dependencies

A visual sketch showing the sensor that triggered the incident and any dependent sensors, with their current status. This helps understand the broader impact of the failure.

Action	When available	Description
Acknowledge	Active incidents only	Mark as being investigated, with a required comment
Add comment	Any open incident	Contribute to the team discussion thread
View sensor	Always	Navigate to the sensor detail page
View agent	Always	Navigate to the agent detail page

Summary

Aspect	Details
Access	Monitoring → Incidents
Lifecycle	Active → Acknowledged → Resolved
Severity	Critical, Major, Minor, Info
Metrics	MTTR, MTBF, MTTA, uptime
Actions	Acknowledge (with comment), add comments
Resolution	Automatic when sensor returns to healthy state
Mobile	View and acknowledge from the native app