Incidents

Overview

An incident represents an abnormal state detected by a sensor. When a check fails — a server is unreachable, a database query returns unexpected results, a certificate is about to expire — Onagre creates an incident and triggers notifications through your configured integrations.

Incidents provide the complete lifecycle of a problem: from detection to acknowledgement to resolution, with a full audit trail of events, notifications, and team comments.

You can view and manage incidents from Monitoring → Incidents in the Onagre dashboard, or directly from the native mobile app.


Incident dashboard

The incidents page provides a comprehensive overview of your monitoring health.

Key metrics

The top of the page displays real-time counters:

  • Opened — Active incidents requiring attention.
  • Acknowledged — Incidents someone is investigating.
  • Resolved — Incidents that have been fixed.
  • Uptime — Overall uptime percentage across your sensors.

Operational metrics

Onagre computes three standard reliability indicators:

Metric Description
MTTR Mean Time To Recovery — average time from incident creation to resolution
MTBF Mean Time Between Failures — average time between incidents
MTTA Mean Time To Acknowledge — average time from creation to acknowledgement

Trend chart

A 30-day stacked area chart shows incident volume over time, broken down by severity level. This helps identify patterns and recurring issues.

  • Longest downtime — Top 5 sensors by cumulative incident duration.
  • Most incidents — Top 5 sensors by incident count.

Incident list

The main table lists all incidents with the following information:

  • Incident ID and timestamp.
  • Reason — A human-readable description of what went wrong.
  • Severity — Critical, Major, Minor, or Info (color-coded).
  • Status — Active, Acknowledged, or Resolved.
  • Sensor — The sensor that triggered the incident (clickable link).
  • Duration — Elapsed time, or a pulsing "Ongoing" indicator for active incidents.

Filtering

You can filter incidents by:

  • Status — All, Opened, Acknowledged, or Resolved (tab navigation).
  • Severity — All, Critical, Major, Minor, or Info.
  • Search — Full-text search by sensor name.

Pagination lets you choose 10, 25, or 50 incidents per page.


Incident lifecycle

Every incident follows a three-stage lifecycle:

Active  →  Acknowledged  →  Resolved

Active

The sensor has detected a failure. Onagre creates the incident, triggers notifications through your integrations, and starts tracking duration. The incident remains active until someone acknowledges it or the problem resolves itself.

Acknowledged

A team member has acknowledged the incident, signaling that someone is investigating. Acknowledgement requires a comment describing the actions being taken. This prevents duplicate investigations and keeps the team informed.

Resolved

The sensor check has returned to a healthy state. Onagre automatically resolves the incident and records the total duration. No further action is needed.

💡 Auto-resolution

Incidents are resolved automatically when the sensor reports a successful check. You do not need to close them manually.


Severity levels

Each sensor is configured with a severity level that determines the priority of its incidents:

Severity Color Use case
Critical Red Production-breaking issues requiring immediate action
Major Orange Significant problems affecting service quality
Minor Yellow Degraded functionality with limited user impact
Info Blue Informational alerts for awareness

Severity is defined when creating or editing a sensor. It cannot be changed on an individual incident.


Incident detail page

Click on any incident to open its detail page, which provides full context:

Summary cards

  • Duration — Elapsed time, updated live for active incidents.
  • Sensor — Link to the sensor that triggered the incident.
  • Agent — The agent that executed the check.
  • Notifications — Count of sent and failed alert deliveries.

Incident reason

The specific details of what went wrong, displayed in monospace font. This content is sensor-type specific — for example, an HTTP sensor shows the status code, a SQL sensor shows the query error.

Timeline

A chronological event log showing every state change:

  • Incident triggered — The initial failure detection with details.
  • Check retry — Subsequent check attempts that still fail.
  • Acknowledged by [User] — When and who acknowledged, with their comment.
  • Incident resolved — When the problem was fixed.
  • Alert sent — Notification deliveries with integration name and status.

Notifications

A list of all alert deliveries for this incident, showing the integration type (Slack, Discord, Webhook, etc.), name, timestamp, and delivery status (Sent or Failed).

Comments

A team discussion thread attached to the incident. Any team member can leave comments to share context, progress updates, or resolution notes. Comments are displayed with the author name and timestamp.

Dependencies

A visual sketch showing the sensor that triggered the incident and any dependent sensors, with their current status. This helps understand the broader impact of the failure.


User actions

Action When available Description
Acknowledge Active incidents only Mark as being investigated, with a required comment
Add comment Any open incident Contribute to the team discussion thread
View sensor Always Navigate to the sensor detail page
View agent Always Navigate to the agent detail page

Summary

Aspect Details
Access Monitoring → Incidents
Lifecycle Active → Acknowledged → Resolved
Severity Critical, Major, Minor, Info
Metrics MTTR, MTBF, MTTA, uptime
Actions Acknowledge (with comment), add comments
Resolution Automatic when sensor returns to healthy state
Mobile View and acknowledge from the native app