Skip to main content

9.2 Grafana Notifications

Grafana Notifications allow you to monitor the health of your node and its components in real-time. By configuring alerts for critical metrics, you can proactively resolve issues before they impact performance or security. Once a notification channel like Telegram, Discord or E-Mail has been set up, you will be able to configure custom rules on when the Grafana Dashboard sends messages based on gathered metrics.

tip

This guide uses the default Grafana 📝 Templates to configure notification behaviour.

info

The following steps are performed on your 💻 personal computer.

1. Add Notifications

To be notified once some process is down or something is off, you will have to create notifications for every metric.

  1. Click the Granfana Icon to get to the Landing Page and select the LUKSO Dashboard from the list.
  2. Scroll down to the dashboard's Alerts section, select an alert, and click Edit on the 3-Dot-Menu on the right side.
  3. Within the Alert tab, select Create Alert Rule from this Panel to open up a new window.
  4. Click Preview to print out the graph and adjust the Threshold section to your likings on when the alert should happen.
  5. In case the Reduce section is showing NaN, Replace Non-numeric Values with a custom number above the alert range.
  6. For metrics that rely on the network, its recommended to set a NaN value, so it triggers when the network is down.
  7. Within the Alert Evaluation Behavior section, add a node-alerts folder where all the alert data will be stored.
  8. Within the Evaluation Group selection, add a node-group or select it from the panel.
  9. Its recommended to always choose the same alert folder and group for one node, so you do not mix up any targets.
  10. Scroll up and click Save to enable this alert for the specific metric.
info

The steps need to be repeated for every Alert on the Grafana Dashboard you want to receive notifications for.

2. Set Metrics Presets

This section outlines every alert setup from the default dashboard. You can check the picures and validate your configurations.

tip

Below metric presets are based on default Grafana 📝 Templates. If you used different service job names within the Prometheus Dataset Configuration, you will have to adjust the job names to match your Prometheus installation.

Grafana Alert Board

Alert: Consensus Process Down

1: Process up
0: Process down

Consensus Process Down Metric

up{job="consensus-client-job"}

Alert: Validator Process Down

1: Process up
0: Process down

Validator Process Down Metric

up{job="validator-client-job"}
warning

The Validator Proccess Down Alert does not exist for Nimbus-Eth2, as it uses a single Beacon Proccess.

Alert: Consensus Process Restarted

1:   Process up
0: Process down
NaN: Not available (likely down --> 0)

Consensus Process Restarted Metric

(time()-process_start_time_seconds{job="consensus-client-job"})/3600

Alert: Validator Process Restarted

1:   Process up
0: Process down
NaN: Not available (likely down --> 0)

Validator Process Restarted Metric

(time()-process_start_time_seconds{job="validator-client-job"})/3600
warning

The Validator Proccess Restarted Alert does not exist for Nimbus-Eth2, as it uses a single Beacon Proccess.

Alert: Below 40 Peers

above 30: Ideal healthy connections
below 30: Resyncing or weak connections
NaN: Not available (no connections --> 0)

Below 40 Peers Metric

p2p_peer_count{state="Connected",job="consensus-client-job"}

Alert: Participation Rate below 80%

above 80: Ideal healthy network
below 80: Unstable network
NaN: 2nd data feed (ignore metric --> 100)

Participation Rate below 80% Metric

(beacon_prev_epoch_target_gwei{job="consensus-client-job"} / beacon_prev_epoch_active_gwei{job="consensus-client-job"}) *100

Alert: 50 Slots Behind

below 50: Ideal syncing speed
above 50: Unstable syncing
NaN: Not available (likely unstable --> 51)

50 Slots Behind Metric

beacon_clock_time_slot{job="consensus-client-job"} - beacon_head_slot{job="consensus-client-job"}

Alert: No Hourly Earnings

above 0,0001: Earning rewards
below 0,0001: Syncing or negative rewards
NaN: Not available (likely negative rewards --> 0)

No Hourly Earnings Metric

sum(validator_balance{job="validator-client-job"}) - sum(validator_balance{job="validator-client-job"} offset 1h != 0) - (32 * count(validator_balance{job="validator-client-job"} > 16)) + (32 * count(validator_balance{job="validator-client-job"} offset 1h > 16))
warning

The Hourly Earnings Alert does not exist for Teku, as it's client does not expose any Validator Balance Metrics.

Alert: Less than 2GB Free Memory

above 2000000000: More than 2GB remaining
below 2000000000: Less than 2GB remaining

Less than 2GB Free Memory Metric

(node_memory_MemFree_bytes{job="node-exporter-job"} or node_memory_MemFree{job="node-exporter-job"}) + (node_memory_Cached_bytes{job="node-exporter-job"} or node_memory_Cached{job="node-exporter-job"})

Alert: CPU Usage above 40%

above 4: More than 40% of computation resources used
below 4: Less than 40% of computation resources used

CPU Usage above 40% Metric

sum(irate(node_cpu_seconds_total{mode="user",job="node-exporter-job"}[5m])) or sum(irate(node_cpu{mode="user",job="node-exporter-job"}[5m]))

Alert: Disk Usage above 60%

above 0,6: Disk more than 60% occupied by tasks
below 0,6: Disk less than 60% occupied by tasks

Disk Usage above 60% Metric

(sum(node_filesystem_size_bytes{job="node-exporter-job"})-sum(node_filesystem_avail_bytes{job="node-exporter-job"}))/sum(node_filesystem_size_bytes{job="node-exporter-job"})

Alert: CPU Temperature above 75 °C

above 75: Processor is running hot
below 75: Processor is running normally

CPU Temperature above 75 °C Metric

node_hwmon_temp_celsius{chip="platform_coretemp_0",job="node-exporter-job",sensor="temp1"}

Alert: Google Ping above 30ms

above 0,03: Connection takes longer than 30ms, not ideal
below 0,03: Connection takes less than 30ms, everything alright

Google Ping above 30ms Metric

probe_duration_seconds{job="google-ping-job"}

3. Configure Intervals

Once an alert is triggered, you can define how frequently the message will be sent out to your notification channel.

  1. Navigate to the Alerting section on the left menu and click on the Notification Policies heading.
  2. Select the 3-Dot-Menu on the default notification channel and choose Edit within the popup.
  3. Expand the Timing Options field to a duration or message frequency of your liking.

Grafana Alert Interval

info

Besides intervals, Grafana will also send a Resolved message once the issue is not present anymore.

4. Continuous Notifications

Within the Grafana Dashboard, you can also enable Alert Reminders that send notifications after a certain period of time. Setting it to one hour will send a notification every hour if a critical error or status has not changed yet.

5. Permanent Alerts

For a metric to send out permanent notifications, you can clone or create a new alert rule for a metric and define a rule that it is never supposed to be reached, so it permanently triggers. If you want hourly updates on the participation rate, you could select the alert for under 100% participation. In this case, you would constantly get notified about the network participation.