Skip to content

Using LibreNMS for server OS monitoring, part 2: Alert rules

Reading time: 3 minutes

My greatest enemy is alert fatigue. Every time I receive a false positive alert from a monitoring system, I cringe. I try my best to combat this by having sensible alert rules.

Please keep in mind that this post is aimed at server monitoring (in my case, virtualized servers running Linux or Windows). These rules might not be applicable in other scenarios, i.e. if you’re monitoring thousands of networking devices.

I mainly use LibreNMS for monitoring network connectivity, disk usage, CPU utilization, and memory usage. I have set up a few simple rules that I use to ensure that I get notified when a server is either unavailable, or if a resource is at capacity (usually indicating that something has gone haywire, or that a VM needs to be reprovisioned). Below are a few of them.

Alert rules is something I always consider a work in progress, and you should adapt them to best suit your needs. But these should help you get started. As always, suggestions, tips and tricks are greatly appreciated.

(For information on how to set up alert rules, take a look at this YouTube video, and the LibreNMS documentation.)

 

Disk usage over 85%

If your servers are running applications which produce large amounts of log output, you might be familiar with issues with storage capacity. This rule ensures you get notified if you’re getting close to running out of storage space. The 20 minute delay ensures you won’t get bothered when fluctuations occur, but only when you have a steady increase in disk usage.

If you’re monitoring servers or networking devices running Linux, chances are they’ve got low capacity boot partitions etc. If they trigger alerts, and you consider them to be false positives, you should consider appending exceptions to the rule.

Rule: %macros.device_up = “1” && %storage.storage_perc > “85” && %storage.storage_descr !~ “/boot”

Severity: Warning
Max alerts: 1
Delay: 20 m
Interval: 5 m

 

Windows memory usage over 85%

If you’re running into issues with high memory consumption on a Windows server, chances are you’ve got a process eating it all up. Regardless of cause, if memory consumption is consistently high, you probably want to get notified. This rule has a 30 minute delay, to avoid alerts being triggered by peaks in consumption.

Note that this is a Windows rule only. Linux servers usually consume all available memory and use it for disk caching. This causes graphs populated by SNMP data to appear as if Linux servers are running at full memory capacity, even when they’re not. You can read more about this here. (And I don’t have a solution to this issue yet, so if you know how to work around it, let me know!)

Rule: %macros.device_up = “1” && %mempools.mempool_descr ~ “physical” && %devices.os ~ “Windows” && %mempools.mempool_perc > “85”

Severity: Warning
Max alerts: 1
Delay: 30 m
Interval: 5 m

 

Ethernet port down

LibreNMS has a default “Port status up/down” rule which triggers when any port is down. But in an OS you might have VPN connections, loopback and different sorts of faux interfaces. To ensure you only get notified when actual physical ethernet interfaces are down, you should consider disabling the default rule, and use this instead.

Rule: %macros.port_down = “1” && %ports.ifType = “ethernetCsmacd”

Severity: Critical
Max alerts: 1
Delay: 5 m
Interval: 5 m

 

Aggregate CPU alerts

At the moment, LibreNMS only allows for triggering reliable alerts on single CPU devices. If a system has multiple CPU’s, and you set up an alert rule using processors.processor_usage, it will not trigger on average CPU utilization, but rather trigger if any one core matches the set criteria. This is a problem when you’re monitoring multi-core servers. To work around it, I’ve made a hack. Read more about it here.

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.