WIO hides incidents

I was working on some reliability problem with DNS. I tried to use WIO to monitor the issue. Unfortunately, I do not find it useful.

I see an issue on “day” graph but the issue is not visible on “week” graph. Could be this fixed? The issue should be visible in :week/month/year graph because day graph shows only interval of the last 24 hours and after that time, recorded issue could be seen only in graphs with lower resolution…

Example, notice that issue visible on Day graph is not visible on Week graph, so it looks like DNS is “reliable”:

Day, we have an issue


Week, where is the issue from the last 24 hours??

I fear that an event with a duration of 30 minutes disappears in higher order graphs/RRD data.
RRDtools compute the data points displayed by the mean function.
To change your issue, data collection and graphing of WIO has to be done with another tool. No idea which tool is better suited nor about the effort for the change ( including the transfer of data! ).

Sorry for this news.

For more information see
https://oss.oetiker.ch/rrdtool/doc/rrdtool.en.html
https://oss.oetiker.ch/rrdtool/doc/rrdcreate.en.html

WIO uses AVERAGE type, maybe a type of MAX would be more adequate for just on/off values.

3 Likes

I add more comment to my “bad” DNS server. I monitor it with a script, one sample each minute. I see that 3% of these samples are not responded, that is only 97% “uptime”. far lower than 99.9% that is “good business practice”. Because it is DNS server “cluster” (two servers, primary and secondary), availability should be at least 99.99%…

I sample every 15 minutes with WIO, so I miss many “bad responses”, so that 30 minutes downtime recorded is only 2% in 24H, that is better than real 3% I measure with my script. Anyway, 2% is a bad downtime and WIO should not hide such issue… My example has one 30 minutes “red” interval but that was just a coincident, it could be two 15 minutes intervals during the day and this “downtime” should be visible when graph is scaled out.

We are monitoring our devices to know about downtime, aren’t we? I think that first try could be to replace AVERAGE with MAX (or MIN??) because we want to see downtimes… Current WIO is too optimistic… :wink:

BTW, I do not have control over that “bad” DNS server, I just collect evidence that it is misconfigured…

:thinking: Maybe a workaround/solution could be the following post

BR

3 Likes

it is not meant to be a network monitoring type tool that catches every single missed packet. It is just a simple Who is Online? type tool.

4 Likes

Why not to add estimated SLA indicator, that will calculate uptime availibility in realtime, value between 0% and 100%? I understand that it can have only informative value. Not really useful when WIO is monitoring desktop station but helpful when infrastructure devices or web servers are monitored…

This indicator can have only informative value because it will be impacted even with IPfire downtimes; when IPfire is updated, it has to be rebooted… Anyway, simple uptime metric can help to find troublemaker in the infrastructure, or compare reliability of different devices.

1 Like

Mostly because there are only a limited amount of developers (volunteers). And with a small amount of developers the focus is on security related fixes and updates.

Any amount of assistance from new developers is greatly appreciated. Let us know if you like to help!

4 Likes