Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
175 views
in Technique[技术] by (71.8m points)

organization - Effective monitoring? How to avoid alter-fatigue?

Not a developer question in the sense of hacking and coding but I can't be the only programmer who has to maintain and troubleshoot code and processes in customer sites. Whenever we roll something to production we set up rules and states by which we recognize the "correct" state of the software (established connections, age and file of sizes, log entries in sql, ...). After that the trouble starts: error states send tickets, mails, high priority mails, blink red on a dedicated monitor, will send messages to Teams,... everytime "enough" processed use one way, one tends to fade it out and we have to find a new way to get our own attention.

There are of course many alerts that just blink temporarily or one time and then there are those which just exist so the documentation contains a sensor to tick of. Altogether they stuff our ticket system and monitoring and real threats can be overseen (Big, bad sensors going red or little ones going red often, systematically).

Does anyone have a golden bullet to avoid the alert fatigue?

question from:https://stackoverflow.com/questions/65892293/effective-monitoring-how-to-avoid-alter-fatigue

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Yes, Nagios (at least, this used to be our solution 10+ years ago. For all I know, there could be something more current now). The core of Nagios is open source. Also, you could just buy one of their books and copy their underlying way of doing things into your existing system. https://www.nagios.org/about/propaganda/books/

They use hierarchies of nodes. So if an upstream node goes down, you just get notified for that one node, not all for the downstream nodes.

They also use escalating hierarchies for roles, groups, and methods of communications as well. So if something goes down, you don't need to notify everybody, just one role/group (at least initially).

Also, they can set the granular severity of notifications. For instance, if something only goes down intermittently, you could decide whether this is important to you or not.

Now this isn't my area of expertise and it's been a long time since I've seen it in use, but its entire point was to reduce the number of alerts IT support would receive anytime something went wrong.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...