Imagine you’re a System Administrator for an organization that typically starts business at 8AM on Mondays. Saturday night you spent a couple hours replacing that old Cisco 10/100 switch with a noisy fan, patched up your Exchange server, and re-wired the server rack that has been bugging you for months. Great night’s work; off to bed and rest up on Sunday.
It’s now 6:17AM on Monday morning and you stir in bed only to see a light blinking on your phone out of the corner of your eye. You have three missed calls, a voicemail, and a text from your boss (who happens to be the CFO) saying “EVERYTHING’S DOWN! PLEASE ADVISE!!!” Time to start the work week. You frantically throw on work clothes while turning on your laptop. Mind racing as fast as it can, you start to go through the usual train of thought:
- What all did I do Saturday night?
- Is our new switch already broken?
- Why is my boss in so early on a Monday?
- What exactly does he mean by everything?
- Did I not trace all my cables correctly when I re-did our cable management?
After your laptop has booted, you connect to the company VPN successfully and start pinging various services and critical infrastructure devices. VMware hosts, SAN, domain controllers, router, switches, Exchange server, and application servers all reachable. It’s been about 45 minutes since your boss had texted and you see your phone light up with an incoming phone call. With a loud exasperated sigh, you answer the phone and immediately hear “Any progress?”
It’s now 7:05AM and you should just now be waking up for work. On the short call with your boss, you acknowledge you are looking into the issue and will pinpoint the exact problem once you get more detail. However, after hanging up you are fairly confident all critical systems are working properly by testing a few items:
- Inbound/outbound email working
- ERP application opens
- File shares accessible
- Websites loading promptly from a workstation
After enough testing, you finish getting ready and go into the office. You walk into the doors at 7:55AM as others are arriving in the office. Users are chit-chatting about their weekend but you’re on a mission to figure out what the problem is before this bubbles up even farther and others notice. You head straight for the CFO’s office where you’re met with very stern eyes as he is just finishing a sentence “…I will get that to you as soon as our systems are back up. I apologize for the inconvenience.” Well, that doesn’t sound good.
The process of troubleshooting continues. You check the boss’s computer and it’s on, he’s logged in, and the computer seems to be functioning as normal. After completing a few pings and browsing the web you notice he’s not getting access to anything internal or external. AH HA! You read the network port label on the wall and run to the patch panel in the server closet to trace the port to the switch. Red light on port 32. Removing the cable and placing it into another open port has resulted in a blinking green light and you verify everything is working with your boss. Everything is working…funny how that happens.
Most technical people have run into situations like the one mentioned above. On one hand, the damage could have been much worse and nothing is literally on fire. However, the anxiety and stress produced early Monday morning isn’t exactly starting the week off right. How do we curve the amount of surprise involved and quickly identify what the issue is? Monitoring!
This multi-part series is going to cover aspects of monitoring including why, how, and when we monitor systems. Each business is different, but constant monitoring of critical, and even non-critical systems, should always be happening. It doesn’t have to cost an organization a lot of money to get started and usually provides a great ROI in terms of potential downtime or lost productivity.
Why do we monitor?
If you’re a System Administrator who apparently sits around and does nothing according to your boss then you’re really good, or not so good at your job. Ideally IT people want to be proactive about things going wrong instead of the classic reactive approach. Despite your best intentions, upper management or decision makers may agree with the proactive approach but not give you a budget to help succeed in that regard. It would not be wise to sit around and let the reactive IT lifestyle continue for your own sake, as well as the organization’s.
Truth be told, many companies don’t have monitoring at all. This includes both small businesses and corporations. It might be that the IT personnel is understaffed and doesn’t have time, or maybe the small business only has a single IT person that’s busy fixing print spooler errors and removing viruses off computers. In many situations the monitoring process goes like this:
- Bill, the IT guy, is sitting at his desk looking at gifts for his wife
- Somebody knocks at the door, it’s Susan from Accounts Payable who sits right outside from Bill
- Susan mentions that her Shared drive no longer works and she can’t click on anything
- Bill ponders it for a bit and decides to walk over to Susan’s desk, only to see a few other people walking in his direction
One of the first telltale signs of something going wrong is when multiple users are looking for you at the same time. In this case, the file server’s drive filled up so Bill had to extend the partition out. That only takes five minutes but imagine if the storage wasn’t able to extend the partition because the backend datastore was full? Time to delete information or more storage is needed.
Items like a disk being full are simple to monitor, but the list doesn’t stop there. With monitoring it’s nice to know when:
- An air conditioner has failed in the server room
- The power goes out
- A storage datastore is reaching capacity
- CPU on a server is sitting at 100% usage
- A bad cable is being used on the network (phone cable into an Ethernet port anyone?)
- Unexpected reboots
- Critical processes are not running on a server
- A website is down
- Drives starting to go bad in your RAID array
- DNS server stopped working
- Devices using a wrong gateway
- Spanning tree changes happening
That’s only one part of the equation though. Monitoring those items can be configured and set but the next step is to decide what medium is used to let you know something is wrong. Emails, phone calls, texts, and dashboards are all a great possibility. We’ll cover who gets what and when in another section of this multi-part series. On one hand, System Administrators want to set up monitoring for their own sanity and peace of mind when sleeping at night. Translating that over to a key stakeholder or executive can be difficult if they don’t have a full understanding of what it takes to be an IT person. Think about how a monitoring solution could help the business overall:
- Email downtime reduced due to multiple probes being placed on the Exchange server to help monitor services
- Customer web-portal at 99.999% uptime in the last year because of increased tech reaction and no complaints from customers
- Dashboard of uptime and availability of web facing servers for management to see
- Quickly able to assess IT damage due to a thunderstorm overnight
- Better future budget estimations based on patterns and trends seen
Time is money to an organization – the less time critical services are down the better off everyone is. An initial investment of time and money in a proper monitoring solution can provide an amazing ROI for the business and also help keep stress levels down. As a System Administrator myself, I would rather get an email or text from an automated service than an upset boss on a Monday morning.
The next part of this series will include how we monitor systems with a brief history, protocols, devices, and issues faced when implementing a monitoring solution.