I encountered a problem yesterday that started when a customer reported a minor application issue. It demonstrated a common logistical problem that happens across many organisations when engineers attempt to fix customer issues and end up creating problems while trying to access servers.
Here is the common application server support problem :
- An engineer builds a new critical application server.
- It goes into production.
- Users start using it.
- A user reports a problem.
- A team of engineers from a support company login to the server to find the root cause of the problem that has been reported by the user.
- Once the engineers have finished working on the problem, they forget to logoff leaving behind their disconnected sessions on the server.
Disconnected sessions use memory and create extra processor load.
The Root Cause Analysis process
While investigating the incident the engineers performed a number of standard fault-finding tasks:
- The engineers looked in the event log for application, system and security warnings or errors.
- They ran process monitor to look for signs of memory leaks and high processor utilisation.
- They checked the vital stats on the server, (CPU) (MEMORY) (HARD DISK SPACE).
- While logged on to the server the engineers checked that relevant services were running.
A new engineer logs on to the server and receives a Windows logon message telling him that he/she must select a user to disconnect so that he/she may login.
The new engineer faces the decision of whether to disconnect other engineers or not? This decision could cause delays to the resolution.
No Login required
The engineers never needed to log into the server at all to carry out the tasks mentioned above in the root cause analysis process.
There are two ways to solve this logon issue.
- To solve the problem of disconnected sessions hogging resources, a group policy should be applied to set a time limit for disconnected sessions.
- To solve the problem of the engineer logging on to the server in the first place, an engineer should utilise the Solarwinds web portal where he/she could see the status of all his/her servers and applications via a NOC (Network Operations Centre) Display.
The engineer can access the vital statistics of the server in question from one console :
- Average CPU Load
- Memory Utilisation
- Average Response Time & Packet Loss
- Disk Volumes
The Node Details display will give the engineer access to, Real-Time Process Explorer, Service Control Manager and the Real Time Event Log Viewer allowing them to carry out all the tasks listed in the Root Cause Analysis process above, from one console.
Scheduled tasks are displayed. It is easy to spot a failure. An alert is generated if a scheduled task fails.
Alerts and Events about the node are displayed. Any active alerts are listed in red and an email would be sent out to the engineer to let them know an alert is active.
Capacity forecasting is displayed to give the engineer the ability to predict and plan for future capacity problems.
Any application templates that are added to a node, such as on an application server that runs IIS and MySQL are also monitored and any problems are highlighted making it easier for an engineer to see exactly where the problem is.
With read only access to Solarwinds an engineer can make an informed judgement about what is causing a server to produce an error and the engineer never needs to login to a server to provide this diagnosis.