Is Your Environment Operationally Ready?

Do You Want Reports With That?

Running a successful environment requires assurance, quality control and removal of any troublesome Orc’s.

To provide peace of mind, automated readiness checks and daily reports should be carried out every day.

All readiness checks should be automated and compiled into a daily report to be sent via schedule at 9am every morning.

These checks will provide a high level of assurance that we have everything in our environment monitored and it is performing at satisfactory levels.

Each section listed below should be its own report, otherwise one morning report could get very large.

These reports can be e-mailed to users or placed into a folder to be checked every day.

Here are my critical reports and operational maintenance procedures for any environment (I will add more to this as I discover useful environmental reports) :

SolarWinds Functions

SolarWinds Agent Inventory.

Capacity Management

Disk Space Utilisation (windows servers).

Disk Space Utilisation (linux servers).

CPU Average Utilisation in (%) over Last 24 Hours for Windows and Linux Servers.

Memory Average Utilisation in (%),(MB/GB) over Last 24 Hours for Windows and Linux Servers.

Storage Availability / Performance

Storage Disk Array Availability / Performance of the Storage / IOPS.

Storage array LUN forecasted run out date.

Alerting

Triggered Alerts – Last 24 Hours web-based. This includes every single alert for the environment over the last 24 hours.

Network Equipment Availability / Performance

Critical Connections. (Bandwidth, Utilisation, Events, Discards & Errors.)

Internet connectivity.

Critical Data Centre Interconnect ports.

Other Critical Links & Ports.

Firewalls.

Routers.

Switches.

Load Balancers.

Virtual Infrastructure

Vcenter, ESX hosts availability and performance report.

Run through the environment vCheck report to identify any issues and outstanding alerts.

\<domain name><dfs root name>backupsvCheck

Application Availability & Performance

Business critical applications.

Application monthly availability table and performance chart. Examples below.

  • Active Directory.
  • DNS
  • NTP
  • Bespoke Applications.
  • Other Critical Business Applications.

Database Availability / Performance

Refer to Applications – Map dependencies.

Database Availability / Performance for databases added into SQL AppInsight and Database performance analyser.

Hardware Availability / Performance

Important hardware and performance of that hardware.

Configuration Status

Network device changes (NCM)

Application changes (SCM? / Powershell scripts)

Server changes (SCM? / Powershell scripts)

SolarWinds Operational and Maintenance Procedures

Operational Shutdown

Before making any changes to the SolarWinds environment, such as patching, application or OS, a shutdown of the SolarWinds services is required.

Orion Service Manager can be located on the SolarWinds application server and is a better alternative to having to stop and start services manually.

To shutdown SolarWinds correctly click stop everything.

To startup SolarWinds correctly click start everything.

Operational Configuration – Daily

Check for recurring alerts and events, determine what is causing the recurring alert/event, raise and resolve the issue with relevant service owner and alter thresholds accordingly.

Check environmental appstack and application health overview for new critical or warning items. Raise and resolve.

Check through nightly imported nodes, configure Appinsight and apply templates if required, speak to service owner to find out about new nodes and requirements.

Remove nodes or unmanage nodes, no longer required, confirmed by service owner.

Check NOC displays for updates due to new nodes being added or nodes being removed.

Raise any particular alerting issues at the morning meeting.

Preventative Maintenance – Weekly

Review and apply as needed Application Patches, Hotfixes to SolarWinds Application Server.

Review and apply as needed Updates to OS, Hardware Drivers, etc.

Review CPU/Memory/Disk on Monitoring application server and Monitoring SQL server and any other SolarWinds application servers.

Check Active Diagnostics Bi-Weekly.

Run Diagnostics for support purposes – Store elsewhere than application or database server (Other network location or offline storage).

Take Snapshots (If virtualized).

Review Enable/Disable Automatic Baselining settings as needed.

Review Enable/Disable Automatic Dependencies as needed.

Alerts should be running every 1 minute – unless the environment requires more regular alerting.

Check in-house created Alerts’ raw SQL Queries against Estimated Execution Plan Time in SQL Server Management Studio to ensure queries don’t need refactoring.

Review “Down” or “Unknown” Nodes or Applications for polling errors or that need to be unmanaged or removed from monitoring.

Review Custom Property Utilization Across the Board for all Nodes, Applications, Interfaces, etc. – Fill in gaps where needed.

Database Maintenance Review – Monthly

Verify Database Maintenance is Completing MS SQL Server running Orion SolarWinds database.

Review C:ProgramDataSolarWindsLogsOrionswdebugMaintenance.log

Ensure “Database Maintenance Complete” Message appears in a reasonable amount of time from when it begins at 2:15 AM every morning (by default).

Ensure you have Database backups and you can restore from those backups.

Index Rebuilds / Reindexing – Ensure this is happening, quarterly at a minimum.

Review Total Database Size and/or large tables for uncharacteristic growth.

Verifying Trap & Syslog Messages are being cleaned out (space reservations in SQL DB)

Settings and Configurations – Monthly

Review Polling Engine Polling Status/Rates.

Check Polling Completion (Polling Engine Status Report).

Check Polling Settings (Consider Alert Time Frames vs. Polling Intervals). Reminder – Retention Syslog, Trap, Discovery, Downtime – Will increase DB Size.

Review Logging Levels – Run Logadjuster.exe to see all logging settings.

Ensure that Logging Levels are at the Defaults or specified standard settings.

Review Logs with Quick Rotations/Timestamp Successions.

Review Diagnostic Logs for Errors – Check against Success Center for common issues; escalate to Support if KB is unable to solve.

Ensure Permissions are set appropriately (Run Orion Permission Checker).

Last but not least, check for any remaining Orc’s and remove them…