Technology

CUCM is the heart of Cisco Collaboration. The server which integrates with all servers for signaling and is accountable for registrations of almost all devices should be in good health. A good plan to check health is very important. How to check? What to check specifically? What is alarming? Need not worry, the information below will provide you simple 9 steps to do a basic CUCM health checkup. You will need to log in to CUCM CLI using OS ADMIN credentials.

Step 1. When you logged in make sure the partitions are in an aligned state.

Command Line Interface is starting up, please wait ...
   Welcome to the Platform Command Line Interface
VMware Installation:
        2 vCPU: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
        Disk 1: 80GB, Partitions aligned
        6144 Mbytes RAM

Step 2. Use the command “show status”

To analyze any problem, this is the first command you should probably begin with.

This will give important information as the Hostname of your server, server full version, the UP time of the server, etc.

Unified OS version – This is the Guest OS version used. Cisco CUCM before 12.x versions are using RHEL6. CUCM 12.X versions are moved to CentOS versions.

When you should be alarmed? When the Average processor Load is above 60-70%, IOWAIT above 4-5%, and disk usage above 90% for active/inactive/logging partition may indicate potential problems with the server.

admin:show status
Host Name          : PUBLISHER
Date               : Fri Apr 2, 2021 11:05:50
Time Zone          : India Standard Time (Asia/Kolkata)
Locale             : en_US.UTF-8
Product Ver        : 11.5.1.16900-16
Unified OS Version : 6.0.0.0-2
Uptime:
 11:05:52 up 154 days, 23:54,  1 user,  load average: 0.43, 0.47, 0.45

CPU Idle:   91.80%  System:   06.60%    User:   06.09%
  IOWAIT:   00.00%     IRQ:   00.00%    Soft:   00.51%
Memory Total:        5993936K
        Free:         280396K
        Used:        5713540K
      Cached:        1813344K
      Shared:         303692K
     Buffers:         200508K
                        Total            Free            Used
Disk/active         14154228K         575156K       13433944K (96%)
Disk/inactive       14154228K       13393148K          35424K (1%)
Disk/logging        49573612K       19563780K       27484904K (59%)

Step 3. The command “show network cluster” will give an insight of the number of nodes in the cluster. The operations team should look at the “authenticated using TCP” section. All servers in the cluster must be in “authenticated” state.

admin:show network cluster
172.16.154.142 SUBSCRIBER01.domain.com SUBSCRIBER01 Subscriber callmanager DBSub authenticated using TCP since Sun Nov 29 17:16:37 2020
172.16.154.143 SUBSCRIBER02.domain.com SUBSCRIBER02 Subscriber callmanager DBSub authenticated using TCP since Fri Mar 19 10:28:13 2021
172.16.154.149 PRESENCE02.domain.com PRESENCE02 Subscriber cups DBSub authenticated using TCP since Sun Nov 29 17:16:26 2020
172.16.154.148 PRESENCE01.domain.com PRESENCE01 Subscriber cups DBPub authenticated using TCP since Thu Oct 29 11:13:29 2020
172.16.154.141 PUBLISHER.domain.com PUBLISHER Publisher callmanager DBPub authenticated
Server Table (processnode) Entries
----------------------------------
PUBLISHER.domain.com
SUBSCRIBER01.domain.com
SUBSCRIBER02.domain.com
172.16.154.148
172.16.154.149

Step 4. Next useful command to check server health is “utils diagnose test” which will do a self-assessment of the server covering all important aspects.

The output can provide you important information of errors (if any) such as Disk space issues, Server Manager issues, Tomcat Memory leaking issues / HTTP-HTTPS related issues, network connectivity issues, NTP issues, etc. These are the most common issues.
If you face any issues related to Tomcat, Server manager, you need to contact Cisco TAC.

admin:utils diagnose test
Log file: platform/log/diag4.log
Starting diagnostic test(s)
===========================
test - disk_space          : Passed (available: 562 MB, used: 13120 MB)
skip - disk_files          : This module must be run directly and off hours
test - service_manager     : Passed
test - tomcat              : Passed
test - tomcat_deadlocks    : Passed
test - tomcat_keystore     : Passed
test - tomcat_connectors   : Passed
test - tomcat_threads      : Passed
test - tomcat_memory       : Passed
test - tomcat_sessions     : Passed
skip - tomcat_heapdump     : This module must be run directly and off hours
test - validate_network    : Passed
test - raid                : Passed
test - system_info         : Passed (Collected system information in diagnostic log)
test - ntp_reachability    : Passed
test - ntp_clock_drift     : Passed
test - ntp_stratum         : Passed
skip - sdl_fragmentation   : This module must be run directly and off hours
skip - sdi_fragmentation   : This module must be run directly and off hours
Diagnostics Completed
 The final output will be in Log file: platform/log/diag4.log
 Please use 'file view activelog platform/log/diag4.log' command to see the output

Step 5. “Utils ntp status”, As the command itself is self-explanatory, NTP synchronization is mandatory for all devices in the network but most important in Collaboration infrastructure.

NTP issues in collaboration can cause very complicated issues such as DB replication issues. Informix DB replication will not be stable without NTP synchronization.

Use the command in all CUCM, IM Presence Nodes. All nodes output should come as follows.

Make sure to check NTP status.

 (i) NTP should be synchronized.

 (ii) NTP stratum should be <=3 (for Publisher node, incase Subscriber node then NTP<=4)

admin:utils ntp status
ntpd (pid 24927) is running...
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*172.16.151.1    LOCAL(0)         2 u  831 1024  377    0.989    1.242   1.200
synchronised to NTP server (172.16.151.1) at stratum 3
   time correct to within 40 ms
   polling server every 1024 s
Current time in UTC is : Fri Apr  2 05:37:11 UTC 2021
Current time in Asia/Kolkata is : Fri Apr  2 11:07:11 IST 2021

Step 6. “Utils service list” command is used to check if required services of the servers are in started state. Make sure to check this on all nodes of the cluster.

admin:utils service list
Requesting service status, please wait...
System SSH [STARTED]
Cluster Manager [STARTED]
Name Service Cache [STARTED]
Entropy Monitoring Daemon [STARTED]
Cisco SCSI Watchdog [STARTED]
Service Manager [STARTED]
HTTPS Configuration Download [STARTED]
Service Manager is running
Getting list of all services
>> Return code = 0
A Cisco DB[STARTED]
A Cisco DB Replicator[STARTED]
Cisco AMC Service[STARTED]
Cisco AXL Web Service[STARTED]
Cisco Audit Event Service[STARTED]
Cisco Bulk Provisioning Service[STARTED]
Cisco CAR DB[STARTED]
Cisco CAR Scheduler[STARTED]
Cisco CAR Web Service[STARTED]
Cisco CDP[STARTED]
Cisco CDP Agent[STARTED]
Cisco CDR Agent[STARTED]
Cisco CDR Repository Manager[STARTED]
Cisco CTIManager[STARTED]
Cisco CTL Provider[STARTED]
Cisco CallManager[STARTED]
Cisco CallManager Admin[STARTED]
Cisco CallManager SNMP Service[STARTED]
Cisco CallManager Serviceability[STARTED]
Cisco CallManager Serviceability RTMT[STARTED]
Cisco Certificate Authority Proxy Function[STARTED]
Cisco Certificate Change Notification[STARTED]
Cisco Certificate Expiry Monitor[STARTED]
Cisco Change Credential Application[STARTED]
Cisco DHCP Monitor Service[STARTED]
Cisco DRF Local[STARTED]
Cisco DRF Master[STARTED]
Cisco Database Layer Monitor[STARTED]
Cisco Dialed Number Analyzer[STARTED]
Cisco Dialed Number Analyzer Server[STARTED]
Cisco DirSync[STARTED]
Cisco Directory Number Alias Lookup[STARTED]
Cisco Directory Number Alias Sync[STARTED]
Cisco E911[STARTED]
Cisco ELM Client Service[STARTED]
Cisco Extended Functions[STARTED]
Cisco Extension Mobility[STARTED]
Cisco Extension Mobility Application[STARTED]
Cisco IP Manager Assistant[STARTED]
Cisco IP Voice Media Streaming App[STARTED]
Cisco Intercluster Lookup Service[STARTED]
Cisco License Manager[STARTED]
Cisco Location Bandwidth Manager[STARTED]
Cisco Log Partition Monitoring Tool[STARTED]
Cisco Management Agent Service[STARTED]
Cisco Prime LM Admin[STARTED]
Cisco Prime LM DB[STARTED]
Cisco Prime LM Server[STARTED]
Cisco Push Notification Service[STARTED]
Cisco RIS Data Collector[STARTED]
Cisco RTMT Reporter Servlet[STARTED]
Cisco SOAP - CDRonDemand Service[STARTED]
Cisco SOAP - CallRecord Service[STARTED]
Cisco Serviceability Reporter[STARTED]
Cisco Syslog Agent[STARTED]
Cisco TAPS Service[STARTED]
Cisco Tftp[STARTED]
Cisco Tomcat[STARTED]
Cisco Tomcat Stats Servlet[STARTED]
Cisco Trace Collection Service[STARTED]
Cisco Trace Collection Servlet[STARTED]
Cisco Trust Verification Service[STARTED]
Cisco UXL Web Service[STARTED]
Cisco Unified Mobile Voice Access Service[STARTED]
Cisco User Data Services[STARTED]
Cisco WebDialer Web Service[STARTED]
Cisco Wireless Controller Synchronization Service[STARTED]
Host Resources Agent[STARTED]
MIB2 Agent[STARTED]
Platform Administrative Web Service[STARTED]
SNMP Master Agent[STARTED]
SOAP - Diagnostic Portal Database Service[STARTED]
SOAP -Log Collection APIs[STARTED]
SOAP -Performance Monitoring APIs[STARTED]
SOAP -Real-Time Service APIs[STARTED]
Self Provisioning IVR[STARTED]
System Application Agent[STARTED]
Cisco Prime LM Resource API[STOPPED]  Service Not Activated
Cisco Prime LM Resource Legacy API[STOPPED]  Service Not Activated
Primary Node =true

Step 7. The next command “utils dbreplication status” will do a refresh of dbreplication which will verify all tables in the database.

admin:utils dbreplication status
Replication status check is now running in background.
Use command 'utils dbreplication runtimestate' to check its progress
The final output will be in file cm/trace/dbl/sdi/ReplicationStatus.2021_04_02_11_08_03.out
Please use "file view activelog cm/trace/dbl/sdi/ReplicationStatus.2021_04_02_11_08_03.out " command to see the output

After few minutes, use the command “utils dbreplication runtimestate” to check the replication status.

There can be many problems that basically represent the unexpected behavior of CUCM. Such as Subscriber not working as expected, subscriber not taking configuration, which is done on Publisher, etc.

Or the issues which are difficult to reproduce, originate from Informix database replication malfunction. To ensure everything related to this is fine, Check the below parameters.

(i) Replication status command Ended with all tables <no> out of <no>
(ii) No Errors or mismatches found
(iii) Ping response should be >= 80 ms.
(iv) DB/RPC/DbMon should have all as Y/Y/Y
(v) Replication queue is 0
(vi) Replication setup is in (2) state with Setup Completed.

 If you see any differences in outputs, you should consult Cisco TAC.

admin:utils dbreplication runtimestate
Server Time: Fri Apr  2 11:12:16 IST 2021
Cluster Replication State: Replication status command started at: 2021-04-02-11-08
     Replication status command ENDED. Checked 706 tables out of 706
     Last Completed Table: devicenumplanmapremdestmap
     No Errors or Mismatches found.
     Use 'file view activelog cm/trace/dbl/sdi/ReplicationStatus.2021_04_02_11_08_03.out' to see the details
DB Version: ccm11_5_1_16900_16
Repltimeout set to: 300s
PROCESS option set to: 1
Cluster Detailed View from PUBLISHER (3 Servers):

                                      PING      DB/RPC/   REPL.    Replication    REPLICATION SETUP
SERVER-NAME         IP ADDRESS        (msec)    DbMon?    QUEUE    Group ID       (RTMT) & Details
-----------         ----------        ------    -------   -----    -----------    ------------------
PUBLISHER       172.16.154.141    0.023     Y/Y/Y     0        (g_2)          (2) Setup Completed
SUBSCRIBER01     172.16.154.142    0.202     Y/Y/Y     0        (g_3)          (2) Setup Completed
SUBSCRIBER02     172.16.154.143    0.204     Y/Y/Y     0        (g_4)          (2) Setup Completed

Step 8. The next command “file view install system-history.log” will display the events that have occurred on a node: restarts, installation of components (COP files), failed, and successful backups.

You will see two BOOT sequences if your system was shut down ungracefully.
This is an example of an unclean shutdown:
08/14/2012 13:36:09 | root: Boot 9.0.1.10000-37 Start
08/14/2012 17:28:25 | root: Boot 9.0.1.10000-37 Start

The ideal output should look like mentioned below.

admin:file view install system-history.log
=======================================
Product Name -    Cisco Unified Communications Manager
Product Version - 11.5.1.16900-16
Kernel Image -    2.6.32-573.18.1.el6.x86_64
=======================================
08/01/2019 17:00:56 | root: Boot 11.5.1.16900-16 Start
08/01/2019 20:39:39 | root: Cluster Security Mode Cluster set to secure mode using tokenless CTL (CLI)
08/13/2019 17:31:36 | root: Boot 11.5.1.16900-16 Start
08/21/2019 12:06:42 | root: Restart 11.5.1.16900-16 Start
08/21/2019 12:07:15 | root: Boot 11.5.1.16900-16 Start
09/17/2019 17:33:02 | root: Shutdown 11.5.1.16900-16 Start
09/18/2019 09:55:23 | root: Boot 11.5.1.16900-16 Start
09/18/2019 10:45:52 | root: Shutdown 11.5.1.16900-16 Start
09/18/2019 14:45:45 | root: DRS Backup UCMVersion:11.5.1.16900-16/CUPVersion:11.5.1.16910-12 Start
09/18/2019 15:11:58 | root: DRS Backup UCMVersion:11.5.1.16900-16/CUPVersion:11.5.1.16910-12 Success
01/15/2020 12:21:06 | root: Cisco Option Install ciscocm.free_common_space_v1.5.cop Start
01/15/2020 12:21:14 | root: Cisco Option Install ciscocm.free_common_space_v1.5.cop Success
03/06/2021 10:49:26 | root: DRS Backup UCMVersion:11.5.1.16900-16/CUPVersion:11.5.1.16910-12 Start
03/06/2021 11:16:00 | root: DRS Backup UCMVersion:11.5.1.16900-16/CUPVersion:11.5.1.16910-12 Success
end of the file reached
options: q=quit, n=next, p=prev, b=begin, e=end (lines 61 - 65 of 65) :
admin:

Step 9. The next command “utils core active list” is used to identify any Linux server process error or issues. The command will give you any active core dumps by Linux server.

The command may be requested by Cisco TAC along with the output of “Utils core active analyze < core file name >” so they can backtrack the issue and find if there is any issue/bug which was hit and suggest next steps for remediation.

admin:utils core active list
      Size         Date            Core File Name
=================================================================
 291096 KB   2020-06-15 10:11:40   core.4335.11.cef.1592196098
 247816 KB   2020-09-29 11:33:16   core.11201.11.cef.1601359394
 329728 KB   2021-03-27 14:43:35   core.24693.11.cef.1616836414
 186288 KB   2020-02-08 11:25:27   core.23353.11.cef.1581141326

We hope this article gives you an understanding, how to do a basic health check of CUCM. These steps will give you an idea of the problem for why the system is not running as expected or if there are any upcoming challenges in your system. The checks mentioned above will help you verify the system at the Virtual machine level, Network level, Service level, or OS level.  

Are you looking for consulting, advisory and professional services to deploy a Collaboration Environment for your organization? 

Zindagi Technologies Pvt. Ltd. is an IT consultancy and professional services organization based out of New Delhi, India. We have expertise in planning, designing, and deployment of collaboration environments, large-scale data centers, Private/Public/Hybrid cloud solutions. We believe in “Customer First” and provide quality services to our clients always. Call us on +919773973971.

Author
Rahul Bhukal
Sr. Collaboration Consultant