Top 5 issues when using Cisco DCNM

_ April 26, 2019_ Team ZT_ 0 Comments

What is Cisco DCNM (Data Center Network Manager)?

Cisco DCNM (Data Center Network Manager) is a next generation network manager ova solution for data center. It provides high level of visibilities and control through web based console to manage Cisco Nexus switches, MDS and Cisco UCS. Cisco DCNM provides LAN fabric automation for all cisco nexus devices and SAN fabric automation for cisco MDS.

The goal of DCNM is to reduce the operation expanses by providing the efficient operations, monitoring and troubleshooting the Data Center network infrastructure.

For user prospective it is easy to configure and maintain the Data Center operations. User can monitor Cisco Nexus switches family events, inventory and performances also using DCNM user can perform administrative tasks. It has comprehensive feature sets that meets the routing, switching and storage administration.

It comes with two versions:-

Data center Network Manager 10
Data Center Network Manager 11

DCNM 10 released in February 29th, 2016 and it is still in the market later cisco revise the product and announce the DCNM 11 on July 2th, 2018.

Topology View of Cisco DCNM

When it comes to viewing topology information it has easy options to view all the information related to network topology.

Topology is a first class menu item in DCNM 11 release with the intention that it is fully functional for providing detailed access to configuration as well as monitoring functionality. The DCNM topology consolidates functionality in the existing Fabric topology as well as the current Dashboard topology into a new fully featured topology which includes the following features in a single view.

Optional display of Vinci Balls or device icons.
Display of Multi-links, Port-channels, VPCs, which interface is connected to which interface and in which port-channels and also VLAN information.
Display of Inter-fabric links.
VDC and Pod Groupings.
Device-Scope, Fabric and Datacenter drill-down.
Automatic VPC Peer and FEX Groupings.
Ability to select devices and take action consistent with other areas of the product.

DCNM HA overview

We can use DCNM Native HA which basically provides a high level availability solutions in which we have two DCNM nodes in which one nodes acts as an Active node while other act as a Standby node.

DCNM Native HA only supported on Linux platform with ISO and OVA installation, DCNM does not support Native HA when we use standalone installation because standalone has missing Linux package which makes him standalone, also Windows platform does not support DCNM Native HA.

We can use show ha-role command to verify DCNM Host role.

Viewing inventory information

We can use this pane to view information related to inventory and performance for both SAN and LAN switch. We can select SAN, LAN or both to view the inventory information.

You can print and export the inventory report if you so desire. Using this we can also add a new discovered LAN or SAN switch

Viewing monitor information

We can monitor statistics of CPU utilization, Traffic, event information, Traffic, and other accounting. We can also check performance information of SAN and LAN. Customized reports can be created based on historical performance, events, and inventory.

Viewing configure information

DCNM allows user to view and configure zoning, device Alias, Port Monitoring and Device Credentials.

DCNM has pre-configured templates for deployment, using those template we can configure underlay and overlay. However, their functionality is limited to getting the fabric discovered and BGP evpn setup. But if you want to configure routing protocols (like ospf, RIP, BGP and ISIS) or any access-list or any route leaking or any other configuration, for that you have to use option called SWITCH FREEFORM TEMPLATE, here you have to first create the template for the configuration that you want to deploy on any or all Fabric and then you will have the SWITCH FREEFORM option where you can add the template (which is also called policy) which is manual deployments.

Issues while Installing and deploying the configuration and discovering the Fabric to DCNM

Note: – When you are adding Switches to DCNM fabric always prefers Bootstrap method.

ISSUE # 1: Cisco DCNM out of sync issue. DCNM goes out of sync and erases manual configuration

Problem:

If you are using the switchfreeform method to deploy manual configurations, fabric will go out of sync every time and when you re-sync DCNM will erase the manual configuration.

Solution:

To resolve this issue you have to use correct configuration template, no override. “no override” means if we are using freeform template then we have to be careful that we are not using same command in freeform again and again.

For Example:

router ospf <P.I>

If we are using the same command on two different freeform , then it will be problem , because of this fabric will go out of sync and whenever we do re-sync manually it will try re-sync every time and that is the issue

ISSUE # 2: Cisco DCNM Split Brain Syndrome

Problem:

Cisco DCNM has split-brain syndrome issue, means if you are using DCNM Native HA then two hosts are having communication issue.

Solution:

To resolve this problem following troubleshoot steps you should use to get rid this issue:-

Steps to resolve:

Open the CLI of both DCNM servers and stop the application on both Active and Standby DCNM servers, to do that you should run the following command

    dcnm1#appmgr stop all

    #appmgr stop all

After stopping the application on both DCNM nodes you have to ping the eth1 ip address from both servers and make sure it is reachable.
Now you have to start all the application on both the DCNM servers using following command on both DCNM servers:-

    dcnm1#appmgr start all

    dcnm2#appmgr start all

Verify if all the applications are working up and running or not, to do this you should run following command on both servers.

     dcnm1#appmgr status all

     dcnm2#appmgr status all

Logon using username and password to dcnm server 1 and verify if it is fully functional and if all the data is correct.

If all the data on DCNM server 1 correct, proceed to further step.
If data is lost seen, then stop the DCNM server 1 using the appmgr stop all command.

Use the appmgr stop all command, to stop the applications.

     dcnm1# appmgr stop all

Start all the applications on dcnm2. Wait for all the applications to be operational. Use the appmgr status all command to check the status of the applications.

     dcnm2# appmgr status all

Logon to DCNM. Verify if it is fully functional. Check if the device data is correct.

If successful, power on dcnm1 as Secondary host. Terminate the troubleshooting procedure.
If data loss is seen on dcnm2, use the appmgr stop all command, to stop all the applications.

     dcnm2# appmgr stop all

Restore both hosts from backup

ISSUE # 3: Cisco DCNM High Availability Failure

Problem:

What if DCNM HA status failed.

Solution:

In this situation use following steps to resolve the problem:-

First logon to dcnm server on web browser using logon credential

Once Logon on dcnm server go to administration>>Native HA>> and then click on test HA icon. If there is an error message comes up, click on details logs for more information.

The details log file will be downloaded at following location with the name of fms_ha.log:-

     /usr/local/cisco/dcm/fm/logs/ fms_ha.log

Once you open the log file you should have some log messages which will indicates that why HA status is failed.

Check the standby dcnm server if it is operational , to verify if the Active and standby are operational use following command

         Dcnm1#show ha-role

         Dcnm2#show ha-role

Check standby server if any application is stopped or not working. In general HA status show failed because of database of standby server is being down or rejected connection. If you found that the connection to standby server database rejected then the status of HA shows you failed. You can check in the details logs which is located at:-

  /usr/local/cisco/dcm/db/data/pg_hba.conf

This configuration contains entries for all IP addresses listed on active server ip address. If it is not then contact CISCO TAC.

If the database of standby server is completed down, you have to troubleshoot and bring up the database of standby server.

ISSUE # 4: Database of Standby Server completely down

Problem:

The database of the standby server is down or corrupted.

Solution:

How to restore the database of standby server if it is completed down:-

Open the CLI of standby server and try to start database first using the following command :-

   /etc/init.d/postgresql-11.1 start

If Postgresql 11.1 restarted successfully means the standby database is up and HA status will be shown ok within a few minutes.

Wait for the database to start – if it doesn’t, this means – it’s corrupted.

In this stage use the backup file to restore the database and check if database is started.

ISSUE # 5: Both servers in the Cisco DCNM HA Pair are down

Problem:

The problem is, that none of the DCNM servers in the HA pair seem to be operational.

Solution:

To resolve this issue you should follow these troubleshooting steps:-

First power on the DCNM server 1 and wait for until all the application starts, to verify application status use following command :-

    Dcnm1#appmgr status all

Once all the application started , Logon to dcnm server on to web browser using credential and verify if it is fully functional also check if it has correct data.
After successfully logon to DCNM, power on the dcnm server 2 which is acting as a standby server.
If server failed to start applications after power on the servers or if server has incorrect data , in this case first stop all the application using following command:-

    Dcnm1#appmgr stop all

And wait for all application to be stop.

Once first server application stopped, power on dcnm server 2 and wait for all application to be start, using appmgr status all command you can verify the application status.
Now logon to DCNM on web browser to verify if is fully functional also verify if it has correct data. If successfully then power on the dcnm server 1 which will acts as a standby, if dcnm server 2 failed to start all the applications and data is not correct then stop all the application using appmgr stop all.
Restore both servers from backup.

*********************************************************************************

I hope you found this article useful and that this will give you some useful hints if/when you encounter these issues when implementing DCNM.

If you’re looking for professional services to deploy your data center fabric using Cisco DCNM, or if you need consulting services on how to enable virtual workload mobility across your entire data center – please reach out to us. Zindagi Technologies consultants have expertise in the field of designing, building and maintaining large scale data centers. Our professional services are vendor agnostic. Contact us now to know how we can help you realise your data center and private cloud vision. We’re also reachable on [email protected] and +917678682775

Author:

Rohit Raj

Consulting Engineer,

 Data Center Expert

Zindagi Technologies LLP

Author

Blog Post