March 2019 NX-OS Vulnerability Dump

On March 6th Cisco released 29 high and medium rated PSIRT notices for NX-OS based platforms. These platforms include the Cisco Nexus 3000 – 9000 series and Nexus adjacent platforms FX-OS and UCS Fabric Interconnect platforms. Not all advisories affect all platforms but all platforms are affected by at least one high rated vulnerability. The vulnerabilities range from command and code execution, privilege escalation, denial of service, and arbitrary file read vulnerabilities. This is just about everything bad that could affect core infrastructure devices.

If you haven’t updated your switch in a while this is probably the time too. Within some of the advisories Cisco notes that they are providing free updates:

Cisco has released free software updates that address the vulnerability described in this advisory. Customers may only install and expect support for software versions and feature sets for which they have purchased a license.

I’ve included a table of the fixed in versions notes as of the writing of this post.  I would recommend looking at the advisories to assist in selecting the best version as there are other code versions that have integrated the fixes.

PlatformVersion
Nexus 1000v5.2(1)SM3(2.1) (Hyper-V)
5.2(1)SV3(4.1a) (VMWare)
Nexus 3000
Nexus 3500
Nexus 3600
9.2(2)
Nexus 5500, 5600, and 6000
Nexus 7000 and 7700
8.3(3)
Nexus 9000 and 95009.2(2)
UCS 6200 and 6300 Series Fabric Interconnects
UCS 6400 Series Fabric Interconnects
4.0(2a)

Cisco has a bundled advisory for all of the high rated notices at the following link,

Cisco Event Response: March 2019 Cisco FXOS and NX-OS Software Security Advisory Bundled Publication

I have also included a laundry list of notices including both high and medium rated vulnerabilities for your reference.

Happy patching!

Nexus 7000 and a systematic Bug

I have been thinking about an old issue that a customer encountered with an pair of Nexus 7000 switches about a year and half ago. When the issue first came onto my radar it was in a bad place, this customer had Nexus 2000 Fabric Extenders that would go offline and eventually the Nexus 7000 would go offline causing some single homed devices to be come in reachable, and in the process broader reachability issues. This is occurred intermittently which always causes data collection to be complicated. After working with TAC and finally collecting all of the information the the summary of the multiple causes came down the these 5 items.

1. Fabric extender link become error disabled due to CSCtz01813
2. Nexus is reloaded to recover from Fabric Extender issue.
3. After reload ports get stuck in INIT state due to CSCty8102 and failed to come online.
4. Peer Link, Peer Keep-Alive and VPCs fail to come online since ports are err-disabled from sequence time out.
5. VPC would go into a split brain state causing SVIs to go into shutdown mode.
6. Network connectivity is lost until reload of module and ports are brought online.

The summary is two bugs that would get triggered at random times causing a firestorm of confusing outages. The two temporary work arounds to mitigate the problem before we could upgrade the code on the switches was to,

  1. VPC keep alive link to Admin port on Supervisor.
  2. Use EEM script to reset a register when a module comes on line.

When thinking about what occurred it is important to remember the Nexus 7000 platform consists of many line cards that each contain an independent “brain” (Forwarding Engine(s) and supporting systems on the line cards) that are connected and orchestrated by the Supervisor module. It is true previous statement was a bit of a simplification, however I find it enigmatic of some of the design challenges you can on the Nexus 7000 platform. For example there are many limitations with Layer 3 routing features and VPC. In the example above it could be said that this sort of complexity can cause safety features such as those build into VPC to cause more harm then good when they encounter an in planned failure scenario. This is different from the Catalyst platform where (for the most part) everything is processed through an central processor.

Over all the Nexus 7000 system design allows for tightly coupled interactions between the modules, supervisors and even more loosely coupled interactions between chassises. These interactions can allow for the high speed and throughput that can be delivered, however is adds to the complexity of troubleshooting and complex designs. In the end what makes this issue so interesting to me and and why I keep mentally revisiting it is that it is an example of a system failure. Every single cause if occurred individually would have been as greatly problematic but their interactions together caused the observed issue to be many times worse.

Some great Nexus 7000 references