Nexus 7000 and a systematic Bug

I have been thinking about an old issue that a customer encountered with an pair of Nexus 7000 switches about a year and half ago. When the issue first came onto my radar it was in a bad place, this customer had Nexus 2000 Fabric Extenders that would go offline and eventually the Nexus 7000 would go offline causing some single homed devices to be come in reachable, and in the process broader reachability issues. This is occurred intermittently which always causes data collection to be complicated. After working with TAC and finally collecting all of the information the the summary of the multiple causes came down the these 5 items.

1. Fabric extender link become error disabled due to CSCtz01813
2. Nexus is reloaded to recover from Fabric Extender issue.
3. After reload ports get stuck in INIT state due to CSCty8102 and failed to come online.
4. Peer Link, Peer Keep-Alive and VPCs fail to come online since ports are err-disabled from sequence time out.
5. VPC would go into a split brain state causing SVIs to go into shutdown mode.
6. Network connectivity is lost until reload of module and ports are brought online.

The summary is two bugs that would get triggered at random times causing a firestorm of confusing outages. The two temporary work arounds to mitigate the problem before we could upgrade the code on the switches was to,

  1. VPC keep alive link to Admin port on Supervisor.
  2. Use EEM script to reset a register when a module comes on line.

When thinking about what occurred it is important to remember the Nexus 7000 platform consists of many line cards that each contain an independent “brain” (Forwarding Engine(s) and supporting systems on the line cards) that are connected and orchestrated by the Supervisor module. It is true previous statement was a bit of a simplification, however I find it enigmatic of some of the design challenges you can on the Nexus 7000 platform. For example there are many limitations with Layer 3 routing features and VPC. In the example above it could be said that this sort of complexity can cause safety features such as those build into VPC to cause more harm then good when they encounter an in planned failure scenario. This is different from the Catalyst platform where (for the most part) everything is processed through an central processor.

Over all the Nexus 7000 system design allows for tightly coupled interactions between the modules, supervisors and even more loosely coupled interactions between chassises. These interactions can allow for the high speed and throughput that can be delivered, however is adds to the complexity of troubleshooting and complex designs. In the end what makes this issue so interesting to me and and why I keep mentally revisiting it is that it is an example of a system failure. Every single cause if occurred individually would have been as greatly problematic but their interactions together caused the observed issue to be many times worse.

Some great Nexus 7000 references