Nexus 7000 and a systematic Bug

I have been thinking about an old issue that a customer encountered with an pair of Nexus 7000 switches about a year and half ago. When the issue first came onto my radar it was in a bad place, this customer had Nexus 2000 Fabric Extenders that would go offline and eventually the Nexus 7000 would go offline causing some single homed devices to be come in reachable, and in the process broader reachability issues. This is occurred intermittently which always causes data collection to be complicated. After working with TAC and finally collecting all of the information the the summary of the multiple causes came down the these 5 items.

1. Fabric extender link become error disabled due to CSCtz01813
2. Nexus is reloaded to recover from Fabric Extender issue.
3. After reload ports get stuck in INIT state due to CSCty8102 and failed to come online.
4. Peer Link, Peer Keep-Alive and VPCs fail to come online since ports are err-disabled from sequence time out.
5. VPC would go into a split brain state causing SVIs to go into shutdown mode.
6. Network connectivity is lost until reload of module and ports are brought online.

The summary is two bugs that would get triggered at random times causing a firestorm of confusing outages. The two temporary work arounds to mitigate the problem before we could upgrade the code on the switches was to,

  1. VPC keep alive link to Admin port on Supervisor.
  2. Use EEM script to reset a register when a module comes on line.

When thinking about what occurred it is important to remember the Nexus 7000 platform consists of many line cards that each contain an independent “brain” (Forwarding Engine(s) and supporting systems on the line cards) that are connected and orchestrated by the Supervisor module. It is true previous statement was a bit of a simplification, however I find it enigmatic of some of the design challenges you can on the Nexus 7000 platform. For example there are many limitations with Layer 3 routing features and VPC. In the example above it could be said that this sort of complexity can cause safety features such as those build into VPC to cause more harm then good when they encounter an in planned failure scenario. This is different from the Catalyst platform where (for the most part) everything is processed through an central processor.

Over all the Nexus 7000 system design allows for tightly coupled interactions between the modules, supervisors and even more loosely coupled interactions between chassises. These interactions can allow for the high speed and throughput that can be delivered, however is adds to the complexity of troubleshooting and complex designs. In the end what makes this issue so interesting to me and and why I keep mentally revisiting it is that it is an example of a system failure. Every single cause if occurred individually would have been as greatly problematic but their interactions together caused the observed issue to be many times worse.

Some great Nexus 7000 references

My interest of academics of systems

Lately I’ve been very interested in academic side of computers. Complex systems, Theoretical Computing, and Control Theory are two of my focuses right now. This has come about because I’m getting more interested in how the systems work and how ti measure them, more then how to implement them. My career has been very focus on the implementation then how systems work and can be measured. I’ve never had any sort of formal Computer Science education, making a lot of this new territory to me. As I dive deeper into these topics I realize how much math I have forgotten over the years. These topics are some of  reasons for me to refresh my math skills, however math skills are also analyze sampled data such as monitoring data. A great video discussing data analysis is by Noah Kantrowitz at Monitorama PDX 2014.

Monitorama PDX 2014 – Noah Kantrowitz from Monitorama on Vimeo.

Some of the topics I’m learning about are much broader then others. The definitions of these fields of study as defined by their Wikipedia articles are as follows,

Control theory is an interdisciplinary branch of engineering and mathematics that deals with the behavior of dynamical systems with inputs, and how their behavior is modified by feedback.
Wikipedia: Control Theory

The field of theoretical computer science is interpreted broadly so as to include algorithms, data structures, computational complexity theory, distributed computation, parallel computation, VLSI, machine learning, computational biology, computational geometry, information theory, cryptography, quantum computation, computational number theory and algebra, program semantics and verification, automata theory, and the study of randomness. Work in this field is often distinguished by its emphasis on mathematical technique and rigor.
Wikipedia: Theoretical Computer Science

Complex systems present problems both in mathematical modelling and philosophical foundations. The study of complex systems represents a new approach to science that investigates how relationships between parts give rise to the collective behaviors of a system and how the system interacts and forms relationships with its environment.
Wikipedia: Complex Systems

All of these topics I feel are important as products start to become much simpler and centrally controlled or incredibly complex in their interactions. Algorithms, controls, and data are becoming more and more important to understand.