Nexus 7000 and a systematic Bug

I have been thinking about an old issue that a customer encountered with an pair of Nexus 7000 switches about a year and half ago. When the issue first came onto my radar it was in a bad place, this customer had Nexus 2000 Fabric Extenders that would go offline and eventually the Nexus 7000 would go offline causing some single homed devices to be come in reachable, and in the process broader reachability issues. This is occurred intermittently which always causes data collection to be complicated. After working with TAC and finally collecting all of the information the the summary of the multiple causes came down the these 5 items.

1. Fabric extender link become error disabled due to CSCtz01813
2. Nexus is reloaded to recover from Fabric Extender issue.
3. After reload ports get stuck in INIT state due to CSCty8102 and failed to come online.
4. Peer Link, Peer Keep-Alive and VPCs fail to come online since ports are err-disabled from sequence time out.
5. VPC would go into a split brain state causing SVIs to go into shutdown mode.
6. Network connectivity is lost until reload of module and ports are brought online.

The summary is two bugs that would get triggered at random times causing a firestorm of confusing outages. The two temporary work arounds to mitigate the problem before we could upgrade the code on the switches was to,

  1. VPC keep alive link to Admin port on Supervisor.
  2. Use EEM script to reset a register when a module comes on line.

When thinking about what occurred it is important to remember the Nexus 7000 platform consists of many line cards that each contain an independent “brain” (Forwarding Engine(s) and supporting systems on the line cards) that are connected and orchestrated by the Supervisor module. It is true previous statement was a bit of a simplification, however I find it enigmatic of some of the design challenges you can on the Nexus 7000 platform. For example there are many limitations with Layer 3 routing features and VPC. In the example above it could be said that this sort of complexity can cause safety features such as those build into VPC to cause more harm then good when they encounter an in planned failure scenario. This is different from the Catalyst platform where (for the most part) everything is processed through an central processor.

Over all the Nexus 7000 system design allows for tightly coupled interactions between the modules, supervisors and even more loosely coupled interactions between chassises. These interactions can allow for the high speed and throughput that can be delivered, however is adds to the complexity of troubleshooting and complex designs. In the end what makes this issue so interesting to me and and why I keep mentally revisiting it is that it is an example of a system failure. Every single cause if occurred individually would have been as greatly problematic but their interactions together caused the observed issue to be many times worse.

Some great Nexus 7000 references

Cisco Live 2013 – My (late) wrap up.

This past month I attended Cisco Live in Orlando, FL with 20,000(?) of my fellow Network/Collaboration/Service Provider/Data Center engineers from all around the world. This was my first time attending, and I had a blast! There are a few themes I won’t talk much about in this post that were big topics at Cisco Live one of which is the Internet of Everything (IoE) as that is well covered, and well is really just Market-ecture-tastic. New gear like the Catalyst 6800 or Nexus 7700 and new ASICs all of which are neat, powerful, and that will enable a lot of the future technologies, but Better, Faster, Stronger hardware comes all the time. In the end SDN/Network Virtualization for me was the most discussed topic in through all of the Network-centric sessions and “hallway” conversations throughout the entire week.

CLUS 2013 Schedule I was able to attended many great sessions, but even with a packed schedule I still wanted to be in two places at once most of the time. There were a few standout sessions including “BRKRST-3114 The Art of Network Architecture” and “BRKRST-3045 LISP – A Next Generation Networking Architecture”. “The Art of Network Architecture” was a very business forward discussion of network architecture, and I believe attempted to change the discussion around designing a network. Wheras “LISP – A Next Generation Networking Architecture” got me excited about LISP in a way that I had not been before. All the previous information I had read about LISP left me wanting for a tangible use case. This presentation at CLUS started to describe some good use cases for LISP, I am still left wanting for more wide spread production implementations.

CLUS Tweetup

Another great event I attended was the Tweetup organized by Tom Hollingsworth. I met a lot of people I follow on Twitter there, and it was nice to put a face with a Twitter handle and have some good conversations about networking and well just about anything else.

 

When listening to the discussions and presentations a few trends and themes struck me. First, there is a trend towards the flat network; when I look at the fabric technologies or the affinity networking coming out of Plexxi or potentially Insieme, this all puts a large exclamation point at the end of the need to move to IPv6 or at least implement dual stack sooner rather than later. It will be key in the success of these technologies in the data center. Next there was a constant argument going on about the death of the CLI and that the GUI will reign supreme. I believe both the CLI users and the GUI user can be accommodated, both types of interfaces can be used to manipulate some back end software and logic. An example of this is tail-f NCS which has both, while not “SDN” by some definitions, but an example of the 2 UIs co-existing. The real augment that needs to be had concerns the designs of the system needed to support the applications.

This one is more a rant and less of a theme, but I still think Cisco is missing the mark with the ASA 1000v. I think virtualized physical appliances are a transitional technology, but a needed one. Creating the ASA 1000v and not giving it the full set of features of it’s physical counterpart without a roadmap as far as I can tell to add them, along with the insane licensing scheme of a per-socket protected model does not make sense to me. This is all short changing the IaaS provider market and IMHO it should be licensed and operated similar to the CSR 1000v, full features and per appliance licensing.

Overall, I was left with two general questions from the week. First, I’m curious how the balance of systemic complexity vs configuration complexity vs structure complexity will fall as the overlay, “underlay” and the SDN glue that holds it all together sets into place. Each new technology that is introduced seems to address one of these complexity problems but not all three in one fell swoop, but this is a larger topic for another post. Second is a reoccurring theme in technology: everything old is new again; I look at the data center technologies, and some of the new IP routing technologies (LISP), and they look a lot like old telephony switching technologies, in the same way VDI looks like mainframe dumb terminals. This is not a critique, just an observation on how it’s important to know your past because it will come back Better, Faster, Stronger, or maybe just the same with a new box around it.