I have been given the distinct privilege of being able to attend an exclusive partners-only boot camp training put on by Cisco for their UCS system (Cisco’s blade implementation). Like Scott Lowe, Rich Brambley and Rodney Haywood, I will be blogging my thoughts and technical details (along with the occasional tweet). I may skip past many technical details simply due to the fact they’ve been covered by one of the above blogs, or because I simply have to leave something for my customers to pay for. 🙂
The class started off pretty slow with general housekeeping and an overview of the pains of current blade infrastructures and a high level explanation of FCoE. Then the fun began.
The UCS chassis includes eight half-width blade slots, four power supplies, eight fan modules and two Fabric Extenders (also referred to as I/O Modules). Two power supplies can power the entire chassis with the third and fourth providing redundancy. (Similarly, the HP c-Class enclosures can be fully powered by three power supplies with the fourth, fifth and sixth providing redundancy.) There is no power domain concept in the UCS chassis as there is in the IBM and the HP p-Class enclosures.
External to the chassis, but still required, are a pair of Fabric Interconnects (FI), each of which can support up to 20 or 40 UCS chassis, depending on the FI model. The UCS Manager (UCSM) runs in an Active/Passive mode on these FI. The UCSM is built on a XML database that stores all configuration details and settings for the entire solution, including the Service Profiles. This database can be accessed through many different ways, including the client Java GUI, CLI, SNMP, and XML APIs.
The UCSM works through the Chassis Management Channel (CMC) on the I/O Modules (IOM) within each chassis. There are two IOMs in each chassis and they run Active/Active for the data communication and Active/Passive for the CMC. The CMC is really nothing more than a proxy for talking to the individual chassis, blade and switch components. Also on the IOMs is the Chassis Management Switch (CMS), which provides communication to the Baseboard Management Controller (BMC) on the individual blades (think iLO). These are 100MB connections that flow through a dedicated port on the chassis.
Currently there is only one blade available, the half-width B200, which contains two Xeon 5500 (Nehalem) processors, 12 DIMM slots (up to 96GB), 2 SAS/SATA HDD and a single mezzanine socket. The DIMMs are limited to only Registered 4GB and 8GB 1333Mhz DIMMs (yes, only two types of DIMMs are supported based on what we were told). The BIOS is currently limiting the bus speed to 1066Mhz, no matter how many DIMM slots are populated. It is possible for certain processors to push the bus speed down to 800Mhz.
The next blade to be released will be the full-height B250, which will contain two Xeon 5500 (Nehalem) processors, 48 DIMM slots (up to 384GB), 2 SAS/SATA HDD and 2 mezzanine sockets. The extra DIMM slots are made possible by the Cisco-exclusive Catalina chip. How this works was beyond the scope of our class, but I am supposed to be getting a whitepaper that describes the nuts and bolts, but essentially it’s able to put four times the DIMMs in each channel (8 DIMMs x 3 Channels x 2 Processors) without affecting bus speed and only incurring minimal additional latency (6 ns). The same limits to DIMM types and speeds apply to this blade as in the B200.
All the pieces are well laid out in Rodos’ post here: http://rodos.haywood.org/2009/08/ucs-schematic-sketch.html
One thing I learned today is that FCoE is not simply a Fibre Channel packet wearing an Ethernet Halloween costume. FCoE requires a special handling and flow control. FC works by not allowing packets to be dropped, and FCoE must still abide by this rule. The Nexus switches do this by actually embedding MDS functionality directly into switch.
Figure 1: FCoE Switching
In Figure 1 I have depicted a server with a CNA adapter connecting through two Nexus 5000 switches to a native FCoE storage array. The blue lines depict the FCoE traffic, which by the nature of the NX-OS will be lossless, and the orange lines depict native FC traffic. Note the use of a MAC address as the Source and Destination for the FCoE packets and the fact that they are unwrapped upon arrival at their destination.
Given the losses requirement (among others, as our instructor was quick to point out), FCoE packets cannot be routed through a Catalyst switch. In other words FCoE should only flow through FCoE-aware switches (read Nexus) because they are not normal Ethernet packets.
One point that has been pretty well documented, and I have confirmed this to still be true (for a little while longer at least) is the fact that FCoE cannot be sent upstream (toward the SAN appliance) from the Fabric Interconnects (FI). Note that Figure 1 describes a non-UCS server. There currently is no way to do native FCoE from a UCS blade completely to the Storage Array. See Scott Lowe’s description for a good summary. Essentially, it comes down to this: the FI does not contain the embedded MDS functionality to forward on the FC packets. All the FI can do today is strip off the FCoE wrapper and send the native FC out an NPIV enabled port. This will change in the future, but today is a limitation.
Service Profiles are a big differentiator for the UCS system. Those familiar with VitualConnect may realize that Service Profiles are very similar to VirtualConnect Server Profiles, but UCS does add some unique capabilities, such as the ability to define the Firmware and Quality of Service properties. Another unique feature is the ability to wipe the local drives when applying a Service Profile to a blade.
When it comes to configuring interconnects and adapters, Cisco seems to have moved most functionality and choice into the adapters instead of the interconnects. Ultimately, this seems like a simpler solution, since now you just pick which adapter you want and the I/O Module (IOM) is the same regardless of your choice. There are (or will soon be) three choices for mezzanine adapters: Palo (CNA built for virtualization), Menlo (CNA with two 10GbE and two FC) and Oplin (standard dual port 10GbE with support for OS-implemented FCoE).
Cisco’s upcoming Palo adapter mezzanine card provides similar functionality to HP’s VirtualConnect Flex-10. I was in awe when I first realized what Flex-10 could do. Using the Palo adapter within UCS really blew me away. This is where Cisco’s networking expertise really shows through. Here’s a quick comparison:
- Split a single 10GbE connection into multiple instances that the OS sees as individual devices
- Ability to define bandwidth of each device
- Splitting of connection occurs completely within the Palo adapter, rather than a combination of the adapter and interconnect
- Palo can create 128 connections, as opposed to Flex-10’s four
- Palo can define many more characteristics for each logical connection
- Palo has built-in hardware failover between the two uplinks, eliminating the need to implement failover within the OS/software layer (mezzanine card is still single point of failure)
- Palo is a CNA, meaning those 128 connections can be any combination of vNICs and vHBAs
- Palo can enable direct 1:1 mapping of VM vNICs to Palo vNICs using VN-Link
- The Palo adapter actually runs a Linux OS and an unmanaged switch in order to manage all this magic
As we dug in deeper into the actual data paths when using Palo, FCoE, 6100 Fabric Interconnects (FI) , 2104 Fabric Extenders (FEX) and Nexus switches (primarily 1000v and 5000), I began to wonder: Did Cisco create a complicated UCS (w/ FCoE and Palo adapter) to sell more Nexus 1000v? It essentially comes down to this: Ethernet best practice is to not route a packet back down the same port it came in on. In the case of an ESX host, this could be a possible scenario. In order to avoid this, VN-Link creates virtual Ethernet ports on the FI in order to treat them as two separate ports, thereby allowing routing between them. At the end of the long, hard to grasp discussion it was stated that the Nexus 1000v would avoid all of this by simply routing the traffic within the host and avoiding the FI completely. Good selling point for the Nexus 1000v.
Two great pictures we actually used in our class for understanding how traffic flows out of the UCS using Palo can be found here: http://www.internetworkexpert.org/2009/08/11/cisco-ucs-nexus-1000v-design-palo-virtual-adapter/ and here: http://www.internetworkexpert.org/2009/07/05/cisco-ucs-vmware-vswitch-design-cisco-10ge-virtual-adapter/. Both are by Brad Hedlund, who appears (based on my limited exposure to the Cisco Data Center world) to be an IT rockstar (perhaps the Duncan Epping of UCS?).
This leads me to a final general point about the UCS system. Ultimately, there is a lot to love about the UCS system. It was clearly designed with Network and Storage I/O in mind (as you would expect from Cisco), and with little innovation needed on Nehalem systems, this helps Cisco stand apart. They have also made an effort to truly unify all the management interfaces, though based on the screenshots I’ve seen so far they’re not as nice as HP’s. At the same time I worry that the UCS system is simply just too complicated to sell to the general customer. As a HP reseller and implementer, I find the whole VirtualConnect and Flex-10 conversation can go over many technical people’s heads. UCS is even harder to understand (note to self: practice UCS whiteboarding skills thoroughly).
Some additional comparisons to HP blades:
- Cisco’s blade slot architecture seems similar to HP p-class (4 slots that can be divided w/half height blades) as opposed to c-class (16 half height slots that can be converted to full height).
- Cisco’s Baseboard Manager is equivalent to HP’s iLO
Miscellaneous final notes:
- UCS certifications will be available early next year for design and implementation
- Storage redundancy is not handled in the UCS hardware and should be implemented within the OS/Application layer
- If multiple uplinks are used to connect IOM and FI, they are completely separate connections and cannot be combined with a port group
- An IOM can only be connected to a single FI
I guess that’s it for today. Not enough to digest? Check back tomorrow and the rest of the week for more. Don’t worry; I’m pretty sure this will be the longest post since the rest of the week will involve more labs and less architecture.
Please feel free to leave comments to ask questions, make corrections or provide additional information.