Design considerations for vSphere and OTV in stretched clusters environments (Part 2)

This article is the second part of a series dealing with OTV and vSphere network design impacts. I will first cover the scenario where vSphere hosts are attached to the network at the access layer, to a pair of Nexus 5000. Then we will look at multi-nic vMotion impact on OTV flows.

For complete high-level design reference and requirements, check Part 1.

Let's analyze the first design depicted below:


I didn’t specify any particular configuration between Nexus 5000 peers, as this is not the purpose of this article. But there could be different options, such as back-to-back vPC between Nexus 5000 and 7000, standard vPC or even configuring Nexus 5000 peers as standalone, independent switches. However, as our VMware network design is not based on LACP it means that using vPC technology at the access layer would introduce orphan ports. This would require additional considerations, maybe for another post.

In this architecture, each OTV VDC is connected to the core/aggregation VDCs by using standard spanning-tree protocol, whereas Nexus 5000 access switches are connected to these by leveraging vPC links. One of the reason may be a physical constraint that requires the use of F1 and F2 line card for OTV-to-core VDCs connectivity. This configuration doesn'€™t allow for vPC, which means we are back to standard STP topology. The following picture is from the Cisco vPC best practices design guide and describes this non-supported vPC configuration, among others:

Screen Shot 2014-03-18 at 22.37.25

One scenario not represented is when vPC is formed by F2 and F2e ports, which would be a supported configuration.So now let's see what happens in terms of network traffic flow:


Let’s assume VLAN 100 (VM network) needs to talk to VLAN 100 in the second datacenter through OTV. Because we have chosen active/active VM traffic at the ESXi level, both vmnic0 and vmnic1 can forward VLAN 100 traffic. For example, if network traffic were to flow up to the left-hand Nexus 5000 (as shown on the picture), vPC hashing could select the right-hand core VDC as the next hop. Because VLAN 100 is an even VLAN, it also has to go through the left OTV edge device, which is authoritative for even VLANs. So network traffic would cross the peer-link. If the peer-link is not sized accordingly, it may cause some serious problems. This is because the peer-link is primarily responsible for the control plane synchronization between vPC peers. No data plane traffic should cross the peer-link under normal conditions. Overloading the peer-link could prevent synchronization, bringing the full vPC configuration down.

In part 1, a good solution to alleviate the problem was to use non VPC VLANs and dedicate specific trunk links between peers to transport those VLANs. Here this alternative is not possible as connections between Nexus 5000 and Nexus 7000 core VDCs are leveraging vPCs. All transported VLANs have to be authorized over the vPC peer-link.

The only alternative to solve this problem is by making the access layer mimic the Layer 2 topology that is defined at the OTV VDC layer. So for example if you are using STP between OTV VDCs and core VDCs, like in our case, then you should also be using STP between Nexus 5000 and core VDCs. Network traffic would then be optimal. Similarly, if you’re using vPC within OTV VDCs, then also use vPC at the access layer.


This architecture provides optimal paths, and network traffic will cross the peer-link only in the event of a port or link failure. Full configuration details and packet walkthrough can be found here.

In the last part, we are going to see how vMotion can leverage simultaneously 2 IP links between datacenters. Assuming AEDs responsible for the same VLAN range are facing each other in each datacenter, the rationale is the following: (if not, ECMP would automatically load-balance network flows anyway, making multi-nic vMotion path non-deterministic across datacenters) each core VDC has one IP link transporting OTV packet to the opposite datacenter. Under normal conditions, the core VDC will directly forward traffic to the second datacenter, transporting exclusively odd or even VLANs, depending on which VLANs the connected OTV edge device is authoritative for. By choosing respectively even and odd VLANs for the 2 VMkernel responsible for multi-nic vMotion, we will be able to leverage both IP links, as depicted in the diagram below:


You’ve got now 2 x 10GbE at your disposal. Because these links can be shared with other traffic types, it's also important to define proper QoS so that a savvy VMware admin can’t bring down the whole infrastructure by moving VMs around!

In the event of an impending outage potentially impacting the full datacenter, restarting VMs at the recovery site will be faster and more efficient than performing VMs evacuation with vMotion + Storage vMotion. Of course it depends on the total numbers of VMs to be migrated, but try to do the math and estimate the required bandwidth, given the maximum number of simultaneous vMotion/storage vMotion per 10GbE link. Then compare it to the time required to restart those VMs at the recovery site. Choose wisely!

Ideally you should obviously leverage vPC architecture everywhere, but I took this particular example to show that designing a VMware environment together with OTV is not only about designing ESXi hosts or the vCenter Server. You should be able to answer questions like: what impact does my design or technical requirements have on other datacenter areas? or which limitations or caveats am I introducing for other technologies if I opt for that choice?

90% of the time, OTV or Layer 2 extension requirements are initiated by the server/virtualization team, who wants to leverage VMware stretched clusters for “better” availability. As you can see, this is not a straightforward path.

In addition, there is often a confusion between what VMware stretched cluster achieves (also called vSphere Metro Storage Cluster, or vMSC) and disaster recovery. A vSphere stretched cluster is NOT a disaster recovery solution, itâ€'s certainly not about business continuity either, this is about extended high availability. You mostly have no control over your workloads and you cannot guarantee anything in terms of VMs restart, not even their priority (to a certain extent). Would you host your most critical applications on top of an environment you cannot properly control? I’m not sure.

It'€™s true that there are mechanisms that can be used to alleviate this lack of control, but most of the time they’re not implemented or not properly implemented, so what is the point? Is this lack of control worth the amount of money your have invested – usually vMSC is the most expensive solution as it requires synchronous storage replication, stretched FC fabrics, etc. – nothing less sure!


comments powered by Disqus