Design considerations for vSphere and OTV in stretched clusters environments (Part 1)

In this series of posts, I’m going to focus on different scenarios involving Cisco OTV in a VMware vSphere environment, and try to demonstrate the potential impacts of design decisions for both technologies. These scenarios include common configurations at network physical and virtual access layer, often used in reference architectures, such as Flexpod or VSPEX.

Here are the 2 main scenarios on which I’ll focus:

  • vSphere hosts directly connected to a pair of Nexus 7000 at the aggregation layer (this post).
  • vSphere hosts connected to a pair of Nexus 5000 at the access layer (Part 2).

But first of all, a small reminder of what OTV is, and its main use cases:

To make it simple, OTV or Overlay Transport Virtualization is a Cisco technology providing Layer 2 adjacency on top of an IP network. Although you could use it within a datacenter, its main use case is for VLAN extension across distant sites.

As a consequence, it gives the ability to run multiple features or technologies across datacenters, such as long distance vMotion (by the way it’s not because you can that you should do it!), build VMware vSphere Metro Storage Cluster or keep identical IP addresses after VMware SRM ran a recovery plan.

But there are also a few drawbacks. As you probably know, extending VLANs also means extending Layer 2 failure domains. It propagates flooding, induces hair-pinning when forwarding traffic to the default gateway, etc. And as a general rule, stretching VLANs over distance is a bad idea!

However, OTV alleviates these potential problems by leveraging the following mechanisms:

  • It synchronizes MAC addresses by using a Layer 3 control plane (IS-IS), thus reducing flooding when learning new MAC addresses.
  • It does not transmit BPDUs over the overlay. All sites are running individual STP instance(s).
  • It provides FHRP localization, thus optimizing egress North/South traffic.
  • It drops unknown unicast frames by default.

So to summarize, if you really want to extend you Layer 2 domain across datacenters, and you do have a business case, do it with OTV!

Now let's define the requirements and the guidelines for our scenarios, as well as the different assumptions:

  • The design is limited to 2 sites connected with DWDM technology.
  • Nexus 7000 pairs are connected with Layer 3 links between datacenters, in a square topology.
  • The network hierarchical model is collapsed-core, aggregation and core functions are performed by the same devices.
  • vMotion should be able to leverage all available active datacenter Interconnect links.
  • Where possible, no single point of failure.
  • When using vPC, data traffic over the peer-link should be limited. (ideally the peer-link should only be used for peer synchronization).
  • OTV is using its own VDC. The network aggregation/core role is performed by another VDC of the same physical switch.
  • The architecture chosen for OTV is appliance "on a stick". Aggregation/core device is used for both VLAN extension and IP transport between datacenters.
  • vSphere hosts have two 10GbE adapters.
  • Native FC is provided by additional HBAs within isolated SAN Fabrics.
  • Network traffic will be load-balanced using VMware physical NIC load teaming policy for virtual machines (also known as LBT or Load-Based teaming) and with active/standby policies for vMotion and Management network. An alternative choice would be to leverage vPC, but we need non-VPC VLANs for the purpose of this article. Also, another argument may be that vPC is not as smart as LBT when it comes to path selection, as LACP doesn’t take host’s NIC load into account when selecting egress port.

The high-level design could be represented as follows:

OTV_1

Now let's have a look at the first datacenter design:

non_VPC

This architecture highlights the following:

  • VM network port groups are connected to both Nexus 7000 in an active/active fashion.
  • OTV is multihomed (using two OTV edge devices, in dark blue on the picture) for redundancy and load balancing purposes. The current version of NX-OS (6.2(6) at the time of writing) provides non configurable automatic load balancing. Each OTV device will be authoritative (Authoritative Edge Device or AED) for a particular set of VLANs. Network traffic destined to odd VLANs will be forwarded by one OTV edge device and even VLANs will leverage the second edge device.
  • OTV edge devices are single attached to each aggregation VDC.

Now what is the potential risk in this design:

worst_path_non_vpc

As shown on the picture, if VLAN 100 in one datacenter needs to talk to VLAN 100 in the other one, network traffic may go through the right-hand Nexus 7000 core VDC, depending on the virtual switch decision. VLAN 100 is an even VLAN, which also means that network flow has to be steered to the left-hand OTV device, crossing the peer-link. Same logic applies to VLAN 101.

The peer-link is responsible for vPC peers control plane synchronization and should not transport any data under normal conditions.In this scenario, if the peer-link was not sized accordingly, it may become overloaded and cause a complete outage. The bigger the VMware environment, the more likely the problem will arise.

If this network topology is a constraint or an input from the networking team, then the VMware network design could potentially solve the problem by configuring vmnic0 as active for even VLANs and vmnic1 as standby, and the opposite for odd VLANs. This design would leverage the peer-link for transporting data only if one NIC were to fail, which is acceptable. The alternative solution, much smarter, would be to dedicate trunk links that transport those VLANs between vPC peers, as depicted below:

OTV_nonVPC_VLANs1

Of course, this alternative solution would come from the network team, not the server team.

Let'€™s look at an alternative design, meeting the same initial requirements:

vPC_7k

In this configuration, OTV edge devices are connected to the aggregation/core layer via vPC. This configuration avoids consuming unnecessary peer-link bandwidth, as each aggregation/core switch can directly access OTV edge devices. It’s definitely a more robust design.

path_vPC_7k

When architecting VMware environments, every design choice may have an impact on other datacenter areas, especially when these areas introduce innovative technologies such as Cisco OTV. Network and virtualization are more and more depending on each other, and if teams or SME involved don’t work collectively to address these design challenges, projects that appear compelling at first glance can lead to total failure! It was just one example, reflecting what I can see when working with my customers. I’m usually engaged in multi-layer projects, involving multiple teams, especially the server team and the network team. I can talk both languages, so I can quickly pinpoint those issues. But the original problem is that most of the time, teams just donâ€'t talk to each other. My piece of advice would be that before initiating any virtualization project, sit with the network and the storage team and verify that your design doesn’t induce any potential risk (This is even more true if you’re preparing for the VCDX certification).

Part 2 will cover vSphere hosts connected to a pair of Nexus 5000 at the access layer and multi-nic vMotion design.

Comments

comments powered by Disqus