Be aware of FUD...NSX over ACI explained
There are moments where I'm very proud to work with best of breed technologies, at the bleeding edge of the industry. The reason is that I do enjoy solving business problems with innovations, and finding the right use-case for the right feature or solution. It is true that most of the time, technology is the easiest part of the picture. Biggest hurdles are generally sitting within organisation processes or stemming from the siloed nature of IT teams. Part of my job is to explain the technology to partners and customers, and show them how to articulate different technical solutions to solve data center challenges. It is also to highlight how we differentiate from competition, so sometimes I'm also working on debunking misconceptions, trying to limit other vendors FUD.
I'm not naive, I know that the world we live in is mostly driven by profits and that marketing teams are sometimes ready to push the envelope as far as they can in order to increase credibility up to the point where they fool people on product's ability to execute. This is what VMware has been doing for the past few weeks and I feel very uncomfortable with this practice. Spreading FUD in the hope of gaining illegitimate endorsement by twisting statements, pictures and IP is not the level of standard we can expect from such a company. Or at least it was not the case a few years ago when everything was set fair, now things are a little bit different. Financial pressure may impose being more "creative" when it comes to building marketing strategies. However reading those fallacies makes me blow my top and I think reality needs to be clarified at the very least, both from a messaging and technical standpoint. In the next paragraphs, I'm going to expose these statements, explain where they're coming from, and detail why they're completely absurd and most importantly what is the reality behind them.
Lying about the facts
First statement from VMware:
"Last week, we were acknowledged by Cisco at Cisco Live in Berlin! They are now saying we are perfect for the virtual world and ACI is for Physical."
Well this one is the most ridiculous. Here VMware is claiming that we are not targeting ACI for virtual environments and that we recognise NSX as the best solution for SDN. This sounds like an obvious fallacy but still, this may be misleading when you don't know where that statement is coming from.
At Cisco Live in Berlin two weeks ago, a colleague of mine - Steve Sharman - gave a very good session about ACI for network admins. If you want to see the full content, look for BRKACI-2005 on CiscoLive Online. For 1.5 hour, Steve explains ACI concepts from a network administration point of view and does a very good job at it. He's got a section focussing on virtual environments integration and a specific use case depicting a scenario where customers may want to integrate ACI with NSX. Actually he asked me to review this particular slide and I found the initial intention of showing how open is ACI very laudable. After a few amendments to the drawing, we ended up with the following detailed diagram, showing all the complexity of adding NSX on top of ACI.
Let's explain this slide in more details.
At the bottom part of the diagram, you can notice an EPG (Endpoing Group) called "NSX_Transport". In the case of NSX, this EPG will transport VXLAN frames encapsulated by the hypervisor VTEP. And this is the very first thing that doesn't make a lot of sense if you've already got ACI. You don't need to bother with extra management of host overlays, which provides poor scalability and adds extra head-end replication (HER). Alternatively you can use multicast, but you have to get PIM configured on the network when adhering to VMware L3 fabric "best practice" with NSX. More importantly, the NSX based overlay hide Virtual Machines from a network point of view (MAC addresses and IP addresses, unless you want to advertise /32 everywhere in the data center), making end-to-end visibility impossible. Here VMware would obviously argue that you can correlate underlay (physical network) and NSX by using vROPs, of course at an additional cost.
All this overlay management comes out-of-the box with ACI and you don't need to invest in other monitoring tools as it does provide end-to-end visibility for both virtual and physical. It is fully programmable via RESTful API's or using a Python SDK, which NSX is still lacking after 2 years...
However, because it is possible, some people will want to do it. This diagram is basically showing that it would be a very bad idea even if technically it would work. ACI can transport any application, and in the NSX case, the application is just "VXLAN". Cisco's position is rather simple to understand: ACI is a software driven arbitrary fabric that provides better SDN functions than NSX. It manages network overlays for physical, virtual AND containers. The configuration is managed via centralised policies, consistent across all the environments I just mentioned. This means you can have VMware, Bare Metal servers, Docker, Microsoft Hyper-V, all connected and managed via the same set of abstracted rules. Running NSX overlays on top of ACI just adds the extra complexity of managing 2 independent VXLAN domains, leading to a double VXLAN encapsulation in the network.
Leveraging illegitimate endorsement
Second statement from VMware:
"As seen in this latest CiscoLive slide, Cisco is now correctly positioning Cisco ACI for physical underlay functions only, allowing customers to take full advantage of NSX deployed on top. This shows that NSX is inevitable and Cisco is starting to accept that."
This one is clearly trying to demonstrate two things:
- Cisco was wrong with ACI in the first place and host based overlay is the way to go. But Manageability, Performance, Availability and Costs criteria (especially in the case of NSX) show the exact opposite.
- ACI needs NSX to be able to provide all SDN functions.
Let's try to clarify this...
NSX Controllers are depending on a 1:1 relationship with vCenter. if multiple vCenter(s) are required, then "Universal" transport zones and new constructs are required to get an end-to-end overlay. In addition, important features are lost, such as stateful N/S firewalling. Local egress also becomes challenging unless you're using distributed logical routing, which implements local routes tagging when advertising routes upstream. Long story short, as NSX scales, it's drastically increasing the number of constraints, which limits design choices. On top of that, ECMP stays an OPTION, even basic features you could expect from a traditional network are not a given.
On the opposite, Cisco ACI can currently integrate up to 10 vCenter(s), managed by a single set of controllers. More importantly, design choices are not limited. Whether you choose to go for a single stretched fabric or multiple fabrics, local egress is always guaranteed as well as optimal load sharing across all active links. Cisco ACI can accommodate any arbitrary virtual or physical network topology, whether L2 or L3. There is no need for any extra x86 resources to provide multi-tenancy features. Traditional network constructs such as VLAN or subnet are completely abstracted, which enables SDN functions across all types of environment, whether virtual or physical. To put this in perspective against the previous diagram showing NSX complexity, the following diagram depicts the same scenario with ACI only:
It substantially simplifies the global SDN design as well as the integration of existing network.
In NSX, unless you've got multicast transport zones, forwarding information are stored in a 3-node controller cluster. Depending on failures scenarios, data-plane and control-plane outage can be experienced. In Cisco ACI, there is a different approach. Controllers don't participate in the control-plane at all. Their primary function is to store policies and the state of the Management Information Tree. The control-plane function is responsible for determining where the endpoints are effectively connected. The information is stored in a distributed database residing within the fabric spines. Similarly, the routing functions are also distributed by using standard routing protocols, dynamically managed by the fabric. Even if you were to lose the 3 controllers, no data-plane outage would be experienced at all. Only the ability to create new policies would be impacted.
For NSX, when it comes to edge perimeter security and multi-tenancy, it's getting worse, as the whole NSX tenant reachability is provided by a pair of Edge Gateways. Their failover time of 15s by default, which can be reduced to 6s, but still very long when you know that optimised OSPF network has a sub-second convergence. In Cisco ACI, multi-tenancy capabilities are provided by VRF's which fully isolate IP spaces between tenants (funny enough, this is a virtualization technology). All Leaf nodes (ToR) are acting as the default gateway for the locally connected end points. There is no dependency on external resources, therefore no possible outage within the tenant perimeter, other than those induced by poor host design (e.g. hosts not dual homed).
About Performance and Security
For VMware, Performance isn't really a concern with NSX as it's not part of its selling points. From their perspective, networking is just a tap you can turn on or off. Well this may sound great for virtual admins at the first glance, but anyone with a little networking background knows that keeping the network lights on isn't as easy as that. When using NSX as an overlay, the first hop default gateway for Virtual Machines traffic is pushed to the host level in a distributed fashion. It means the more hosts, the more cumulated bandwidth you can allocate for east/west traffic between Virtual Machines. This however comes at the cost of some CPU overhead on the host itself, which may be a concern in environments with a high consolidation ratio. But the fact is CPU is not usually a scarce resource in virtualized data centers.
The real performance concern is not at this level, it sits at the boundary of the VXLAN domain, when traffic has to be forwarded to the physical network. Whether bridged or routed, all Virtual Machine traffic destined to the outside (eg. Bare Metal servers, WAN, or any non VXLAN-enabled network) is steered to an NSX Edge gateway. It's essentially acting as a gateway between the VXLAN overlay and the physical network. In other words, it becomes a bottleneck. On top of that you can't really expect line rate for any single ESG, and probably about half maximum line-rate for the smallest ESG footprint. However, you could leverage ECMP and use multiple ESG's to improve the global throughput, but at the cost of the stateful security functions. You can't have both ECMP and North/South firewall at the edge of your tenant.
So you have to make a choice between 2 really basic network capabilities that are now mutually exclusive if you chose to go for VMware NSX. Choose ECMP and your tenant is not secured, unless you want to rely solely on micro-segmentation for N/S communication, which is not manageable or scalable at all. Or choose to secure your tenant, but you'll lose ECMP capabilities, which will also impact performance. Who's said that NSX was simple? it's all about tradeoff in the end.
Now how is that compared to Cisco ACI? ACI offers line-rate VXLAN-VLAN L2 or L3 gateway at each ToR within a Zero Trust Model where by default traffic is not allowed between security zones, which we call EPG (End Point Group). Policies are created to make EPG's talk together and security is enforced in a distributed fashion at the ingress. You can leverage those policies both for East/West (intra-tenant) and North/South (inter-tenants or tenant to outside) in a scalable fashion. Alternatively, ACI also gives you the ability to use your existing firewall or L4-7 solution to protect the tenant. The ACI controller (APIC) will then program the security device via the security policies you've defined. This also means that the security device is now partially under the control of ACI and can take advantage of ACI RESTful API's to either automate further configuration or get extra information from the APIC, such as device health score or service health (very useful for load-balancers).
No compromise, no trade-off for the best scalability and performance. And more importantly, ACI can deliver this for Virtual Machines (with Virtual Machine Manager integration for vCenter, Hyper-V or Openstack), physical servers and even containers, all managed with the same set of consistent policies.
To sum it up
While running NSX on top of ACI is possible, it's doesn't provide anything that ACI can't do as a solution, including micro-segmentation. Actually when combining both, mostly every single architectural benefit becomes duplicate, and so does the cost! On the other hand, VMware NSX is just 50% of a working solution as you'll always need a fabric underneath. So now the point is if you've got an SDN capable fabric anyway, why spend extra money for any software-based overlay that doesn't add any major function or benefits? One could argue that in terms of responsibility, NSX is targeted to server/virtualisation teams and the ACI is targeted to network teams.
But this is the worst approach when foreseeing SDN adoption. Bare in mind that the goal of SDN is to simplify network operations and seamlessly integrate network provisioning, and ultimately management, into broader infrastructure life cycles. Before thinking of SDN, organisations should first take a look at their IT model and build overlay teams to limit the "silo" effect of technology adoption. Then SDN is also about automation and orchestration, so think about the tools provided by the solution. Do you really want to manage flat xml files and use raw http requests or use a proper object model with a flexible SDK on top of a JSON schema? Automation engineers will also have a good opinion about that.
So when embracing SDN, be wise and before talking to vendors, create a team of SME's to think about your current challenges in terms of infrastructure, virtualization, network directions but also automation.