RDMA over Commodity Ethernet at Scale

Traditional TCP/IP stack doesn’t scale to the latest fast datacenter network:

CPU overhead is too high, despite all kinds of hardware and software optimization.
TCP can’t provide low latency because of the kernel stack and packet drops.

RDMA needs lossless network and RoCEv2 uses PFC (Priority-based Flow Control) to achieve that.

VLAN tag based PFC control (layer 2), which is the typical solution, doesn’t scale for Microsoft’s environment. They introduced DSCP (Differentiated Services Code Point) based PFC control (layer 3).

RoCEv2 (RDMA over Converged Ethernet v2): encapsulates an RDMA transport packet within an Ethernet/IPv4/UDP packet.

PFC: a hop-by-hop protocol between two Ethernet nodes. Once the ingress queue length reaches a certain threshold (XOFF), the switch sends out a PFC pause frame to the corresponding upstream egress queue.

There is a time period between the pause frame is sent and the sender stops sending, which is especially related to the propagation delay (300 meters). To guarantee the losslessness, the switch needs to buffer the packets in the meanwhile. Because the buffer size of the switch they use is small (9MB or 12MB), they can only reserve enough headroom for at most two lossless priorities, even though the PFC standard has eight.

They use DCQCN for congestion control.

Coexistence of RDMA and TCP:

In this paper, RDMA is designed for intra-DC communications. TCP is still needed for inter-DC communications and legacy applications. We use a different traffic class (which is not lossless), with reserved bandwidth, for TCP. Different traffic classes isolate TCP and RDMA traffic from each other.

RDMA can achieve low latency (99us P99 for RDMA, vs. 700us P99 for TCP) and high throughput (60% of the total Clos network capacity and 0% CPU utilization), but not both (P99 latency similar to TCP during throughput benchmark).

This paper talks about Microsoft’s experience in building a lossless network using commodity Ethernet. It has lots of challenges of abstruse bugs. I especially love the experiment of how dropping 1/256 packets can drop the goodput to zero. To me, I feel the losslessness is such a rigid property. If one of the servers or switches got hacked, ignoring the normal PFC and DCQCN procedures, would it take down the whole cluster? Even if not considering being hacked, I’m not sure if a hardware/software malfunctioning is possible in a datacenter. At least it happens several times in our lab. One was triggered by a Hackintosh XNU kernel panic. And the rest of times were because of a developing file system module on Linux kernel.