PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric
This paper considers the requirements for a scalable, easily manageable, fault-tolerant, and efficient data center network fabric. Trends in multi-core processors, end-host virtualization, and commodities of scale are pointing to future single-site data centers with millions of virtual end points. Existing layer 2 and layer 3 network protocols face some combination of limitations in such a setting: lack of scalability, difficult management, in exible communication, or limited support for virtual machine migration. To some extent, these limitations may be inherent for Ethernet/IP style protocols when trying to support arbitrary topologies. We observe that data center networks are often managed as a single logical network fabric with a known baseline topology and growth model. We leverage this observation in the design and implementation of PortLand, a scalable, fault tolerant layer 2 routing and forwarding protocol for data center environments. Through our implementation and evaluation, we show that PortLand holds promise for supporting a \plug-and-play" large-scale, data center network.
| Attachment | Size |
|---|---|
| p39.pdf | 2.19 MB |
Summary/Comments
The objective of PortLand is to scale layer 2 switching to the modern data center requirements of millions of virtual machines, which need seamless migration capabilities within said data center. Current layer 2 architectures cannot scale to that size due to limited switch state for forwarding tables, performance limitations caused by the spanning tree protocol and the ARP broadcast overhead.
The main idea behind PortLand is the locator/identifier split: nodes are identified by their actual MAC (AMAC) address, and located by their pseudo MAC (PMAC) address, which encodes hierachical location information in its structure. Mapping between between the two addressing spaces is done by the edge switches, the ones where the nodes are directly connected. They perform AMAC-to-PMAC and PMAC-to-AMAC rewriting for outgoing and incoming traffic respectively.
A centralized fabric manager keeps all state related to IP-to-PMAC mappings and responds to ARP requests intercepted by edge switches. The responses contain the PMAC of the destination node, and forwarding is done based on this "locator", which pinpoints the node's position in the fat tree topology, chosen by the authors. Position information may be manually set by the administrator, but PortLand includes a location discovery protocol (LDP) to do it automatically (the design goal of PortLand is plug-and-play functionality).
The use of spanning tree is avoided so that all available paths could be used for traffic, the authors assuming flow hashing in ECMP to achieve this. Since switches know their positions in the hierachy, loop-free forwarding can be accomplished without the use of spanning tree.
PortLand can run on currently available switches without harware modifications, requiring only a software upgrade. The result is a big 100,000 port switch from the host software's point of view, to which all this is transparent and no explicit support is needed. In contrast, VL2, the next proposal in the session takes a different approach, by modifying host software and leaving switches intact.
The paper is well written and easy to read, the ideas are presented in a logical order. While reading, most of the time when I spotted an issue with the design, the next paragraph usually described a solution or workaround or explained why it is not a problem. The work is very interesting and addresses an important problem.
One of the concerns I have is related to security: if the machines in the data center are not owned by a sigle entity and customers may have full control over their nodes, ARP poisoning attacks can be a serious threat which should be addressed. I would have also liked to have a study of packet losses during the virtual machine migration in the evaluation section.
The evaluation is limited to a small testbed, which is understandable, but some of the results obtained may change significantly in a large testbed. The switches used are actually PCs with NetFPGA cards, it would have been nice to see a testbed with off-the-shelf switches, as the design goal states, to confirm that it is actually feasible to do.
interesting paper
This is one of my favourite papers for SIGCOMM this year.
The paper proposes a scalable layer 2 protocol designed for data center environments.
The idea is quite simple: it leverages specific knowledge of the baseline topology and growth model of data center networks to assign pseudo MAC addresses to servers based on their location in the topology.
The pseudo addresses internally encode topological information of the server and this makes it simple to route traffic in the data center fabric.
In addition, the size of forwarding tables at each switch is smaller and scales better than traditional solutions.
The big underlying assumption is that data centers networks are built as fat tree topologies.
Such a topology is divided into three layers: edge, aggregation and core.
This hierarchy is chosen as each switch relies upon it to discover its own location in the network: a location discovery protocol is introduced for that purpose and it is shown to support growing the network in a plug-and-play fashion.
Edge and aggregation switches are connected with some level of redundancy and are grouped into pods. Each pod then connects to each core switch.
Edge switches are responsible for translating MAC addresses on the fly into positional pseudo MAC addresses.
Finally, the topology is provably loop free and therefore a layer 2 TTL is not needed.
There is a central Fabric Manager that is responsible for tracking each correspondence of IP to pseudo MAC address. The switches intercept ARP requests and use the Fabric Manager to resolve the IP addresses.
They have an implementation of PortLand based on OpenFlow and NetFPGAs. This is very cool and must have taken a lot of work. The evaluation is fair although it is based on their small scale data center network.
One of the major drawbacks of the paper is that everything assumes one basic topology and it's not clear if this is the best topology or if it will be best for future data centers.
Further, the design has one central component, the Fabric Manager, that is very critical for the entire correct functioning of the network. It is suggested that traditional ARP can be used as a fallback mechanism in case the Fabric Manager fails. However, cache pollution or simply malfunctioning of the address resolution in the Fabric Manager will bring the network to it's knees.
Something I felt is missing in the paper is showing potential benefits for energy efficiency. However, power consumption has become such an important factor that new data center designs should incorporate it from day one rather than trying to fit it retrospectively.

PortLand Review
PortLand is a scalable Ethernet-like layer 2 routing and forwarding protocol (similar to SEATTLE [SIGCOMM'08]) for data centers with three-tiered hierarchical topologies (core-aggregation-edge). It has the ease of use, simplicity of management, mobility support, and plug-and-play functionality of the Ethernet, while overcoming its lack of scalabili1ty (due to limited MAC address table size in switches, broadcast, etc.), fault tolerance, and multipath forwarding (both due to the formation of a single spanning tree). PortLand achieves these properties by just modifying the control plane of the network, leaving the switch hardware and end hosts untouched.
PortLands guarantees loop-free forwarding without exploiting spanning trees (which significantly decrease the bisection bandwidth) by constraining the switches to never forward the packets upward in the hierarchy if the packet is received from an upper switch. This constraint is stateless and local to each switch, therefor it is easy to implement. PortLand basically enables multipath routing, but the choice on how to do it is orthogonal to the work and they assumed a standard technique like flow hashing works fine.
PortLand limits the forwarding table size in individual switches. Each host is assigned a pseudo mac address (PMAC) which represents the host location (in the form of pod.position.port.vmid) in the hierarchical topology. A logically centralized fabric manager keeps track of PMAC to IP address mappings and edge switches do that for AMAC (actual mac) to PMAC mappings. Switches are in charge of discovering their own locations using the PortLand's location discovery protocol, and the edge switches construct PMAC addresses for the newly seen AMAC addresses. ARP requests get intercepted at the edge switches and are sent to the fabric manager. If the fabric manager has the mapping for the requested IP address, the corresponding PMAC is returned; otherwise the ARP request gets broadcasted.
Fabric manager is a central component of PortLand. Even though it maintains soft state, its failure may result in many ARP requests in a short period of time which in turn can potentially bring a network with that many hosts sharing the same broadcast domain to its knees. There is also an implicit assumption of trust on the end hosts, because otherwise malicious virtual machines can query for non-existing IP addresses to generate broadcast traffic. The paper also does not expand on how to distribute the fabric manager and how to suppress the non-ARP broadcast traffic.
The other problem is that PortLand assumes that once a VM migrates, switches direct the traffic destined to the old PMAC address to the new PMAC address. This is a mandatory feature, since the end hosts send traffic to the old PMAC address until the corresponding ARP table entry times out. This functionality increases the amount of state that needs to be stored in switches. Also if such a feature is present (as in the OpenFlow case) why shouldn't we do the same to IP addresses rather than PMAC addresses?