This paper describes a new method for interconnecting the host computers in a data center network using a fat tree topology and a variety of routing methodologies. This provides much greater aggregate bandwidth than the traditional hierarchical approach.
The main points of this paper were as follows:
- The current common data center topology consists of core, aggregation, and edge switches. The problem is that the aggregate bandwidth of what is fundamentally a tree structure is limited by the bandwidth of the root node. This was less of a problem in the past, since it was possible to get core switches which were fast enough to handle the aggregate bandwidth requirements of the hosts, albeit at great cost (expensive, specialized hardware). However, with the push toward 10 GbE hardware, developing a root node which can handle the aggregate bandwidth of thousands of 10 GbE nodes is simply untenable at any cost. Therefore, we need a new approach which can scale to the 10 GbE range. The authors of this paper propose a fat tree topology using simple commodity 10 GbE switches. This provides almost 100% of the ideal bisection bandwidth, while reducing power and cost. This reflects the current trend of aggregating lots of commodity hardware to provide high performance instead of using just a few extremely high-performance systems.
- One challenge with the fat tree approach is that, as opposed to the single tree topology previously used, there are multiple possible shortest paths between end hosts. This means that traditional routing protocols are insufficient to allow the netowkr to take full advantage of the bisection bandwidth, since they cannot adapt to the traffic patterns present. In order to solve this problem, the paper presents three different approaches: a two-level static routing table, local flow classification, and global flow scheduling. As we saw in the previous paper, it turns out that managing flows in a global manner provides the greatest performance, since local techniques can cause bottlenecks “beyond the horizon” visible to the local switches. In addition, the power of modern computing hardware means that one centralized node can easily handle the routing requirements of their ideal 27k-host network.
- Another advantage of using commodity hardware is that it very easily fits into the current Internet architecture (Ethernet, IP, TCP). This is opposed to specialized products like InfiniBand, which, while able to provide the bandwidth of this approach (incidentally, using a similar fat-tree-based architecture), uses completely different layer 1-4 protocols, making it difficult to map back to the Ethernet/IP/TCP design that we want. Some of the specialized products also require far too much wiring complexity (e.g. the torus found in BlueGene/L and the Cray XT3), which was addressed in detail in this paper, and which this design reduces to a great degree by allowing cables to be grouped to increase the efficiency and simplicity of interconnect runs.
One weakness of this paper is that they did not discuss how to implement a distributed centralized controller. When other parts of the system were so meticulously detailed (e.g. the section on how to wire these pods together in a data center), it seemed like just saying “possibly replicated” was a bit of a weak statement, especially when presenting theoretical results using such a controller. It seems likely that any data center with 27k nodes is going to want some kind of failover or replication support on a centralized controller, and, while this probably will not affect performance, it might, which would make for some interesting experimental results.
Future research in this field might want to focus on a way to distribute the task of scheduling flows throughout the network. While we have recently seen the power of a centralized approach, one of the benefits of using a 27k-node cluster over a small number of supercomputers is that it is extremely fault-tolerant. This is a necessity, since commodity hardware tends to have higher failure rates (due to its significantly lower cost) than the more complex and expensive hardware. It would be interesting to see this fault-tolerance implemented both in the flow management (whether it be a distributed method or a local method) as well as in the routing protocol (which is allowed by the dynamic routing protocols presented in the paper).