Graduate Networks, UCSD

CSE222 – Spring 2009

A Scalable, Commodity, Data Center Network Architecture May 7, 2009

This paper describes a new method for interconnecting the host computers in a data center network using a fat tree topology and a variety of routing methodologies. This provides much greater aggregate bandwidth than the traditional hierarchical approach.

The main points of this paper were as follows:

  1. The current common data center topology consists of core, aggregation, and edge switches. The problem is that the aggregate bandwidth of what is fundamentally a tree structure is limited by the bandwidth of the root node. This was less of a problem in the past, since it was possible to get core switches which were fast enough to handle the aggregate bandwidth requirements of the hosts, albeit at great cost (expensive, specialized hardware). However, with the push toward 10 GbE hardware, developing a root node which can handle the aggregate bandwidth of thousands of 10 GbE nodes is simply untenable at any cost. Therefore, we need a new approach which can scale to the 10 GbE range. The authors of this paper propose a fat tree topology using simple commodity 10 GbE switches. This provides almost 100% of the ideal bisection bandwidth, while reducing power and cost. This reflects the current trend of aggregating lots of commodity hardware to provide high performance instead of using just a few extremely high-performance systems.
  2. One challenge with the fat tree approach is that, as opposed to the single tree topology previously used, there are multiple possible shortest paths between end hosts. This means that traditional routing protocols are insufficient to allow the netowkr to take full advantage of the bisection bandwidth, since they cannot adapt to the traffic patterns present. In order to solve this problem, the paper presents three different approaches: a two-level static routing table, local flow classification, and global flow scheduling. As we saw in the previous paper, it turns out that managing flows in a global manner provides the greatest performance, since local techniques can cause bottlenecks “beyond the horizon” visible to the local switches. In addition, the power of modern computing hardware means that one centralized node can easily handle the routing requirements of their ideal 27k-host network.
  3. Another advantage of using commodity hardware is that it very easily fits into the current Internet architecture (Ethernet, IP, TCP). This is opposed to specialized products like InfiniBand, which, while able to provide the bandwidth of this approach (incidentally, using a similar fat-tree-based architecture), uses completely different layer 1-4 protocols, making it difficult to map back to the Ethernet/IP/TCP design that we want. Some of the specialized products also require far too much wiring complexity (e.g. the torus found in BlueGene/L and the Cray XT3), which was addressed in detail in this paper, and which this design reduces to a great degree by allowing cables to be grouped to increase the efficiency and simplicity of interconnect runs.

One weakness of this paper is that they did not discuss how to implement a distributed centralized controller. When other parts of the system were so meticulously detailed (e.g. the section on how to wire these pods together in a data center), it seemed like just saying “possibly replicated” was a bit of a weak statement, especially when presenting theoretical results using such a controller. It seems likely that any data center with 27k nodes is going to want some kind of failover or replication support on a centralized controller, and, while this probably will not affect performance, it might, which would make for some interesting experimental results.

Future research in this field might want to focus on a way to distribute the task of scheduling flows throughout the network. While we have recently seen the power of a centralized approach, one of the benefits of using a 27k-node cluster over a small number of supercomputers is that it is extremely fault-tolerant. This is a necessity, since commodity hardware tends to have higher failure rates (due to its significantly lower cost) than the more complex and expensive hardware. It would be interesting to see this fault-tolerance implemented both in the flow management (whether it be a distributed method or a local method) as well as in the routing protocol (which is allowed by the dynamic routing protocols presented in the paper).

 

A Scalable, Commodity Data Center Network Architecture May 7, 2009

Three Important Things:

  • Inter-node communication bandwidth represents a significant bottleneck in today’s data center networks. Often links have to be oversubscribed in order to provide peak bandwidth for any random communicating pair. The presented application of a fat tree topology addresses this critical issue. The topology is able to provide a 1:1 subscription ratio while leveraging identical commodity switching elements which are fully compatible with Ethernet and IP. This is an attractive alternative to large and expensive high-end switches like those offered by Infinband. Fault tolerance is another added benefit of fat tree due to the redundancy of available paths between any pair of hosts.
  • Maximum link utilization cannot be achieved by mere virtue of the fat tree topology. To ensure proper bandwidth allocation a central scheduler is proposed. Its task is to monitor large flows and assign them non-conflicting paths. It accomplishes this with a linear scan over the core switches which allows it to compute and unloaded path for the flow. With simple modifications to switching elements, they first determine the optimal path locally as they see fit until that time when the scheduler notifies them of the globally optimal forwarding scheme.
  • The paper presents a set of impressive results in the areas of bandwidth utilization, heat and power consumption, and cost. The experimental setup with a central flow scheduler is able to achieve 93% worst case bandwidth utilization, as opposed to 28% for a traditional tree topology. Using commodity switching elements that are more efficient per Gbps of bandwidth lead to a 50% reduction in power consumption and heat generation, implying an operational cost savings on server operation and cooling units. Initial capital investment scales well with the number of hosts even past the point where there would be no possible solution with specialized switches and link oversubscription.

Glaring Problem:

An acknowledged major concern with the fat tree topology is cabling requirement of the “fat” links from the core to the pods. The paper proposes a packaging solution that seeks to minimize the amount of cabling, but it still seems like the cost and weight of these cables would significantly hinder the deployment of such a topology (specifically for a 20k+ node cluster). There is no mention of these cost/weight numbers, or of alternative cabling solutions such as optical multiplexing.

Future Work:

Given that the best case bandwidth is achieved using global knowledge of  all packet flows, the current solution relies on a centralized controller to oversee the network. There are some drawbacks to this approach (single point of failure, scalability concerns), and so it would be worth exploring other ways to perform optimal routing negotiations, perhaps in a distributed manner across the core routing elements.

 

A Scalable, Commodity, Data Center Network Architecture May 7, 2009

(i) the three most important things the paper says

One of the most important ideas that this paper brings to light is that, at the time this paper was written, a very-large-scale network with a 1:1 oversubscription rate was much cheaper to implement with a fat-tree design than with traditional data center network architectures.  Although the paper does not specify what percentage of larger data centers actually employ less efficient or more expensive architectures than the one proposed, it can be assumed that the problem is of sizable magnitude, at some level.  Another important idea that the paper brings to light is that a static approach to switching paths in the fat-tree architecture (and perhaps other architectures) is not as efficient of a solution as can exist for a minimal increase in cost.  This is shown through the increased efficiencies shown in the more dynamic protocols versus the static assignment of port forwarding.  The third most important idea in the paper is the fact that the fat-tree beats the traditional architecture in yet another metric (and perhaps even more of the more complicated, but not fat-tree oriented, designs that aren’t specifically detailed here): power usage and heat dissipation.  Since the fat-tree design detailed here sticks to only GigE switches, the power usage per Gbps was shown to be significantly lower (when compared to using 10Gbps switches to reach the same aggregate bandwidth marks).  For a large-scale data center, this should be a huge argument for switching to this architecture.

(ii) the most glaring problem with the paper

The biggest problem that I found in the paper was that in the Introduction and the Related Works sections, many advanced offerings from other companies were mentioned.  I was a little surprised to not find any benchmark numbers from these throughout the paper (or any direct comparisons with these systems).  I feel that the paper could have been much stronger had the authors shown the aggregate and link-based bandwidth comparisons between their system and the other commercial systems out there.  I say this with the understanding of the fact that many companies will not allow this type of testing (at least not without physically buying the actual solution–a prohibitively expensive research endeavor), but still think that it would have been very useful to provide some concrete numbers to supplement the other advantages of the system described in the paper (including the advantage of this system being less proprietary).

(iii) the future research directions of the work.

To start, I feel that this system should be compared with some of the more proprietary systems on a whole host of different metrics (cost, power, heat dissipation, complexity, aggregate bandwidth under different loads, etc.) in order to demonstrate its superiority (I repeated this here, even though I mentioned it in ii).  Another good future research direction would be a case study of this design in a real world example, with actual hardware running under a realistic data center load.  Also, it would be interesting to see if this design would still be power efficient (and still retain its benefits) when running with either a subset of 10Gbps switches or an entire design using 10Gbps switches.