Graduate Networks, UCSD

CSE222 – Spring 2009

A Scalable, Commodity, Data Center Network Architecture May 7, 2009

This paper describes a new method for interconnecting the host computers in a data center network using a fat tree topology and a variety of routing methodologies. This provides much greater aggregate bandwidth than the traditional hierarchical approach.

The main points of this paper were as follows:

  1. The current common data center topology consists of core, aggregation, and edge switches. The problem is that the aggregate bandwidth of what is fundamentally a tree structure is limited by the bandwidth of the root node. This was less of a problem in the past, since it was possible to get core switches which were fast enough to handle the aggregate bandwidth requirements of the hosts, albeit at great cost (expensive, specialized hardware). However, with the push toward 10 GbE hardware, developing a root node which can handle the aggregate bandwidth of thousands of 10 GbE nodes is simply untenable at any cost. Therefore, we need a new approach which can scale to the 10 GbE range. The authors of this paper propose a fat tree topology using simple commodity 10 GbE switches. This provides almost 100% of the ideal bisection bandwidth, while reducing power and cost. This reflects the current trend of aggregating lots of commodity hardware to provide high performance instead of using just a few extremely high-performance systems.
  2. One challenge with the fat tree approach is that, as opposed to the single tree topology previously used, there are multiple possible shortest paths between end hosts. This means that traditional routing protocols are insufficient to allow the netowkr to take full advantage of the bisection bandwidth, since they cannot adapt to the traffic patterns present. In order to solve this problem, the paper presents three different approaches: a two-level static routing table, local flow classification, and global flow scheduling. As we saw in the previous paper, it turns out that managing flows in a global manner provides the greatest performance, since local techniques can cause bottlenecks “beyond the horizon” visible to the local switches. In addition, the power of modern computing hardware means that one centralized node can easily handle the routing requirements of their ideal 27k-host network.
  3. Another advantage of using commodity hardware is that it very easily fits into the current Internet architecture (Ethernet, IP, TCP). This is opposed to specialized products like InfiniBand, which, while able to provide the bandwidth of this approach (incidentally, using a similar fat-tree-based architecture), uses completely different layer 1-4 protocols, making it difficult to map back to the Ethernet/IP/TCP design that we want. Some of the specialized products also require far too much wiring complexity (e.g. the torus found in BlueGene/L and the Cray XT3), which was addressed in detail in this paper, and which this design reduces to a great degree by allowing cables to be grouped to increase the efficiency and simplicity of interconnect runs.

One weakness of this paper is that they did not discuss how to implement a distributed centralized controller. When other parts of the system were so meticulously detailed (e.g. the section on how to wire these pods together in a data center), it seemed like just saying “possibly replicated” was a bit of a weak statement, especially when presenting theoretical results using such a controller. It seems likely that any data center with 27k nodes is going to want some kind of failover or replication support on a centralized controller, and, while this probably will not affect performance, it might, which would make for some interesting experimental results.

Future research in this field might want to focus on a way to distribute the task of scheduling flows throughout the network. While we have recently seen the power of a centralized approach, one of the benefits of using a 27k-node cluster over a small number of supercomputers is that it is extremely fault-tolerant. This is a necessity, since commodity hardware tends to have higher failure rates (due to its significantly lower cost) than the more complex and expensive hardware. It would be interesting to see this fault-tolerance implemented both in the flow management (whether it be a distributed method or a local method) as well as in the routing protocol (which is allowed by the dynamic routing protocols presented in the paper).

 

A Scalable, Commodity, Data Center Network Architecture May 7, 2009

(i) the three most important things the paper says

One of the most important ideas that this paper brings to light is that, at the time this paper was written, a very-large-scale network with a 1:1 oversubscription rate was much cheaper to implement with a fat-tree design than with traditional data center network architectures.  Although the paper does not specify what percentage of larger data centers actually employ less efficient or more expensive architectures than the one proposed, it can be assumed that the problem is of sizable magnitude, at some level.  Another important idea that the paper brings to light is that a static approach to switching paths in the fat-tree architecture (and perhaps other architectures) is not as efficient of a solution as can exist for a minimal increase in cost.  This is shown through the increased efficiencies shown in the more dynamic protocols versus the static assignment of port forwarding.  The third most important idea in the paper is the fact that the fat-tree beats the traditional architecture in yet another metric (and perhaps even more of the more complicated, but not fat-tree oriented, designs that aren’t specifically detailed here): power usage and heat dissipation.  Since the fat-tree design detailed here sticks to only GigE switches, the power usage per Gbps was shown to be significantly lower (when compared to using 10Gbps switches to reach the same aggregate bandwidth marks).  For a large-scale data center, this should be a huge argument for switching to this architecture.

(ii) the most glaring problem with the paper

The biggest problem that I found in the paper was that in the Introduction and the Related Works sections, many advanced offerings from other companies were mentioned.  I was a little surprised to not find any benchmark numbers from these throughout the paper (or any direct comparisons with these systems).  I feel that the paper could have been much stronger had the authors shown the aggregate and link-based bandwidth comparisons between their system and the other commercial systems out there.  I say this with the understanding of the fact that many companies will not allow this type of testing (at least not without physically buying the actual solution–a prohibitively expensive research endeavor), but still think that it would have been very useful to provide some concrete numbers to supplement the other advantages of the system described in the paper (including the advantage of this system being less proprietary).

(iii) the future research directions of the work.

To start, I feel that this system should be compared with some of the more proprietary systems on a whole host of different metrics (cost, power, heat dissipation, complexity, aggregate bandwidth under different loads, etc.) in order to demonstrate its superiority (I repeated this here, even though I mentioned it in ii).  Another good future research direction would be a case study of this design in a real world example, with actual hardware running under a realistic data center load.  Also, it would be interesting to see if this design would still be power efficient (and still retain its benefits) when running with either a subset of 10Gbps switches or an entire design using 10Gbps switches.

 

Ethane: Taking control of the Enterprise May 5, 2009

1. Centralized control

The Ethane design proposed by the paper cetralizes control. Centralized control, while having its issues, simplifies structure and makes it possible to explore solutions and implementations not possible with distributed control. Broadcasting and the learning process that a network device such as a switch might have to go through when it comes up is made unnecessary. All it would need to do would be to contact the central controller to obtain information abotu the network. It also makes it easier to implement new policies, make changes, etc, since they would have to be made only at central controller. In addition, since the central controller is a single device used for a specific purpose, it can be customized and provided with the required computational capacity.

2. Policies determined in terms of high level names.

Ethane performs bindings between users, hosts and address. This makes spoofing difficult. Using high level names to declare policies seems natural and logical. It also allows different users of the same host to be governed by different policies. It also provodes for a flexibility by making it easy for hosts and users to move around in the network without having to declare the policies all over again.

3. Ethane switches. Simple and efficient

Ethane uses switches to inter-connect multiple hosts to the controller. These switches are a lot simpler than the ones used by Ethernet. Since all control and routing occurs at the central controller, these switches are not required to posses large memories or the computational power to check for source-address spoofing, support for VLAN etc. They are only required to maintain much smaller flow tables and a few per flow statistics.

Oversights

It is a tough job to design an efficient, secure system with minimum overhead. There exist several trade-offs between security and overhead that needed to be made in the design of Ethane. Providing for a central controller and requiring every flow to be registered and authenticated by the central controller would add a bottleneck in the network. Conections between end hosts or switches and the controller would be used frequently and by several hosts hence increasing the load on these connections. Accomodating dynamically changing policies, provinding for syncronization between multiple central controllers, etc adds overhead as well.

Future Work

The design can be extended to support multiple central controllers, thereby distributing central control to an extent. A policy to achieve  sycronization between these multiple central units would need to be developed. This would speed up the flow authentication process, reduce the load on each central controller and provide resiliency.

 

A Policy-aware Switching Layer for Data Centers April 11, 2009

This paper presents a new approach for enforcing network management policies in a data center setting through the use of pswitches. Currently, data center network managers are faced with difficult problems. Data center networks are very complex, containing thousands of nodes running a large number of different applications. Each of these applications has different policies that need to be enforced (i.e. QoS, or Security).

Administrators deploy a large number of middleboxes throughout the network that enforce these policies. These middleboxes include things like firewalls, load balancers, SSL offloaders, web caches and intrusion prevention boxes. There are many problems with this approach, mostly stemming from the fact that a middlebox is only effective if it is place on the network path between the gateway into the data center and the destination within the data center. If it is not placed on this path, packets will not pass through the middlebox and application policies will be violated. As a result, network administrators are forced to either modify the physical network topology or massage layer-2 spanning tree construction to ensure packet traversal across middleboxes. In large networks, both of these approaches become very complex and are extremely error prone.

For example, if an administrator removes physical links to force packets along a certain path, the data center network has lost some redundancy in the network. In the face of failures, this may lower the availability of a service, something that could be an even higher priority than policies the middleboxes were trying to enforce. Secondly, trying to modify link costs to massage the spanning tree construction to choose a specific path is very difficult as well.

Pswitches act as regular layer-2 switches that enforce middlebox policies as well. An administrator can configure pswitches, and implement various policies all from a central location. Middleboxes are plugged into the pswitches, and packets are forced to the correct destination through the use of indirection. This allows the separation of the physical network topology and the logical network topology.

This approach requires a minimal amount of change to existing data center networks during deployment. The only switches that need to be modified are the ones that face the outside of the data center. Middleboxes, servers, and routers do not need to be modified in any way.

Three major contributions:

  1. Separation of the physical and logical network topology to allow middleboxes to be placed off the physical network path.
  2. Created a centralized way for administrators to implement and enforce network policies.
  3. A fault tolerant method that can survive network churn, including hardware and network failures.

The most glaring problem with the paper:

This approach requires deep packet inspection, as well as application content identification. These are both non-trivial tasks. How much will this increase the cost and complexity of switches?

Future research:

Pswitches appear as layer-2 devices. They require middleboxes to be in the same layer-2 network as all of the nodes that they service. It would be nice if these devices could allow for packet direction across domains. Secondly, pswitches still depend on indirection and spanning tree construction to communicate with each other. If there is a problem with spanning tree construction, or any other part of the layer-2 network protocols then pswitches will not be able to function correctly. It would be interesting to take this idea even fruther, following the 4D architecture, and completely separate the data dissemination layer with the network management layer. As a result, network policies could still be enforced even if network components are not configured properly.