Graduate Networks, UCSD

CSE222 – Spring 2009

A Scalable, Commodity, Data Center Network Architecture May 7, 2009

One of the primary contributions of the paper is the proposal for a new data center network topology which theorietically offers full bisection bandwidth. Traditional data centers have problems with scaling due to oversubscription of link capacity at higher levels in the topology or high capacity links at the higher levels. But the fat tree offers scalability using identical elements throughout the topology.

Another contribution is the novel hierarchical addressing scheme which eases routing in the fat tree and allows for smaller routing tables in the routers. The addressing scheme goes well with traditional switch based IP subnets used in data centers. This allows for simpler routers with faster lookups.

The simple two level lookups and the local flow classifier are simple techniques that allow for easy routing in the topology with very simple router hardware.

The cost analysis (dollar costs, power and heat etc) of the fat tree topology vs traditional data center topologies using complex and expensive hardware at the higher layers gives strength to the ideas in the paper.

One of the primary challenges to the paper is the actual bandwidth that would be achieved in the fat tree. The fat tree is theoretically non blocking – but traffic needs to be scheduled appropriately for that. The evaluation is only performed for very simple unrealistic communication patterns. So the real benefits will depend on scheduling the traffic appropriately. Also data centers are increasingly looking at using layer 2 based solutions for the entire data center. While the paper mainly talks about a layer 3 based solution, no attempt is made to indicate if it can be modified or incorporated into layer 2 based solutions.

Future research could study scheduling the traffic appropriately in the data center to get better bisection bandwidth. Packaging the large number of switches and cables in the fat tree is another interesting direction.

 

A Scalable, Commodity, Data Center Network Architecture May 7, 2009

This paper, one of the more popular ones at SIGCOMM 2008, sets the tone for further research into networks in data centers. the main points are:

  1. Achieving full bisection bandwidth in a multi-rooted tree is possible if data centers revert back to the “fat-tree” topology, which had its origin in the telecommunication networks in the 20th century. This also obviates the need for expensive and specialized switches at the aggregation and core layers; commodity switches will be sufficient to build a functional fat-tree.
  2. The paper introduces the concepts of flow scheduling and flow classification which take into account the need for load balance in the network. They route flows such that no one link/switch becomes the bottleneck as long as there are other links/switches that are not overloaded yet. This is possible in a fat-tree because there are multiple routes between any pair of end-hosts (potentially one through every core switch, if there are no failures).
  3. The real impact of this paper (along with similar other papers) is that it has triggered a focus on how data center networks differ from their Internet counterparts. And, how best to leverage these distinctions by using tailor-made protocols, instead of being handicapped with Internet-style generic protocols like TCP/IP. Specifically, Amin Vahdat’s group at UCSD is researching on the possibility of routing packets through layer 2 — thereby merging layer 2 and 3 into a single layer.

One drawback of this paper is that the flow classification as described in the paper is static. If there is a heavy traffic between a pair of end-hosts, this could lead to the same overloaded link/switch case that it is meant to avoid. Also, the fault tolerance model for links and switches is not dealt with in sufficient detail.

The future holds potential for research into the possible merging layer 2 and layer 3, so that routing can happen in layer 2. The IP address needs to be delinked from the location information, so that IP address is retained (and existing TCP connections do not break) when the virtual machine migrates. Layer 2 has excellent properties like plug-and-play and easy fault isolation, but has problems with loops and broadcast scalability. Plug-and-play is an important enough feature because it eliminates the need for manual/automatic external configuration of the network. The switches can come up and use some distributed machanism to arrive at a concensus on their location in the topology. Tackling these problems prosent an exciting challenge as well.

 

A Scalable, Commodity Data Center Network Architecture May 7, 2009

This scheme uses fat-trees to improve throughput and scalability using COTS (commercial off-the-shelf) equipment. Fat-trees provide path redundancy; by doing so, one can direct multiple flows between the same endpoints to use different paths, increasing the bandwidth between the two, and bypass switches and paths that cannot spare any or have failed. This also changes topology from fewer high-capacity switches to more lower-capacity switches.

The most glaring issue I see in this paper is in the cabling. While this is addressed, and deployment/packaging suggestions are included, this would still require a large amount of cable, many of which would be closely bundled. Assuming the interconnects are copper, this may incur crosstalk, especially at 10Gbps, which would reduce the throughput and thus effectiveness of the connections.

 

A Scalable, Commodity, Data Center Network Architecture May 7, 2009

This paper describes a new method for interconnecting the host computers in a data center network using a fat tree topology and a variety of routing methodologies. This provides much greater aggregate bandwidth than the traditional hierarchical approach.

The main points of this paper were as follows:

  1. The current common data center topology consists of core, aggregation, and edge switches. The problem is that the aggregate bandwidth of what is fundamentally a tree structure is limited by the bandwidth of the root node. This was less of a problem in the past, since it was possible to get core switches which were fast enough to handle the aggregate bandwidth requirements of the hosts, albeit at great cost (expensive, specialized hardware). However, with the push toward 10 GbE hardware, developing a root node which can handle the aggregate bandwidth of thousands of 10 GbE nodes is simply untenable at any cost. Therefore, we need a new approach which can scale to the 10 GbE range. The authors of this paper propose a fat tree topology using simple commodity 10 GbE switches. This provides almost 100% of the ideal bisection bandwidth, while reducing power and cost. This reflects the current trend of aggregating lots of commodity hardware to provide high performance instead of using just a few extremely high-performance systems.
  2. One challenge with the fat tree approach is that, as opposed to the single tree topology previously used, there are multiple possible shortest paths between end hosts. This means that traditional routing protocols are insufficient to allow the netowkr to take full advantage of the bisection bandwidth, since they cannot adapt to the traffic patterns present. In order to solve this problem, the paper presents three different approaches: a two-level static routing table, local flow classification, and global flow scheduling. As we saw in the previous paper, it turns out that managing flows in a global manner provides the greatest performance, since local techniques can cause bottlenecks “beyond the horizon” visible to the local switches. In addition, the power of modern computing hardware means that one centralized node can easily handle the routing requirements of their ideal 27k-host network.
  3. Another advantage of using commodity hardware is that it very easily fits into the current Internet architecture (Ethernet, IP, TCP). This is opposed to specialized products like InfiniBand, which, while able to provide the bandwidth of this approach (incidentally, using a similar fat-tree-based architecture), uses completely different layer 1-4 protocols, making it difficult to map back to the Ethernet/IP/TCP design that we want. Some of the specialized products also require far too much wiring complexity (e.g. the torus found in BlueGene/L and the Cray XT3), which was addressed in detail in this paper, and which this design reduces to a great degree by allowing cables to be grouped to increase the efficiency and simplicity of interconnect runs.

One weakness of this paper is that they did not discuss how to implement a distributed centralized controller. When other parts of the system were so meticulously detailed (e.g. the section on how to wire these pods together in a data center), it seemed like just saying “possibly replicated” was a bit of a weak statement, especially when presenting theoretical results using such a controller. It seems likely that any data center with 27k nodes is going to want some kind of failover or replication support on a centralized controller, and, while this probably will not affect performance, it might, which would make for some interesting experimental results.

Future research in this field might want to focus on a way to distribute the task of scheduling flows throughout the network. While we have recently seen the power of a centralized approach, one of the benefits of using a 27k-node cluster over a small number of supercomputers is that it is extremely fault-tolerant. This is a necessity, since commodity hardware tends to have higher failure rates (due to its significantly lower cost) than the more complex and expensive hardware. It would be interesting to see this fault-tolerance implemented both in the flow management (whether it be a distributed method or a local method) as well as in the routing protocol (which is allowed by the dynamic routing protocols presented in the paper).

 

A Scalable, Commodity Data Center Network Architecture May 7, 2009

Three Important Things:

  • Inter-node communication bandwidth represents a significant bottleneck in today’s data center networks. Often links have to be oversubscribed in order to provide peak bandwidth for any random communicating pair. The presented application of a fat tree topology addresses this critical issue. The topology is able to provide a 1:1 subscription ratio while leveraging identical commodity switching elements which are fully compatible with Ethernet and IP. This is an attractive alternative to large and expensive high-end switches like those offered by Infinband. Fault tolerance is another added benefit of fat tree due to the redundancy of available paths between any pair of hosts.
  • Maximum link utilization cannot be achieved by mere virtue of the fat tree topology. To ensure proper bandwidth allocation a central scheduler is proposed. Its task is to monitor large flows and assign them non-conflicting paths. It accomplishes this with a linear scan over the core switches which allows it to compute and unloaded path for the flow. With simple modifications to switching elements, they first determine the optimal path locally as they see fit until that time when the scheduler notifies them of the globally optimal forwarding scheme.
  • The paper presents a set of impressive results in the areas of bandwidth utilization, heat and power consumption, and cost. The experimental setup with a central flow scheduler is able to achieve 93% worst case bandwidth utilization, as opposed to 28% for a traditional tree topology. Using commodity switching elements that are more efficient per Gbps of bandwidth lead to a 50% reduction in power consumption and heat generation, implying an operational cost savings on server operation and cooling units. Initial capital investment scales well with the number of hosts even past the point where there would be no possible solution with specialized switches and link oversubscription.

Glaring Problem:

An acknowledged major concern with the fat tree topology is cabling requirement of the “fat” links from the core to the pods. The paper proposes a packaging solution that seeks to minimize the amount of cabling, but it still seems like the cost and weight of these cables would significantly hinder the deployment of such a topology (specifically for a 20k+ node cluster). There is no mention of these cost/weight numbers, or of alternative cabling solutions such as optical multiplexing.

Future Work:

Given that the best case bandwidth is achieved using global knowledge of  all packet flows, the current solution relies on a centralized controller to oversee the network. There are some drawbacks to this approach (single point of failure, scalability concerns), and so it would be worth exploring other ways to perform optimal routing negotiations, perhaps in a distributed manner across the core routing elements.

 

A Scalable, Commodity data center network architecture May 7, 2009

Paper proposes a solution to building large data centre from commodity switches. Large number of nodes requires three tiered design hierarchy of data centre. Hierarchy discussed in the paper has core tier in the root of the tree, aggregation tier in the middle of the tree and edge tier at the leaves of the tree.  Paper discusses that delivering full bandwidth between arbitrary hosts requires multi-rooted tree with multiple core switches. Paper proposes that fat-tree based interconnection hold the promise of delivering scalable bandwidth at moderate cost.  This is motivated by the fact that cost difference between commodity switches and high-speed specialized switch is significant. Similar trend motivated Clos to design a network that delivers high level of bandwidth for many end devices from commodity devices.  Based on this architecture, paper proposes to use extension of IP forwarding to utilize the multiple-routing path available in the fat-tree topology. Traditional IP forwarding builds single path between source and destination.

Another aspect associated with fat-tree is the fact that this can add additional wiring overhead. Paper proposes packaging and placement techniques to ameliorate this cost. Paper introduces two levels of route lookups to assist with multi-path routing across the fat free. Paper also discusses flow classification and flow scheduling techniques as alternate multi-path routing methods. Paper proposes to use TCAMs for lookup because they are fast.

First two level of switches in a proposed topology act as filtering traffic diffuser. Host Ids are treated as source of deterministic entropy. This causes packets to the same host to follow same path, which avoids reordering. Flow classification is performed in the switches o dynamically assign ports to overcome the local congestion when two flows compete for the same output port while other ports of the switch is underutilized.

Paper proposes that scheduling large flows makes the most important role to achieve bisection bandwidth of a network. Paper proposes a central scheduler for this task. Edge switches allocate a new flow to least loaded port initially.  Edge switch does not block or wait for the central scheduler to perform scheduling computation. It treats a large flow like any other flow and assigns the least loaded flow. Paper achieves the goal of delivering scalable bandwidth at significantly lower cost.

 

A Scalable, Commodity, Data Center Network Architecture May 7, 2009

(i) the three most important things the paper says

One of the most important ideas that this paper brings to light is that, at the time this paper was written, a very-large-scale network with a 1:1 oversubscription rate was much cheaper to implement with a fat-tree design than with traditional data center network architectures.  Although the paper does not specify what percentage of larger data centers actually employ less efficient or more expensive architectures than the one proposed, it can be assumed that the problem is of sizable magnitude, at some level.  Another important idea that the paper brings to light is that a static approach to switching paths in the fat-tree architecture (and perhaps other architectures) is not as efficient of a solution as can exist for a minimal increase in cost.  This is shown through the increased efficiencies shown in the more dynamic protocols versus the static assignment of port forwarding.  The third most important idea in the paper is the fact that the fat-tree beats the traditional architecture in yet another metric (and perhaps even more of the more complicated, but not fat-tree oriented, designs that aren’t specifically detailed here): power usage and heat dissipation.  Since the fat-tree design detailed here sticks to only GigE switches, the power usage per Gbps was shown to be significantly lower (when compared to using 10Gbps switches to reach the same aggregate bandwidth marks).  For a large-scale data center, this should be a huge argument for switching to this architecture.

(ii) the most glaring problem with the paper

The biggest problem that I found in the paper was that in the Introduction and the Related Works sections, many advanced offerings from other companies were mentioned.  I was a little surprised to not find any benchmark numbers from these throughout the paper (or any direct comparisons with these systems).  I feel that the paper could have been much stronger had the authors shown the aggregate and link-based bandwidth comparisons between their system and the other commercial systems out there.  I say this with the understanding of the fact that many companies will not allow this type of testing (at least not without physically buying the actual solution–a prohibitively expensive research endeavor), but still think that it would have been very useful to provide some concrete numbers to supplement the other advantages of the system described in the paper (including the advantage of this system being less proprietary).

(iii) the future research directions of the work.

To start, I feel that this system should be compared with some of the more proprietary systems on a whole host of different metrics (cost, power, heat dissipation, complexity, aggregate bandwidth under different loads, etc.) in order to demonstrate its superiority (I repeated this here, even though I mentioned it in ii).  Another good future research direction would be a case study of this design in a real world example, with actual hardware running under a realistic data center load.  Also, it would be interesting to see if this design would still be power efficient (and still retain its benefits) when running with either a subset of 10Gbps switches or an entire design using 10Gbps switches.

 

A Scalable, Commodity, Data Center Network Architecture May 7, 2009

This document shows the importance of topology and routing on communication capability. The standard practice for data center networking uses a tree topology with high performance switches aggregating the the traffic from commodity switches connected to hosts at the leaves. The commodity architecture uses a fat-tree topology with only commodity switches and enough redundancy to ensure that the bisection of the network has enough capacity to allow all pairs of hosts to communicate at full host interface speed. Bottlenecks are eliminated.

Redundant switches and paths improve fault tolerance and provide enough paths for full speed communications. Expensive high performance switches are replaced with low cost commodity switches. The deployment cost and maintenance costs are greatly lowered. Total cost of ownership is greatly improved.

Best performance is achieved when a central scheduler optimizes flow paths. Standard shortest path routing does not allow the exploitation of multiple parallel paths though the network. Static routing two-level routing is an improvement of shortest path routing but fails to achieve full throughput under a significant number of cases. An active central scheduler allows performance within 12% of optimal in all test cases.

The fat-tree topology is is more complex the the standard tree topology and requires a great deal of additional cable connections. The authors describe an efficient way to package the system components that reduces implementation complexity.

The bisection of the network supports all-pairs communication at full rate if path selection  were optimal. The flow scheduler has global knowledge an  is able to route flows dynamically. Why is performance not optimal?

Integration with Ethane Ethernet should be investigated. The flow mapping alogorithm can be integrated with the Ethan controller and simple Ethane sitches could increase performance and reduce the cost of the comodity switches that both ideas depend upon. The central flow shceduler is comomn to both systems. This architecture introduces the fat-tree topology and path selection while Ethane introduces a simple switching technique.

 

A Scalable, Commodity Data Center Network Architecture May 7, 2009

(i) Three most important things

1. It should be possible for any host in a date center to communicate with any other host in the network at full bisection bandwidth. Existing solutions addressing relatively low ineffective bandwidths center involve hierarchies of switches with expensive switches at the top of the hierarchy. The port density of high-end switches limits overall cluster size while at the same time incurring high cost. The paper proposes implementing a fat-tree out of cheap Ethernet commodity switches that can achieve full bisection bandwidth.

2.  Cheap off-the-shelf Ethernet switches should become the foundation for large-scale data center networks. The implementation of fat-trees in data centers delivers scalable bandwidth at a significantly lower cost than existing techniques.

3. Scalable bandwidth in data centers can be possible while still remaining backward compatible with Ethernet, IP, and TCP. The paper provides an effective implementation of a network topology for data centers that doesn’t require any modifications to the end host network interface, operating system, or applications, allowing existing data centers to take advantage of this new interconnect architecture with no modifications.

(ii) Most glaring problem

The most glaring problem would be the prototype for the new architecture proposed has not been tested with realistic workloads. The central scheduler, which tracks all active large flows and the status of all links, could prove to be infeasible for large arbitrary networks.

(iii) Future Research Directions

Future research directions for this work would be to actually implement the new architecture in realistic setting so that we can see how the flow management of the new architecture can handle larger networks and determine if the central scheduler would actually be feasible.

 

A Scalable, Commodity, Data Center Network Architecture May 7, 2009

By organizing data center clusters in a fat-tree topology, the authors show that an over-subscription ratio of 1:1 can be achieved with commodity Gbps switches at a cost considerably lower than with traditional hierarchical topologies. The key behind the fat-tree design is that between any two nodes, there exists a large number of Gbps paths. These paths allow packets to be delivered fairly independently of other flows between other nodes. This avoids saturation of switch ports that would otherwise occur using a traditional aggregating hierarchical design.

The major contributions include the fat-tree topology, the two level table routing, and flow classification/scheduling routing algorithms. The routing algorithms provide ways to use the fat-tree topology so as to avoid congestion from port contention. They are necessary to support full Gbps speeds between any two nodes.

The fat-tree topology is a pattern for connecting nodes to switches with k ports. There are 2 levels to the design, a set of core switches and a set of pods. There are k core switches and k pods. Each node is part of a pod. Pods contain k nodes. A lower level pod switch is connected to k/2 nodes and k/2 upper level pod switches. A upper level pod switch is connected to k/2 lower level pod switches and k/2 core level switches. Each core switch is connected to all k pods. This topology allows multiple paths between any pair of nodes.

The two level table routing algorithm uses the last three octets in a 32 bit IP address to identify which port to route packets (at every switch in the system). The pattern assumes 10 for the first octet, pod id for the second, switch id (within a pod) for the third, and node id (within a pod) for the last. When a packet is received by a pod switch from a node, it looks at the destination address to determine which port to forward it on through. The basis of the two level table routing allows the switch to check the first table for a matching prefix. If a terminating prefix is found, the destination is local to the pod and forwarded out the appropriate port. If not, the address’ suffix is used as a lookup in the second table. The output port found in the second table forwards the packet up to the next switch. Using this method, the switches can route traffic to different nodes on different ports. Thus avoid congestion.

The flow classification/scheduling routing algorithm works to avoid congestion by letting switches route packets among any of its available ports. When a flow of packets is identified as being long enough to constitute a long lived flow, a flow scheduler is notified. The flow scheduler keeps track of the flows running in the system and can adjust the path a flow takes through the topology so as to reduce the load across the core and pod switches.

The fat-tree topology appears to be a great way to use commodity switches to achieve over-subscription ratios of 1:1. However, the major drawback is the lack of commodity switch support for the routing protocols needed to attain Gbps speeds on the topology. The authors suggest that implementing the two level routing algorithm represents only a minor change to existing switch technology. This may be true, but if that change cannot be easily instrumented in commodity switches, this approach is relegated to academic simulations.

I’d like to see future research extend the idea of using a fat-tree topology with commodity switches that can be configured to run at Gbps speeds, without the need for a custom routing protocol. This would provide organizations with a tangible solution that can be used in place of traditional hierarchical topologies.

 

A Scalable, Commodity, Data Center Network Architecture May 7, 2009

This paper presents a novel network architecture to interconnect nodes of a data center by using commodity switches, yet achieving full aggregate bandwidth across the data center. The authors point out that high-end switches are very expensive, and often do not provide full aggregate bandwidth across the cluster. Also, bandwidth scales poorly with cluster size and thus bandwidth costs scale non-linearly with cluster size. Their approach is to interconnect a hierarchy of commodity switches to achieve high bandwidth. The great benefit of their approach is that they don’t break backward compatibility, and there is no need to modify the end hosts either. The main points of the paper are:

  1. Hierarchy of commodity switches: It is possible to achieve aggregate bandwidth for a cluster running on a hierarchy of commodity switches. Thus clusters can achieve high aggregate bandwidth with drastically reduced costs (compared to high-end switches). This is done by creating a fat tree with multiple layers: the top layer is the core layer which contains the roots of the network hierarchy, the middle layer is layer of switches that perform aggregation, and the lower layer – the edge – is the series of hosts. Further, they prove that such design is possible without breaking backwards compatibility or requiring end-host modification.
  2. Efficient Routing: IP network switches normally choose a single route from one edge node to another. Thus this route can quickly become the bottleneck if it is overwhelmed with other traffic. In this paper, they try to spread traffic as evenly as possible across the core switches. Since the design offers multiple paths from one edge to another edge of the fat-tree, the network routing algorithm chooses the path that has the least amount of traffic on it. This multi-path routing is possible because the hierarchy of switches is multi-rooted, and each aggregate layer has multiple switches connected to the lower aggregate layers or the edge routers.
  3. Flow classification: the flow scheduling algorithm is able to evenly distribute the flow of traffic by periodically reassigning the flow between two ports to make the flows more evenly distributed between these two ports. By doing so, the network bandwidth is better utilized.

Problems: the only potential problem that I can think of is that the cost and bandwidth estimation for a large cluster seems to be based on calculations, but no experiments. The cluster that they evaluate is a rather small cluster with only 16 hosts, and thus certain problems that can be only observed in large clusters might not have been observed during their experiments (such as Incast).

Future Research: implementation of this network hierarchy in a larger cluster would be beneficial to see if bandwidth scales quite high as expected in a larger cluster. Also, if the main goal is to achieve high bandwidth at a low cost, it might be acceptable to modify end hosts it could result in higher performance.

 

A Scalable, Commodity, Data Center Network Architecture May 7, 2009

Filed under: R11. A Scalable, Commodity, Data Center Network Architecture — subhramazumdar @ 5:42 pm

The paper proposes a new topology for data center networks that can leverage commodity switches at low cost and lower over subscription ratio than normal hierarchical topology. The conventional tree topology requires the bandwidth demand to progressively increase towards the root of the tree. To sustain such bandwidths extremely high speed switches or routers are needed that are extremely costly. Using a fat tree architecture proposed in the paper, this high requirement of bandwidth can be brought down thus being able to leverage normal commodity switches at every node. This topology is also the only way of achieving over subscription ratio of 1:1. The basic idea of the topology is to use the same commodity switch at every layer of hierarchy. The topology is essentially a fat tree with end nodes connected to pod switches, and pods connected to core switches. Since the port supported by each switch is same at each kevel, this requires more switches to interconnect all the nodes than a conventional hierarchical topology. Although requiring more switches the overall cost of such network using only commodity is shown to much less than conventional topology in which the tremendous cost of the high-speed routers and switches at the top throw off the cost to performance ratio. Finally from a power and heat perspective, this topology is much less demanding than a hierarchical one in which the high speed switches at top are extremely power hogging and produce a lot more heat than normal commodity switches. Thus on the whole, such a topology gives much greater available bandwidth at much lower cost and power consumption and also highly scalable.

The challenges faced are the problem of multipath routing due to the presence of mutiple possible routes from one node to another. This introduces a two level routing algorithm which can increase the lookup latency and hence the overall routing delay. Also the key idea of distributing the trffic between various nodes to leverage the full bandwidth of multiple is a challenge for which appropriate diffusing characteristics has to be implemented in the pod switches which can vary dynamically based on current traffic distribution. Another drawback of the fat tree architecture is the number of interconnects to connect all the machines since it depends on large fan-out to multiple switches at each layer to achieve its scaling properties. Thus although the theoretical numbers of available bandwidth and oversubscription ratio look tempting, setting up such a network with high number of interconnects may pose a problem of efficient packaging.

Future works can include a more detail implementation of the network architecture and its validation. Also the speculated heat and power numbers can be verified on a real system since they vary much with the actual installation. Further research may also include using of an upper bound on high-speed switches and optimizing the bandwidth and cost under such constraints which will give a more flexible topology of the network.

 

A Scalable, Commodity Data Center Network Architecture May 7, 2009

Filed under: R11. A Scalable, Commodity, Data Center Network Architecture — gracewangcse222 @ 5:41 pm

(i) The three most important things the paper says:

  1. The authors use a fat-tree topology to realize the architecture described in their paper. Briefly, the fat-tree topology consists of k pods, where intra-pod connections are done with (k/2)2 core switches and inter-pod connections employ k/2 switches in the aggregation layer and k/2 edge switches to connect (k/4)2 hosts in each pod (yielding a total of up to (k/4)3 hosts).  There are several benefits to this approach: identical switching elements can be used in the topology, so the authors could leverage cheap commodity switches, and the topology allows optimal bandwidth saturation for some set of paths. Furthermore, the calculated power consumption and heat dissipation is lower than in a hierarchical design.
  2. Internet traffic may be characterized as long tailed (i.e. bandwidth is dominated by a few large flows and many small ones). Thus achieving good bandwidth is dependent on routing the large flows well (i.e. minimizing overlap between large flows). This is done with a central scheduler which identifies and reserves links for large flows.
  3. Fault-tolerance is achieved (for the most part) by multiple paths being available between any pair of hosts in the lower- to upper-layer switch portion of the path and in the upper layer to core switch portion of the path. A bidirectional forwarding  detection session is maintained at each switch for its neighbours, and failure tags are broadcast as needed. However, failures in the edge switches would cause hosts to be disconnected, so this can only be remedied by adding redundancy in the edge switches.

(ii) The most glaring problem with the paper:

The incremental cost of the system might be significant. For example, if one additional host is needed (so then we would need k = 49), then we would need additional ports on all the switches, a number of additional switches, and links to connect the new switches and ports to maintain the fat-tree topology.

(iii) The future research directions of the work:

Additional work could be done to measure power consumption and heat dissipation, and to add redundancy (and thus additional fault tolerance) in the edge switches and the central controller, as well as measure the performance effects to the system this redundancy would yield. More sophisticated algorithms can be implemented for flow scheduling than linearly searching for a path to allocate the flow.

 

A Scalable, Commodity Data Center Network Architecture May 7, 2009

Important things:
1) Full bisection bandwidth for large scale data center environments is acheivable without the need for expensive, high-bandwidth equipment. With a fat-tree topology, and a good multi-path routing strategy, full bisection bandwidth is acheivable with commodity-like switching components.
2) A multi-path routing strategy which employs a central scheduler for allocating paths to large flows can increase bisection bandwidth under a variety of communication patterns.
3) An attempt to mitigate the increased wiring overhead, which this topology introduces, is made through a proposed packaging technique.

Problem:
The commodity switches that are to be the building blocks of this architecture will need new functionality in order to address the multi-path routing problem, so such switches are not yet commodity. Also, the effectiveness of a central flow scheduler at very large scale has to be evaluated.

Future Work:
Future work should continue to evaluate different multi-path routing or flow scheduling techniques, and evaluate any component that might limit scalability. More communication patterns should be evaluated, especially those that are more typical of actual data center traffic and those that might reveal the weakness of a given routing strategy.

 

A Scalable, Commodity, Data Center Network Architecture May 7, 2009

Filed under: R11. A Scalable, Commodity, Data Center Network Architecture — krishnanadh @ 5:40 pm

The paper proposes leveraging low cost commodity switches for data center networks that require high end to end bandwidth guarantees. The authors comment that using specialized switches introduce prohibitive costs to the network and using commodity switches with the existing data center network architectures will limit the bandwidth when traversing the network hierarchy. To this address this issue, they propose a novel fat-tree architecture which gives close to 1:1 oversubscription at a much lower cost. The primary goals that the authors address are to achieve a network architecture that is scalable across any number of nodes, usage of off-the-shelf inexpensive devices with scaling needs and backward compatibility with existing hosts running Ethernet and IP. The paper cites the comparison with conventional 2 and 3 tier data center architectures topologies where multiple layers of the topology leverage different switches with the higher layers using 128 port 10GigE switches. The authors highlight the limitations of these topologies in terms of their oversubscription of 2.5:1-8:1, high cost and usage of ECMP multi-path routing algorithms which have multiplicative flow table entries and bandwidth limited flows. The main points discussed in the paper are the design and implementation of the fat-tree architecture using commodity switches, the routing table structure and algorithm and a packaging architecture to reduce the required wiring complexity of the system.

The fat-tree structure uses (k/2)^2 core k-port switches, with every port of the core switch connected to each higher/aggregation layer k-port switch of the k-pods. This structure cumulatively forms a k-ary fat-tree topology which supports k^3/4 end hosts. The paper proposes a two level routing lookup strategy with a routing algorithm to utilize the full potential of the rearrangeably non-blocking architecture of fat-tree. The authors argue that the existing routing protocols like the OSPF cannot leverage the (k/2)^2 shortest paths since switches concentrate traffic bound for a single subnet on a single port thereby increasing localized congestion. Hence, for uniform distribution of traffic among all the available paths, the authors use a two-level routing table lookup and table population algorithm on the quad-dotted IP addresses to shorten lookup latency. The primary and secondary tables are implemented in TCAM and use left-handed and right-handed entries respectively. Host IDs are used in conjunction with the routing table creation algorithms at each layer in the hierarchy to ensure an even spread of traffic. Each pod switch has primary table entries containing terminating prefix for subnets in that pod and for inter-pod traffic a /0 prefix is added to the secondary table matching host IDs. Since core switches switch traffic between individual pods, their table entries contain /16 prefixes of destination pods. While traditional single path IP routing protocols might end up using the same routing path for a pair of host IDs, multi-path routing described above with two-level lookup uses distinct paths to fully utilize the available redundancy of a fat-tree. Since using static table allocation fails to achieve perfect distribution and may result in local congestion on a single output port, the authors demonstrate a dynamic routing algorithm that uses flow classification with dynamic port re-assignment. The switches report large flows to a central scheduler which then uses this information to dynamically select the best path for a given flow. Finally to achieve fault tolerance in the system, bidirectional forwarding detection sessions are maintained at each switch to detect failures across various layers of the hierarchy and take appropriate actions.

It is shown that the power requirement posed by the fat-tree topology is much less than those using specialized although the number of commodity switches is much higher. This is attributed to the order of magnitude difference in power consumption of high aggregate bandwidth 10GigE switches. The paper then describes the implementation of the architecture on NetFPGA and Click using commodity switches and PCs. A set of motley traffic generation schemes namely stride, random, staggered, are benchmarked on the architecture to demonstrate its versatility. While static table allocation achieves 75% bisection bandwidth (since it is limited by localized congestion) the flow-classification approaches at 93% bandwidth thereby giving near 1:1 oversubscription. The authors use grid layout packaging to reduce the cable lengths and wiring complexity of the fat-tree topology at the same time allowing incremental deployment. Each pod with the constituent host nodes and the edge and aggregation switches are packaged into a super-switch called the pod switch. The ports fanning out of a pod switch are connected to the core layer switches of the fat tree. This packaging topology is shown to scale for additional cabling opportunities. Overall, the paper makes a very ambitious claim of achieving close to 100% bisection bandwidth in data center networks and the validity of the claim is substantially supported by the implementation and algorithmic details demonstrated. However, since a project of this magnitude requires large scale deployment testing to prove to be scalable, commercial viability and deployment of fat-trees to existing data centers is a subject of gradual change. Further the proposed packaging scheme also needs to address issues of power consumption that arise in grid layout cabling architectures.

 

A Scalable, Commodity Data Center Network Architecture May 7, 2009

(i) The three most important things the paper says:

1) The use of “two level look up” to distribute traffic and maintain packet ordering.   If this wasn’t taken into account, the use of the fat-tree scheme may actually perform worse.  This is because in extremes, it will either not use all possible paths effectively, and/or packet re-ordering will occur detrimentally.

2) The use of Flow Classification and Flow Scheduling to reduce local and global congestion respectively.  Although this wasn’t necessary to make the fat tree topology worked, this was an important addition to make it “on par” with existing commercial router solutions.

3) The observation that many applications must exchange information with remote nodes before proceeding with local computation.  This was a key element in the paper, since most of the paper was devoted to inter-node communication.  And it serves as the key reason why this system scales much better than traditional 2/3 tier tree topology.

(ii) The most glaring problem with the paper:

Although they acknowledge that the amount of wiring for a fat tree approach is much higher (factor of 10), they didn’t equate that into the cost.  One can make a strong case that depending on the physical distance of cables required, that the cost margin benefit will not be substantial in time.  Also considering possible increase in price if we switched to other cable mediums (e.g. fiber optics).

(iii) The future research directions of the work:

The future research of the work would definitely involve some analysis into how this topology works in an environment where users connect to the internet as well.  Since if we need to consider the internet, there may be much more traffic going towards the “Core”.  In this case, how would the “Two Tier Table”, “Flow Classification” and “Flow Scheduling” behave in this fat tree topology.

 

A Scalable, Commodity Data Center Network Architecture May 7, 2009

Filed under: R11. A Scalable, Commodity, Data Center Network Architecture — filipposeracini @ 5:39 pm

This  paper presents a new architecture to structure a data center network, based on commodity Ethernet switches and routers.

Typical data center networks can either be extremely expensive, if based on high end routers, or, when based on commodity Ethernet switches, they can become oversubscribed so then the available bandwidth per host is much smaller than the link ones.

The architecture in the paper tries to address both the concerns above. On one hand side, it is much cheaper because it is based on commodity switches, and on the other hand it addresses the bandwidth problem by using a fat tree structure.

The three most important characteristics of the presented approach are:

  1. Cost. The components used for creating the infrastructure are commodity switches. Hence the new architecture can achieve very high bandwidth without requiring neither a huge investment on high end hardware, nor specialized components and protocols. This also permits to create an infrastructure that is compatibile with hosts running Ethernet and IP.
  2. Flow optimization. The architecture is optimized to handle large flows by reserving routes to them. Once a flow is identified, a notice is sent to the central scheduler which then finds a not-already-reserved route and assigns it to the flow. This optimization was taken to because studies indicated that few large long living flows are responsible for most of the bandwidth. By reserving a route with no or few traffic going on, the proposed architecture tries to avoid as much as possible flows overlapping, hence achieving better performance and bandwidth utilization. The remaining burst traffic is routed on the other available routes.
  3. Fault tolerance. By having a central scheduler that has a complete knowledge of the network, the proposed architecture is able to recover quickly from a link failure by rerouting traffic on some other route. It is important to say that the fat tree structure implies redundancy among the routes, hence increasing availability and fault tolerance.

I would define as main drawback to this approach  the massive cabling requirement that the fat tree structure implies. In the paper the authors suggest a solution to this problem, but local requirement of a data center could prevent the data center to apply such cabling scheme.

Another problem that I can see with the fat tree structure is management. This approach requires many more switches to maintain, control and coordinate in order to apply network policies. In the paper this problem is not mentioned at all. Studying how this architectures deals with management issues could be an interesting avenue for research as well as a further proof of the quality of this approach.

 

A Scalable, Commodity Data Center Network Architecture May 7, 2009

1)

i. The traditional data center network architecture of an oversubscribed hierarchical tree is expensive and cannot always provide the maximum bandwidth between hosts. The author propose a different architecture that uses commodity switches in a fat-tree layout to provide a cheaper solution that also guarantees to provide the maximum aggregate bisection bandwidth. This fat-tree layout combines hosts into subnets and subnets into pods and finally pods into the core of the network. The key distinction is that at each layer boundary (except the host connecting to the subnet) there are multiple links to the next layer up instead of just one so a switch has a choice on how it can direct traffic up to the next layer.

ii. In the proposed architecture a traditional routing protocols all had issues with scalability and/or creating unnecessary congestion. The authors propose a novel two-level ip lookup protocol that only needs to be implemented on the switches. When the packet is travelling up the fat-tree a switch will lookup the longest matching prefix, but the prefix might point to a secondary table containing suffixes and then the algorithm will also attempt to lookup the longest matching suffix from table . If there is only a matching prefix, the packet is routed like it normally would be under IP. However if there is a matching suffix, the entries in the suffix table serve to diffuse traffic across multiple links/switches as it travels upwards.

iii. Dynamic flow scheduling can help provide bandwidth very close to the ideal maximum. The algorithm they use sends out new flows along the least loaded port, but after the flow grows past a certain size it notifies a central scheduler about the large flow. The central scheduler knows the topology and also about all other large flows and attempts to route all large flows along non-conflicting paths if possible.

2) One thing the paper doesn’t address in my opinion is why they chose a fat-tree topology as opposed to some other topology that also has redundant links. This might however be due to ignorance on the part of myself since it seems much of the alternate work also uses fat-tree and apparently so do the telephone networks, but a short discussion would have been helpful.

3) The future research would be to actually go out and implement this architecture in a large scale data center with 20K+ hosts and evaluate if it actually does perform as expected.

 

A Scalable, Commodity Data Center Network Architecture May 7, 2009

Data center architecture today typically use trees of switches and routers to connect nodes of a cluster. When scaling the data center size and the communication fabric, bandwidth is the primary bottleneck between nodes of a cluster. Even with highest end IP switches and routers, we still have the problem of low aggregate bandwidth at the edge of the network, high cost, and high power consumption as we scale the number of nodes. Using specialized hardware and protocols such as InfiniBand and Myrinet scale to large number of nodes and provide high aggregate bandwidth but do not utilize commodity parts and are very expensive. This paper shows how to leverage commodity switches to support full aggregate bandwidth of a cluster that can scale above tens of thousands of nodes. The setup will reduce costs, result in cost to scale linearly with the number of hosts, and lower power consumption.

1. The paper arranges the switches in the network architecture to a special instance of a Clos topology called fat-tree. The k-ary fat-tree is organized into k pods with each pod consisting of 2 layers of (k/2) switches. The bottom layer of switches with k ports connect to k/2 hosts, while the other k/2 ports connect to the aggregation layer at the top. The top aggregation layer connects to the core switches. There are (k/2)^2 k-port core switches connecting to all k pods. The total number of hosts supported is (k^3)/4 in general. The IP address assignment of this network architecture is 10.pod.switch.1 for the pod switches, and 10.pod.switch.hostid for the hosts. The core routers have IP address 10.k.j.i where j and i are the coordinates.

2. Routing of packets become another issue. This paper presents a two level routing table using prefixes in the first level to match subnets going down the tree. If the prefixes doesn’t match the subnet, the second level table is looked up using suffixes to forward packets up the tree towards the core routers. The implementation of the two level routing table is done using a TCAM and a priority encoder to address forwarding tables in RAM. An algorithm is presented to compute the static routing tables at each switch in the fat-tree topology.

3. Static switching has downsides which include no fault tolerance and not maximizing the utilization of the redundancy of paths to a destination node in the fat-tree network topology. The paper introduces a flow based scheduling and introduces a centralized scheduler to find the best underutilized path. Fault tolerance is simply carried out by marking links as busy in the central scheduler and the flows can route around the fault in the link. Switches will implement bidirectional forwarding detection (BFD) links to neighboring switches to determine if a link is down. Switches can fall back to two level routing when the state of the flows are lost.

One drawback is the cabling complexity though this paper addresses that but more work can be done there. Another drawback is if we have flow scheduling and two level route lookups, commodity switches will then have to be modified to support most of this in hardware for high performance. Further research also can be done in measuring actual power benefits, wiring complexity issues, and real world large scale implementation performance.

 

A Scalable, Commodity, Data Center Network Architecture May 7, 2009

This paper introduces a new solution to data centers. Instead of buying expensive specialize IP switches and routers that may only support 50 percent of the aggregate bandwidth available at the edge of the network, the authors use commodity Ethernet switches to support full aggregate bandwidth of clusters that consist of tens of thousands of elements. Not only do they save on cost, get better bandwidth, they also save on power and heat consumption using commodity Ethernet switches. Their 3 main goals in the paper were: scalable interconnection bandwidth, economies of scale and backwards compatibility.

Major Points:
1.) K-ary Fat Tree
The authors use a three-tiered design: core, aggregation and edge. Using a fat tree structure, the authors were able to scale well without require higher-speed uplinks in the fat-tree. Some more advantages of using a fat tree are all switching elements are identical, enabling them to leverage cheap commodity parts for all of the switches which reduces the cost of building the data center and fat-trees are arrangeable non-blocking, which means that some paths will saturate all bandwidth available to the end hosts in the topology. With fat trees the cost of building a data center using commodity parts is attractive and fat trees allow the rest of the functionality of the paper to work.
2.) Flow Classification
It is important that their solution stay backwards compatible with host running Ethernet and IP. They classify flows by dynamically reassigning ports to different flows. If they get packets of the same flow they need to forward them on the same outgoing port so we do not get packets reorder. They also help the bandwidth problem by reassigning a minimal number of flow output ports to minimize the disparity between aggregate flow capacity of different ports.
3.) Two-Level Lookup
To have a traffic diffusion capable of taking advantage of the structure of the topology, the authors used a two-level routing table. The first level prefix look up is used to route down the topology to a endhost. The second prefix look up is a suffix look up used to route up towards the core, diffuse and spread out traffic and maintain packet order on the same ports to the same host. They use a special TCAM hardware to implement the two-level lookup.

Glaring Problem:
I think the problem with this paper is although it is fault tolerant, they do not show the performance impact on a switch or host going down. In a data center, even though a link is alive the performance might drop to an unsatisfactory state. This paper also limits the data center to a fix size.

Future Work:
The impact of this paper is huge. With the cost savings, lower heat and power consumptions and better aggregate bandwidth, data centers will be very tempted to use commodity Ethernet switches instead of expensive specialize switches. As 10 GigE switches become commodity, the authors claim that commodity switches are the only to get full bandwidth out of the clusters. So regardless of whether their solutions work, it is important. More work will be done to get these commodity switches to scale efficiently as commodity switches get faster.