Graduate Networks, UCSD

CSE222 – Spring 2009

Web Search for a Planet: The Google Cluster Architecture May 12, 2009

This paper describes the Google cluster architecture. It evaluates the workload required to offer a high-performance web search system and shows how the cluster architecture provided in Google’s data center allows high query throughput while keeping the cost per query low.

The main goals of their cluster are as follows:

  1. Provide reliability in software instead of hardware. Typical data centers use many redundant hardware systems (redundant power supplies, RAID arrays for disk accesses, etc.) in order to maintain system uptime. However, Google chooses not to use these mechanisms because they increase cost. Rather, they design their software to detect and route around failed hardware. This allows them to use much cheaper commodity machines, even though they are more prone to failure because the software is resilient against the failure of a small number of nodes.
  2. Use replication to provide throughput and availiability. Since their workload is very easily distributed, they exploit a huge amount of thread-level parallelism. When a query enters the data center, it is portioned out to dozens of machines, some of which access a distributed inverted index, some of which perform spell checking and ad generation from the search query, and some of which look up the documents which are returned by the index access. This allows the query to be satisfied very quickly even though each node in the system is not particularly fast. In addition to intra-data-center replication, Google also has multiple data centers spread throughout the world in order to provide availability in the face of natural disasters as well as localizing query responses to improve query latency.
  3. Maximize price per unit of performance instead of peak performance. In many respects, using “modern hardware” is not too helpful. The Pentium IIIs they are using in their system are objectively much slower than Pentium IVs for many workloads. However, since their workload consists of a large amount of thread-level parallelism and has minimal instruction-level parallelism, moving to Pentium IVs, which only improve on Pentium IIIs on code which contains higher ILP would actually reduce the performance of their system, especially from a cost/query point of view. It also means that any power saving measures need to be balanced against the resulting performance loss, since the metric that Google cares about is Watts/query.

The weakness of this paper is that it is a very high-level overview which does not provide too much detail to allow someone else to implement a similar system. This is to be expected because Google’s architecture is part of their “secret sauce,” which allows them to provide high performance search queries and dominate the search (and thus their advertisement) market.

Future research in this field should include a more general framework for distributing these tasks over a wide cluster as well as an analysis of the class of problems which can be solved by this approach. Clearly it is not well-suited for applications like scientific computing, which are very much peak performance based and relatively insensitive to latency, as well as being more dependent on communication between computation nodes (as compared to Google’s essentially read-only architecture). However, as a general model for serving web requests to a large number of users, it could easily serve as a more realistic model than the traditional “small number of powerful machines” approach.

 

Web Search For a Planet: The Google Cluster Architecture May 12, 2009

(i) The three most important things the paper says:

1) By providing reliability in software rather than in server class hardware, they can use commodity PCs to build a high end computing cluster at a low end price.  They argue that since hardware is intrinsically prone to failure, it was a more practicle idea to support reliability in software.  And sice they are going to support reliability in software, there was no advantage from a reliability point of view to buy more expensive hardware.

2) By choosing machines based on “price/performance”, then commodity PCs is the superior choice over specialize server hardware.  For instance, they argue that since their system parallelizes very well motherboards with 4 processors doesn’t perform well enough to recoup the cost.  However they do acknowledge that this was a case for the class of problems that they are solving, and that other applications may indeed demand a motherboard with 4 processor to satisfy the “price/performance” argument.

3) Even though the Google cluster has many machines by using commodity PCs, they show that this is manageable since they have a homogeneous system.  They argue that since indexes are broken up into shards and each machine has a specialized function, all machines have identical configuration.  This is another important factor in the argument to use commodity machines.

(ii) The most glaring problem with the paper:

The biggest problem with the paper is that they didn’t really address the “power problem” in full.  They provide calculations of how much power a “Google Cluster” would take and what it would take to justify the purchase of “low power servers”.  However they didn’t specify how they addressed this problem in their solution of commodity clusters.  Perhaps they were afraid of giving out their intellectual property though.

(iii) The future research directions of the work:

The future research of the work would be to see how scalable their system is as the searching/datamining algorithm becomes more complex.  Will the system of merging results from multiple processes continue to be a trivial task?  Also they didn’t discuss the merging step much, but since it is an aggregator, would it benefit from using a non-commodity PC?

 

A Scalable, Commodity, Data Center Network Architecture May 7, 2009

By organizing data center clusters in a fat-tree topology, the authors show that an over-subscription ratio of 1:1 can be achieved with commodity Gbps switches at a cost considerably lower than with traditional hierarchical topologies. The key behind the fat-tree design is that between any two nodes, there exists a large number of Gbps paths. These paths allow packets to be delivered fairly independently of other flows between other nodes. This avoids saturation of switch ports that would otherwise occur using a traditional aggregating hierarchical design.

The major contributions include the fat-tree topology, the two level table routing, and flow classification/scheduling routing algorithms. The routing algorithms provide ways to use the fat-tree topology so as to avoid congestion from port contention. They are necessary to support full Gbps speeds between any two nodes.

The fat-tree topology is a pattern for connecting nodes to switches with k ports. There are 2 levels to the design, a set of core switches and a set of pods. There are k core switches and k pods. Each node is part of a pod. Pods contain k nodes. A lower level pod switch is connected to k/2 nodes and k/2 upper level pod switches. A upper level pod switch is connected to k/2 lower level pod switches and k/2 core level switches. Each core switch is connected to all k pods. This topology allows multiple paths between any pair of nodes.

The two level table routing algorithm uses the last three octets in a 32 bit IP address to identify which port to route packets (at every switch in the system). The pattern assumes 10 for the first octet, pod id for the second, switch id (within a pod) for the third, and node id (within a pod) for the last. When a packet is received by a pod switch from a node, it looks at the destination address to determine which port to forward it on through. The basis of the two level table routing allows the switch to check the first table for a matching prefix. If a terminating prefix is found, the destination is local to the pod and forwarded out the appropriate port. If not, the address’ suffix is used as a lookup in the second table. The output port found in the second table forwards the packet up to the next switch. Using this method, the switches can route traffic to different nodes on different ports. Thus avoid congestion.

The flow classification/scheduling routing algorithm works to avoid congestion by letting switches route packets among any of its available ports. When a flow of packets is identified as being long enough to constitute a long lived flow, a flow scheduler is notified. The flow scheduler keeps track of the flows running in the system and can adjust the path a flow takes through the topology so as to reduce the load across the core and pod switches.

The fat-tree topology appears to be a great way to use commodity switches to achieve over-subscription ratios of 1:1. However, the major drawback is the lack of commodity switch support for the routing protocols needed to attain Gbps speeds on the topology. The authors suggest that implementing the two level routing algorithm represents only a minor change to existing switch technology. This may be true, but if that change cannot be easily instrumented in commodity switches, this approach is relegated to academic simulations.

I’d like to see future research extend the idea of using a fat-tree topology with commodity switches that can be configured to run at Gbps speeds, without the need for a custom routing protocol. This would provide organizations with a tangible solution that can be used in place of traditional hierarchical topologies.