Graduate Networks, UCSD

CSE222 – Spring 2009

Web Search for a Planet: The Google Cluster Architecture May 12, 2009

(i) the three most important things the paper says

One of the most important things that this paper says is the fact that, at the volume of requests that Google receives, aggregate throughput is much more important than single request latency.  This reasoning follows quite well with reasoning of a typical scenario: does it matter more that a single user receives a response in .1 seconds vs 1 second, or does it matter more that a particular server/cluster served 10x more users at 1 second versus just 1x at .1 seconds.  This observation allows Google to save tons of money in server hardware by purchasing with the most economical (performance/price) mindset.  Another important idea that the paper demonstrates is that hardware replication (redundancy) is much easier/cost-effective when handling failures than writing software that will handle those failures gracefully.  Hardware replication, at the price-point of commodity hardware is extremely cheap, while software developer time is much more expensive.  Also, this type of software might require frequent changes depending on the type of hardware used, which would require even more engineering time.  A third important observation made in the paper detailed the price/performance disadvantage of concentrating on low-power hardware versus standard commodity hardware (typically).  The paper says that the power and cooling cost savings must outweigh the cost of the hardware itself (while factoring in how long that hardware will last).  When this paper was written, commodity hardware won this battle.

(ii) the most glaring problem with the paper

One of the biggest problems in this paper is that it is devoid of any alternate storage analysis.  We’re expected to take the analysis that hard disks are the way to go without any explanations.  Alternate memory technologies are much more prevalent now and should be included in such a justification, as many of them provide great latency, power, and durability advantages over commodity disks.

(iii) the future research directions of the work

It would be interesting to see some analysis numbers on how switching to lower-power servers would impact the power usage of Google as a whole (and how that impact would translate to power generation companies, and thereby the environment as a whole).  It would also be interesting to see how CMPs and SMT would help power usage (versus single processor commodity hardware).  It may be the case that a mix of low-power hardware with CMP or SMT technology might save money overall.

 

Web Search for a Planet: The Google Cluster Architecture May 12, 2009

Main contributions:

A scalable architecture to provide an internet scale service with cheap desktop-class PCs.
Showed that Fault-Tolerance can be achieved at a large scale using software to overcome faulty hardware.
Provided an alternative metric for evaluating large scale architectures. (i.e. not just raw performance, but a cost to performance ratio)
  1. A scalable architecture to provide an internet scale service with cheap desktop-class PCs.
  2. Showed that Fault-Tolerance can be achieved at a large scale using software to overcome faulty hardware.
  3. Provided an alternative metric for evaluating large scale architectures. (i.e. not just raw performance, but a cost to performance ratio)

Major problem:

This architecture is for a very specific application, and workload. (i.e. easily parallelized tasks, throughput oriented performance goals and a read-dominant workload) In applications that require communication between tasks, or maintain state between jobs, shared-memory systems with closely coupled CPUs might be more necessary.

Future implications:

This raises large issues about power efficiency of desktop-class PCs. There are two routes that can be taken; either improve the cooling mechanisms used in large data centers, or increase the power efficiency of the machines while keeping the cost to compute ratio exactly the same. Both are difficult problems.

 

Web Search for a Planet: The Google Cluster Architecture May 12, 2009

This paper describes the Google cluster architecture. It evaluates the workload required to offer a high-performance web search system and shows how the cluster architecture provided in Google’s data center allows high query throughput while keeping the cost per query low.

The main goals of their cluster are as follows:

  1. Provide reliability in software instead of hardware. Typical data centers use many redundant hardware systems (redundant power supplies, RAID arrays for disk accesses, etc.) in order to maintain system uptime. However, Google chooses not to use these mechanisms because they increase cost. Rather, they design their software to detect and route around failed hardware. This allows them to use much cheaper commodity machines, even though they are more prone to failure because the software is resilient against the failure of a small number of nodes.
  2. Use replication to provide throughput and availiability. Since their workload is very easily distributed, they exploit a huge amount of thread-level parallelism. When a query enters the data center, it is portioned out to dozens of machines, some of which access a distributed inverted index, some of which perform spell checking and ad generation from the search query, and some of which look up the documents which are returned by the index access. This allows the query to be satisfied very quickly even though each node in the system is not particularly fast. In addition to intra-data-center replication, Google also has multiple data centers spread throughout the world in order to provide availability in the face of natural disasters as well as localizing query responses to improve query latency.
  3. Maximize price per unit of performance instead of peak performance. In many respects, using “modern hardware” is not too helpful. The Pentium IIIs they are using in their system are objectively much slower than Pentium IVs for many workloads. However, since their workload consists of a large amount of thread-level parallelism and has minimal instruction-level parallelism, moving to Pentium IVs, which only improve on Pentium IIIs on code which contains higher ILP would actually reduce the performance of their system, especially from a cost/query point of view. It also means that any power saving measures need to be balanced against the resulting performance loss, since the metric that Google cares about is Watts/query.

The weakness of this paper is that it is a very high-level overview which does not provide too much detail to allow someone else to implement a similar system. This is to be expected because Google’s architecture is part of their “secret sauce,” which allows them to provide high performance search queries and dominate the search (and thus their advertisement) market.

Future research in this field should include a more general framework for distributing these tasks over a wide cluster as well as an analysis of the class of problems which can be solved by this approach. Clearly it is not well-suited for applications like scientific computing, which are very much peak performance based and relatively insensitive to latency, as well as being more dependent on communication between computation nodes (as compared to Google’s essentially read-only architecture). However, as a general model for serving web requests to a large number of users, it could easily serve as a more realistic model than the traditional “small number of powerful machines” approach.

 

Web Search For a Planet: The Google Cluster Architecture May 12, 2009

(i) The three most important things the paper says:

1) By providing reliability in software rather than in server class hardware, they can use commodity PCs to build a high end computing cluster at a low end price.  They argue that since hardware is intrinsically prone to failure, it was a more practicle idea to support reliability in software.  And sice they are going to support reliability in software, there was no advantage from a reliability point of view to buy more expensive hardware.

2) By choosing machines based on “price/performance”, then commodity PCs is the superior choice over specialize server hardware.  For instance, they argue that since their system parallelizes very well motherboards with 4 processors doesn’t perform well enough to recoup the cost.  However they do acknowledge that this was a case for the class of problems that they are solving, and that other applications may indeed demand a motherboard with 4 processor to satisfy the “price/performance” argument.

3) Even though the Google cluster has many machines by using commodity PCs, they show that this is manageable since they have a homogeneous system.  They argue that since indexes are broken up into shards and each machine has a specialized function, all machines have identical configuration.  This is another important factor in the argument to use commodity machines.

(ii) The most glaring problem with the paper:

The biggest problem with the paper is that they didn’t really address the “power problem” in full.  They provide calculations of how much power a “Google Cluster” would take and what it would take to justify the purchase of “low power servers”.  However they didn’t specify how they addressed this problem in their solution of commodity clusters.  Perhaps they were afraid of giving out their intellectual property though.

(iii) The future research directions of the work:

The future research of the work would be to see how scalable their system is as the searching/datamining algorithm becomes more complex.  Will the system of merging results from multiple processes continue to be a trivial task?  Also they didn’t discuss the merging step much, but since it is an aggregator, would it benefit from using a non-commodity PC?

 

Can Software Routers Scale? April 30, 2009

Filed under: R09. Can Software Routers Scale? — stufflebean @ 4:24 pm
Tags: , ,

Yes (with caveats).

This paper provides two models of differing accuracy in order to explore the possibility of implementing a carrier-grade router using only commodity computers and software routing. First, they use a back-of-the-envelope calculation based on theoretical bandwidths of processor front-side busses (FSBs), PCI-Express, and memory. This highly-optimistic model shows that using processors which should be available in the near future (mesh-network instead of FSB), a 40Gb/s line-speed router should be possible, which is at the low end of carrier-grade.

They then proceed to refine this model by performing calculations using an actual software router running Click on a commodity server. After correcting for a memory bug which hurt performance, they found that performance was only limited by FSB, which should be mitigated by future processor architectures.

After addressing a couple of possible enhancements to their naive software router approach (amortizing packet descriptors and allowing the processor to snoop DMA data directly into cache), they go on to discuss the switching problem. One area in which a commodity server will never be able to compete with purpose-built products is in the speed of the switching fabric. However, to solve this, they propose using a cluster of machines, which will increase aggregate bandwidth while providing load balancing.

A weakness of this paper is that it tends to rely a bit too heavily on future technology without providing much in the way of convincing benchmarks. Their Click benchmark seemed to jibe with their theoretical result, but that does not mean that it will scale with the advancement of commodity server technology. They mention in their closing remarks that they are working on creating a complete cluster using a switching mesh, and it would be nice to see the results from that study before drawing conclusions from this paper.

Future research needs to examine whether the mesh-based processors can really provide the bandwidth promised by their model or whether there will be another bottleneck which they did not account for. Also, if the packet descriptors cannot be modified at the NIC level, they need to find a way to account for that in order for their claims of 40+Gb/s to hold. Further, they need to decide whether the power consumption, reliability, and packaging constraints involved in a clustered system are worth the price gap between a cluster and a purpose-built part.