Graduate Networks, UCSD

CSE222 – Spring 2009

Web Search for a Planet: The Google Cluster Architecture May 12, 2009

(i) the three most important things the paper says

One of the most important things that this paper says is the fact that, at the volume of requests that Google receives, aggregate throughput is much more important than single request latency.  This reasoning follows quite well with reasoning of a typical scenario: does it matter more that a single user receives a response in .1 seconds vs 1 second, or does it matter more that a particular server/cluster served 10x more users at 1 second versus just 1x at .1 seconds.  This observation allows Google to save tons of money in server hardware by purchasing with the most economical (performance/price) mindset.  Another important idea that the paper demonstrates is that hardware replication (redundancy) is much easier/cost-effective when handling failures than writing software that will handle those failures gracefully.  Hardware replication, at the price-point of commodity hardware is extremely cheap, while software developer time is much more expensive.  Also, this type of software might require frequent changes depending on the type of hardware used, which would require even more engineering time.  A third important observation made in the paper detailed the price/performance disadvantage of concentrating on low-power hardware versus standard commodity hardware (typically).  The paper says that the power and cooling cost savings must outweigh the cost of the hardware itself (while factoring in how long that hardware will last).  When this paper was written, commodity hardware won this battle.

(ii) the most glaring problem with the paper

One of the biggest problems in this paper is that it is devoid of any alternate storage analysis.  We’re expected to take the analysis that hard disks are the way to go without any explanations.  Alternate memory technologies are much more prevalent now and should be included in such a justification, as many of them provide great latency, power, and durability advantages over commodity disks.

(iii) the future research directions of the work

It would be interesting to see some analysis numbers on how switching to lower-power servers would impact the power usage of Google as a whole (and how that impact would translate to power generation companies, and thereby the environment as a whole).  It would also be interesting to see how CMPs and SMT would help power usage (versus single processor commodity hardware).  It may be the case that a mix of low-power hardware with CMP or SMT technology might save money overall.

 

Web Search for a Planet: The Google Cluster Architecture May 12, 2009

Main contributions:

A scalable architecture to provide an internet scale service with cheap desktop-class PCs.
Showed that Fault-Tolerance can be achieved at a large scale using software to overcome faulty hardware.
Provided an alternative metric for evaluating large scale architectures. (i.e. not just raw performance, but a cost to performance ratio)
  1. A scalable architecture to provide an internet scale service with cheap desktop-class PCs.
  2. Showed that Fault-Tolerance can be achieved at a large scale using software to overcome faulty hardware.
  3. Provided an alternative metric for evaluating large scale architectures. (i.e. not just raw performance, but a cost to performance ratio)

Major problem:

This architecture is for a very specific application, and workload. (i.e. easily parallelized tasks, throughput oriented performance goals and a read-dominant workload) In applications that require communication between tasks, or maintain state between jobs, shared-memory systems with closely coupled CPUs might be more necessary.

Future implications:

This raises large issues about power efficiency of desktop-class PCs. There are two routes that can be taken; either improve the cooling mechanisms used in large data centers, or increase the power efficiency of the machines while keeping the cost to compute ratio exactly the same. Both are difficult problems.

 

Web Search for a Planet: The Google Cluster Architecture May 12, 2009

The paper is really an overview of how Google’s search queries are handled and what factors are taken into account when making hardware decisions. The query architecture is massively parallelized and load balanced across many commodity PCs. Thousands of PCs form a cluster, and there are many clusters geographically distributed all over the world. Google’s query application is representative of a certain class of application that can be parallelized across computational units.

The main contribution from this paper is that parallelizable applications can be deployed over many different computers. This flexibility provides scalability, robustness, and flexibly when evaluating and purchasing hardware.

Instead of purchasing several high-end computers that have multiple processors, Google can purchase considerably more mid-range computers with modest performance for the same price. Although these mid-range commodity PCs have slower peak performance, Google’s query application can still be served just as fast. The key is that because the application be partitioned to run on multiple computers, many PCs can run in parallel to provide the result. Thus, the primary metric for purchasing PCs becomes price/throughput, not price/peak-computation.

Of course, maintaining more PCs adds to the overhead cost. This is mitigated by the observations, that most PCs have a useful life of 3 years and that administering 100 PCs is about the same as 1000 PCs.

A result of having more computers is that the system becomes inherently more robust. If a computer fails, the loss represents a smaller percentage of the cluster. Furthermore, multiple individual PCs allow failures to be contained. A failure on a multi-processor or multi-core computer often brings down all the processors/cores.

Lastly, the distributed computational nature of Google’s query application allows for great scalability. Simply adding more computers reduces the amount of work performed on each computer for a single query, which allows each computer to perform computations for more queries.

The only drawback with this architecture is that it is only applicable to a class of application that can be easily parallelized. While many applications can be segmented to execute computations on pieces of disjoint data, with low synchronization or cross-computer communication, I feel that most applications cannot. Indeed, this is a great architecture for Map/Reduce applications.

I’d be interested to see how this architecture performs when the application has high inter-PC communication requirements. I expect that there is some point in the spectrum where a faster processor interconnect is necessary (i.e FSB).