Graduate Networks, UCSD

CSE222 – Spring 2009

Web Search for a Planet: The Google Cluster Architecture May 12, 2009

(i) the three most important things the paper says

One of the most important things that this paper says is the fact that, at the volume of requests that Google receives, aggregate throughput is much more important than single request latency.  This reasoning follows quite well with reasoning of a typical scenario: does it matter more that a single user receives a response in .1 seconds vs 1 second, or does it matter more that a particular server/cluster served 10x more users at 1 second versus just 1x at .1 seconds.  This observation allows Google to save tons of money in server hardware by purchasing with the most economical (performance/price) mindset.  Another important idea that the paper demonstrates is that hardware replication (redundancy) is much easier/cost-effective when handling failures than writing software that will handle those failures gracefully.  Hardware replication, at the price-point of commodity hardware is extremely cheap, while software developer time is much more expensive.  Also, this type of software might require frequent changes depending on the type of hardware used, which would require even more engineering time.  A third important observation made in the paper detailed the price/performance disadvantage of concentrating on low-power hardware versus standard commodity hardware (typically).  The paper says that the power and cooling cost savings must outweigh the cost of the hardware itself (while factoring in how long that hardware will last).  When this paper was written, commodity hardware won this battle.

(ii) the most glaring problem with the paper

One of the biggest problems in this paper is that it is devoid of any alternate storage analysis.  We’re expected to take the analysis that hard disks are the way to go without any explanations.  Alternate memory technologies are much more prevalent now and should be included in such a justification, as many of them provide great latency, power, and durability advantages over commodity disks.

(iii) the future research directions of the work

It would be interesting to see some analysis numbers on how switching to lower-power servers would impact the power usage of Google as a whole (and how that impact would translate to power generation companies, and thereby the environment as a whole).  It would also be interesting to see how CMPs and SMT would help power usage (versus single processor commodity hardware).  It may be the case that a mix of low-power hardware with CMP or SMT technology might save money overall.

 

Web Search for a Planet: The Google Cluster Architecture May 12, 2009

Main contributions:

A scalable architecture to provide an internet scale service with cheap desktop-class PCs.
Showed that Fault-Tolerance can be achieved at a large scale using software to overcome faulty hardware.
Provided an alternative metric for evaluating large scale architectures. (i.e. not just raw performance, but a cost to performance ratio)
  1. A scalable architecture to provide an internet scale service with cheap desktop-class PCs.
  2. Showed that Fault-Tolerance can be achieved at a large scale using software to overcome faulty hardware.
  3. Provided an alternative metric for evaluating large scale architectures. (i.e. not just raw performance, but a cost to performance ratio)

Major problem:

This architecture is for a very specific application, and workload. (i.e. easily parallelized tasks, throughput oriented performance goals and a read-dominant workload) In applications that require communication between tasks, or maintain state between jobs, shared-memory systems with closely coupled CPUs might be more necessary.

Future implications:

This raises large issues about power efficiency of desktop-class PCs. There are two routes that can be taken; either improve the cooling mechanisms used in large data centers, or increase the power efficiency of the machines while keeping the cost to compute ratio exactly the same. Both are difficult problems.

 

Web Search for a Planet: The Google Cluster Architecture May 12, 2009

The paper briefly describes the architecture of Google’s clusters used for supporting web search. Their focus is on utilizing many cheap, commodity computers rather than few expensive, powerful servers. The main contributions of the paper are:

  1. Applications (such as web search) that can be broken down into smaller tasks, depend highly on parallelization. For such applications, it’s the throughput of the system rather than peak thread performance that really matters. The authors show that it is most cost sensible to use many cheap commodity computers rather than few powerful server for throughput-oriented tasks. This is because many computer can achieve higher parallelization. It also helps if the task can be broken down into smaller tasks where the computation-to-communication ratio between the tasks is quite high; that is, the sub-tasks can run in isolation without requiring much communication with each other.
  2. When using low-end systems, reliability should be handled by the software rather than hardware. This reduces the need for the hardware to be ultra-reliable. Reliable hardware often costs more; therefore by designing a software that can provide fault tolerance through redundancy, they can focus on utilizing cheaper hardware that is not as reliable as high-end servers.
  3. The price-performance ratio should be an important decision in choosing the commodity servers. This allows Google to increase the number of its server, allowing for a good level of fault tolerance and load balancing, as well as higher availability. This is because with cheaper hardware, Google can afford to have more replicas of each of its servers. More replicas provide fault tolerance, while a load balancer can more easily divide tasks between them.

Glaring problems: I don’t see any particular problems with their design, aside from high power consumption.

Future research: The paper does not mention anything about virtual machines, so I’m not sure if this is already used: for the power consumption problems, I can’t see why Google couldn’t utilize many virtual machines on the PCs to take better utilization of CPU and power. Also, it would be valuable to see a mix of cheap and expensive computers handling different tasks for the Google web-server. For example the less parallelizable tasks such as query analysis and spell-checking could be performed on single higher-end servers.