This paper describes the Google cluster architecture. It evaluates the workload required to offer a high-performance web search system and shows how the cluster architecture provided in Google’s data center allows high query throughput while keeping the cost per query low.
The main goals of their cluster are as follows:
- Provide reliability in software instead of hardware. Typical data centers use many redundant hardware systems (redundant power supplies, RAID arrays for disk accesses, etc.) in order to maintain system uptime. However, Google chooses not to use these mechanisms because they increase cost. Rather, they design their software to detect and route around failed hardware. This allows them to use much cheaper commodity machines, even though they are more prone to failure because the software is resilient against the failure of a small number of nodes.
- Use replication to provide throughput and availiability. Since their workload is very easily distributed, they exploit a huge amount of thread-level parallelism. When a query enters the data center, it is portioned out to dozens of machines, some of which access a distributed inverted index, some of which perform spell checking and ad generation from the search query, and some of which look up the documents which are returned by the index access. This allows the query to be satisfied very quickly even though each node in the system is not particularly fast. In addition to intra-data-center replication, Google also has multiple data centers spread throughout the world in order to provide availability in the face of natural disasters as well as localizing query responses to improve query latency.
- Maximize price per unit of performance instead of peak performance. In many respects, using “modern hardware” is not too helpful. The Pentium IIIs they are using in their system are objectively much slower than Pentium IVs for many workloads. However, since their workload consists of a large amount of thread-level parallelism and has minimal instruction-level parallelism, moving to Pentium IVs, which only improve on Pentium IIIs on code which contains higher ILP would actually reduce the performance of their system, especially from a cost/query point of view. It also means that any power saving measures need to be balanced against the resulting performance loss, since the metric that Google cares about is Watts/query.
The weakness of this paper is that it is a very high-level overview which does not provide too much detail to allow someone else to implement a similar system. This is to be expected because Google’s architecture is part of their “secret sauce,” which allows them to provide high performance search queries and dominate the search (and thus their advertisement) market.
Future research in this field should include a more general framework for distributing these tasks over a wide cluster as well as an analysis of the class of problems which can be solved by this approach. Clearly it is not well-suited for applications like scientific computing, which are very much peak performance based and relatively insensitive to latency, as well as being more dependent on communication between computation nodes (as compared to Google’s essentially read-only architecture). However, as a general model for serving web requests to a large number of users, it could easily serve as a more realistic model than the traditional “small number of powerful machines” approach.