Google technology

The Google Legacy 55

Chapter Three: Google Technology

Chapter Three:

Google Technology

“Apart from the problems of scaling traditional search techniques to data of thismagnitude, there are new technical challenges involved with using the additionalinformation present in hypertext to product better search results.... Fast crawlingtechnology is needed to gather the Web documents and keep them up to date.Storage space must be used efficiently to store indices and, optionally, thedocuments themselves. The indexing system must process hundreds of gigabytes ofdata efficiently. Queries must be handled quickly, at the rate of hundreds tothousands per second.” – Sergey Brin and Lawrence Page, 19971

In the beginning, there was BackRub, the service that became Google. Today, Google is mostclosely associated with its PageRank algorithm. PageRank is a voting algorithm weighted forimportance. The indicators of a Web page’s importance is the number of pages that link to aparticular page.

Messrs. Brin and Page soon added another factor which voted for the importance of a Webpage. This idea was the number of people who click on a Web page. The more clicks on a Webpage, the more weight that Web page was given. Over time, still other factors have been addedto the PageRank algorithm; for example, the frequency with which content on a page ischanged.

Google’s PageRank technology is closely allied with Internet search. Voting algorithms areless effective in enterprise search, for instance. The attention given to Google and its searchtechnology dominate popular thinking about the company. Google search is like a nova. The

1. From “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” www.-db.standord.edu/~backrub/google.html


56 The Google Legacy

luminescence makes it difficult for the observer to see other aspects of the phenomenonclearly or easily.

Radiance aside, Google is a technology company.2 Some of that technology when described intechnical papers such as the earliest one “The Anatomy of a Large-Scale Hypertextual WebSearch Engine” is demanding. The later papers such as “MapReduce: Simplified DataProcessing on Large Clusters” can be a slow read.3 Since Google is technology, explainingwhat Google does in an easily-digestible meal is difficult. The diagram below providesunauthorized snapshot of Google’s computing framework.

2. The annex to this monograph contains a listing of more than 60 Google patents. The list is not all-inclusive; however, it does provide the patent number and a brief description for some of Google’s most important patents. The PageRank patent belongs to the trustees of Stanford University. Google’s patent efforts have focused on systems and methods for relevance, advertising, and other core foci of the company. Google is creating a patent fence to protect its interests.3. Jeff Dean, former Alta Vista researcher and a Google senior engineer, has been an advocate of MapReduce. His most recent papers are available on his Web page at http://labs.google.com/people/jeff/.

Important Google technologies that underlie this diagram of the Googleplex include: [a] modifications to Linux to permit large file sizes and other functions so as to accelerate the overall system; [b] a distributed architecture that allows applications and scaling to be “plugged in” without the type of hands-on set-up other operating systems require; [c] a technical architecture that is similar at every level of scale; [d] a Web-centric architecture that allows new types of applications to be built without a programming language limitation.

a

b

c

d



Google’s technology has emerged from a series of continuous improvements or what Japanesemanagement consultants call kaizan. Each Google technical change may be inconsequential tothe average user of Google. But when taken as a whole, Google’s “technological advantage”comes from Google’s incremental innovations, clever adaptations of research-computingconcepts, and Byzantine tweaks to Linux. Some day, a historian of technology will be able toidentify, from the hundreds of improvements that Google has engineered in the last nine years,one or two that stand with PageRank as of major importance. Critics of Google will see thatthe company has grafted to its core technology processes from many different sources.

To illustrate, the structure of Google’s data centers and the messages passed to and from thesedata centers is in many ways a variant of grid computing.4 Google’s ability to read data frommany computers simultaneously is reminiscent of BitTorrent’s technology.5 Google’s use ofcommodity or “white box” hardware in its data centers is an indication of Google’s hackerethos. The use of memory and discs to store multiple copies of data comes from the frontiersof computing.

Google’s approach to technology, then, is eclectic and in many ways represents a buildingblock approach to large-scale systems. Google benefits from that eclecticism in several ways.First, Google’s computational framework delivers sizzling performance from low-costhardware. Second, Google worked around the bottlenecks of such operating systems asSolaris, Windows Advanced Server, and off-the-shelf Linux. Third, Google took goodprogramming ideas from other languages, implementing new functions and libraries toeliminate most of the manual coding required to parallelise an application across Google’sservers.6

According to Jeff Dean, one of Google’s senior engineers, “Google engineering is sort ofchaotic.”7 This is neither surprising nor necessarily a negative. The Googleplex is a toy boxfor engineers and programmers. The tools are sophisticated. The challenges of the problemsand peers make Google “the place to be” for the best and brightest technical talent in theworld. The nature of creativity combined with Google’s approach to innovation make itdifficult to predict the next big thing from Google.

Before reviewing selected parts of Google’s technology in somewhat more detail, the diagram“Google’s Computing Framework” provides an overview of the Googleplex and some of itstechnologies. These will be touched upon in this section.

4. Grid computing is applying resources from many computers in a network to a single problem or application. Google uses grid-like technology in its distributed computing system.5. BitTorrent is a peer-to-peer file distribution tool written by programmer Bram Cohen in 2001.The reference implementation is written in Python and is released under the MIT License.6. Google has anywhere from 100,000 to 165,000 or more servers. Servers are organized into clusters. Clusters may reside within one rack or across multiple racks of servers. Some Google functions are distributed across data centers.7. From Dr Dean’s speech at the University of Washington in October 2003. See http://www.uwtv.org/programs/displayevent.asp?rid=2459.



PageRank requires a lot of computing horsepower cycles to work. When Google gotunderway in 1996, Messrs. Brin and Page had limited computing horsepower. In order tomake PageRank work, they had to figure out how to get the PageRank algorithm to run ongarden-variety computers available to them.

From the beginning – and this is an important issue with regards to Google’s almost-certaincollision course with Microsoft – Google had to solve both software engineering andhardware engineering issues to make Google Search viable. In fact, when discussing Googletechnology, it is important to keep in mind that PageRank is important only because it can runquickly in the real world, not in a sterile computer lab illuminated with the blue glow ofsupercomputers.

The figure Google’s Fusion: Hardware and Software Engineering shows that Google’stechnology framework has two areas of activity. There is the software engineering effort thatfocuses on PageRank and other applications. Software engineering, as used here, meanswriting code and thinking about how computer systems operate in order to get work donequickly. Quickly means the sub one-second response times that Google is able to maintaindespite its surging growth in usage, applications and data processing.

The other effort focuses on hardware. Google has refined server racks, cable placement,cooling devices, and data center layout. The payoff is lower operating costs and the ability toscale as demand for computing resources increases. With faster turnaround and the

The Google phenomenon comes from the fission occurring when PageRank’s software and hardware engineering interact. Google’s technology delivers super computer applications for mass markets.

Google’s Fusion: Hardware and Software Innovations



elimination of such troublesome jobs as backing up data, Google’s hardware innovations giveit a competitive advantage few of its rivals can equal as of mid-2005.

PageRank with its layering of additional computations added over the years is a softwareproblem of considerable difficulty. The Google system must find Web pages and performdozens, if not hundreds of analyses of those Web pages. Consider the links pointing to a Webpage. Google must keep track of them for more than eight billion Web pages. For a single Webpage with one link pointing to it, the problem is trivial. One link equals one pointer. But whathappens when a site has 10,000 links pointing to it? The problem becomes many times largerand more computationally demanding. Some of these links are likely to come from sites thathave more traffic than others. Some of the links may come from sites that have spoofedGoogle for fun or profit. The calculations to sort out the “value” of each of these links adds tocomputational work associated with PageRank. Keeping track of these factors is a big job.Sizing up different factors against one another for a single page can be hard without acalculator to help. Take the same task and apply it by a couple of billion Web pages, and thecomputing task becomes one for a supercomputer.

Yet this task is everyday stuff for Google and its PageRank process. Users do not give muchthought to what technology underpins a routine query or the 300 million queries Googlehandles each day. In a single second, Google’s technology handles around 340 queries indozens of languages from users worldwide.

Google’s technology cannot be separated from search. Search was the prime mover in theGoogle universe. Once Messrs. Brin and Page were able to fiddle with a limited number ofcommodity computers and make their PageRank algorithm work, Google was headed down aroad that it still follows.

The software requires a suitable hardware and network infrastructure in which to operate.Without Google’s hardware and software, there would be no Google. Hardware and softwareare inextricably linked at Google. With each new advance in software, Google’s engineersmust make correspondingly significant advances in hardware. And when hardware engineerscome up with an advance, the software engineers greedily use that advance to up thefunctionality of their software.

What Google owns is its own snappy, turbocharged supercomputer, interesting software tools,and several thousand people trying to figure out what else the Googleplex can do. Some of thetinkerers come at the problem from bits and bytes, writing code, and weaving applications outof the available functions. The result is a brilliant product.

Others come at the problem from the soldering iron and screwdriver angle. These engineerslook for ways to build hardware and physical systems that can perform the calculations neededto make PageRank work. Google’s approach to data centers, the racks in the data centers, andthe devices in the racks in the data centers is as clever as the company’s search system. Thehardware has to be more than clever. The hardware has to work 24x7, under continuous load,and in locations from Switzerland to Beijing. The synergy between software and hardware isperhaps one of Google’s major accomplishments.



How Google Is Different from MSN and Yahoo

Google’s technology is simultaneously just like other online companies’ technology, and verydifferent. A data center is usually a facility owned and operated by a third party wherecustomers place their servers. The staff of the data center manage the power, air conditioningand routine maintenance. The customer specifies the computers and components. When a datacenter must expand, the staff of the facility may handle virtually all routine chores and maywork with the customer’s engineers for certain more specialized tasks.

Before looking at some significant engineering differences between Google and two of itsmajor competitors, review this list of characteristics for a Google data center.

1 Google data centers – now numbering about two dozen, although no one outside Googleknows the exact number or their locations. They come online and automatically, underthe direction of the Google File System, start getting work from other data centers.These facilities, sometimes filled with 10,000 or more Google computers, find oneanother and configure themselves with minimal human intervention.

2 The hardware in a Google data center can be bought at a local computer store. Googleuses the same types of memory, disc drives, fans and power supplies as those in astandard desktop PC.

3 Each Google server comes in a standard case called a pizza box with one importantchange: the plugs and ports are at the front of the box to make access faster and easier.

4 Google racks are assembled for Google to hold servers on their front and back sides.This effectively allows a standard rack, normally holding 40 pizza box servers, to hold80.

5 A Google data center can go from a stack of parts to online operation in as little as 72hours, unlike more typical data centers that can require a week or even a month to getadditional resources online.

6 Each server, rack and data center works in a way that is similar to what is called “plugand play.” Like a mouse plugged into the USB port on a laptop, Google’s network of datacenters knows when more resources have been connected. These resources, for the mostpart, go into operation without human intervention.

Several of these factors are dependent on software. This overlap between the hardware andsoftware competencies at Google, as previously noted, illustrates the symbiotic relationshipbetween these two different engineering approaches. At Google, from its inception, Googlesoftware and Google hardware have been tightly coupled. Google is not a software companynor is it a hardware company. Google is, like IBM, a company that owes its existence to bothhardware and software. Unlike IBM, Google has a business model that is advertiser supported.Technically, Google is conceptually closer to IBM (at one time a hardware and softwarecompany) than it is to Microsoft (primarily a software company) or Yahoo! (an integrator ofmultiple softwares).



Software and hardware engineering cannot be easily segregated at Google. At MSN and Yahoohardware and software are more loosely-coupled. Two examples will illustrate thesedifferences.

Microsoft – with some minor excursions into the Xbox game machine and peripherals –develops operating systems and traditional applications. Microsoft has multiple operatingsystems, and its engineers are hard at work on the company’s next-generation of operatingsystems. Microsoft does not design or make its own hardware. Its operating systems are coded,for example, for processors that evolved from the Intel chips for personal computers. RecentlyMicrosoft embarked on a new path with its game machine, the Xbox 360. The new Xbox usesa processor from IBM’s family of PowerPC chips also used in the Macintosh computer, theSony PS/3, and Nintendo next-generation game machines. Microsoft’s applications run onMicrosoft operating systems, although a version of Microsoft Office and Internet Explorer runon Apple’s Macintosh.

In addition, Microsoft buys hardware from various suppliers to run its online systems. Most ofthese suppliers, not surprisingly, are certified by Microsoft. Examples include Microsoft’s useof Dell Computers. Microsoft’s engineers use these machines in configurations required by theMicrosoft operating systems and applications. For example, Microsoft servers often require aload balancing feature. Microsoft implements its load balancing via software. When moreperformance is required, Microsoft upgrades the hardware, adds memory, or shifts to higher-speed hard drive technology instead of recoding the operating system itself to deliver higherperformance as Google does. Once a function is released to customers, Microsoft’s engineersfocus on stamping out bugs. Re-engineering a software application for higher performance isnot typically a priority.

Several observations are warranted:

1 Unlike Google, Microsoft does not focus on performance as an end in itself. As a result,Microsoft gets performance the way most computer users do. Microsoft buys orupgrades machines. Microsoft does not fiddle with its operating systems and theirsubfunctions to get that extra time slice or two out of the hardware.

2 Unlike Google, Microsoft has to support many operating systems and invest time andenergy in making certain that important legacy applications such as Microsoft Office orSQLServer can run on these new operating systems. Microsoft has a boat anchor tied toits engineer’s ankles. The boat anchor is the need to ensure that legacy code works inMicrosoft’s latest and greatest operating systems.

3 Unlike Google, Microsoft has no significant track record in designing and buildinghardware for distributed, massively parallelised computing. The mice and keyboardswere a success. Microsoft has continued to lose money on the Xbox, and the suddendemise of Microsoft’s entry into the home network hardware market provides moreevidence that Microsoft does not have a hardware competency equal to Google’s.



In terms of technology, Google has the hardware and software engineering expertise to buildapplications rapidly, perform computationally-intensive applications quickly, and deliverhigh-reliability services from low-cost, commodity hardware.

Yahoo! operates differently from both Google and Microsoft. Yahoo! is in mid-2005 a directcompetitor to Google for advertising dollars. Yahoo! has grown through acquisitions. Insearch, for example, Yahoo acquired 3721.com to handle Chinese language search andretrieval. Yahoo bought Inktomi to provide Web search. Yahoo bought Stata Labs in order toprovide users with search and retrieval of their Yahoo! mail. Yahoo! also ownsAllTheWeb.com, a Web search site created by FAST Search & Transfer. Yahoo! owns theOverture search technology used by advertisers to locate key words to bid on. Yahoo! ownsAlta Vista, the Web search system developed by Digital Equipment Corp. Yahoo! licensesInQuira search for customer support functions. Yahoo has a jumble of search technology;Google has one search technology.

Historically Yahoo has acquired technology companies and allowed each company to operateits technology in a silo. Integration of these different technologies is a time-consuming,expensive activity for Yahoo. Each of these software applications requires servers and systemsparticular to each technology. The result is that Yahoo has a mosaic of operating systems,hardware and systems. Yahoo!’s problem is different from Microsoft’s legacy boat-anchorproblem. Yahoo! faces a Balkan-states problem.

There are many voices, many needs, and many opposing interests. Yahoo! must invest inmanagement resources to keep the peace. Yahoo! does not have a core competency inhardware engineering for performance and consistency. Yahoo! may well have considerablecompetency in supporting a crazy-quilt of hardware and operating systems, however. Yahoo!is not a software engineering company. Its engineers make functions from disparate systemsavailable via a portal.

Google also acquires technology. A good example is Picasa. The photo management softwareruns on the user’s Windows PC.

The program has been integrated with several of Google’s network-centric applications:

1 Gmail. The user’s images can be uploaded and sent via email to friends, colleagues andfamily. A Picasa user without a Gmail account is able to register and receive a username and password. The Gmail account can also be used, if the user wishes, for otherGoogle services, including Fusion, which is Google’s personalized portal, and thesearch history function, which saves a registered user’s Google queries for laterreference.

2 Blog Publishing. The user can post pictures to a Google property, Blogger.com. Theimage publishing function is simplified to one or two clicks. Posting images on someWeb log systems is beyond the expertise of many computer users.

3 Image Printing. The user can send images to online photo processing services.



In sharp contrast to Yahoo’s approach, Google integrated the Picasa application into theGoogleplex. The “hooks” are painless to the user.8 Google has bundled into one freeapplication point-and-click solutions to make management of digital still images intuitive andfluid. Yahoo!’s acquisitions, in general, are not woven into a seamless experience with otherYahoo! services. Consider the 3721.com search system. That service remains a separateChinese language operation available from mostly non-English Yahoo pages. Googleconstructs an application using some code on the user’s PC and other software running on theGoogleplex somewhere on the Internet.

These three companies, different in structure and technical focus, are on a collision course.Like vessels in America’s Cup, each is going toward the same goal, but subject to forcesdifficult for their helmsman to control. Even though there is market space between the three,

8. Picasa requires a download. The installation process is smooth. Indexing speed was about five times faster than ACDSee’s image management program, a competitive product. With Picasa, Google’s technologists demonstrate a rapid, trouble-free installation and an intuitive interface.

One-click access to network services available as part of the user’s virtual application.

One-click access to functions performed on the user’s local computer.

Recently-viewed images



collisions are inevitable. The figure below provides an overview of the mid-2005 technicalorientation of Google, Microsoft and Yahoo.

MSN, and by extension Microsoft Corporation, has a core competency in software. Thecompany has grown from its operating system roots to provide a range of products for mobiledevices, desktop and notebook computers, and enterprise-class servers. Looking forward, thecompany’s Dot Net technology is Microsoft’s framework for virtual applications. In someways, Dot Net is a less-open version of the AJAX technology that Google uses in the GoogleMaps and Gmail products. Microsoft has expended great effort to push Windows downward tomobile devices and outward to network-centric computers in an effort to increase revenue. ForMicrosoft to continue to be the dominant force in software in the future, the company must beable to capture a commanding share of the market for network-centric applications. However,Microsoft’s position (whether real or perceived) is its products’ vulnerability to securitybreaches. Patch after patch, problem after problem, then promise after promise have done littleto bolster the firm’s credibility for delivering secure systems and software. Looking forwardover the next 12 to 18 months, Microsoft’s prospects hinge on security, cost and its developercommunity. The growth of open source alternatives are hard proof that die-hard Microsoftusers are willing to shift for security, cost savings and functionality. Microsoft has weaknessesthat can be attacked by Google and other competitors.

Yahoo’s situation is typical to many American organizations. Most large US corporations are ahotch-potch of different systems, incompatible architectures and a Tower of Babel of dataformats. For Yahoo to deliver specific markets to its advertisers, Yahoo must integrateinformation from disparate systems and be able to segment and deliver ads to those usersefficiently. Yahoo is now spending money to break down the walls of its data silos andintegrating its user data. If Yahoo cannot deliver narrowly segmented markets, advertisersmay abandon Yahoo for services that offer more targeted marketing opportunities. After yearsof flirting with becoming a New Age America Online, Yahoo is beginning to behave like atraditional media company.



MSN and Yahoo! are becoming ad-supported versions of general-interest portals like Yahoo,America Online and Tiscali. In contrast, Google is focusing on applications that tie users to itsGoogleplex. The company’s focus on hardware and software engineering gives it a cost andperformance advantage over MSN and Yahoo, among others competing in Web search.Google’s high-performance, homogeneous Googleplex means that the company does notstruggle with some integration, performance and cost issues that bedevil Microsoft and MSN.Google may not be doing everything right from a computer science point of view. Comparedto MSN or Yahoo, Google is doing less wrong than these two aggressive competitors.

The Technology Precepts

Google’s technology uses concepts and techniques from the leading edge of computer science.Most of these innovations are difficult to explain to engineers steeped in traditional approachesto massively distributed, highly parallelized computing. The eclectic footnotes and referencesin the earlier BackRub paper have been sharpened in Google’s later technical presentations.Readers without a first-hand understanding of NOW-Sort, River, and BAD-FS are unlikely tocraft dinner conversation from Google’s explanations of the influence of these researchcomputing demonstrations.9

For the purposes of this monograph and understanding the nature of Google’s technology, fiveprecepts thread through Google’s technical papers and presentations. The following snapshotsare extreme simplifications of complex, yet extremely fundamental, aspects of theGoogleplex.

Cheap Hardware and Smart Software

Google’s use of commodity hardware for high-demand, 24x7 systems has existed as a coreprecept since 1996. Most of its competitors’ online systems combine branded hardware fromIBM, Sun Microsystems, Hewlett-Packard, and Dell Computers with specialized peripherals.The operating systems in use are a combination of Unix and Microsoft operating systems withsome Linux and open source components.

Google approaches the problem of reducing the costs of hardware, set up, burn-in andmaintenance pragmatically. A large number of cheap devices using off-the-shelf commoditycontrollers, cables and memory reduces costs. But cheap hardware fails.

In order to minimize the “cost” of failure, Google conceived of smart software that wouldperform whatever tasks were needed when hardware devices fail. A single device or an entirerack of devices could crash, and the overall system would not fail. More important, when sucha crash occurs, no full-time systems engineering team has to perform technical triage at 3 a.m.

9. See for example Andrea C. Arpaci-Dusseau, et. al. “HIgh Performance Sorting on Network of Workstations”. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, Tucson, Arizona, May 1997 or John Bent, et. al. “Explicit Control in a Batch-Aware Distributed File System”. Both contained in Proceedings of the 1st USENIX Symposium on Networked Systems Design and Implementation. March 2004.



The focus on low-cost, commodity hardware and smart software is part of the Google culture.In one presentation at a December 2004 technical conference, a Google spokesman joked thatanyone in the room could buy the same hardware that Google uses at Frye’s Electronics, aretail chain with stores in Palo Alto and other cities in California.

Logical Architecture

Google’s technical papers do not describe the architecture of the Googleplex as self-similar.Google’s technical papers provide tantalizing glimpses of an approach to online systems thatmakes a single server share features and functions of a cluster of servers, a complete datacenter, and a group of Google’s data centers.

The diagram below shows a representation of the Googleplex’s tightly organized, highlyregular organization of files, servers, clusters, and more than two dozen data centers in a stableorganizational pattern.10

The diagram illustrates that Google’s technical infrastructure is similar at every level in theGoogleplex. The collections of servers running Google applications on the Google version ofLinux is a supercomputer. The Googleplex can perform mundane computing chores liketaking a user’s query and matching it to documents Google has indexed. Further more, theGoogleplex can perform side calculations needed to embed ads in the results pages shown touser, execute parallelized, high-speed data transfers like computers running state-of-the-artstorage devices, and handle necessary housekeeping chores for usage tracking and billing.

10.The illustration is a Sierpinkski Triangle, chosen because it conveys how each component in Google’s infrastructure replicates other larger combinations of servers and data centers. The overall structure – in this illustration an equilateral triangle – expresses the stability of the Google approach to its system. This famous fractal connotes how Google scales without altering the micro or macro structure of the Googleplex.

A single Google pizza box server

A data centre uses the same design and is composed of racks.

The Googleplex is a larger instance of the organization of a single pizza box server.

A single replicated Google file reflects the controllling organizing principle

A single Google cluster embodies the same organizing principle as a single pizza box server



What is of interest is that Google does this with low-cost commodity hardware running onGoogle’s version of Linux. Google has infused the Googleplex with logic that allows softwareto handle data recovery, to streamline messages passed from server to server, and to grabadditional computing resources in order to complete a job quickly. When Google needs to addprocessing capacity or additional storage, Google’s engineers plug in the needed resources.Due to self-similarity, the Googleplex can recognize, configure and use the new resource.Google has an almost unlimited flexibility with regard to scaling and accessing the capabilitiesof the Googleplex. Unlike a collection of different building materials, Google’s approachdelivers a homogeneous computing system.

A good example is bringing a new rack of 40 or more pizza box servers online and creatingone of the many types of servers Google users.11 Servers, according to the fractal architecture,consist of two or more clusters of pizza boxes. A cluster allows data to be replicated and workshared among pizza boxes with spare capacity. A rack is assembled and then Google’s pizzabox servers are “plugged in.” Cables are attached among the pizza boxes and the rack is thenplugged into a network hub. An engineer turns on the power, and the other devices becomeaware of the new rack’s resources. Master servers – Google’s term for the pizza box that is incharge of one or more clusters – instruct other servers to copy data to the new cluster and beginusing the clusters to do work.

In Google’s self-similar architecture, the loss of an individual device is irrelevant. In fact, arack or a data center can fail without data loss or taking the Googleplex down. The Googleoperating system ensures that each file is written three to six times to different storage devices.When a copy of that file is not available, the Googleplex consults a log for the location of thecopies of the needed file. The application then uses that replica of the needed file andcontinues with the job’s processing. Redundancy and other engineering tweaks to Linux givesthe Googleplex ways to eliminate or reduce the bottlenecks associated with traditional onlinecomputer systems’ operation. The Google technical recipe includes distributed computing,optimized file handling, and embedded logic to make the servers working on tasks smarter.

This architecture allows Google to expand its computational capacity, its storage and itssupported applications with an ease and price point rivals cannot easily match. According toJeff Dean, one of Google’s senior engineers, “At Google, everything is about scale.”12

Speed and Then More Speed

Google Search is fast with most results coming back to the user in less than one second. Incommercial data centers, speed has traditionally been achieved by buying high-end, high-performance hardware from such manufacturers such as Sun Microsystems and usingadvanced storage devices connected to the servers by exotic fibre optics.

11.Data centers use computer cases that are shaped like the boxes used to hold pizzas. The term pizza boxes has been appropriated by engineers to describe one of the standard form factors for servers housed in rack mounts in data centers.12.Statement made at the University of Washington, October 2004



Not Google. Google uses commodity pizza box servers organized in a cluster. A cluster isgroup of computers that are joined together to create a more robust system. Instead of usingexotic servers with eight or more processors, Google generally uses servers that have twoprocessors similar to those found in a typical home computer.

Through proprietary changes to Linux and other engineering innovations, Google is able toachieve supercomputer performance from components that are cheap and widely available.The table below provides some data from 2002 about the speed with which Google can readdata from hard drives:13

To put these data in a context of 2002 technology, consider that an IBM EXP3 storage deviceavailable in 2002 could read data in burst mode at the rate of about 58 MB / second. Google’sread rate in 2002 averaged ten times the read rate of the IBM EXP The write rate is comparable.The cost of a single IBM EXP3 in 2002 was about $18,000 for 360 gigabytes of storage,excluding controller and cables. Google’s cost for comparable storage and the higherperformance was about $1,000. For greater speed, Google spends less. In the world of ever-increasing demands for speed and storage, Google has a strong one-two punch.14 Advances incommodity storage devices translate to even faster performance for Google. Google has notupdated its read rate data, but engineers familiar with Google believe that read rates may insome clusters approach 2,000 megabytes a second. When commodity hardware gets better,Google runs faster without paying a premium for that performance gain.

Google engineers for computational speed. Google’s approach has been to focus on making itssoftware engineering produce the turbocharged performance. Speed is crucial to Google’sPageRank and other analytic processes. If Google’s computational throughput were slow,Google could not perform the work needed to know that for a particular query, a particular set

13.From “The Google File System” by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung (Google) ACM SOSP 2003 Conference Proceedings 1-58113-757-5/03/0010, page 12.14.With Google’s advanced programming tools, Google is able to increase the productivity of its engineers. Combined with hardware speed and performance, Google squeezes out more productivity by applying its engineering talents to application development. This is a one-two-three punch to which Google’s competitors have to respond.

These data show the results of two clusters’ performance. Google’s read throughput has gone up since 2002. Based on increases in commodity drive throughput, Google’s read rate may be close to 2,000 megabytes per second, which may be a Google watchers enthusiasm boosting already-robust figures.



of indexed Web pages is the best match. Without fast response to a query, users would not bewilling to run multiple queries and interact fluidly with the Google applications.

Google does not mindlessly match key words in a user’s query to the terms in the Googleindex. Google’s approach is more subtle and computationally involved, although termmatching is an important part of the Google process. Google reviews data, various scores orvalues from certain algorithms. Google then uses these different values in other algorithms tofind search results, identify the best match (Google’s “Feeling Lucky” link), extract matchingads from its advertising server, and continuously update values as Google users of click onlinks. Once these various query and ad matching processes are complete, Google displays theresults page to the user; typically in less than one second across a public network.

Google is a hot rod computer that can perform the basic mathematics needed to deliver mostsearch results in less than a half second, display maps with the speed of a dedicated desktopapplication like Encarta, and look at a Web page matching a user’s query and, in someapplications, insert additional hyperlinks to related content before displaying the results pageto the user. The Googleplex does experience slow downs. When these occur, the Googleplexallocates additional resources to eliminate the brown out.

Speed has many meanings at Google. Speed means that users can interact with the Googleproducts and services as if the Google application were running on a dedicated PC in front ofthe user. Speed also means that Google must be able to expand its computational and storagecapacity quickly. Speed also means rapid development and deployment of new products.Speed, like Google’s ability to scale, is a core functionality of the Googleplex.

Google applies its high-speed technology to search and to other types of servers. Among theservers using Google’s go-fast technology are those shown below:

What does the combination of go-fast technology plus multiple types of Google data allow thecompany to do? Google can engage in fast new product development. One example is GoogleMaps. Google developed a basic mapping product over the course of 2004. In late 2004,Google purchased Keyhole. By June 30, 2005, Google had:

1 Released a basic mapping product.

Type Function

Advertising server Delivers text and other paid advertisements for AdWords and AdSense.

Chunkserver Schedules and delivers blocks of data for further processing.

Image servers Serves images for Google Image, Print and Video services.

Index server The workhorse of search. Server handles search-and-retrieval.

Mail server Delivers the Gmail service.

News server Gathers, analyses and displays news.

Web server Orders results and makes them available to users.



2 Integrated information from Google Local in early 2005.

3 Hooked Keyhole satellite imagery into Google Maps in early May 2005.

4 Announced Google Earth in May 2005.

5 Upgraded the system to integrate two dimensional point-to-point routes on top ofsatellite imagery.

6 Demonstrated a function that accepts a query in another language, translates the resultsto the user’s language, and displays the data in a three-dimensional mode.

The image below shows that Google’s Map and Earth service pushes the functions of onlinemap and data integration to another level. In the span of several days, Google integratedKeyhole technology, launched, upgraded and redefined online mapping services.15

Another key notion of speed at Google concerns writing computer programs to deploy toGoogle users. Google has developed short cuts to programming. An example is Google’screating a library of canned functions to make it easy for a programmer to optimize a programto run on the Googleplex computer. At Microsoft or Yahoo, a programmer must write some

15.The source for this image was http://blog.eee-craft.com/archives/23345086.html.

This is the results of a Japanese language Google Maps-Earth query for the location of Wendy’s restaurants in New York City. The addition of the Japanese language support, the three-dimensional view of the section of Manhattan where the user wants directions, and the integration of hot links, the two dimensional map, and information about the restaurants was part of Google’s fast-cycle launch and enhancement program designed to beat Microsoft to the market.



code or fiddle with code to get different pieces of a program to execute simultaneously usingmultiple processors. Not at Google. A programmer writes a program, uses a function from aGoogle bundle of canned routines, and lets the Googleplex handle the details. Google’sprogrammers are freed from much of the tedium associated with writing software for adistributed, parallel computer. What does increased programmer productivity mean? In termsof money, Google makes each engineering dollar go farther. If a single programmer can reduceby 10 percent the time required to code a program, the savings could be several thousanddollars. If a programmer can slash coding time in half, Google gets twice the potentialproductivity out of each of its 3,000 plus programmers.16

Eliminate or Reduce Certain System Expenses

Some lucky investors jumped on the Google bandwagon early. Nevertheless, Google wasfrugal, partly by necessity and partly by design. The focus on frugality influenced manyhardware and software engineering decisions at the company. Spending money wisely doesnot mean cheaply. Examples of how Google eliminates or reduces certain system expensesinclude:

• Google eliminates the costs associated with backing up and restoring data when a hardware failure occurs. The fractal principal requires that Google replicate data three to six times elsewhere in the Googleplex. When a device fails, the “master server” for a task looks at a file that tells where the other copies of the data or the programs are. The “master server” then uses those data or those processes to complete a task. No tape, no human intervention, and no downtime; Google does not have these costs due to its engineering acumen.

• Google does not have to certify new hardware. When additional storage or computational capacity is required, Google technicians assemble one or more racks of Google “pizza boxes.” Once in the rack, the Googleplex recognizes the new resources in a way that is similar to how a laptop knows when a user plugs in a USB mouse. The expensive certification processes otherwise required for some high-end hardware are eliminated. Google engineers plug in resources and let the Googleplex handle the other tasks.

• Google innovation uses open source code as a starting point. Many of Google’s most striking technical advances are based on modifying open source software to benefit from insights gained from experimental results in supercomputing. Google does not have to work around known bottlenecks in some commercial operating systems. Unlike Microsoft, Google did not write a complete operating system for its Googleplex. Google made key changes to Linux, adding necessary services and functions to meet the specific requirements of Google applications. Google’s approach is pragmatic and less time-

16.Some Google programmers have complained about the peer pressure to perform. Google management faces a challenge in managing its programming talent. Staff burn out or defections could impair Google’s technical resources.



consuming than Microsoft’s “death march” to get Longhorn shipped by late 2006. Compared with Yahoo, Google’s approach is more cohesive. Yahoo faces integration drudgery as a result of its multiple systems and heterogeneous hardware and data. Google has used Linux, standards, and open source software for virtually all of its core services and thus spends less time pounding disparate systems and data into a standard type.17

• Google does not spend money for high-performance devices to make its system perform faster.

To illustrate the financial payoff from the use of commodity hardware, Google engineersrevealed a back-of-the-envelope calculation. Although dated, it underscores the economies ofthe Google approach:18

The cost advantages of using inexpensive, PC-based clusters over high-endmultiprocessor servers can be quite substantial, at least for a highly parallelisableapplication like ours. For example, a $278,000 rack contains 176 2-GHz Xeon CPUs,176 Gbytes of RAM, and 7 Tbytes of disk space. In comparison, a typical x86-basedserver contains eight 2-GHz Xeon CPUs, 64 Gbytes of RAM, and 8 Tbytes of diskspace; it costs about $758,000. In other words, the multi-processor server isabout three times more expensive but has 22 times fewer CPUs, three times lessRAM, and slightly more disk space. Much of the cost difference derives from themuch higher interconnect bandwidth and reliability of a high-end server, but again,Google’s highly redundant architecture does not rely on either of these attributes.[Emphasis added]

This means that when Microsoft of Yahoo! spends US$3.00 for better performance, Googlespends less than US$1.00.19 Over time, competitors such as IBM, Microsoft or Yahoo mayimplement similar features into their network-centric services. Until then, Google has a costadvantage at least with regards to scaling online operations. If these 2002 data can beaccepted, Google spends one-third for more computing horsepower and disc space thancompanies spend using a traditional server architecture.

Snapshots of Google Technology

Google engineers generate a large volume of technical information. Some of the data are inthe form of patents, often written in a style that communicates little of the patent’s substanceto a lay reader. The link for Google’s publications can shift unexpectedly.20 Exploring

17.Google does not explicitly state that it has embraced a services oriented architecture or SOA. However, many of Google’s practices illustrate an informed use of certain features of SOA.18.Luiz André Barroso, Jeffrey Dean, and Urs Hölzle, “Web Search for a Planet: The Google Cluster Architecture”, IEEE Computer Society 0272-1732/03, March April 2003.19.A review of Google’s cost estimates for this monograph revealed that Google is understating its cost advantage by one or two orders of magnitude. As the performance of commodity hardware goes up, the cost of that hardware goes down. Bulk purchasing chops as much as 50 percent off the cost of some hardware. Google can replicate its data and give away free gigabytes of email storage. The cost to Google can be as low as a few cents a gigabyte.20.See http://labs.google.com/papers.html#compilers on June 1, 2005.



biographies of Google executives and Google Web logs can yield some useful technicalinformation. For example, one Google biography linked to more than 36 personal projects,including one by Google’s CEO.21 Surprisingly, Google’s search engine does a hit-and-missjob of indexing Google’s own technical information.

Useful engineering information appears on the Google Web site. The topics covered in variousmonographs, white papers and technical notes concern a wide range of subjects. For example,in mid-2005, papers were available on such topics as algorithms, compiler optimization,information retrieval, artificial intelligence, file system design, data mining, geneticalgorithms, software engineering and design, and operating systems and distributed systems,among others. Google explains its use of very large files as well as how the Google-modifiedversion of Linux automatically allocates work and avoids the file system bottlenecks that canplague Solaris and Windows Advanced Server 2003, among others.

Google’s technical papers and Google patents provide some insight into areas of interest atGoogle. For example, Google is posting more information about operating systems andapplications. The thrust of Google’s innovation is to build out the search platform and expandthe functionality of its backoffice programs such as those used for advertising services.

The annex to this monograph provides information about more than 60 patents for whichGoogle is believed to be the assignee. To provide a more fine-grained look at Googletechnology, the table below identifies selected examples of innovations documented byGoogle engineers or researchers close to the company. Most of these papers appeared prior toGoogle’s receiving a patent for the technology referenced in these reports:

21.This is the lex project that “helps write programs whose control flow is directed by instances of regular expressions in the input stream. It is well suited for editor-script type transformations and for segmenting input in preparation for a parsing routine.”

Technology Purpose To Learn More

Google Suggest Helps users find needed information by analysing queries and suggesting other queries.

Services Computing, 2004 IEEE International Conference on (SCC'04) by Stephen Davies, Serdar Badem, Michael D. Williams, Roger King September 2004.

Video Object Search User types an object name and Google finds that object in a video.

Ninth IEEE International Conference on Computer Vision Volume 2 Josef Sivic, Andrew Zisserman Publication Date: October 2003.

MapReduce New functions in Google Linux to speed programming and other processes involving large data sets.

OSDI Proceedings, December 2004.

Google File System Extension to Google Linux to allow high-speed data reads and writes from commodity drives.

ACM Publication 1-58113-757-5/03/0010.



Drawbacks of the Googleplex

The coaching mantra, “No pain without gain” is true for Google. Google does make mistakes:and some big ones. The example fresh in news headlines is Web Accelerator. The product wasintroduced in May 2005 and withdrawn less than six weeks later. Speed and nimbleness aside,Web Accelerator was technology that ran head on into “issues.”Of greater consequence are theperiodic slowdowns for Gmail. The Googleplex is scalable, but until more servers are online,users may face annoying delays.

Going Too Fast: The Google Web Accelerator

The Web Accelerator software was supposed to use Google servers to store Web pages a userviewed. Web Accelerator parsed a page in the user’s browser. The Web Accelerator functionthen followed each link on that specific page. The page was then stored in a Google cache.When the user clicked on a link, the user would see the page from the Google cache, thusreducing the time required to display the page to the user.

Web Accelerator worked fine on such sites as a www.whitehouse.gov, which makes minimaluse of advanced Web services. Unfortunately, the Web Accelerator function followed linksthat transmitted instructions to Web applications. For example, Web Accelerator would clickon “delete” links, causing some Web applications such as Backpack to remove the user’spreferences or content.22 Web Accelerator blithely ignored confirmations generated byJavaScript so that unintentional instructions were transmitted. Some Google watchers raisedquestions about caching data as well as privacy and copyright issues. Before these concernsreached a crescendo, Google reported that Web Accelerator had reached its capacity. Googleblocked downloads for the product.

The Laws of Physics: Heat and Power 101

Google does not reveal the number of servers it uses, but the number is believed to be in the150,000 to 170,000 range as of June 30, 2005. Conflicting information surfaces in Web logsand in talks at conferences. In reality, no one knows. Google has a rapidly expanding numberof data centers. The data center near Atlanta, Georgia, is one of the newest deployed. This

Identify Authoritative or High-Value Sources in Web Content

Uses pattern mining in order to generate a numeric value to indicate an authoritative source as an indication of content quality.

Seventh International Database Engineering and Applications Symposium (IDEAS'03) Haofeng Zhou, Yubo Lou, Qingqing Yuan, Wilfred Ng, Wei Wang, Baile Shi July 2003.

MetaCrystal Metasearch technology to allow a single query to retrieve and organize results in a visual display.

Second International Conference on Coordinated & Multiple Views in Exploratory Visualization (CMV'04) Anselm Spoerri July 2004.

22.Backpack is a Web application that sends a user the contents of any page as email. See www.backpackit.com.

Technology Purpose To Learn More



state-of-the-art facility reflects what Google engineers have learned about heat and powerissues in its other data centers. Within the last 12 months, Google has shifted fromconcentrating its servers at about a dozen data centers, each with 10,000 or more servers, toabout 60 data centers, each with fewer machines.23 The change is a response to the heat andpower issues associated with larger concentrations of Google servers.

The most failure prone components are:

• Fans.

• IDE drives which fail at the rate of one per 1,000 drives per day.

• Power supplies which fail at a lower rate.

Repairs are batch operations. Scheduling the fixes is a major job and work is underway toimprove the Google-developed scheduling capability. Google has to locate hosting facilitiesthat can meet the company’s heat and power requirements.

Other Data Center Issues

Google data centers have access to multiple high-speed linesand normal data center functions such as redundant power,traffic routing and strict rules governing access to the physicalboxes.

PRWeaver’s Web log contained a posting of a photographallegedly taken inside a Google data center. If true, the physicallayout of the racks holding an estimated 2,000 or more serverssqueezes a large amount of hardware in a tightly-packed space.

This type of dense configuration helps explain the comments about Google’s heat and powerconcerns. Most data centers were not designed to handle dense concentrations of thousands ofservers. Heat contributes to hard drive failures. On the plus side, the dense configurationmakes set up and maintenance somewhat easier. Google packs servers on two sides of a rack.

A unique property of the data centers is that replicated content can be written from one datacenter to another. Google data within the data center are replicated on other servers and otherclusters running in the racks.

The Google “plug and play” engineering philosophy appears to be used in and across datacenters. If a data center, such as the one shown above, needs additional index server capacity,the technicians in that center can build a Google rack of 40 pizza box servers. These serversare connected to the network. When the rack is powered up, it becomes available to the masterservers for that data center. These master servers then mark the rack’s resources as available.Master servers then begin sending work to the new devices. The information about data

23.These data appear at www.mcdar.net/SEOTools.htm



centers indicates that this “plug and play” concept and automatic discovery of new resourcesapplies to new data centers, not just the racks within them.

It may be an exaggeration that a Google rack and the data center in which the rack residesworks like a USB mouse. The general concept seems to be what Google engineers have tried toachieve. By eliminating such tasks as certifying and configuring Small Computer SystemInterface RAID storage devices, Google is content to let the auto-discovery functionality alert a“master server” to a new resource, master servers to alert other master servers, masters tonotify clients of tasks, and data centers to pass information that racks, clusters or a new datacenter are available for use.

A a Google engineer said, “Wherever we put a cluster, we have heat, cooling and powerissues. When we put in a data center, that data center operator faces new challenges. We useeach day four megawatts of electric power.”

The problems include:

1 Heat. Special racks with fans that cool the core of the rack are used.

2 Power. The power demand at load is greater than data centers typically sustain. “Ourcages are custom built and there’s a lot of work done by us and the data center peoplebefore we can flip the switch,” said Jeff Dean, a senior Google engineer.

3 Network management tools. Google has had to create network management tools tomanage its self-healing, automatic failover operating system.

What’s Up, Sergey?

The Google data centers are concentrated in North America with other data centers located inSwitzerland, the Pacific Rim, and Beijing.24

Because the GOS is self-healing, the operating system and the various “master computers” in acluster know what device is online and what device is dead. Off-the-shelf networkmanagement tools are not tailored to Google’s requirements. Therefore, Google is developingnetwork management and monitoring tools so that the information in the Google operatingsystem log files can be displayed in a meaningful way to Google network engineers.

The overall Googleplex works and continues working even if a device, rack or data centergoes dark or dies. Network management tools have to provide a broad range of monitoringand support functions for the global network, devices, data flows, work loads and potentialproblem areas. Google is developing needed network management tools specifically for its theGoogleplex.

24.The Beijing data center was purpose built to conform to the ruling body’s requirements for online access, monitoring and related issues. Google complied in order to do business in China. Yahoo! bought 3721.com in order to accelerate its effort in China.



Unanticipated Faults Could Derail Google’s Juggernaut

Google’s network uses a number of concepts from the fringes of computer innovation as wellas its hands-on knowledge gained by from the Googleplex itself. The result is a highly-resilient network that may breed problems not previously encountered. Although Google hasoperated for more than five years without downtime from system failure, the possibility –however remote – does exist that something unanticipated could occur. A sufficiently largeproblem could deal Google a severe blow. The advanced technology of Google’s MapReducetool and its 400 module library could pose as yet unforeseen technical problems.

Summary of Google’s Drawbacks

Critics of Google can point to three “problems” with Google’s approach to performance.

First, Google is a one-trick pony. The changes to Linux and the other technical modificationsare little more than hackers’ attempts to squeeze a small performance gain.

Second, Google’s use of commodity hardware and cheap storage is a risky solution. Unknownproblems may lurk when cheap components are used in a mission-critical system. Increasingthe potential risk are the changes Google makes to speed up program execution.

The diagram shows how Google’s approach eliminates the bottleneck in parallelized systems produced by excessive message traffic flowing through a server coordinating work among different computers. This is a diagram produced by Google engineers.



Finally, other operating systems – including those from computer research laboratories andeven Microsoft – do the same things and have for years.

Leveraging the Googleplex

Google has demonstrated that search is just one application that can run in the Googleenvironment. There are many other applications that can benefit from Google’s approach toonline services.

1 Applications that require a high performance payoff for a low cost such as electronicmail.

2 An application that can run in Google’s redundant environment where there is noprivate-state replication such as found in IBM’s AS/400 operating environment andothers.

3 Computationally-intensive, stateless applications.

4 Applications that require request-level parallelism, a characteristic exploitable byrunning individual requests on separate servers such as Google Earth.

There is little to be gained by trotting out war-horses to trample Google. The user experiencespeaks for itself. Google’s approach to massively-parallel distributed computing works, evenon dial-up networks.

Google fused the type of thinking associated with small, cash-strapped companies withtechniques from advanced computer systems. Commodity products keep costs down. Amodified Linux delivers fast performance at a bargain basement cost. Google is taking astrategic risk with commodity hardware and a souped up version of Linux. Each day Googlebets that its technologists can keep the system humming.

Another reason why Google’s approach to technology is paying off is that Google employesthe same pragmatism and cleverness in application development. Google uses standardengineering practices, proprietary knowledge, and off-the-shelf techniques such as its use ofWeb services. Google uses the same Web programming techniques that millions of Webdevelopers use. The payoff is that it is easy for Google to hire people who can code for theGoogleplex. Google so far has not had to spend money for developer marketing programs ortrain new hires to work in the Googleplex.

The biggest boost to Google’s technical approach is that its competitors are followingdifferent, more expensive approaches. Yahoo is a fruit cake of hardware, operating systems,and applications coded at different times in different languages by different people. Microsoftuses its own operating systems but relies on other operating systems as well, including Solaris.Microsoft’s must invest in hardware to squeeze performance out of its platforms. Yahoowrestles with its many different platforms. Microsoft seems powerless to enhance the speed ofits operating system. Both are digital ostriches burying their heads in their own marketingmaterial.



Google’s technology is one major challenge to Microsoft and Yahoo. So to conclude thiscursory and vastly simplified look at Google technology, consider these items:

1 Google is fast anywhere in the world.

2 Google learns. When the heat and power problems at dense data centers surfaced,Google introduced cooling and power conservation innovations to its two dozen datacenters.

3 Programmers want to work at Google. “Google has cachet,” said one recent Universityof Washington graduate.

4 Google’s operating and scaling costs are lower than most other firms offering similarbusinesses.

5 Google squeezes more work out of programmers and engineers by design.

6 Google does not break down, or at least it has not gone offline since 2000.

7 Google’s Googleplex can deliver desktop-server applications now.

8 Google’s applications install and update without burdening the user with gory detailsand messy crashes.

9 Google’s patents provide basic technology insight pertinent to Google’s corefunctionality.

A young programmer in Osaka or Beijing is very likely to have been influenced by Google.The skilled programmers want to work at Google, develop for the Googleplex, and, if possible,create their own Google killer. The mantra is, “Be like Sergey and Larry”.

Google has a next-generation computing platform. That platform is optimised to delivervirtual applications to its users worldwide. Google uses standard Web technologies in cleverways. Although the technical challenges facing Google are formidable, the company hasadvanced the art of online computing.

Google technology

Technology

Transcript of Google technology