Tag Archives: petabytes

Can Google lead amid its ever growing infrastructure and computation expenditures?

While reading our daily dose of news, stories and events from the web sector we came across an interesting fact worth reading and mentioning further. Google seems to be processing huge amounts of data per day in their daily routines – 20 Petabytes per day (20,000 Terabytes, 20M GBs).

The average MapReduce job is said to run across a $1 million hardware cluster, not including bandwidth fees, datacenter costs, or staffing. The January 2008 MapReduce paper provides new insights into Google’s hardware and software crunching processing tens of petabytes of data per day.

In September 2007, for example, the white paper document shows Googlers have made 2217 MapReduce jobs crunching approximately 11,000 machine years in a single month. Breaking these numbers further down shows that 11,081 machine years / (2217 job.s x 395 sec = .0278 years) implies 399,000 machines. Since this is believed to double about every 6 months one may guess Google are up to about 600,000 machines by now.

Google converted its search indexing systems to the MapReduce system in 2003, and currently processes over 20 terabytes of raw web data.

Google is known to run on hundreds of thousands of servers – by one estimate, in excess of 450,000 (data as of 2006, today more likely 600,000) – racked up in thousands of clusters in dozens of data centers around the world. It has data centers in Dublin, Ireland; in Virginia; and in California, where it just acquired the million-square-foot headquarters it had been leasing. It recently opened a new center in Atlanta, and is currently building two football-field-sized centers in The Dalles, Ore.

Microsoft, by contrast, made about a $1.5 billion capital investment in server and data structure infrastructure in 2006. Google is known to have spent at least as much to maintain its lead, following a $838 million investment in 2005. We estimate 2008’s Google IT expenditures to be in the $2B range. 

Google buys, rather than leases, computer equipment for maximum control over its infrastructure. Google chief executive officer Eric Schmidt defended that strategy once in a call conference with financial analysts. “We believe we get tremendous competitive advantage by essentially building our own infrastructures,” he said.

In general, Google has a split personality when it comes to questions about its back-end systems. To the media, its answer is, “Sorry, we don’t talk about our infrastructure.” Yet, Google engineers crack the door open wider when addressing computer science audiences, such as rooms full of graduate students whom it is interested in recruiting.

Among other things, Google has developed the capability to rapidly deploy prefabricated data centers anywhere in the world by packing them into standard 20- or 40-foot shipping containers.

Interesting fact from the Google’s history can be found back in 2003 when, in a paper, Google noted that power requirements of a densely packed server rack could range from 400 to 700 watts per square foot, yet most commercial data centers could support no more than 150 watts per square foot. In response, Google was investigating more power-efficient hardware, and reportedly switched from Intel to AMD processors for this reason. Google has not confirmed the choice of AMD, which was reported two years later by Morgan Stanley analyst Mark Edelstone.

Basically Google is mainly relying on its own internally developed software for data and network management and has a reputation for being skeptical of “not invented here” technologies, so relatively few vendors can claim it as a customer.

Google is being rumored that they would eventually start to build their own servers, storage systems, Internet switches and perhaps, sometime in the future, even optical transport systems.

Other rumors claim Google to be a big buyer of dark fiber to connect its data centers, which helps explain why the company spent nearly $3.8 billion over the past seven quarters on capital expenditures.

That’s tremendous amount of information and IT operations and based on our basic calculations, as far as we are correct in our human computation, it turns out that Google is facing IT expenditures in the $2B range per year, including for their data centers and the people.

Even though Google’s completive advantage is not only because of its infrastructure but also employees (Google has what is arguable the brightest group of people ever assembled for a publicly held company), proprietary software, global brand awareness, huge market capitalization and revenues of more than $10B per year, we think $2B burn rate per year on computing needs alone is “walking on thin ice” strategy at breakneck pace. Companies like Guill, who are claiming to have invented a technology 10 times cheaper than Google’s in terms of indexing and storing the information, Powerset working in hadoop/hbase environment, IBM, Microsoft and Yahoo! could potentially take an advantage over Google as Web grows further, so the Google’s computing expenses too.

Btw, we have also found on Web that Google processes its data on a standard machine cluster node consisting two 2 GHz Intel Xeon processors with Hyper-Threading enabled, 4 GB of memory, two 160 GB IDE hard drives and a gigabit Ethernet link.

Yahoo! and Powerset are known to use Hadoop while Microsoft’s equivalent is called Dryad. Dryad and Hadoop are the competing equivalent to Google’s GFS, MapReduce and the BigTable.

More about MapReduce

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

Google’s implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day.

More about Hadoop

Hadoop is an interesting software platform that lets one easily write and run applications that process vast amounts of data. Here’s what makes Hadoop especially useful:

Scalable: Hadoop can reliably store and process petabytes.

Economical: It distributes the data and processing across clusters of commonly available computers. These clusters can number into the thousands of nodes.

Efficient: By distributing the data, Hadoop can process it in parallel on the nodes where the data is located. This makes it extremely rapid.

Reliable: Hadoop automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures.

Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS). MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located. Hadoop has been demonstrated on clusters with 2000 nodes. The current design target is 10,000 node clusters. Hadoop is a Lucene sub-project that contains the distributed computing platform that was formerly a part of Nutch.

More about Dryad

Dryad is an infrastructure which allows a programmer to use the resources of a computer cluster or a data center for running data parallel programs. A Dryad programmer can use thousands of machines, each of them with multiple processors or cores, without knowing anything about concurrent programming.

The Structure of Dryad Jobs
 
A Dryad programmer writes several sequential programs and connects them using one-way channels. The computation is structured as a directed graph: programs are graph vertices, while the channels are graph edges. A Dryad job is a graph generator which can synthesize any directed acyclic graph. These graphs can even change during execution, in response to important events in the computation.

Dryad is quite expressive. It completely subsumes other computation frameworks, such as Google’s map-reduce, or the relational algebra. Moreover, Dryad handles job creation and management, resource management, job monitoring and visualization, fault tolerance, re-execution, scheduling, and accounting.

More

http://doi.acm.org/10.1145/1327452.1327492
http://www.niallkennedy.com/blog/2008/01/google-mapreduce-stats.html
http://labs.google.com/papers/mapreduce.html
http://research.google.com/people/jeff/
http://research.google.com/people/sanjay/
http://research.microsoft.com/research/sv/dryad/
http://lucene.apache.org/hadoop/
http://labs.google.com/papers/gfs.html
http://labs.google.com/papers/bigtable.html
http://research.microsoft.com/research/sv/dryad/
http://www.techcrunch.com/2008/01/09/google-processing-20000-terabytes-a-day-and-growing/
http://feedblog.org/2008/01/06/mapreduce-simplified-data-processing-on-large-clusters/
http://en.wikipedia.org/wiki/MapReduce#Uses
http://open.blogs.nytimes.com/tag/hadoop/
http://www.baselinemag.com/print_article2/0,1217,a=182560,00.asp
http://www.stanford.edu/services/websearch/Google/
http://gigaom.com/2007/12/04/google-infrastructure/
http://gigaom.com/2005/09/19/google-asks-for-googlenet-bids/