Category Archives: IBM

Can Google lead amid its ever growing infrastructure and computation expenditures?

While reading our daily dose of news, stories and events from the web sector we came across an interesting fact worth reading and mentioning further. Google seems to be processing huge amounts of data per day in their daily routines – 20 Petabytes per day (20,000 Terabytes, 20M GBs).

The average MapReduce job is said to run across a $1 million hardware cluster, not including bandwidth fees, datacenter costs, or staffing. The January 2008 MapReduce paper provides new insights into Google’s hardware and software crunching processing tens of petabytes of data per day.

In September 2007, for example, the white paper document shows Googlers have made 2217 MapReduce jobs crunching approximately 11,000 machine years in a single month. Breaking these numbers further down shows that 11,081 machine years / (2217 job.s x 395 sec = .0278 years) implies 399,000 machines. Since this is believed to double about every 6 months one may guess Google are up to about 600,000 machines by now.

Google converted its search indexing systems to the MapReduce system in 2003, and currently processes over 20 terabytes of raw web data.

Google is known to run on hundreds of thousands of servers – by one estimate, in excess of 450,000 (data as of 2006, today more likely 600,000) – racked up in thousands of clusters in dozens of data centers around the world. It has data centers in Dublin, Ireland; in Virginia; and in California, where it just acquired the million-square-foot headquarters it had been leasing. It recently opened a new center in Atlanta, and is currently building two football-field-sized centers in The Dalles, Ore.

Microsoft, by contrast, made about a $1.5 billion capital investment in server and data structure infrastructure in 2006. Google is known to have spent at least as much to maintain its lead, following a $838 million investment in 2005. We estimate 2008’s Google IT expenditures to be in the $2B range. 

Google buys, rather than leases, computer equipment for maximum control over its infrastructure. Google chief executive officer Eric Schmidt defended that strategy once in a call conference with financial analysts. “We believe we get tremendous competitive advantage by essentially building our own infrastructures,” he said.

In general, Google has a split personality when it comes to questions about its back-end systems. To the media, its answer is, “Sorry, we don’t talk about our infrastructure.” Yet, Google engineers crack the door open wider when addressing computer science audiences, such as rooms full of graduate students whom it is interested in recruiting.

Among other things, Google has developed the capability to rapidly deploy prefabricated data centers anywhere in the world by packing them into standard 20- or 40-foot shipping containers.

Interesting fact from the Google’s history can be found back in 2003 when, in a paper, Google noted that power requirements of a densely packed server rack could range from 400 to 700 watts per square foot, yet most commercial data centers could support no more than 150 watts per square foot. In response, Google was investigating more power-efficient hardware, and reportedly switched from Intel to AMD processors for this reason. Google has not confirmed the choice of AMD, which was reported two years later by Morgan Stanley analyst Mark Edelstone.

Basically Google is mainly relying on its own internally developed software for data and network management and has a reputation for being skeptical of “not invented here” technologies, so relatively few vendors can claim it as a customer.

Google is being rumored that they would eventually start to build their own servers, storage systems, Internet switches and perhaps, sometime in the future, even optical transport systems.

Other rumors claim Google to be a big buyer of dark fiber to connect its data centers, which helps explain why the company spent nearly $3.8 billion over the past seven quarters on capital expenditures.

That’s tremendous amount of information and IT operations and based on our basic calculations, as far as we are correct in our human computation, it turns out that Google is facing IT expenditures in the $2B range per year, including for their data centers and the people.

Even though Google’s completive advantage is not only because of its infrastructure but also employees (Google has what is arguable the brightest group of people ever assembled for a publicly held company), proprietary software, global brand awareness, huge market capitalization and revenues of more than $10B per year, we think $2B burn rate per year on computing needs alone is “walking on thin ice” strategy at breakneck pace. Companies like Guill, who are claiming to have invented a technology 10 times cheaper than Google’s in terms of indexing and storing the information, Powerset working in hadoop/hbase environment, IBM, Microsoft and Yahoo! could potentially take an advantage over Google as Web grows further, so the Google’s computing expenses too.

Btw, we have also found on Web that Google processes its data on a standard machine cluster node consisting two 2 GHz Intel Xeon processors with Hyper-Threading enabled, 4 GB of memory, two 160 GB IDE hard drives and a gigabit Ethernet link.

Yahoo! and Powerset are known to use Hadoop while Microsoft’s equivalent is called Dryad. Dryad and Hadoop are the competing equivalent to Google’s GFS, MapReduce and the BigTable.

More about MapReduce

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

Google’s implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day.

More about Hadoop

Hadoop is an interesting software platform that lets one easily write and run applications that process vast amounts of data. Here’s what makes Hadoop especially useful:

Scalable: Hadoop can reliably store and process petabytes.

Economical: It distributes the data and processing across clusters of commonly available computers. These clusters can number into the thousands of nodes.

Efficient: By distributing the data, Hadoop can process it in parallel on the nodes where the data is located. This makes it extremely rapid.

Reliable: Hadoop automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures.

Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS). MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located. Hadoop has been demonstrated on clusters with 2000 nodes. The current design target is 10,000 node clusters. Hadoop is a Lucene sub-project that contains the distributed computing platform that was formerly a part of Nutch.

More about Dryad

Dryad is an infrastructure which allows a programmer to use the resources of a computer cluster or a data center for running data parallel programs. A Dryad programmer can use thousands of machines, each of them with multiple processors or cores, without knowing anything about concurrent programming.

The Structure of Dryad Jobs
 
A Dryad programmer writes several sequential programs and connects them using one-way channels. The computation is structured as a directed graph: programs are graph vertices, while the channels are graph edges. A Dryad job is a graph generator which can synthesize any directed acyclic graph. These graphs can even change during execution, in response to important events in the computation.

Dryad is quite expressive. It completely subsumes other computation frameworks, such as Google’s map-reduce, or the relational algebra. Moreover, Dryad handles job creation and management, resource management, job monitoring and visualization, fault tolerance, re-execution, scheduling, and accounting.

More

http://doi.acm.org/10.1145/1327452.1327492
http://www.niallkennedy.com/blog/2008/01/google-mapreduce-stats.html
http://labs.google.com/papers/mapreduce.html
http://research.google.com/people/jeff/
http://research.google.com/people/sanjay/
http://research.microsoft.com/research/sv/dryad/
http://lucene.apache.org/hadoop/
http://labs.google.com/papers/gfs.html
http://labs.google.com/papers/bigtable.html
http://research.microsoft.com/research/sv/dryad/
http://www.techcrunch.com/2008/01/09/google-processing-20000-terabytes-a-day-and-growing/
http://feedblog.org/2008/01/06/mapreduce-simplified-data-processing-on-large-clusters/
http://en.wikipedia.org/wiki/MapReduce#Uses
http://open.blogs.nytimes.com/tag/hadoop/
http://www.baselinemag.com/print_article2/0,1217,a=182560,00.asp
http://www.stanford.edu/services/websearch/Google/
http://gigaom.com/2007/12/04/google-infrastructure/
http://gigaom.com/2005/09/19/google-asks-for-googlenet-bids/

Two major acquisition deals within the online storage space

IBM today announced it has acquired XIV, a privately-held storage technology company based in Tel Aviv, Israel. XIV, its technologies and employees, will become part of the IBM System Storage business unit of the IBM Systems and Technology Group. Financial terms of the acquisition were not disclosed but sources tell the price was $350M. 

XIV’s main product Nextra is a storage system based on a grid of standard hardware components. XIV will become part of the IBM System Storage business unit of the IBM Systems and Technology Group. XIV was established in 2002 by five graduates from the 14th class of the Israeli Army’s elite “Talpiot” program where the name XIV coming from. It’s the Roman numeral for 14. The company got only $3 million in backing thus far, making this deal a fairly huge exit for the founders.

“The acquisition of XIV will further strengthen the IBM infrastructure portfolio long term and put IBM in the best position to address emerging storage opportunities like Web 2.0 applications, digital archives and digital media,” said Andy Monshaw, general manager, IBM System Storage. “The ability for almost anyone to create digital content at any time has accelerated the need for a whole new way of applying infrastructure solutions to the new world of digital information.  IBM’s goal is to provide the leading technologies and solutions at every layer of the data center – storage, servers, software and services – to address these new realities IT customers face.” 

“We are pleased to become a significant part of the IBM family, allowing for our unique storage architecture, our engineers and our storage industry experience to be part of IBM’s overall storage business,” said Moshe Yanai, chairman, XIV.  “We believe the level of technological innovation achieved by our development team is unparalleled in the storage industry.  Combining our architectural advancements with IBM’s world-wide research, sales, service, manufacturing, and distribution capabilities will provide us with the ability to have these technologies tackle the emerging Web 2.0 technology needs and reach every corner of the world.”

The NEXTRA architecture has been in production for more than two years, with more than four petabytes of capacity being used by customers today. 

IBM’s acquisition of XIV supports the IBM growth strategy and capital allocation model, as part of the company’s overall objective for earnings-per-share growth through 2010.

XIV is led by Moshe Yanai, one of the key architects of data storage systems and instrumental in the development of EMC’s Symmetrix and DMX product lines throughout the 1990s.

Which brings us to the question why EMC did not buy XIV but that was done by IBM? EMC instead has acquired the online storage startup Mozy, headquartered in Utah. EMC Corporation itself is a public storage company. EMC has paid $76 million for the company, according to web sources.

“Mozy’s technology and online delivery model has proven itself to be one of the industry’s most admired offerings for customers looking to safely and cost-effectively backup and recover their digital information stored on desktops, laptops, and remote office servers,” said Tom Heiser, EMC SVP, Corporate Development and New Ventures. “The acquisition of Mozy is a natural extension of EMC’s leadership in the protection and security of personal and business information. We will continue to invest in Mozy’s full portfolio of online backup and recovery services and advance the Mozy brand in the marketplace.”

“I have been researching and developing internet-scale storage and information management solutions throughout my career,” said Josh Coates, founder and former CEO of Berkeley Data Systems. “EMC and Berkeley Data Systems are a natural fit, and I’m confident that EMC is the right organization to take Mozy to the next level. I look forward to working with EMC to continue innovating in the storage and information management industry.”

The company has basically a very simple way for users to back up their computer hard drives online. You need to download their software and the backups occur slowly over time. Mozy supports both Windows and Mac machines.

Mozy has raised just $1.9 million in venture capital, which is less than the $3M XIV has raised but the XIV’s exit sale is much larger by contrast. The round, closed in May 2005, was led by Wasatch Ventures, with participation from Tim Draper of Draper Associates and Draper, Fisher, Jurvetson and Novell co-founder Drew Major. Mozy was created by Berkeley Data Systems, which is a technology company based in Utah that specializes in large scale, parallel storage systems and software.

There were rumors circulating some time ago that Mozy was close to being acquired by Google for significantly less than this. The company eventually passed on the deal, which must have been a tough call. They clearly made the right choice in waiting.

About EMC Corporation

EMC Corporation is the world’s leading developer and provider of information infrastructure technology and solutions. We help organizations of every size around the world keep their most essential digital information protected, secure, and continuously available. We are among the 10 most valuable IT product companies in the world. We are driven to perform, to partner, to execute. We go about our jobs with a passion for delivering results that exceed our customers’ expectations for quality, service, innovation, and interaction. We pride ourselves on doing what’s right and on putting our customers’ best interests first. We lead change and change to lead. We are devoted to advancing our people, customers, industry, and community. We say what we mean and do what we say. We are EMC, where information lives. EMC Corporation has nearly $40 billion market cap. EMC is listed on the NYSE (NYSE: EMC).

About IBM System Storage business

IBM is a market leader in the storage industry. Innovative technology, open standards, excellent performance, a broad portfolio of storage proven software, hardware and solutions offerings – all backed by IBM with its recognized e-business on demand(r) leadership are just a few of the reasons why you should consider IBM storage offerings. Through its deep industry expertise, patent leadership, research and innovation, IBM has long been the leader in providing customers with technology solutions that help them deliver and utilize information effectively.  With industry recognized leadership in storage and server hardware and software, and through the recent strategic acquisitions of Softek, FileNet and NovusCG, IBM has grown its storage services offerings and presents customers with strategic solutions to deliver integrated software, hardware, services and research in standardized offerings that can be used by customers of all sizes to help them transform their businesses.  

Competition

Other online storage companies include: Amazon’s S3 (Simple Storage Service), Cnet’s All you can Upload, AllMyData, Box.net, eSnips, Freepository, GoDaddy, iStorage, Mofile, Omnidrive, Openomy, Streamload, Strongspace, iBackup, Zingee, Xdrive and Carbonite, which is known to have raised $21 million in venture financing.

It is also rumored that Google is planning to launch gDrive. Microsoft is also jumping into the same bandwagon and more information can be found over here. Zmanda is an open source back up solution as well.

The online storage space is hugely overpopulated and crowded area. Who is next? A comparison chart over some of the companies above can be found over here: http://www.flickr.com/photo_zoom.gne?id=93730415&size=o

Our basic conclusion is that both XIV and Mozy have made very impressive exit deals taking into consideration the small amount of funding they both have taken so far.

More

http://www.mozy.com/
http://mozy.com/blog
http://mozy.com/news/releases
http://www.xivstorage.com/
http://www.xivstorage.com/company/company_news.asp 
http://www.emc.com/
http://www.emc.com/about/
http://www.ibm.com/storage
http://www-03.ibm.com/systems/storage/index.html
http://crunchbase.com/company/mozy
http://www.techcrunch.com/2006/01/31/the-online-storage-gang/
http://www.techcrunch.com/2008/01/03/ibm-acquires-storage-company-xiv-for-350-million/
http://www.techcrunch.com/2008/01/03/benchmark-europe-invests-in-uk-gambling-site/
http://www.crunchbase.com/company/carbonite
http://www.techcrunch.com/2006/01/31/the-online-storage-gang/
http://avc.blogs.com/a_vc/2005/12/online_backups_.html
http://jeremiahthewebprophet.blogspot.com/2006/05/online-data-storage-companies-ongoing.html
http://www.microsoft-watch.com/article2/0,1995,1951237,00.asp?kc=MWRSS02129TX1K0000535
http://www.eweek.com/article2/0,1895,1934589,00.asp
http://sftechsessions.com/2006/06/june-online-storage/
http://c2web.blogspot.com/2006/01/carbonite-online-photo-backup.html
http://www.flickr.com/photo_zoom.gne?id=93730415&size=o
http://www.storagesearch.com
http://ptech.wsj.com/archive/ptech-20061214.html
http://www.usatoday.com/tech/products/2007-10-30-tech-backup_N.htm
http://draperandassociates.com/
http://www.dfj.com/