History of Hadoop
Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely
used text search library. Hadoop has its origins in Apache Nutch, an open
source web search engine.
The Origin of the Name “Hadoop”
The name Hadoop is not an acronym; it’s a made-up name. The project’s
creator, Doug Cutting, explains how the name came about:
The name his kid gave a stuffed yellow elephant. Short, relatively easy to
spell and pronounce, meaningless, and not used elsewhere: those are my naming
criteria. Googol is a kid’s term. Subprojects and “contrib” modules in Hadoop also tend to have names that are
unrelated to their function, often with an elephant or other animal theme
(“Pig,” for example). Smaller components are given more descriptive (and
therefore more mundane) names. This is a good principle, as it means you can
generally work out what something does from its name. For example, the
jobtracker keeps track of MapReduce jobs.
Building a web search engine from scratch was an ambitious goal, for not
only is the software required to crawl and index websites complex to write, but
it is also a challenge to run without a dedicated operations team, since there
are so many moving parts. It’s expensive, too: Mike Cafarella and Doug Cutting
estimated a system supporting a 1-billion-page index would cost around half a
million dollars in hardware, with a monthly running cost of $30,000. Nevertheless,
they believed it was a worthy goal, as it would open up and ultimately
democratize search engine algorithms.
Nutch was started in 2002, and a working crawler and search system quickly
emerged. However, they realized that their architecture wouldn’t scale to the
billions of pages on the Web. Help was at hand with the publication of a paper
in 2003 that described the architecture of Google’s distributed filesystem,
called GFS, which was being used in production at Google. GFS, or something
like it, would solve their storage needs for the very large files generated as
a part of the web crawl and indexing process. In particular, GFS would free up
time being spent on administrative tasks such as managing storage nodes. In
2004, they set about writing an open source implementation, the Nutch
Distributed Filesystem (NDFS).
In 2004, Google published the paper that introduced MapReduce to the world.
Early in 2005, the Nutch developers had a working MapReduce implementation in
Nutch, and by the middle of that year all the major Nutch algorithms had been
ported to run using MapReduce and NDFS.
NDFS and the MapReduce implementation in Nutch were applicable beyond the
realm of search, and in February 2006 they moved out of Nutch to form an
independent subproject of Lucene called Hadoop. At around the same time, Doug
Cutting joined Yahoo!, which provided a dedicated team and the resources to
turn Hadoop into a system that ran at web scale (see sidebar). This was
demonstrated in February 2008 when Yahoo! announced that its production search
index was being generated by a 10,000-core Hadoop cluster
In January 2008, Hadoop was made its own top-level project at Apache,
confirming its success and its diverse, active community. By this time, Hadoop
was being used by many other companies besides Yahoo!, such as Last.fm,
Facebook, and the New York Times.
In one well-publicized feat, the New York Times used Amazon’s EC2
compute cloud to crunch through four terabytes of scanned archives from the
paper converting them to PDFs for the Web. The processing took less than 24
hours to run using 100 machines, and the project probably wouldn’t have been
embarked on without the combination of Amazon’s pay-by-the-hour model (which
allowed the NYT to access a large number of machines for a short period) and
Hadoop’s easy-to-use parallel programming model.
In April 2008, Hadoop broke a world record to become the fastest system to
sort a terabyte of data. Running on a 910-node cluster, Hadoop sorted one
terabyte in 209 seconds (just under 3½ minutes), beating the previous year’s
winner of 297 seconds.November of the same year, Google reported that its MapReduce implementation
sorted one terabyte in 68 seconds.
Hadoop at Yahoo!
Building Internet-scale search engines requires
huge amounts of data and therefore large numbers of machines to process it.
Yahoo! Search consists of four primary components: the Crawler, which
downloads pages from web servers; the WebMap, which builds a graph of
the known Web; the Indexer, which builds a reverse index to the best
pages; and the Runtime, which answers users’ queries. The WebMap is a
graph that consists of roughly 1 trillion (1012) edges each
representing a web link and 100 billion (1011) nodes each
representing distinct URLs. Creating and analyzing such a large graph requires
a large number of computers running for many days. In early 2005, the
infrastructure for the WebMap, named Dreadnaught, needed to be
redesigned to scale up to more nodes. Dreadnaught had successfully scaled from
20 to 600 nodes, but required a complete redesign to scale out further.
Dreadnaught is similar to MapReduce in many ways, but provides more flexibility
and less structure. In particular, each fragment in a Dreadnaught job can send
output to each of the fragments in the next stage of the job, but the sort was
all done in library code. In practice, most of the WebMap phases were pairs
that corresponded to MapReduce. Therefore, the WebMap applications would not
require extensive refactoring to fit into MapReduce.
Eric Baldeschwieler (Eric14) created a small team
and started designing and prototyping a new framework written in C++ modeled
after GFS and MapReduce to replace Dreadnaught. Although the immediate need was
for a new framework for WebMap, it was clear that standardization of the batch
platform across Yahoo! Search was critical and by making the framework general
enough to support other users, we could better leverage investment in the new
platform.
At the same time, watching Hadoop, which
was part of Nutch, and its progress. In January 2006, Yahoo! hired Doug
Cutting, and a month later they decided to abandon their prototype and adopt Hadoop.
The advantage of Hadoop over our prototype and design was that it was already
working with a real application (Nutch) on 20 nodes. That allowed them to bring
up a research cluster two months later and start helping real customers use the
new framework much sooner than they could have otherwise. Another advantage, of
course, was that since Hadoop was already open source, it was easier (although
far from easy!) to get permission from Yahoo!’s legal department to work in
open source. So they set up a 200-node cluster for the researchers in early 2006
and put the WebMap conversion plans on hold while we supported and improved
Hadoop for the research users.
Here’s a quick timeline of how things have progressed:
·
2004—Initial versions of what is now Hadoop
Distributed Filesystem and MapReduce implemented by Doug Cutting and Mike
Cafarella.
·
December 2005—Nutch ported to the new framework.
Hadoop runs reliably on 20 nodes.
·
January 2006—Doug Cutting joins Yahoo!.
·
February 2006—Apache Hadoop project officially
started to support the standalone development of MapReduce and HDFS.
·
February 2006—Adoption of Hadoop by Yahoo! Grid
team.
·
April 2006—Sort benchmark (10 GB/node) run on
188 nodes in 47.9 hours.
·
May 2006—Yahoo! set up a Hadoop research
cluster—300 nodes.
·
May 2006—Sort benchmark run on 500 nodes in 42
hours (better hardware than April benchmark).
·
October 2006—Research cluster reaches 600 nodes.
·
December 2006—Sort benchmark run on 20 nodes in
1.8 hours, 100 nodes in 3.3 hours, 500 nodes in 5.2 hours, 900 nodes in 7.8
hours.
·
January 2007—Research cluster reaches 900 nodes.
·
April 2007—Research clusters—2 clusters of 1000
nodes.
·
April 2008—Won the 1 terabyte sort benchmark in
209 seconds on 900 nodes.
No comments:
Post a Comment