Friday, 25 April 2014

MapReduce

In last blog we saw HDFS, now in this we see map reduce details.
some of the important terms that need to be remember for mapreduce.
1) The client, which submits the MapReduce job.
2)The jobtracker, which  coordinates the job run. The jobtracker is a Java application whose main class is JobTracker.
3)The tasktracker, which run the task that the job has been split into. Tasktracker are Java applications whose main class is TaskTracker.
4) The distributed filesystem which is used for sharing job files between the other entities.

Following figure shows how hadoop is works mapreduce. 
1) The runjob() method on JobClient is a convenience method that creates a new JobClient instance and calls submitJob() on it. (step 1)
2) The job submission process implemented by JobClient's submitJob() method does the following:
  a) Asks the jobtracker for a new job ID(by calling getnewJobId() on JobTracker) (step 2).
  b) Checks the output specification of the job.
  c) Computers the input splits for the job.
  d) Copies the resources needed to run the job, including the job JAR file, the configuration file and the computed input splits, to the jobtracker's filesystem in a directory named after the job ID. The job JAR is copied with a high replication factor so that there are lots of copies across the cluster for the tasktracker to access when they run tasks for the job (step 3).
3) It tells the jobtracker that the job is ready for execution (by calling submitJob() on Jobtracker) (step 4).
4) When Jobtracker receives a call to its submitJob() method, it puts it into an internal queue from the job scheduler will pick it up and initialize it. Initialization involves creating an object to represent the job being run, which encapsulates its tasks, and bookkeeping information to keep track of the tasks' status and progress(step 5).
5) To create the list of tasks to run, the job scheduler first retrieves the input splits computed by the JobClient from the shared filesystem(step 6). 
6) A tasktracker will indicate whether it is ready to run a new task, and if it is, the jobtracker will allocate it a task, which it communicates to the tasktracker.(step 7).
7) The tasktracker localize the job JAR by copying it from the shared filesystem to the tasktracker's filesystem. It also copies any files needed from the distributed cache by the applications to the local disk. Second, it creates a local working directory for the task, and un-jars the contents of the JAR into this directory. Third, it creates an instance of TaskRunner to run the task.
8) TaskRunner launches a new Java Virtual Machine (step 9) to run each task in (step 10), so that any bugs in the user-defined map and reduce functions don't affect the tasktracker.

In next blog we see how hadoop installation on single or multinode configuration with ubuntu.

Wednesday, 23 April 2014

The Hadoop Distributed Filesystem

In last blog we have seen the journey of Hadoop. Now, from this onwards we see HDFS, Map Reduce, and many more.
  • The Hadooop Distributed Filesystem- When a dataset outgrows the storage capacity of a single physical machine, it becomes necessary to partition it across a number of separate machines, Filesystems that manage the storage across a network of machine are called distributed filesystem. Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed Filesystem. 
  • The Design of HDFS- HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware. 
    • Very Large file- the files that are size of megabytes, gigabytes, terabytes or upto perabytes.
    • Streaming Data Access- HDFS based on write-once and read-many times pattern. 
    • Commodity Hardware- Hadoop doesn't require expensive, highly reliable hardware to run on.  
  • HDFS Concept  - 
    • Blocks - A disk has a block size, which is the minimum amount of data that it can read or write. Filesystem blocks are typically a few KB in size, while disk blocks are normally 512 bytes.  
    • HDFS Blocks- Block size is 64 MB by default. HDFS blocks are large compared to disk blocks because to minimize the cost of seeks. To keep large block size take the time to transfer the data from the disk can be made to be larger than the time to seek to start of the block.
    • Namenode and Datanodes - A HDFS cluster has two types of nodes. Namenode(master) and number of datanode (workers). 
    • Namenode- It manages the filesystem namespace. it maintains the filesystems and the metadata for al the files and directories in the tree. This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log. The namenode also knows the datanodes on which all the blocks for a given files are located.
    • Datanode - It stores and retrieve blocks, and they report back to the namenode periodically with lists of blocks that they are storing.
  • Data Flow - To get an  idea of how data flows between the client interacting with HDFS, the namenode and the datanode. 
The client opens the file it wishes to read by calling open() on the FileSystem  object, which for HDFS is an instance of DistributedFileSystems (step 1). DistributedFileSystem calls the namenode, using RPC to determine the locations of the blocks in the file (step 2). For each block, the namenode returns the addresses of the datanodes that have a copy of that block. The client then calls read () on the stream (step 3). DFSInputStream, which has stored the datanode addresses for the first few blocks in the file, then connects to the first datanode for the first block in the file. Data is streamed from the datanode back to the client, which calls read () repeatedly on the stream(step 4). when the end of the block is reached, DFSInputStream will close the connection to the datanode, then find the best datanode for the next block (step 5). 
Bloks are read in order with the DFSInnputStream opening new connections to datanodesas the client reads through the stream. It will also call the namenode to retrieve the datanode locations for the next batch of blocks as needed. When the client has finished reading, it calls close() on the FSDataInputStream(step 6). 

In next blog we will see the map and reduce concept.

Monday, 21 April 2014

Hadoop Journey

History of Hadoop

Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open source web search engine.

The Origin of the Name “Hadoop”

The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting, explains how the name came about:
The name his kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Googol is a kid’s term. Subprojects and “contrib” modules in Hadoop also tend to have names that are unrelated to their function, often with an elephant or other animal theme (“Pig,” for example). Smaller components are given more descriptive (and therefore more mundane) names. This is a good principle, as it means you can generally work out what something does from its name. For example, the jobtracker keeps track of MapReduce jobs.
Building a web search engine from scratch was an ambitious goal, for not only is the software required to crawl and index websites complex to write, but it is also a challenge to run without a dedicated operations team, since there are so many moving parts. It’s expensive, too: Mike Cafarella and Doug Cutting estimated a system supporting a 1-billion-page index would cost around half a million dollars in hardware, with a monthly running cost of $30,000. Nevertheless, they believed it was a worthy goal, as it would open up and ultimately democratize search engine algorithms.
Nutch was started in 2002, and a working crawler and search system quickly emerged. However, they realized that their architecture wouldn’t scale to the billions of pages on the Web. Help was at hand with the publication of a paper in 2003 that described the architecture of Google’s distributed filesystem, called GFS, which was being used in production at Google. GFS, or something like it, would solve their storage needs for the very large files generated as a part of the web crawl and indexing process. In particular, GFS would free up time being spent on administrative tasks such as managing storage nodes. In 2004, they set about writing an open source implementation, the Nutch Distributed Filesystem (NDFS).
In 2004, Google published the paper that introduced MapReduce to the world. Early in 2005, the Nutch developers had a working MapReduce implementation in Nutch, and by the middle of that year all the major Nutch algorithms had been ported to run using MapReduce and NDFS.
NDFS and the MapReduce implementation in Nutch were applicable beyond the realm of search, and in February 2006 they moved out of Nutch to form an independent subproject of Lucene called Hadoop. At around the same time, Doug Cutting joined Yahoo!, which provided a dedicated team and the resources to turn Hadoop into a system that ran at web scale (see sidebar). This was demonstrated in February 2008 when Yahoo! announced that its production search index was being generated by a 10,000-core Hadoop cluster
In January 2008, Hadoop was made its own top-level project at Apache, confirming its success and its diverse, active community. By this time, Hadoop was being used by many other companies besides Yahoo!, such as Last.fm, Facebook, and the New York Times.
In one well-publicized feat, the New York Times used Amazon’s EC2 compute cloud to crunch through four terabytes of scanned archives from the paper converting them to PDFs for the Web. The processing took less than 24 hours to run using 100 machines, and the project probably wouldn’t have been embarked on without the combination of Amazon’s pay-by-the-hour model (which allowed the NYT to access a large number of machines for a short period) and Hadoop’s easy-to-use parallel programming model.
In April 2008, Hadoop broke a world record to become the fastest system to sort a terabyte of data. Running on a 910-node cluster, Hadoop sorted one terabyte in 209 seconds (just under 3½ minutes), beating the previous year’s winner of 297 seconds.November of the same year, Google reported that its MapReduce implementation sorted one terabyte in 68 seconds.

Hadoop at Yahoo!

Building Internet-scale search engines requires huge amounts of data and therefore large numbers of machines to process it. Yahoo! Search consists of four primary components: the Crawler, which downloads pages from web servers; the WebMap, which builds a graph of the known Web; the Indexer, which builds a reverse index to the best pages; and the Runtime, which answers users’ queries. The WebMap is a graph that consists of roughly 1 trillion (1012) edges each representing a web link and 100 billion (1011) nodes each representing distinct URLs. Creating and analyzing such a large graph requires a large number of computers running for many days. In early 2005, the infrastructure for the WebMap, named Dreadnaught, needed to be redesigned to scale up to more nodes. Dreadnaught had successfully scaled from 20 to 600 nodes, but required a complete redesign to scale out further. Dreadnaught is similar to MapReduce in many ways, but provides more flexibility and less structure. In particular, each fragment in a Dreadnaught job can send output to each of the fragments in the next stage of the job, but the sort was all done in library code. In practice, most of the WebMap phases were pairs that corresponded to MapReduce. Therefore, the WebMap applications would not require extensive refactoring to fit into MapReduce.
Eric Baldeschwieler (Eric14) created a small team and started designing and prototyping a new framework written in C++ modeled after GFS and MapReduce to replace Dreadnaught. Although the immediate need was for a new framework for WebMap, it was clear that standardization of the batch platform across Yahoo! Search was critical and by making the framework general enough to support other users, we could better leverage investment in the new platform.
At the same time, watching Hadoop, which was part of Nutch, and its progress. In January 2006, Yahoo! hired Doug Cutting, and a month later they decided to abandon their prototype and adopt Hadoop. The advantage of Hadoop over our prototype and design was that it was already working with a real application (Nutch) on 20 nodes. That allowed them to bring up a research cluster two months later and start helping real customers use the new framework much sooner than they could have otherwise. Another advantage, of course, was that since Hadoop was already open source, it was easier (although far from easy!) to get permission from Yahoo!’s legal department to work in open source. So they set up a 200-node cluster for the researchers in early 2006 and put the WebMap conversion plans on hold while we supported and improved Hadoop for the research users.
Here’s a quick timeline of how things have progressed:
·         2004—Initial versions of what is now Hadoop Distributed Filesystem and MapReduce implemented by Doug Cutting and Mike Cafarella.
·         December 2005—Nutch ported to the new framework. Hadoop runs reliably on 20 nodes.
·         January 2006—Doug Cutting joins Yahoo!.
·         February 2006—Apache Hadoop project officially started to support the standalone development of MapReduce and HDFS.
·         February 2006—Adoption of Hadoop by Yahoo! Grid team.
·         April 2006—Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours.
·         May 2006—Yahoo! set up a Hadoop research cluster—300 nodes.
·         May 2006—Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark).
·         October 2006—Research cluster reaches 600 nodes.
·         December 2006—Sort benchmark run on 20 nodes in 1.8 hours, 100 nodes in 3.3 hours, 500 nodes in 5.2 hours, 900 nodes in 7.8 hours.
·         January 2007—Research cluster reaches 900 nodes.
·         April 2007—Research clusters—2 clusters of 1000 nodes.
·         April 2008—Won the 1 terabyte sort benchmark in 209 seconds on 900 nodes.

Sunday, 20 April 2014

Distributed File System



In last blog we have seen some of the important concept of  file systems. Our main focus is on distributed files systems available.
There are number of distributed files systems are available in market some of mentioned in following list

Definition - Distributed File System (DFS)

A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored on the local client machine. The DFS makes it convenient to share information and files among users on a network in a controlled and authorized way. The server allows the client users to share files and store data just like they are storing the information locally. However, the servers have full control over the data and give access control to the clients.
Distributed file systems are also called network file systems. Many implementations have been made, they are location dependent and they have access control lists (ACLs), unless otherwise stated below.
  • 9P, the Plan 9 from Bell Labs and Inferno distributed file system protocol. One implementation is v9fs. No ACLs.
  • Amazon S3
  • Andrew File System (AFS) is scalable and location independent, has a heavy client cache and uses Kerberos for authentication. Implementations include the original from IBM (earlier Transarc), Arla and OpenAFS.
  • Apple Filing Protocol (AFP) from Apple Inc.. AFP may use Kerberos authentication.
  • DCE Distributed File System (DCE/DFS) from IBM (earlier Transarc) is similar to AFS and focus on full POSIX file system semantics and high availability. Available for AIX and Solaris under a proprietary software license.
  • File Access Listener (FAL) is an implementation of the Data Access Protocol (DAP) which is part of the DECnet suite of network protocols created by Digital Equipment Corporation.
  • Microsoft Office Groove shared workspace, used for DoHyki
  • Netware Control Protocol (NCP) from Novell is used in networks based on NetWare.
  • Network File System (NFS) originally from Sun Microsystems is the standard in UNIX-based networks. NFS may use Kerberos authentication and a client cache. (4.1 only)
  • OS4000 Linked-OS provides distributed filesystem across OS4000 systems.
  • Secure File System (SFS)
  • Self-certifying File System (SFS), a global network file system designed to securely allow access to file systems across separate administrative domains.
  • Server Message Block (SMB) originally from IBM (but the most common version is modified heavily by Microsoft) is the standard in Windows-based networks. SMB is also known as Common Internet File System (CIFS). SMB may use Kerberos authentication.





 In next session we start actual Hadoop.


RoadMap of FileSystem

In last blog I mentioned Apache Hadoop-YARN, but before moving to this latest concept we call some of the basics of file system because unless and until to understand basics we can't move forward.



What is file system?
Any computer file is stored on some kind of storage with a given capacity. Actually, each storage is a linear space to read or both read and write digital information. Each byte of information on the storage has its own offset from the storage start (address) and is referenced by this address. A storage can be presented as a grid with a set of numbered cells (each cell – single byte). Any file saved to the storage takes a number of these cells.

Generally, computer storages use a pair of sector and in-sector offset to reference any byte of information on the storage. The sector is a group of bytes (usually 512 bytes) that is a minimum addressable unit of the physical storage. For example, byte 1030 on a hard disk will be referenced as sector #3 and offset in sector 16 bytes ([sector]+[sector]+[16 bytes]). This scheme is applied to optimize storage addressing and use a smaller number to reference any portion of information on the storage.

To omit the second part of the address (in-sector offset), files are usually stored starting from the sector start and occupy all whole sectors (e.g.: 10-byte file occupies the whole sector, 512-byte file also occupies the whole sector, at the same time, 514 byte file occupies two whole sectors).

Each file is stored to 'unused' sectors and can be read then by known position and size. However, how do we know what sectors are used or unused? Where are file size and position stored? Where is file name? These answers give us the file system.

As a whole, file system is a structured data representation and a set of metadata that describe the stored data. File system can not only serve for the purposes of the whole storage but also be a part of an isolated storage segment – disk partition. Usually the file system operates blocks, not sectors. File system blocks are groups of sectors that optimize storage addressing. Modern file systems generally use block sizes from 1 up to 128 sectors (512-65536 bytes). Files are usually stored from the start of a block and take entire blocks.

Immense write/delete operations to file system cause file system fragmentation. As a result files aren't stored as whole fragments anymore and are divided into fragments. For example, a storage is entirely taken by files with size about 4 blocks (e.g. pictures collection). User wants to store a file that will take 8 blocks and therefore deletes the first and the last file. By doing this he releases 8 blocks, however, the first segment is near to the storage start, and the second – to the storage end. In this case 8-block file will be split into two parts (4 blocks for each part) and will take free space 'holes'. The information about both fragments, which are parts of a single file, will be stored to file system.

In addition to user files the file system also stores its own parameters (such as block size), file descriptors (that include file size, file location, its fragments etc.), file names and directory hierarchy. It may also store security information, extended attributes and other parameters.

To comply with diverse requirements as to storage performance, stability and reliability there exists a great variety of file systems each developed to serve certain user purposes.
Windows file systems
Microsoft Windows OS use two major file systems: FAT, inherited from old DOS with its later extension FAT32, and widely-used NTFS file systems. Recently released ReFS file system was developed by Microsoft as a new generation file system for Windows  Servers

FAT (File Allocation Table):
FAT file system is one of the most simple types of file systems. It consists of file system descriptor sector (boot sector or superblock), file system block allocation table (referenced as File Allocation Table) and plain storage space to store files and folders. Files on FAT are stored in directories. Each directory is an array of 32-byte records, each defines file or file extended attributes (e.g. long file name). File record references the first block of file. Any next block can be found through block allocation table by using it as linked-list.

Block allocation table contains an array of block descriptors. Zero value indicates that the block is not used and non-zero – reference to the next block of the file or special value for file end.

The number in FAT12, FAT16, FAT32 stands for the number if bits used to enumerate file system block. This means that FAT12 may use up to 4096 different block references, FAT16 - 65536 and FAT32 - 4294967296. Actual maximum count of blocks is even less and depends on implementation of file system driver.

FAT12 was used for old floppy disks. FAT16 (or simply FAT) and FAT32 are widely used for flash memory cards and USB flash sticks. It is supported by mobile phones, digital cameras and other portable devices.

FAT or FAT32 is a file system, used on Windows-compatible external storages or disk partitions with size below 2GB (for FAT) or 32GB (for FAT32). Windows can not create FAT32 file system over 32GB (however Linux supports FAT32 up to 2TB).

NTFS (New Technology File System):
NTFS was introduced in Windows NT and at present is major file system for Windows. This is a default file system for disk partitions and the only file system that is supported for disk partitions over 32GB. The file system is quite extensible and supports many file properties, including access control, encryption etc. Each file on NTFS is stored as file descriptor in Master File Table and file content. Master file table contains all information about the file: size, allocation, name etc. The first and the last sectors of the file system contain file system settings (boot record or superblock). This file system uses 48 and 64 bit values to reference files, thus supporting quite large disk storages.

ReFS (Resilient File System):
ReFS is the latest development of Microsoft presently available for Windows 8 Servers. File system architecture absolutely differs from other Windows file systems and is mainly organized in form of B+-tree. ReFS has high tolerance to failures achieved due to new features included into the system. And, namely, Copy-on-Write (CoW): no metadata is modified without being copied; no data is written over the existing ones and rather into a new disk space. With any file modifications a new copy of metadata is created into any free storage space, and then the system creates a link from older metadata to the newer ones. As a result a system stores significant quantity of older backups in different places which provides for easy file recovery unless this storage space is overwritten.

MacOS file systems
Apple Mac OS operating system applies HFS+ file system, an extension to their own HFS file system that was used on old Macintosh computers.

HFS+ file system is applied to Apple desktop products, including Mac computers, iPhone, iPod, as well as Apple X Server products. Advanced server products also use Apple Xsan file system, clustered file system derived from StorNext or CentraVision file systems.

This file system except files and folders also stores Finder information about directories view, window positions etc.

Linux file systems
Open-source Linux OS always aimed to implement, test and use different concepts of file systems. Among huge amount of various file system types the most popular Linux file systems nowadays are:
·         Ext2, Ext3, Ext4 - 'native' Linux file system. This file system falls under active developments and improvements. Ext3 file system is just an extension to Ext2 that uses transactional file write operations with journal. Ext4 is a further development of Ext3, extended with support of optimized file allocation information (extents) and extended file attributes. This file system is frequently used as 'root' file system for most Linux installations.
·         ReiserFS - alternative Linux file system designed to store huge amount of small files. It has good capability of files search and enables compact files allocation by storing file tails or small files along with metadata in order not to use large file system blocks for this purpose.
·         XFS - file system derived from SGI company that initially used it for their IRIX servers. Now XFS specifications are implemented in Linux. XFS file system has great performance and is widely used to store files.
·         JFS - file system developed by IBM for their powerful computing systems. JFS one usually stands for JFS, JFS2 is the second edition. Currently this file system is open-source and is implemented in most modern Linux distributions.
The concept of 'hard links' used in this kind of OS makes most Linux file systems similar in that the file name is not regarded as file attribute and rather defined as an alias for a file in certain directory. File object can be linked from many locations, even many times from the same directory under different names. This is one of the causes why recovery of file names after file deletion or file system damage can be difficult or even impossible.

BSD, Solaris, Unix file systems
The most common file system for these OS is UFS (Unix File System) also often referred to FFS (Fast File System – fast compared to a previous file system used for Unix). UFS is a source of ideas for many other file system implementations.

Currently UFS (in different editions) is supported by all Unix-family OS and is major file system of BSD OS and Sun Solaris OS. Modern computer technologies tend to implement replacements for UFS in different OS (ZFS for Solaris, JFS and derived file systems for Unix etc.).

Clustered file systems
Clustered file systems are used in computer cluster systems. These file systems have embedded support of distributed storage.

Among such distributed file systems are:
·         ZFS - Sun company 'Zettabyte File System' - the new file system developed for distrubuted storages of Sun Solaris OS.
·         Apple Xsan - the Apple company evolution of CentraVision and later StorNext file systems.
·         VMFS - 'Virtual Machine File System' developed by VMware company for its VMware ESX Server.
·         GFS - Rad Hat Linux 'Global File System'.
·         JFS1 - original (legacy) design of IBM JFS file system used in older AIX storage systems.
Common property of these file systems is distributed storages support, extensibility and modularity.