Wednesday, 23 April 2014

The Hadoop Distributed Filesystem

In last blog we have seen the journey of Hadoop. Now, from this onwards we see HDFS, Map Reduce, and many more.
  • The Hadooop Distributed Filesystem- When a dataset outgrows the storage capacity of a single physical machine, it becomes necessary to partition it across a number of separate machines, Filesystems that manage the storage across a network of machine are called distributed filesystem. Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed Filesystem. 
  • The Design of HDFS- HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware. 
    • Very Large file- the files that are size of megabytes, gigabytes, terabytes or upto perabytes.
    • Streaming Data Access- HDFS based on write-once and read-many times pattern. 
    • Commodity Hardware- Hadoop doesn't require expensive, highly reliable hardware to run on.  
  • HDFS Concept  - 
    • Blocks - A disk has a block size, which is the minimum amount of data that it can read or write. Filesystem blocks are typically a few KB in size, while disk blocks are normally 512 bytes.  
    • HDFS Blocks- Block size is 64 MB by default. HDFS blocks are large compared to disk blocks because to minimize the cost of seeks. To keep large block size take the time to transfer the data from the disk can be made to be larger than the time to seek to start of the block.
    • Namenode and Datanodes - A HDFS cluster has two types of nodes. Namenode(master) and number of datanode (workers). 
    • Namenode- It manages the filesystem namespace. it maintains the filesystems and the metadata for al the files and directories in the tree. This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log. The namenode also knows the datanodes on which all the blocks for a given files are located.
    • Datanode - It stores and retrieve blocks, and they report back to the namenode periodically with lists of blocks that they are storing.
  • Data Flow - To get an  idea of how data flows between the client interacting with HDFS, the namenode and the datanode. 
The client opens the file it wishes to read by calling open() on the FileSystem  object, which for HDFS is an instance of DistributedFileSystems (step 1). DistributedFileSystem calls the namenode, using RPC to determine the locations of the blocks in the file (step 2). For each block, the namenode returns the addresses of the datanodes that have a copy of that block. The client then calls read () on the stream (step 3). DFSInputStream, which has stored the datanode addresses for the first few blocks in the file, then connects to the first datanode for the first block in the file. Data is streamed from the datanode back to the client, which calls read () repeatedly on the stream(step 4). when the end of the block is reached, DFSInputStream will close the connection to the datanode, then find the best datanode for the next block (step 5). 
Bloks are read in order with the DFSInnputStream opening new connections to datanodesas the client reads through the stream. It will also call the namenode to retrieve the datanode locations for the next batch of blocks as needed. When the client has finished reading, it calls close() on the FSDataInputStream(step 6). 

In next blog we will see the map and reduce concept.

3 comments: