Friday, 25 April 2014

MapReduce

In last blog we saw HDFS, now in this we see map reduce details.
some of the important terms that need to be remember for mapreduce.
1) The client, which submits the MapReduce job.
2)The jobtracker, which  coordinates the job run. The jobtracker is a Java application whose main class is JobTracker.
3)The tasktracker, which run the task that the job has been split into. Tasktracker are Java applications whose main class is TaskTracker.
4) The distributed filesystem which is used for sharing job files between the other entities.

Following figure shows how hadoop is works mapreduce. 
1) The runjob() method on JobClient is a convenience method that creates a new JobClient instance and calls submitJob() on it. (step 1)
2) The job submission process implemented by JobClient's submitJob() method does the following:
  a) Asks the jobtracker for a new job ID(by calling getnewJobId() on JobTracker) (step 2).
  b) Checks the output specification of the job.
  c) Computers the input splits for the job.
  d) Copies the resources needed to run the job, including the job JAR file, the configuration file and the computed input splits, to the jobtracker's filesystem in a directory named after the job ID. The job JAR is copied with a high replication factor so that there are lots of copies across the cluster for the tasktracker to access when they run tasks for the job (step 3).
3) It tells the jobtracker that the job is ready for execution (by calling submitJob() on Jobtracker) (step 4).
4) When Jobtracker receives a call to its submitJob() method, it puts it into an internal queue from the job scheduler will pick it up and initialize it. Initialization involves creating an object to represent the job being run, which encapsulates its tasks, and bookkeeping information to keep track of the tasks' status and progress(step 5).
5) To create the list of tasks to run, the job scheduler first retrieves the input splits computed by the JobClient from the shared filesystem(step 6). 
6) A tasktracker will indicate whether it is ready to run a new task, and if it is, the jobtracker will allocate it a task, which it communicates to the tasktracker.(step 7).
7) The tasktracker localize the job JAR by copying it from the shared filesystem to the tasktracker's filesystem. It also copies any files needed from the distributed cache by the applications to the local disk. Second, it creates a local working directory for the task, and un-jars the contents of the JAR into this directory. Third, it creates an instance of TaskRunner to run the task.
8) TaskRunner launches a new Java Virtual Machine (step 9) to run each task in (step 10), so that any bugs in the user-defined map and reduce functions don't affect the tasktracker.

In next blog we see how hadoop installation on single or multinode configuration with ubuntu.

No comments:

Post a Comment