书名：Mastering Hadoop 3
作者名：Chanchal Singh Manish Kumar
本章字数：521字
更新时间：2025-04-04 14:54:50

YARN and MapReduce

We have covered enough information about YARN in previous chapters. In this section, we will talk about the execution of MapReduce over YARN. The JobTracker in Hadoop version 1 has a bottleneck due to a scalability limit of 4,000 nodes. Yahoo realizes that their current requirement needs a scaling of up to 20,000 nodes. The latter was certainly not possible due to the legacy architecture of the job tracker. Yahoo then introduced YARN, which broke the function of the job tracker for efficient management. We covered the detail architecture in Chapter 3, YARN Resource Management in Hadoop.

The node manager in YARN has enough memory to launch multiple containers. The application master can request any number of containers from the resource manager, which keeps track of the available resources in the YARN cluster. The job type is not limited to MapReduce; instead, YARN can launch any type of application. Let's take a look at the life cycle of a MapReduce application on YARN in the following diagram:

The preceding diagram can be explained as follows:

The MapReduce job client requests a new application ID from the application manager. The resource manager sends the unique application ID to the client after validating the authentication and authorization of client. The MR job client encloses the metadata information about the application into ApplicationSubmissionContext, which also contains information to start the application master.
The resource manager starts the application master on one of the node managers, which fulfils that container requirement for the application master. The resource manager scheduler selects the node manager where the application master will be launched.
The application master creates a client object to make communication with the resource manager and node manager. The application master registers itself with the resource manager and the latter responds back with information such as access tokens, ACL list, and so on.
The MR job client queries the application manager to obtain the information about application master and can then talk directly to the application master for status, counter, and any other information.
The application master computes the number of splits and sends a resource request for mappers and reducers to the resource manager scheduler. The request contains information about memory and CPU information that's required for the container.
The application master receives the container for map task and reduce task and then it communicates with specific node managers to launch containers. The node manager in YARN can launch multiple containers on the same node manager.
The application master also manages and monitors the individual nap task and reduce task and requests the additional container from resource manager if needed. It also ensures that if any task has failed or not responding, it can restart with new resources until the maximum attempt of retry is reached.
The application master runs a task cleanup operation after all the map tasks and reduce tasks are completed. Finally the application master sends the unregistered request to resource manager, exits the execution, and frees up the container occupied.