Learning YARN
上QQ阅读APP看书,第一时间看更新

The Hadoop-YARN multi-node installation

Installing a multi-node Hadoop-YARN cluster is similar to a single node installation. You need to configure the master node, the same as you did during the single node installation. Then, copy the Hadoop installation directory to all the slave nodes and set the Hadoop environment variables for the slave nodes. You can start the Hadoop daemons either directly from the master node, or you can login to each node to run their respective services.

Prerequisites

Before starting with the installation steps, make sure that you prepare all the nodes as specified here:

  • All the nodes in the cluster have a unique hostname and IP address. Each node should be able to identify all other nodes through the hostname. If you are not using the DHCP server, you need to make sure that the /etc/hosts file contains the resolution for all nodes used in the cluster. The entries will look similar to the following:
    192.168.56.101 master
    192.168.56.102 slave1
    192.168.56.103 slave2
    192.168.56.104 slave3
    
  • Passwordless SSH is configured from the master to all the slave nodes in the cluster. To ensure this, execute the following command on the master for all the slave nodes:
    ssh-copy-id <SlaveHostName>
    

Installation steps

After preparing your nodes as per the Hadoop multi-node cluster installation, you need to follow a simple six-step process to install and run Hadoop on your Linux machine. To better understand the process, you can refer to the following diagram:

Installation steps

Step 1 – Configure the master node as a single-node Hadoop-YARN installation

You need to follow the first three steps mentioned in the installation steps for the Hadoop-YARN single node installation. The main difference while configuring the node for the multi-node cluster is the usage of the master node's hostname instead of a loopback hostname (localhost). Assuming that the hostname of the master node is master, you need to replace localhost with master in the core-site.xml and yarn-site.xml configuration files. The properties in these files will look as follows:

  • core-site.xml:
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://master:8020</value>
        <final>true</final>
    </property>
  • yarn-site.xml:
    <property>
        <name>yarn.resourcemanager.address</name>
        <value>master:8032</value>
    </property>
    
    <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>master:8030</value>
    </property>
    
    <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>master:8031</value>
    </property>
    
    <property>
        <name>yarn.resourcemanager.admin.address</name>
        <value>master:8033</value>
    </property>
    
    <property>
        <name>yarn.resourcemanager.webapp.address</name>
        <value>master:8088</value>
    </property>
    
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    
    <property>
        <name>yarn.resourcemanager.scheduler.class</name>
        <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
    </property>

You also need to modify the slaves file. As mentioned earlier, it contains the list of all the slave nodes. You need to add a hostname for all the slave nodes to the slaves file. The content of the file will look as follows:

slave1
slave2
slave3

Step 2 – Copy the Hadoop folder to all the slave nodes

After configuring your master node, you need to copy the HADOOP_PREFIX directory to all the slave nodes. The location of the Hadoop directory and the Hadoop configuration files should be in sync with the master node. You can use the scp command to securely copy files from master to all slaves:

for node in 'cat <path_for_slaves_file in hadoop_conf_directory>'; do scp -r <hadoop_dir> $node:<parent_directory of hadoop_dir>; done

Tip

After replacing the path used in the preceding command with valid directories, the command will look as follows:

for node in 'cat /home/hduser/hadoop-2.5.1/etc/hadoop/slaves'; do scp-r /home/hduser/hadoop-2.5.1 $node:/home/hduser; done

If you are using any system directory as a Hadoop directory (a directory that requires a sudo option for any write operation, for example, /opt), then you will have to use the rsync utility to copy the Hadoop folder to all the slave nodes. It requires NOPASSWD: ALL enabled for the user on the slave machines. You can refer to the blog at http://www.ducea.com/2006/06/18/linux-tips-password-usage-in-sudo-passwd-nopasswd/. This ensures that the user is not prompted for any password while running sudo:

for node in `cat <path_for_slaves_file in hadoop_conf_directory>`; do sudo rsync --rsync-path="sudo rsync" -r <hadoop_dir> $node:<parent_directory of hadoop_dir>; done

Step 3 – Configure environment variables on slave nodes

Similar to configuring the Hadoop environment variables on the master node, you need to configure the environment variables in all the slave nodes. You need to login to the slave node, edit the /home/hduser/.bashrc file and recompile the file using the source command. You can also refer to step 2, under the installation steps for the Hadoop-YARN single node installation.

Step 4 – Format NameNode

This step is the same as you followed for the single node installation. You need to login to the master node and execute the hdfs format command. For more details, you can refer to step 4, under the installation steps for the Hadoop-YARN single node installation.

Step 5 – Start Hadoop daemons

The configuration for the Hadoop-YARN multi node cluster is now finished. Now you need to start the Hadoop-YARN daemons. Login to the master node and run the master daemons (NameNode and ResourceManager) using the below scripts:

Login to each slave node and execute the following scripts to start the DataNode and NodeManager daemons.

If you are configuring a large cluster, then executing the scripts on all the slave nodes is time consuming. To help cluster administrators, Hadoop provides scripts to start / stop all Hadoop daemons through the master node. You need to login to the master node and execute the following scripts to start / stop the HDFS and YARN daemons respectively.

You can find scripts such as start-all.sh and stop-all.sh, but the usage of these scripts is deprecated in the latest versions of Hadoop.

Execute the jps command on each node and ensure that all the Hadoop daemons are running. You can also verify the status of your cluster through the web interface for HDFS-NameNode and YARN-ResourceManager.

To test your cluster, you can refer to the previous topic as the steps to test the multi-node cluster are exactly the same as the single node cluster.