
The Hadoop-YARN multi-node installation
Installing a multi-node Hadoop-YARN cluster is similar to a single node installation. You need to configure the master node, the same as you did during the single node installation. Then, copy the Hadoop installation directory to all the slave nodes and set the Hadoop environment variables for the slave nodes. You can start the Hadoop daemons either directly from the master node, or you can login to each node to run their respective services.
Prerequisites
Before starting with the installation steps, make sure that you prepare all the nodes as specified here:
- All the nodes in the cluster have a unique hostname and IP address. Each node should be able to identify all other nodes through the hostname. If you are not using the DHCP server, you need to make sure that the
/etc/hosts
file contains the resolution for all nodes used in the cluster. The entries will look similar to the following:192.168.56.101 master 192.168.56.102 slave1 192.168.56.103 slave2 192.168.56.104 slave3
- Passwordless SSH is configured from the master to all the slave nodes in the cluster. To ensure this, execute the following command on the master for all the slave nodes:
ssh-copy-id <SlaveHostName>
Installation steps
After preparing your nodes as per the Hadoop multi-node cluster installation, you need to follow a simple six-step process to install and run Hadoop on your Linux machine. To better understand the process, you can refer to the following diagram:

You need to follow the first three steps mentioned in the installation steps for the Hadoop-YARN single node installation. The main difference while configuring the node for the multi-node cluster is the usage of the master node's hostname instead of a loopback hostname (localhost
). Assuming that the hostname of the master node is master
, you need to replace localhost
with master
in the core-site.xml
and yarn-site.xml
configuration files. The properties in these files will look as follows:
core-site.xml
:<property> <name>fs.defaultFS</name> <value>hdfs://master:8020</value> <final>true</final> </property>
yarn-site.xml
:<property> <name>yarn.resourcemanager.address</name> <value>master:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>master:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>master:8031</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>master:8033</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>master:8088</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value> </property>
You also need to modify the slaves
file. As mentioned earlier, it contains the list of all the slave nodes. You need to add a hostname for all the slave nodes to the slaves
file. The content of the file will look as follows:
slave1 slave2 slave3
After configuring your master node, you need to copy the HADOOP_PREFIX
directory to all the slave nodes. The location of the Hadoop directory and the Hadoop configuration files should be in sync with the master node. You can use the scp
command to securely copy files from master to all slaves:
for node in 'cat <path_for_slaves_file in hadoop_conf_directory>'; do scp -r <hadoop_dir> $node:<parent_directory of hadoop_dir>; done
If you are using any system directory as a Hadoop directory (a directory that requires a sudo
option for any write operation, for example, /opt
), then you will have to use the rsync
utility to copy the Hadoop folder to all the slave nodes. It requires NOPASSWD: ALL enabled for the user on the slave machines. You can refer to the blog at http://www.ducea.com/2006/06/18/linux-tips-password-usage-in-sudo-passwd-nopasswd/. This ensures that the user is not prompted for any password while running sudo:
for node in `cat <path_for_slaves_file in hadoop_conf_directory>`; do sudo rsync --rsync-path="sudo rsync" -r <hadoop_dir> $node:<parent_directory of hadoop_dir>; done
Similar to configuring the Hadoop environment variables on the master node, you need to configure the environment variables in all the slave nodes. You need to login to the slave node, edit the /home/hduser/.bashrc
file and recompile the file using the source
command. You can also refer to step 2, under the installation steps for the Hadoop-YARN single node installation.
This step is the same as you followed for the single node installation. You need to login to the master node and execute the hdfs format
command. For more details, you can refer to step 4, under the installation steps for the Hadoop-YARN single node installation.
The configuration for the Hadoop-YARN multi node cluster is now finished. Now you need to start the Hadoop-YARN daemons. Login to the master node and run the master daemons (NameNode and ResourceManager) using the below scripts:

Login to each slave node and execute the following scripts to start the DataNode and NodeManager daemons.

If you are configuring a large cluster, then executing the scripts on all the slave nodes is time consuming. To help cluster administrators, Hadoop provides scripts to start / stop
all Hadoop daemons through the master node. You need to login to the master node and execute the following scripts to start / stop
the HDFS and YARN daemons respectively.

You can find scripts such as start-all.sh
and stop-all.sh
, but the usage of these scripts is deprecated in the latest versions of Hadoop.
Execute the jps
command on each node and ensure that all the Hadoop daemons are running. You can also verify the status of your cluster through the web interface for HDFS-NameNode and YARN-ResourceManager.

To test your cluster, you can refer to the previous topic as the steps to test the multi-node cluster are exactly the same as the single node cluster.