
Starting with the basics
The Apache Hadoop 2.x version consists of three key components:
- Hadoop Distributed File System (HDFS)
- Yet Another Resource Negotiator (YARN)
- The MapReduce API (Job execution, MRApplicationMaster, JobHistoryServer, and so on)
There are two master processes that manage the Hadoop 2.x cluster—the NameNode and the ResourceManager. All the slave nodes in the cluster have DataNode and NodeManager processes running as the worker daemons for the cluster. The NameNode and DataNode daemons are part of HDFS, whereas the ResourceManager and NodeManager belong to YARN.
When we configure Hadoop-YARN on a single node, we need to have all four processes running on the same system. Hadoop single node installation is generally used for learning purposes. If you are a beginner and need to understand the Hadoop-YARN concepts, you can use a single node Hadoop-YARN cluster.
In the production environment, a multi-node cluster is used. It is recommended to have separate nodes for NameNode and ResourceManager daemons. As the number of slave nodes in the cluster increases, the requirement of memory, processor, and network of the master nodes increases. The following diagram shows the high-level view of Hadoop-YARN processes running on a multi-node cluster.

Supported platforms
To install a Hadoop-YARN cluster, you can use either GNU-, Linux-, or Windows-based operating systems. The steps to configure and use the Hadoop-YARN cluster for these operating systems are different. It is recommended to use GNU/Linux for your cluster installations. Apache Hadoop is an open source framework and it's widely used on open source platforms such as Ubuntu/CentOS. The support documents and blogs for Linux machines are easily available. Some companies use the enterprise version of Linux systems such as RHEL (RedHat Enterprise Linux).
In this chapter, we'll be using a 64-bit Ubuntu Desktop (version 14.04) for deployment of the Hadoop-YARN cluster. You can download an ISO image for Ubuntu Desktop from its official website (http://www.ubuntu.com/download).
Hardware requirements
The following section covers the recommended hardware configuration to run Apache Hadoop.
For learning purpose, nodes in the cluster must have the following:
- 1.5 or 2 GB of RAM
- 15-20 GB free hard disk.
If you don't have physical machines, you can use a tool such as Oracle Virtualbox to host virtual machines on your host system. To know more about Oracle Virtualbox and how to install virtual machines using Oracle Virtualbox, you can refer to the blog at http://www.protechskills.com/big-data/hadoop/administration/create-new-vm-using-oracle-virtualbox.
To select nodes for a production environment, you can refer to a blog on the Hortonworks website at http://hortonworks.com/blog/best-practices-for-selecting-apache-hadoop-hardware/.
Software requirements
Hadoop is an open source framework that requires:
- Java already installed on all the nodes
- A passwordless SSH from the master node to the slave nodes
The steps to install java and configure passwordless SSH are covered later in the chapter.
Basic Linux commands / utilities
Before moving forward, it is important to understand the usage of the following Linux commands:
- Sudo
- Nano editor
- Source
- Jps
- Netstat
- Man
The official documentation for the Hadoop cluster installation is based on the Linux platform and as mentioned in the previous section, Linux OS is preferred over the Windows OS. This section allows readers with minimum knowledge of Linux to deploy a Hadoop-YARN cluster with ease. Readers new to Linux should have basic knowledge of these commands / utilities before moving to the cluster installation steps. This section of the chapter covers an overview and usage of these commands.
In Linux, the sudo
command allows a user to execute a command as a superuser, or in other words an administrator of a windows system. The file /etc/sudoers
contains a list of users who have the sudo
permission. If you need to change any of the system properties or access any system file, you need to add sudo
in the beginning of the command. To read more about the sudo
command, you can refer to the blog at http://www.tutorialspoint.com/unix_commands/sudo.htm.
Nano is one of the editor tools for Linux. Its ease of use and simplicity allow beginners to handle files easily. To read more about the nano
editor, you can refer to the documentation at http://www.nano-editor.org/dist/v2.0/nano.html.
The alternate to the nano
editor is the default vi
editor.
When you edit any of the environment setting files such as /etc/environment
or ~/.bashrc
, you need to refresh the file to apply the changes made in the file without restarting the system. To read more about the source
command, you can refer to the blog at http://bash.cyberciti.biz/guide/Source_command.
Jps
is a Java command used to list the Java processes running on a system. The output of the command contains the process ID and process name for all of the Java processes. Before using the jps
command, you need to make sure that the bin
directory of your JAVA_HOME
command is set in the PATH variable for the user.
A sample output is as follows:
hduser@host:~$ jps 6690 Jps 2071 ResourceManager 2471 NameNode
To read more about the jps
command, you can refer to the Oracle documentation at http://docs.oracle.com/javase/7/docs/technotes/tools/share/jps.html.
The netstat
command is a utility to list the active ports on a system. It checks for TCP and UDP connections. This command will be helpful to get the list of ports being used by a process. To read more about the netstat
command and its options, you can refer to the blog at http://www.c-jump.com/CIS24/Slides/Networking/html_utils/netstat.html.
You can use the netstat
command with the grep
command to get filtered results for a particular process:
netstat -nlp | grep <PID>
Most of the Linux commands have their documentations and user manuals that are also known as man pages. The man
command is used to format and view these man pages through the command line interface. The basic syntax of the man
command is as follows:
- Syntax:
man [option(s)] keyword(s)
- Example:
man ls
To read more about the man command, you can refer to the wiki page at http://en.wikipedia.org/wiki/Man_page
Preparing a node for a Hadoop-YARN cluster
Before using a machine as a Hadoop node in a cluster, there are a few prerequisites that need to be configured.

As mentioned in the software requirements for a cluster, all the nodes across the cluster must have Sun Java 1.6 or above and the SSH service installed. The Java version and JAVA_HOME should be consistent across all the nodes. If you want to read more regarding the Java compatibility with Hadoop, you can browse to a page on wiki at http://wiki.apache.org/hadoop/HadoopJavaVersions.
To install and configure Java on Ubuntu, you can refer to the blog at http://www.protechskills.com/linux/unix-commands/install-java-in-linux.
- To verify if Java is installed, you can execute the following command:
java –version
- The Java version will be displayed on the console. The output of the command will look like this:
java version "1.8.0" Java(TM) SE Runtime Environment (build 1.8.0-b132) Java HotSpot(TM) 64-Bit Server VM (build 25.0-b70, mixed mode)
- To verify that the environment variable for Java is configured properly, execute the following command:
echo $JAVA_HOME
- The installation directory for Java will be displayed on the console. The output of the command will look like this:
/usr/lib/jvm/jdk1.8.0/
In the Hadoop cluster node installation, the Hadoop daemons run on multiple systems. The slave nodes run the DataNode and NodeManager services. All the nodes in a cluster must have a common user and a group. It is recommended to create a dedicated user for the Hadoop cluster on all the nodes of your cluster.
To create a user on Ubuntu, you can refer to the Ubuntu documentation at http://manpages.ubuntu.com/manpages/jaunty/man8/useradd.8.html.
Here is a sample command to create a new user hduser
on Ubuntu:
sudo usersadd –m hduser
After creating a new user, you also need to set a password for the new user. Execute the following command to set a password for the newly created user hduser
:
sudo passwd hduser
Hadoop daemons use a few ports for internal and client communication. The cluster administrator has an option to either disable the firewall of the nodes or allow traffic on the ports required by Hadoop. Hadoop has a list of default ports, but you can configure them as per your need.
To disable the firewall in Ubuntu, execute the following command:
sudo ufw disable
Note
Here are some useful links for the ports to be used and firewall options available:
http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.2.0/bk_reference/content/reference_chap2_1.html
A Hadoop node is identified through its hostname. All the nodes in the cluster must have a unique hostname and IP address. Each Hadoop node should be able to resolve the hostname of the other nodes in the cluster.
If you are not using a DHCP server that manages your DNS hostname resolution, then you need to configure the /etc/hosts
file on all the nodes. The /etc/hosts
file of a system contains the IP addresses of the nodes specified with their hostname. You need to prepend the file with the IP address and hostname mapping. You can use the nano or vi editor with the sudo
option to edit the file contents. Assuming the hostname of your nodes is master
, slave1
, slave2
, and so on; the contents of the file will look similar to the following:
192.168.56.100 master 192.168.56.101 slave1 192.168.56.102 slave2 192.168.56.103 slave3
To view the system hostname, you can execute the hostname
command.
hostname
To modify the system hostname in Linux, you can refer to the blogs here:
CentOS: https://www.centosblog.com/how-to-change-hostname-on-centos-linux/
Ubuntu: http://askubuntu.com/questions/87665/how-do-i-change-the-hostname-without-a-restart
After editing the file, you need to either restart your system network settings or reboot your system.
To restart networking on Ubuntu, execute the following command:
sudo /etc/init.d/networking restart
To verify that configuration is working properly, execute the following command:
ping slave1
To stop the command, press ctrl + c. The output of the command will look like this:
PING slave1 (slave1) 56(84) bytes of data. 64 bytes from slave1: icmp_req=1 ttl=64 time=0.025 ms 64 bytes from slave1: icmp_req=2 ttl=64 time=0.024 ms
The OpenSSH server and client packages should already be installed and the sshd
service should be running on all the nodes. By default, the sshd
service uses port 22
.
To install these packages on Ubuntu, execute the following commands:
sudo apt-get install openssh-client sudo apt-get install openssh-server
The Hadoop master node remotely manages all the Hadoop daemons running across the cluster. The master node creates a secure connection through SSH using the dedicated user group for the cluster. It is recommended that you allow the master node to create an SSH connection without a password. You need to configure a passwordless SSH from the master node to all slave nodes.
First you need to create SSH keys for the master node, then share the master's public key with the target slave node using the ssh-copy-id
command.
Assuming that the user for the Hadoop-YARN cluster is hduser
, the ssh-copy-id
command will append the contents of the master node's public key file, /home/hduser/.ssh/ id_dsa.pub
, to the /home/hduser/.ssh/authorized_keys
file on the slave node.
To install the SSH service and configure a passwordless SSH on Ubuntu, you can refer to the Ubuntu documentation at https://help.ubuntu.com/lts/serverguide/openssh-server.html.
To verify that the sshd
service is running, execute the following netstat
command:
sudo netstat -nlp | grep sshd
The output will contain the service details if the service is running:
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 670/sshd
To verify the passwordless SSH connection, execute the ssh
command from the master node and observe that the command will not prompt for a password now:
ssh hduser@slave1