Learning YARN
上QQ阅读APP看书,第一时间看更新

Starting with the basics

The Apache Hadoop 2.x version consists of three key components:

  • Hadoop Distributed File System (HDFS)
  • Yet Another Resource Negotiator (YARN)
  • The MapReduce API (Job execution, MRApplicationMaster, JobHistoryServer, and so on)

There are two master processes that manage the Hadoop 2.x cluster—the NameNode and the ResourceManager. All the slave nodes in the cluster have DataNode and NodeManager processes running as the worker daemons for the cluster. The NameNode and DataNode daemons are part of HDFS, whereas the ResourceManager and NodeManager belong to YARN.

When we configure Hadoop-YARN on a single node, we need to have all four processes running on the same system. Hadoop single node installation is generally used for learning purposes. If you are a beginner and need to understand the Hadoop-YARN concepts, you can use a single node Hadoop-YARN cluster.

In the production environment, a multi-node cluster is used. It is recommended to have separate nodes for NameNode and ResourceManager daemons. As the number of slave nodes in the cluster increases, the requirement of memory, processor, and network of the master nodes increases. The following diagram shows the high-level view of Hadoop-YARN processes running on a multi-node cluster.

Starting with the basics

Supported platforms

To install a Hadoop-YARN cluster, you can use either GNU-, Linux-, or Windows-based operating systems. The steps to configure and use the Hadoop-YARN cluster for these operating systems are different. It is recommended to use GNU/Linux for your cluster installations. Apache Hadoop is an open source framework and it's widely used on open source platforms such as Ubuntu/CentOS. The support documents and blogs for Linux machines are easily available. Some companies use the enterprise version of Linux systems such as RHEL (RedHat Enterprise Linux).

In this chapter, we'll be using a 64-bit Ubuntu Desktop (version 14.04) for deployment of the Hadoop-YARN cluster. You can download an ISO image for Ubuntu Desktop from its official website (http://www.ubuntu.com/download).

Hardware requirements

The following section covers the recommended hardware configuration to run Apache Hadoop.

For learning purpose, nodes in the cluster must have the following:

  • 1.5 or 2 GB of RAM
  • 15-20 GB free hard disk.

If you don't have physical machines, you can use a tool such as Oracle Virtualbox to host virtual machines on your host system. To know more about Oracle Virtualbox and how to install virtual machines using Oracle Virtualbox, you can refer to the blog at http://www.protechskills.com/big-data/hadoop/administration/create-new-vm-using-oracle-virtualbox.

To select nodes for a production environment, you can refer to a blog on the Hortonworks website at http://hortonworks.com/blog/best-practices-for-selecting-apache-hadoop-hardware/.

Software requirements

Hadoop is an open source framework that requires:

  • Java already installed on all the nodes
  • A passwordless SSH from the master node to the slave nodes

The steps to install java and configure passwordless SSH are covered later in the chapter.

Basic Linux commands / utilities

Before moving forward, it is important to understand the usage of the following Linux commands:

  • Sudo
  • Nano editor
  • Source
  • Jps
  • Netstat
  • Man

The official documentation for the Hadoop cluster installation is based on the Linux platform and as mentioned in the previous section, Linux OS is preferred over the Windows OS. This section allows readers with minimum knowledge of Linux to deploy a Hadoop-YARN cluster with ease. Readers new to Linux should have basic knowledge of these commands / utilities before moving to the cluster installation steps. This section of the chapter covers an overview and usage of these commands.

Sudo

In Linux, the sudo command allows a user to execute a command as a superuser, or in other words an administrator of a windows system. The file /etc/sudoers contains a list of users who have the sudo permission. If you need to change any of the system properties or access any system file, you need to add sudo in the beginning of the command. To read more about the sudo command, you can refer to the blog at http://www.tutorialspoint.com/unix_commands/sudo.htm.

Nano editor

Nano is one of the editor tools for Linux. Its ease of use and simplicity allow beginners to handle files easily. To read more about the nano editor, you can refer to the documentation at http://www.nano-editor.org/dist/v2.0/nano.html.

The alternate to the nano editor is the default vi editor.

Source

When you edit any of the environment setting files such as /etc/environment or ~/.bashrc, you need to refresh the file to apply the changes made in the file without restarting the system. To read more about the source command, you can refer to the blog at http://bash.cyberciti.biz/guide/Source_command.

Jps

Jps is a Java command used to list the Java processes running on a system. The output of the command contains the process ID and process name for all of the Java processes. Before using the jps command, you need to make sure that the bin directory of your JAVA_HOME command is set in the PATH variable for the user.

A sample output is as follows:

hduser@host:~$ jps
6690 Jps
2071 ResourceManager
2471 NameNode

To read more about the jps command, you can refer to the Oracle documentation at http://docs.oracle.com/javase/7/docs/technotes/tools/share/jps.html.

Netstat

The netstat command is a utility to list the active ports on a system. It checks for TCP and UDP connections. This command will be helpful to get the list of ports being used by a process. To read more about the netstat command and its options, you can refer to the blog at http://www.c-jump.com/CIS24/Slides/Networking/html_utils/netstat.html.

You can use the netstat command with the grep command to get filtered results for a particular process:

netstat -nlp | grep <PID>

Man

Most of the Linux commands have their documentations and user manuals that are also known as man pages. The man command is used to format and view these man pages through the command line interface. The basic syntax of the man command is as follows:

  • Syntax: man [option(s)] keyword(s)
  • Example: man ls

To read more about the man command, you can refer to the wiki page at http://en.wikipedia.org/wiki/Man_page

Preparing a node for a Hadoop-YARN cluster

Before using a machine as a Hadoop node in a cluster, there are a few prerequisites that need to be configured.

Preparing a node for a Hadoop-YARN cluster

Install Java

As mentioned in the software requirements for a cluster, all the nodes across the cluster must have Sun Java 1.6 or above and the SSH service installed. The Java version and JAVA_HOME should be consistent across all the nodes. If you want to read more regarding the Java compatibility with Hadoop, you can browse to a page on wiki at http://wiki.apache.org/hadoop/HadoopJavaVersions.

To install and configure Java on Ubuntu, you can refer to the blog at http://www.protechskills.com/linux/unix-commands/install-java-in-linux.

  • To verify if Java is installed, you can execute the following command:
    java –version
    
  • The Java version will be displayed on the console. The output of the command will look like this:
    java version "1.8.0"
    Java(TM) SE Runtime Environment (build 1.8.0-b132)
    Java HotSpot(TM) 64-Bit Server VM (build 25.0-b70, mixed mode)
  • To verify that the environment variable for Java is configured properly, execute the following command:
    echo $JAVA_HOME
    
  • The installation directory for Java will be displayed on the console. The output of the command will look like this:
    /usr/lib/jvm/jdk1.8.0/
    

Create a Hadoop dedicated user and group

In the Hadoop cluster node installation, the Hadoop daemons run on multiple systems. The slave nodes run the DataNode and NodeManager services. All the nodes in a cluster must have a common user and a group. It is recommended to create a dedicated user for the Hadoop cluster on all the nodes of your cluster.

To create a user on Ubuntu, you can refer to the Ubuntu documentation at http://manpages.ubuntu.com/manpages/jaunty/man8/useradd.8.html.

Here is a sample command to create a new user hduser on Ubuntu:

sudo usersadd –m hduser

After creating a new user, you also need to set a password for the new user. Execute the following command to set a password for the newly created user hduser:

sudo passwd hduser

Disable firewall or open Hadoop ports

Hadoop daemons use a few ports for internal and client communication. The cluster administrator has an option to either disable the firewall of the nodes or allow traffic on the ports required by Hadoop. Hadoop has a list of default ports, but you can configure them as per your need.

To disable the firewall in Ubuntu, execute the following command:

sudo ufw disable

Configure the domain name resolution

A Hadoop node is identified through its hostname. All the nodes in the cluster must have a unique hostname and IP address. Each Hadoop node should be able to resolve the hostname of the other nodes in the cluster.

If you are not using a DHCP server that manages your DNS hostname resolution, then you need to configure the /etc/hosts file on all the nodes. The /etc/hosts file of a system contains the IP addresses of the nodes specified with their hostname. You need to prepend the file with the IP address and hostname mapping. You can use the nano or vi editor with the sudo option to edit the file contents. Assuming the hostname of your nodes is master, slave1, slave2, and so on; the contents of the file will look similar to the following:

192.168.56.100 master
192.168.56.101 slave1
192.168.56.102 slave2
192.168.56.103 slave3

To view the system hostname, you can execute the hostname command.

hostname

To modify the system hostname in Linux, you can refer to the blogs here:

CentOS: https://www.centosblog.com/how-to-change-hostname-on-centos-linux/

Ubuntu: http://askubuntu.com/questions/87665/how-do-i-change-the-hostname-without-a-restart

After editing the file, you need to either restart your system network settings or reboot your system.

To restart networking on Ubuntu, execute the following command:

sudo /etc/init.d/networking restart

To verify that configuration is working properly, execute the following command:

ping slave1

To stop the command, press ctrl + c. The output of the command will look like this:

PING slave1 (slave1) 56(84) bytes of data.
64 bytes from slave1: icmp_req=1 ttl=64 time=0.025 ms
64 bytes from slave1: icmp_req=2 ttl=64 time=0.024 ms

Install SSH and configure passwordless SSH from the master to all slaves

The OpenSSH server and client packages should already be installed and the sshd service should be running on all the nodes. By default, the sshd service uses port 22.

To install these packages on Ubuntu, execute the following commands:

sudo apt-get install openssh-client
sudo apt-get install openssh-server

The Hadoop master node remotely manages all the Hadoop daemons running across the cluster. The master node creates a secure connection through SSH using the dedicated user group for the cluster. It is recommended that you allow the master node to create an SSH connection without a password. You need to configure a passwordless SSH from the master node to all slave nodes.

First you need to create SSH keys for the master node, then share the master's public key with the target slave node using the ssh-copy-id command.

Assuming that the user for the Hadoop-YARN cluster is hduser, the ssh-copy-id command will append the contents of the master node's public key file, /home/hduser/.ssh/ id_dsa.pub, to the /home/hduser/.ssh/authorized_keys file on the slave node.

To install the SSH service and configure a passwordless SSH on Ubuntu, you can refer to the Ubuntu documentation at https://help.ubuntu.com/lts/serverguide/openssh-server.html.

To verify that the sshd service is running, execute the following netstat command:

sudo netstat -nlp | grep sshd

The output will contain the service details if the service is running:

tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 670/sshd

To verify the passwordless SSH connection, execute the ssh command from the master node and observe that the command will not prompt for a password now:

ssh hduser@slave1