Learning YARN
上QQ阅读APP看书,第一时间看更新

The Hadoop-YARN single node installation

In a single node installation, all the Hadoop-YARN daemons (NameNode, ResourceManager, DataNode, and NodeManager) run on a single node as separate Java processes. You will need only one Linux machine with a minimum of 2 GB RAM and 15 GB free disk space.

Prerequisites

Before starting with the installation steps, make sure that you prepare the node as specified in the above topic.

  • The hostname used in the single node installation is localhost with 127.0.0.1 as the IP address. It is known as the loopback address for a machine. You need to make sure that the /etc/hosts file contains the resolution for the loopback address. The loopback entry will look like this:
    127.0.0.1 localhost
    
  • The passwordless SSH is configured for localhost. To ensure this, execute the following command:
    ssh-copy-id localhost
    

Installation steps

After preparing your node for Hadoop, you need to follow a simple five-step process to install and run Hadoop on your Linux machine.

Installation steps

Step 1 – Download and extract the Hadoop bundle

The current version of Hadoop is 2.5.1 and the steps mentioned here will assume that you use the same version. Login to your system using a Hadoop dedicated user and download the Hadoop 2.x bundle tar.gz file from the Apache archive:

wget https://archive.apache.org/dist/hadoop/core/hadoop-2.5.1/hadoop-2.5.1.tar.gz

You can use your home directory for the Hadoop installation (/home/<username>). If you want to use any of the system directories such as /opt or /usr for installation, you need to use the sudo option with the commands. For simplicity, we'll install Hadoop in the home directory of the user. The commands in this chapter assume that the username is hduser. You can replace hduser with the actual username. Move your Hadoop bundle to the user's home directory and extract the contents of the bundle file:

mv hadoop-2.5.1.tar.gz /home/hduser/
cd /home/hduser
tar -xzvf hadoop-2.5.1.tar.gz

Step 2 – Configure the environment variables

Configure the Hadoop environment variables in /home/hduser/.bashrc (for Ubuntu) or /home/hduser/.bash_profile (for CentOS). Hadoop requires the HADOOP_PREFIX and home directory environment variables to be set before starting Hadoop services. HADOOP_PREFIX specifies the installation directory for Hadoop. We assume that you extracted the Hadoop bundle in the home folder of hduser.

Use the nano editor and append the following export commands to the end of the file:

export HADOOP_PREFIX="/home/hduser/hadoop-2.5.1/"
export PATH=$PATH:$HADOOP_PREFIX/bin
export PATH=$PATH:$HADOOP_PREFIX/sbin
export HADOOP_COMMON_HOME=${HADOOP_PREFIX}
export HADOOP_MAPRED_HOME=${HADOOP_PREFIX}
export HADOOP_HDFS_HOME=${HADOOP_PREFIX}
export YARN_HOME=${HADOOP_PREFIX}

After saving the file, you need to refresh the file using the source command:

source ~/.bashrc

Step 3 – Configure the Hadoop configuration files

Next, you need to configure the Hadoop site configuration files. There are four configuration files that you need to update. You can find these files in the $HADOOP_PREFIX/etc/Hadoop folder.

The core-site.xml file

The core-site.xml file contains information for the namenode host and the RPC port used by NameNode. For a single node installation, the host for namenode will be localhost. The default RPC port for NameNode is 8020. You need to edit the file and add a configuration property under the configuration tag:

<property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:8020</value>
    <final>true</final>
</property>

The hdfs-site.xml file

The hdfs-site.xml file contains the configuration properties related to HDFS. In this file, you specify the replication factor and the directories for namenode and datanode to store their data. Edit the hdfs-site.xml file and add the following properties under the configuration tag:

<property>
    <name>dfs.replication</name>
    <value>1</value>
</property>

<property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///home/hduser/hadoop-2.5.1/hadoop_data/dfs/name</value>
</property>

<property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///home/hduser/hadoop-2.5.1/hadoop_data/dfs/data</value>
</property>

The mapred-site.xml file

The mapred-site.xml file contains information related to the MapReduce framework for the cluster. You will specify the framework to be configured as yarn. The other possible values for the MapReduce framework property are local and classic. A detailed explanation of these values is given in the next chapter.

In the Hadoop configuration folder, you will find a template for the mapred-site.xml file. Execute the following command to copy the template file to create the mapred-site.xml file:

cp /home/hduser/hadoop2.5.1/etc/Hadoop/mapred-site.xml.template /home/hduser/hadoop2.5.1/etc/Hadoop/mapred-site.xml

Now edit the mapred-site.xml file and add the following properties under the configuration tag:

<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
</property>

The yarn-site.xml file

The yarn-site.xml file contains the information related to the YARN daemons and YARN properties. You need to specify the host and port for the resourcemanager daemon. Similar to the NameNode host, for a single node installation, the value for a ResourceManager host is localhost. The default RPC port for ResourceManager is 8032. You also need to specify the scheduler to be used by ResourceManager and auxiliary services for nodemanager. We'll cover these properties in detail in the next chapter. Edit the yarn-site.xml file and add the following properties under the configuration tag:

<property>
    <name>yarn.resourcemanager.address</name>
    <value>localhost:8032</value>
</property>

<property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>localhost:8030</value>
</property>

<property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>localhost:8031</value>
</property>

<property>
    <name>yarn.resourcemanager.admin.address</name>
    <value>localhost:8033</value>
</property>

<property>
    <name>yarn.resourcemanager.webapp.address</name>
    <value>localhost:8088</value>
</property>

<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>

<property>
    <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

<property>
    <name>yarn.resourcemanager.scheduler.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
</property>

The hadoop-env.sh and yarn-env.sh files

The Hadoop daemons require Java settings to be set in the Hadoop environment files. You need to configure the value for JAVA_HOME (the java installation directory) in the Hadoop and YARN environment files. Open the hadoop-env.sh and yarn-env.sh files, uncomment the export JAVA_HOME command, and update the export command with the actual JAVA_HOME value. To uncomment the export command, just remove the # symbol from the line.

The slaves file

The slaves file contains a list of hostname for slave nodes. For single node installation, the value of host is localhost. By default, the slaves file contains only localhost. You don't need to modify the slaves file for a single node installation.

Step 4 – Format NameNode

After configuring Hadoop files, you need to format the HDFS using the namenode format command. Before executing the format command, make sure that the dfs.namenode.name.dir directory specified in the hdfs-site.xml file does not exist. This directory is created by the namenode format command. Execute the following command to format NameNode:

hdfs namenode –format

After executing the preceding command, make sure that there's no exception on the console and that the namenode directory is created.

Note

The following line in the console output specifies that the namenode directory has been successfully formatted:

INFO common.Storage: Storage directory /home/hduser/hadoop-2.5.1/hadoop_data/dfs/name has been successfully formatted.

Step 5 – Start Hadoop daemons

Start the Hadoop services using Hadoop 2 Scripts in the /home/hduser/hadoop-2.5.1/sbin/ directory. For a single node installation, all the daemons will run on a single system. Use the following commands to start the services one by one.

Execute the jps command and ensure that all Hadoop daemons are running. You can also verify the status of your cluster through the web interface for HDFS-NameNode and YARN-ResourceManager.

You need to replace <NameNodeHost> and <ResourceManagerHost> with localhost for single node installation such as http://localhost:8088/.