Hadoop 2 (YARN) : How to setup a single node in Ubuntu (Tutorial)

**Only Draft but public :)**

Tutorial Requirements:

  • Hadoop 2.7.1
  • Java 8
  • Ubuntu (ubuntu-14.04.3-desktop-amd64.iso)
  • VirtualBox

As a convention:

  • we will place our development tools/products under /opt/dev directory.
  • For each product, we will have a directory and the product versions as subdirectories.

1. Set JAVA_HOME

$ sudo gedit /etc/profile.d/java.sh

 

++ Add content

# Oracle JDK 8
export JAVA_HOME=/opt/dev/java/jdk1.8.0_65
export PATH=$PATH:$JAVA_HOME/bin

 

$ source /etc/profile.d/java.sh

 

2. Set HADOOP_HOME

$ sudo gedit /etc/profile.d/hadoop.sh

 

++ Add content

# Hadoop
export HADOOP_HOME=/opt/dev/hadoop/hadoop-2.7.1
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

3. Add Hadoop users :

$ sudo groupadd hadoop
$ sudo useradd -g hadoop yarn
$ sudo useradd -g hadoop hdfs
$ sudo useradd -g hadoop mapred

Choose a password for them

$ sudo passwd yarn
$ sudo passwd hdfs
$ sudo passwd mapred

 

4. Data and Log directories

nn: name node?
snn: checkpoint
dn: datanode

$ sudo mkdir -p /var/data/hadoop/hdfs/nn
$ sudo mkdir -p /var/data/hadoop/hdfs/snn
$ sudo mkdir -p /var/data/hadoop/hdfs/dn
$ sudo chown hdfs:hadoop /var/data/hadoop/hdfs -R

$ sudo mkdir -p /var/log/hadoop/yarn
$ sudo chown yarn:hadoop /var/log/hadoop/yarn -R

 

5. So yarn user..got to be the owner of hadoop installation directory ??

$ cd $HADOOP_HOME
$ chmod g+w logs
$ chown yarn:hadoop . -R

 

6. Configure core-site.xml

$ su yarn
$ vi $HADOOP_HOME/etc/hadoop/core-site.xml

 

++ Add content

<configuration>
<!-- NameNode : HDFS server metadata -->
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>

<!-- The default user name -->
<property>
<name>hadoop.http.staticuser.user</name>
<value>hdfs</value>
</property>
</configuration>

 

7. Configure hdfs-site.xml :

NameNode : metadata server
DataNode : where the actual data is stored
SecondaryNameNode : checkpoint data for the NameNode

$ su yarn
$ vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml

 

++ Add content

<configuration>
<!-- Single Node (Pseudo Replication Mode) => No need for replication that defaults to 3. So let's set it to 1 -->
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

<!-- NameNode Directory -->
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/var/data/hadoop/hdfs/nn</value>
</property>

<!-- Secondary NameNode Directory -->
<property>
<name>fs.checkpoint.dir</name>
<value>file:/var/data/hadoop/hdfs/snn</value>
</property>

<!-- Secondary NameNode Directory (diff with previous?)-->
<property>
<name>fs.checkpoint.edits.dir</name>
<value>file:/var/data/hadoop/hdfs/snn</value>
</property>

<!-- DataNode Directory-->
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/var/data/hadoop/hdfs/dn</value>
</property>
</configuration>

 

8. Configure mapreduce-site.xml

Initially the file mapreduce-site.xml doesn’t exist, but it can be cloned from mapreduce-site.xml.template

$ su yarn

$ cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml

$ vi $HADOOP_HOME/etc/hadoop/mapred-site.xml

 

++ Add content

<configuration>
<!-- MapReduce will run as a Yarn application -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

9. Configure yarn-site.xml

$ vi $HADOOP_HOME/etc/hadoop/yarn-site.xml

 

++ Add content

<configuration>

<!-- NodeManagers need to tell MapReduce how to shuffle (?) data-->
<!-- We need to specify the auxiliary service that NodeManagers will implement -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

<!-- We need to specify the implementation class for the auxiliary service -->
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

<!-- Remark : NodeManagers won’t shuffle data for a non-MapReduce job by default -->
</configuration>

 

10. Format HDFS

The user hdfs which own the NameNode /var/data/hadoop/hdfs/nn must format this directory to setup a new file system.
You will /var/data/hadoop/hdfs/nn as a value for “dfs.namenode.name.dir” in $HADOOP_HOME/etc/hadoop/hdfs-site.xml

$ su - hdfs
$ hdfs namenode -format

 

Check success by looking for this log:

INFO common.Storage: Storage directory /var/data/hadoop/hdfs/nn has been successfully formatted.

11. Start HDFS services

$ hadoop-daemon.sh start namenode

 

 

starting namenode, logging to /opt/dev/hadoop/hadoop-2.7.1/logs/hadoop-hdfs-namenode-jasper-VirtualBox.out

 

$ hadoop-daemon.sh start secondarynamenode

 

 

starting secondarynamenode, logging to /opt/dev/hadoop/hadoop-2.7.1/logs/hadoop-hdfs-secondarynamenode-jasper-VirtualBox.out

 

$ hadoop-daemon.sh start datanode

 

 

starting datanode, logging to /opt/dev/hadoop/hadoop-2.7.1/logs/hadoop-hdfs-datanode-jasper-VirtualBox.out

 

Check the services are running by having a PID

$ jps

 

 

4880 DataNode
4826 SecondaryNameNode
4702 NameNode
5038 Jps

 

12. Start Yarn Services

$ yarn-daemon.sh start resourcemanager

 

 

starting resourcemanager, logging to /opt/dev/hadoop/hadoop-2.7.1/logs/yarn-yarn-resourcemanager-jasper-VirtualBox.out

 

$ yarn-daemon.sh start nodemanager

 

 

starting nodemanager, logging to /opt/dev/hadoop/hadoop-2.7.1/logs/yarn-yarn-nodemanager-jasper-VirtualBox.out

 

$ jps

 

 

5475 NodeManager
5595 Jps
5116 ResourceManager

 

Remark : It’s almost mandatory to check if any PID for the ran service.
For example, I got no error message for starting nodemanager command.
However, I didn’t find the PID of nodemanager.
Then I decided to check the log using :

$ cat /opt/dev/hadoop/hadoop-2.7.1/logs/yarn-yarn-nodemanager-jasper-VirtualBox.log

 

And I found this error :

java.lang.IllegalArgumentException: The ServiceName: mapreduce.shuffle set in yarn.nodemanager.aux-services is invalid.The valid service name should only contain a-zA-Z0-9_ and can not start with numbers

 

So I had to fix it by correcting yarn-site.xml :

> From :

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
</property>

 

> To :

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

13. HDFS Dashboard GUI :

$ firefox http://localhost:50070

 

To check the logs :

http://localhost:50070/logs/

 

14. Yarn (Resource Manager) Dashboard GUI :

$ firefox http://localhost:8088

 

15. Testing MapReduce

$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar pi \
-Dmapreduce.clientfactory.class.name=org.apache.hadoop.mapred.YarnClientFactory \
-libjars $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.7.1.jar 16 1000

 

Result :

Number of Maps = 16
Samples per Map = 1000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Wrote input for Map #10
Wrote input for Map #11
Wrote input for Map #12
Wrote input for Map #13
Wrote input for Map #14
Wrote input for Map #15
Starting Job
16/02/03 20:27:08 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/02/03 20:27:09 INFO input.FileInputFormat: Total input paths to process : 16
16/02/03 20:27:09 INFO mapreduce.JobSubmitter: number of splits:16
16/02/03 20:27:10 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1454547673341_0001
16/02/03 20:27:10 INFO impl.YarnClientImpl: Submitted application application_1454547673341_0001
16/02/03 20:27:10 INFO mapreduce.Job: The url to track the job: http://jasper-VirtualBox:8088/proxy/application_1454547673341_0001/
16/02/03 20:27:10 INFO mapreduce.Job: Running job: job_1454547673341_0001
16/02/03 20:27:19 INFO mapreduce.Job: Job job_1454547673341_0001 running in uber mode : false
16/02/03 20:27:20 INFO mapreduce.Job: map 0% reduce 0%
16/02/03 20:27:50 INFO mapreduce.Job: map 13% reduce 0%
16/02/03 20:27:51 INFO mapreduce.Job: map 38% reduce 0%
16/02/03 20:28:21 INFO mapreduce.Job: map 63% reduce 0%
16/02/03 20:28:22 INFO mapreduce.Job: map 69% reduce 0%
16/02/03 20:28:23 INFO mapreduce.Job: map 69% reduce 23%
16/02/03 20:28:47 INFO mapreduce.Job: map 100% reduce 23%
16/02/03 20:28:49 INFO mapreduce.Job: map 100% reduce 100%
16/02/03 20:28:51 INFO mapreduce.Job: Job job_1454547673341_0001 completed successfully
16/02/03 20:28:51 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=358
FILE: Number of bytes written=1999328
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=4230
HDFS: Number of bytes written=215
HDFS: Number of read operations=67
HDFS: Number of large read operations=0
HDFS: Number of write operations=3
Job Counters
Launched map tasks=16
Launched reduce tasks=1
Data-local map tasks=16
Total time spent by all maps in occupied slots (ms)=417184
Total time spent by all reduces in occupied slots (ms)=55313
Total time spent by all map tasks (ms)=417184
Total time spent by all reduce tasks (ms)=55313
Total vcore-seconds taken by all map tasks=417184
Total vcore-seconds taken by all reduce tasks=55313
Total megabyte-seconds taken by all map tasks=427196416
Total megabyte-seconds taken by all reduce tasks=56640512
Map-Reduce Framework
Map input records=16
Map output records=32
Map output bytes=288
Map output materialized bytes=448
Input split bytes=2342
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=448
Reduce input records=32
Reduce output records=0
Spilled Records=64
Shuffled Maps =16
Failed Shuffles=0
Merged Map outputs=16
GC time elapsed (ms)=8078
CPU time spent (ms)=7590
Physical memory (bytes) snapshot=3698302976
Virtual memory (bytes) snapshot=31985123328
Total committed heap usage (bytes)=2723676160
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1888
File Output Format Counters
Bytes Written=97
Job Finished in 102.754 seconds
Estimated value of Pi is 3.14250000000000000000