Hadoop 2 (YARN) : How to setup a single node in Ubuntu (Tutorial)

**Only Draft but public :)**

Tutorial Requirements:

  • Hadoop 2.7.1
  • Java 8
  • Ubuntu (ubuntu-14.04.3-desktop-amd64.iso)
  • VirtualBox

As a convention:

  • we will place our development tools/products under /opt/dev directory.
  • For each product, we will have a directory and the product versions as subdirectories.


$ sudo gedit /etc/profile.d/java.sh


++ Add content

# Oracle JDK 8
export JAVA_HOME=/opt/dev/java/jdk1.8.0_65
export PATH=$PATH:$JAVA_HOME/bin


$ source /etc/profile.d/java.sh



$ sudo gedit /etc/profile.d/hadoop.sh


++ Add content

# Hadoop
export HADOOP_HOME=/opt/dev/hadoop/hadoop-2.7.1

3. Add Hadoop users :

$ sudo groupadd hadoop
$ sudo useradd -g hadoop yarn
$ sudo useradd -g hadoop hdfs
$ sudo useradd -g hadoop mapred

Choose a password for them

$ sudo passwd yarn
$ sudo passwd hdfs
$ sudo passwd mapred


4. Data and Log directories

nn: name node?
snn: checkpoint
dn: datanode

$ sudo mkdir -p /var/data/hadoop/hdfs/nn
$ sudo mkdir -p /var/data/hadoop/hdfs/snn
$ sudo mkdir -p /var/data/hadoop/hdfs/dn
$ sudo chown hdfs:hadoop /var/data/hadoop/hdfs -R

$ sudo mkdir -p /var/log/hadoop/yarn
$ sudo chown yarn:hadoop /var/log/hadoop/yarn -R


5. So yarn user..got to be the owner of hadoop installation directory ??

$ chmod g+w logs
$ chown yarn:hadoop . -R


6. Configure core-site.xml

$ su yarn
$ vi $HADOOP_HOME/etc/hadoop/core-site.xml


++ Add content

<!-- NameNode : HDFS server metadata -->

<!-- The default user name -->


7. Configure hdfs-site.xml :

NameNode : metadata server
DataNode : where the actual data is stored
SecondaryNameNode : checkpoint data for the NameNode

$ su yarn
$ vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml


++ Add content

<!-- Single Node (Pseudo Replication Mode) => No need for replication that defaults to 3. So let's set it to 1 -->

<!-- NameNode Directory -->

<!-- Secondary NameNode Directory -->

<!-- Secondary NameNode Directory (diff with previous?)-->

<!-- DataNode Directory-->


8. Configure mapreduce-site.xml

Initially the file mapreduce-site.xml doesn’t exist, but it can be cloned from mapreduce-site.xml.template

$ su yarn

$ cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml

$ vi $HADOOP_HOME/etc/hadoop/mapred-site.xml


++ Add content

<!-- MapReduce will run as a Yarn application -->

9. Configure yarn-site.xml

$ vi $HADOOP_HOME/etc/hadoop/yarn-site.xml


++ Add content


<!-- NodeManagers need to tell MapReduce how to shuffle (?) data-->
<!-- We need to specify the auxiliary service that NodeManagers will implement -->

<!-- We need to specify the implementation class for the auxiliary service -->

<!-- Remark : NodeManagers won’t shuffle data for a non-MapReduce job by default -->


10. Format HDFS

The user hdfs which own the NameNode /var/data/hadoop/hdfs/nn must format this directory to setup a new file system.
You will /var/data/hadoop/hdfs/nn as a value for “dfs.namenode.name.dir” in $HADOOP_HOME/etc/hadoop/hdfs-site.xml

$ su - hdfs
$ hdfs namenode -format


Check success by looking for this log:

INFO common.Storage: Storage directory /var/data/hadoop/hdfs/nn has been successfully formatted.

11. Start HDFS services

$ hadoop-daemon.sh start namenode



starting namenode, logging to /opt/dev/hadoop/hadoop-2.7.1/logs/hadoop-hdfs-namenode-jasper-VirtualBox.out


$ hadoop-daemon.sh start secondarynamenode



starting secondarynamenode, logging to /opt/dev/hadoop/hadoop-2.7.1/logs/hadoop-hdfs-secondarynamenode-jasper-VirtualBox.out


$ hadoop-daemon.sh start datanode



starting datanode, logging to /opt/dev/hadoop/hadoop-2.7.1/logs/hadoop-hdfs-datanode-jasper-VirtualBox.out


Check the services are running by having a PID

$ jps



4880 DataNode
4826 SecondaryNameNode
4702 NameNode
5038 Jps


12. Start Yarn Services

$ yarn-daemon.sh start resourcemanager



starting resourcemanager, logging to /opt/dev/hadoop/hadoop-2.7.1/logs/yarn-yarn-resourcemanager-jasper-VirtualBox.out


$ yarn-daemon.sh start nodemanager



starting nodemanager, logging to /opt/dev/hadoop/hadoop-2.7.1/logs/yarn-yarn-nodemanager-jasper-VirtualBox.out


$ jps



5475 NodeManager
5595 Jps
5116 ResourceManager


Remark : It’s almost mandatory to check if any PID for the ran service.
For example, I got no error message for starting nodemanager command.
However, I didn’t find the PID of nodemanager.
Then I decided to check the log using :

$ cat /opt/dev/hadoop/hadoop-2.7.1/logs/yarn-yarn-nodemanager-jasper-VirtualBox.log


And I found this error :

java.lang.IllegalArgumentException: The ServiceName: mapreduce.shuffle set in yarn.nodemanager.aux-services is invalid.The valid service name should only contain a-zA-Z0-9_ and can not start with numbers


So I had to fix it by correcting yarn-site.xml :

> From :



> To :


13. HDFS Dashboard GUI :

$ firefox http://localhost:50070


To check the logs :



14. Yarn (Resource Manager) Dashboard GUI :

$ firefox http://localhost:8088


15. Testing MapReduce

$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar pi \
-Dmapreduce.clientfactory.class.name=org.apache.hadoop.mapred.YarnClientFactory \
-libjars $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.7.1.jar 16 1000


Result :

Number of Maps = 16
Samples per Map = 1000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Wrote input for Map #10
Wrote input for Map #11
Wrote input for Map #12
Wrote input for Map #13
Wrote input for Map #14
Wrote input for Map #15
Starting Job
16/02/03 20:27:08 INFO client.RMProxy: Connecting to ResourceManager at /
16/02/03 20:27:09 INFO input.FileInputFormat: Total input paths to process : 16
16/02/03 20:27:09 INFO mapreduce.JobSubmitter: number of splits:16
16/02/03 20:27:10 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1454547673341_0001
16/02/03 20:27:10 INFO impl.YarnClientImpl: Submitted application application_1454547673341_0001
16/02/03 20:27:10 INFO mapreduce.Job: The url to track the job: http://jasper-VirtualBox:8088/proxy/application_1454547673341_0001/
16/02/03 20:27:10 INFO mapreduce.Job: Running job: job_1454547673341_0001
16/02/03 20:27:19 INFO mapreduce.Job: Job job_1454547673341_0001 running in uber mode : false
16/02/03 20:27:20 INFO mapreduce.Job: map 0% reduce 0%
16/02/03 20:27:50 INFO mapreduce.Job: map 13% reduce 0%
16/02/03 20:27:51 INFO mapreduce.Job: map 38% reduce 0%
16/02/03 20:28:21 INFO mapreduce.Job: map 63% reduce 0%
16/02/03 20:28:22 INFO mapreduce.Job: map 69% reduce 0%
16/02/03 20:28:23 INFO mapreduce.Job: map 69% reduce 23%
16/02/03 20:28:47 INFO mapreduce.Job: map 100% reduce 23%
16/02/03 20:28:49 INFO mapreduce.Job: map 100% reduce 100%
16/02/03 20:28:51 INFO mapreduce.Job: Job job_1454547673341_0001 completed successfully
16/02/03 20:28:51 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=358
FILE: Number of bytes written=1999328
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=4230
HDFS: Number of bytes written=215
HDFS: Number of read operations=67
HDFS: Number of large read operations=0
HDFS: Number of write operations=3
Job Counters
Launched map tasks=16
Launched reduce tasks=1
Data-local map tasks=16
Total time spent by all maps in occupied slots (ms)=417184
Total time spent by all reduces in occupied slots (ms)=55313
Total time spent by all map tasks (ms)=417184
Total time spent by all reduce tasks (ms)=55313
Total vcore-seconds taken by all map tasks=417184
Total vcore-seconds taken by all reduce tasks=55313
Total megabyte-seconds taken by all map tasks=427196416
Total megabyte-seconds taken by all reduce tasks=56640512
Map-Reduce Framework
Map input records=16
Map output records=32
Map output bytes=288
Map output materialized bytes=448
Input split bytes=2342
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=448
Reduce input records=32
Reduce output records=0
Spilled Records=64
Shuffled Maps =16
Failed Shuffles=0
Merged Map outputs=16
GC time elapsed (ms)=8078
CPU time spent (ms)=7590
Physical memory (bytes) snapshot=3698302976
Virtual memory (bytes) snapshot=31985123328
Total committed heap usage (bytes)=2723676160
Shuffle Errors
File Input Format Counters
Bytes Read=1888
File Output Format Counters
Bytes Written=97
Job Finished in 102.754 seconds
Estimated value of Pi is 3.14250000000000000000