Tuhin Mitra

Kevin Greene

A brief history

Doug Cutting had a vision.

To search the entire internet on an open platform.

Worked on open source search engine.

In a way, was one of the first people to really visualize GIFEE

Way to search text

Solr and Elasticsearch both heavily utilize Lucene

When you can search the web you need...

A way to crawl the web

Nutch was designed to be fast!

But it was slow and didn't scale

...meanwhile, at

Two very influential papers came out

Described Google's successfully running file system

Dealt with fault tolerance and component failures

Dealt with huge files

Described Google's successfully running computation engine

Used to power Google's web crawler

Doug Cutting wrote open source versions of both, and called the result Hadoop, after his son's favorite toy.

A Decade Later...

Hadoop still alive and kicking

HDFS and MapReduce still heavily used

The YARN Scheduler was added to the core project

Many, many, many subprojects exist

MapReduce

How Many Steps?

Map

Reduce

Partition

Comparison

Input

Output

Let's ignore Input / Output, and combine Partition and Comparison into "Shuffle"

Map

Input -> KV

Shuffle (Partition + Comparison)

[K1V1, K2V2, K1V3] -> [K1V1, K1V3], [K2V2]

Reduce

[K1V1, K1V3] -> K1R1

But that doesn't really explain it for most people

Let's make the variables mean something

Word Count

Problem

For a given text, how many times does each word appear?

Corpus

Work It Harder, Make It Better
Do It Faster, Makes Us Stronger
More Than Ever Hour After
Our Work Is Never Over

Map

Text comes in

(work,1)
(it,1)
(harder,1)
(make,1)
(it,1)
...

comes out

Shuffle

(after,1)
(do,1)
(ever,1)
(hour,1)
(it,1),(it,1),(it,1)
(is,1)
...

Reduce

(after,1)
(do,1)
(ever,1)
(hour,1)
(it,3)
(is,1)
...

Or, if you're a visual person

So how do we actually make this happen?

A Hadoop Cluster

There are three main types of machine involved:

Client
Master
Worker / Slave

Like pretty much every other cluster

Clients

Loads data into the system
Retrieves the data from the system
Manages the Input / Output of MapReduce

Master - NameNode

Tracks files and data
Does not store data itself
Single point of failure in Hadoop 1.x

Master - SecondaryNameNode

Helps the NameNode perform more computations
Secondary is very misleading.

Master (ish) - StandByNameNode

Introduced in Hadoop 2.0
Provides failover capabilities
Makes Hadoop truly high availability
Cannot be used with SecondaryNameNode

Master - ResourceManager

Allocates the cluster resources

Master - Application Master

Responsible for the execution of a single application
Contains application logic
Framework specific

Worker - DataNode

Stores data

Worker - NodeManager

Communicates to the ResourceManager
Describes what the machine can offer

How did we set it up?

Demo Time

Setting up Hadoop can be an arduous process.

This is what worked for Tuhin.

Edit /etc/hosts

192.168.10.102 hadoop-master
192.168.10.103 hadoop-slave-1

You may need to reserve IPs on the router

Create user hadoop

useradd hadoop
passwd hadoop

Set up key-based (passwordless) login:

su hadoop
ssh-keygen -t rsa
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-master
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-slave-1
chmod 0600 ~/.ssh/authorized_keys

Install Oracle JDK

Download latest Oracle JDK and save it in the /opt directory.

On hadoop-master, expand Java:

cd /opt
tar -zxf jdk-8u91-linux-x64.tar.gz
mv jdk1.8.0_91 jdk

Propagate /opt/jdk to all the slaves

scp -r jdk hadoop-slave-1:/opt

Use the alternatives tool

alternatives --install /usr/bin/java java /opt/jdk/bin/java 2
alternatives --config java # select appropriate program (/opt/jdk/bin/java)
alternatives --install /usr/bin/jar jar /opt/jdk/bin/jar 2
alternatives --install /usr/bin/javac javac /opt/jdk/bin/javac 2
alternatives --set jar /opt/jdk/bin/jar
alternatives --set javac /opt/jdk/bin/javac

Edit /etc/bashrc

export JAVA_HOME=/opt/jdk
export JRE_HOME=/opt/jdk/jre
export PATH=$PATH:/opt/jdk/bin:/opt/jdk/jre/bin

Install Hadoop 2.7.2

On hadoop-master:

cd /opt
wget http://www.eu.apache.org/dist/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz
tar -zxf hadoop-2.7.2.tar.gz
rm hadoop-2.7.2.tar.gz
mv hadoop-2.7.2 hadoop

Propagate /opt/hadoop to slave nodes

scp -r hadoop hadoop-slave-1:/opt

Set the following environment variables to /home/hadoop/.bashrc on every node

export HADOOP_PREFIX=/opt/hadoop
export HADOOP_HOME=$HADOOP_PREFIX
export HADOOP_COMMON_HOME=$HADOOP_PREFIX
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export HADOOP_HDFS_HOME=$HADOOP_PREFIX
export HADOOP_MAPRED_HOME=$HADOOP_PREFIX
export HADOOP_YARN_HOME=$HADOOP_PREFIX
export PATH=$PATH:$HADOOP_PREFIX/sbin:$HADOOP_PREFIX/bin

Configure /opt/hadoop/etc/hadoop/core-site.xml â (NameNode URI) on all nodes

<configuration>
<property>
    <name>fs.defaultFS</name>
    <value>hdfs://hadoop-master:9000/</value>
</property>
</configuration>

Create HDFS DataNode data folders on every node and make hadoop the owner of /opt/hadoop

chown hadoop /opt/hadoop/ -R
chgrp hadoop /opt/hadoop/ -R
mkdir /home/hadoop/datanode
chown hadoop /home/hadoop/datanode/
chgrp hadoop /home/hadoop/datanode/

Configure /opt/hadoop/etc/hadoop/hdfs-site.xml (DataNodes)

<configuration>
<property>
  <name>dfs.replication</name>
  <value>3</value>
</property>
<property>
  <name>dfs.permissions</name>
  <value>false</value>
</property>
<property>
   <name>dfs.datanode.data.dir</name>
   <value>/home/hadoop/datanode</value>
</property>
</configuration>

Create HDFS NameNode data holder on hadoop-master and make hadoop the owner and group

mkdir /home/hadoop/namenode
chown hadoop /home/hadoop/namenode/
chgrp hadoop /home/hadoop/namenode/

Configure /opt/hadoop/etc/hadoop/hdfs-site.xml on hadoop-master.

<property>
        <name>dfs.namenode.data.dir</name>
        <value>/home/hadoop/namenode</value>
</property>

Configure /opt/hadoop/etc/hadoop/mapred-site.xml on hadoop-master.

<configuration>
 <property>
  <name>mapreduce.framework.name</name>
   <value>yarn</value>
 </property>
</configuration>

Configure /opt/hadoop/etc/hadoop/yarn-site.xml (ResourceManager and NodeManagers):

<configuration>
<property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop-master</value>
</property>
<property>
        <name>yarn.nodemanager.hostname</name>
        <value>hadoop-slave-1</value>
</property>
<property>
  <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>
</configuration>

Configure /opt/hadoop/etc/hadoop/slaves on master

hadoop-master
hadoop-slave-1

Disable firewall and IPv6

systemctl stop firewalld

Add to /etc/sysctl.conf:

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1

Format NameNode

su hadoop
hdfs namenode -format

Start HDFS (as user hadoop):

start-dfs.sh

Start YARN on hadoop-master:

start-yarn.sh

Now jps should show the NodeManagers on all nodes and one ResourceManager on hadooop-master.

hadoop-master node consists of a ResourceManager, NodeManager (YARN), NameNode and DataNode (HDFS).

hadoop-slave1 consists of NodeManager and a DataNode.

Test

hdfs dfsadmin -safemode leave # ?????? -- Turn off safe mode
hdfs dfs -mkdir /input                       -- Create an input folder
hdfs dfs -copyFromLocal test.txt /input   -- Create test.txt first and then issue the command. You can put whatever you want in test.txt. This copies the file in hdfs.
hdfs dfs -cat /input/test.txt | head       -- Show the contents of test.txt from hdfs
hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount /input/test.txt /output1 -- Run the actual test.

Hadoop is cool

But many things use it to be even cooler

Want to use SQL?

Want to use more of Google's papers, like BigTable?

Want a way to analyze data that will make your data scientists happy?

And many, many more!

Hadoop is powerful

Use it wisely

And try to use the right tool for the job

Questions?

Tuhin Mitra Kevin Greene

hadoop-intro-talk

KevinGreene

hadoop-intro-talk

0 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

hadoop-intro-talk

A brief history

...meanwhile, at

A Decade Later...

MapReduce

How Many Steps?

Map

Shuffle (Partition + Comparison)

Reduce

Word Count

Problem

Corpus

Map

Shuffle

Reduce

A Hadoop Cluster

Clients

Master - NameNode

Master - SecondaryNameNode

Master (ish) - StandByNameNode

Master - ResourceManager

Master - Application Master

Worker - DataNode

Worker - NodeManager

Demo Time

Edit /etc/hosts

Create user hadoop

Set up key-based (passwordless) login:

Install Oracle JDK

Propagate /opt/jdk to all the slaves

Use the alternatives tool

Edit /etc/bashrc

Install Hadoop 2.7.2

Propagate /opt/hadoop to slave nodes

Set the following environment variables to /home/hadoop/.bashrc on every node

Configure /opt/hadoop/etc/hadoop/core-site.xml â (NameNode URI) on all nodes

Create HDFS DataNode data folders on every node and make hadoop the owner of /opt/hadoop

Configure /opt/hadoop/etc/hadoop/hdfs-site.xml (DataNodes)

Create HDFS NameNode data holder on hadoop-master and make hadoop the owner and group

Configure /opt/hadoop/etc/hadoop/hdfs-site.xml on hadoop-master.

Configure /opt/hadoop/etc/hadoop/mapred-site.xml on hadoop-master.

Configure /opt/hadoop/etc/hadoop/yarn-site.xml (ResourceManager and NodeManagers):

Configure /opt/hadoop/etc/hadoop/slaves on master

Disable firewall and IPv6

Format NameNode

Start HDFS (as user hadoop):

Start YARN on hadoop-master:

Test

Questions?

0 0