hadoop-intro-talk



hadoop-intro-talk

0 0


hadoop-intro-talk


On Github KevinGreene / hadoop-intro-talk

Tuhin Mitra

Kevin Greene

A brief history

Doug Cutting had a vision.

To search the entire internet on an open platform.

Worked on open source search engine.

In a way, was one of the first people to really visualize GIFEE

Way to search text

Solr and Elasticsearch both heavily utilize Lucene

When you can search the web you need...

A way to crawl the web

Nutch was designed to be fast!

But it was slow and didn't scale

...meanwhile, at

Two very influential papers came out

Described Google's successfully running file system

Dealt with fault tolerance and component failures

Dealt with huge files

Described Google's successfully running computation engine

Used to power Google's web crawler

Doug Cutting wrote open source versions of both, and called the result Hadoop, after his son's favorite toy.

A Decade Later...

Hadoop still alive and kicking

HDFS and MapReduce still heavily used

The YARN Scheduler was added to the core project

Many, many, many subprojects exist

MapReduce

How Many Steps?

Map

Reduce

Partition

Comparison

Input

Output

Let's ignore Input / Output, and combine Partition and Comparison into "Shuffle"

Map

Input -> KV

Shuffle (Partition + Comparison)

[K1V1, K2V2, K1V3] -> [K1V1, K1V3], [K2V2]

Reduce

[K1V1, K1V3] -> K1R1

But that doesn't really explain it for most people

Let's make the variables mean something

Word Count

Problem

For a given text, how many times does each word appear?

Corpus

Work It Harder, Make It Better
Do It Faster, Makes Us Stronger
More Than Ever Hour After
Our Work Is Never Over

Map

Text comes in

(work,1)
(it,1)
(harder,1)
(make,1)
(it,1)
...

comes out

Shuffle

(after,1)
(do,1)
(ever,1)
(hour,1)
(it,1),(it,1),(it,1)
(is,1)
...

Reduce

(after,1)
(do,1)
(ever,1)
(hour,1)
(it,3)
(is,1)
...

Or, if you're a visual person

So how do we actually make this happen?

A Hadoop Cluster

There are three main types of machine involved:

  • Client
  • Master
  • Worker / Slave

Like pretty much every other cluster

Clients

  • Loads data into the system
  • Retrieves the data from the system
  • Manages the Input / Output of MapReduce

Master - NameNode

  • Tracks files and data
  • Does not store data itself
  • Single point of failure in Hadoop 1.x

Master - SecondaryNameNode

  • Helps the NameNode perform more computations
  • Secondary is very misleading.

Master (ish) - StandByNameNode

  • Introduced in Hadoop 2.0
  • Provides failover capabilities
  • Makes Hadoop truly high availability
  • Cannot be used with SecondaryNameNode

Master - ResourceManager

  • Allocates the cluster resources

Master - Application Master

  • Responsible for the execution of a single application
  • Contains application logic
  • Framework specific

Worker - DataNode

  • Stores data

Worker - NodeManager

  • Communicates to the ResourceManager
  • Describes what the machine can offer

How did we set it up?

Demo Time

Setting up Hadoop can be an arduous process.

This is what worked for Tuhin.

Edit /etc/hosts

192.168.10.102 hadoop-master
192.168.10.103 hadoop-slave-1

You may need to reserve IPs on the router

Create user hadoop

useradd hadoop
passwd hadoop

Set up key-based (passwordless) login:

su hadoop
ssh-keygen -t rsa
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-master
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-slave-1
chmod 0600 ~/.ssh/authorized_keys

Install Oracle JDK

Download latest Oracle JDK and save it in the /opt directory.

On hadoop-master, expand Java:

cd /opt
tar -zxf jdk-8u91-linux-x64.tar.gz
mv jdk1.8.0_91 jdk

Propagate /opt/jdk to all the slaves

scp -r jdk hadoop-slave-1:/opt

Use the alternatives tool

alternatives --install /usr/bin/java java /opt/jdk/bin/java 2
alternatives --config java # select appropriate program (/opt/jdk/bin/java)
alternatives --install /usr/bin/jar jar /opt/jdk/bin/jar 2
alternatives --install /usr/bin/javac javac /opt/jdk/bin/javac 2
alternatives --set jar /opt/jdk/bin/jar
alternatives --set javac /opt/jdk/bin/javac

Edit /etc/bashrc

export JAVA_HOME=/opt/jdk
export JRE_HOME=/opt/jdk/jre
export PATH=$PATH:/opt/jdk/bin:/opt/jdk/jre/bin

Install Hadoop 2.7.2

On hadoop-master:

cd /opt
wget http://www.eu.apache.org/dist/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz
tar -zxf hadoop-2.7.2.tar.gz
rm hadoop-2.7.2.tar.gz
mv hadoop-2.7.2 hadoop

Propagate /opt/hadoop to slave nodes

scp -r hadoop hadoop-slave-1:/opt

Set the following environment variables to /home/hadoop/.bashrc on every node

export HADOOP_PREFIX=/opt/hadoop
export HADOOP_HOME=$HADOOP_PREFIX
export HADOOP_COMMON_HOME=$HADOOP_PREFIX
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export HADOOP_HDFS_HOME=$HADOOP_PREFIX
export HADOOP_MAPRED_HOME=$HADOOP_PREFIX
export HADOOP_YARN_HOME=$HADOOP_PREFIX
export PATH=$PATH:$HADOOP_PREFIX/sbin:$HADOOP_PREFIX/bin

Configure /opt/hadoop/etc/hadoop/core-site.xml â (NameNode URI) on all nodes

<configuration>
<property>
    <name>fs.defaultFS</name>
    <value>hdfs://hadoop-master:9000/</value>
</property>
</configuration>

Create HDFS DataNode data folders on every node and make hadoop the owner of /opt/hadoop

chown hadoop /opt/hadoop/ -R
chgrp hadoop /opt/hadoop/ -R
mkdir /home/hadoop/datanode
chown hadoop /home/hadoop/datanode/
chgrp hadoop /home/hadoop/datanode/

Configure /opt/hadoop/etc/hadoop/hdfs-site.xml (DataNodes)

<configuration>
<property>
  <name>dfs.replication</name>
  <value>3</value>
</property>
<property>
  <name>dfs.permissions</name>
  <value>false</value>
</property>
<property>
   <name>dfs.datanode.data.dir</name>
   <value>/home/hadoop/datanode</value>
</property>
</configuration>

Create HDFS NameNode data holder on hadoop-master and make hadoop the owner and group

mkdir /home/hadoop/namenode
chown hadoop /home/hadoop/namenode/
chgrp hadoop /home/hadoop/namenode/

Configure /opt/hadoop/etc/hadoop/hdfs-site.xml on hadoop-master.

<property>
        <name>dfs.namenode.data.dir</name>
        <value>/home/hadoop/namenode</value>
</property>

Configure /opt/hadoop/etc/hadoop/mapred-site.xml on hadoop-master.

<configuration>
 <property>
  <name>mapreduce.framework.name</name>
   <value>yarn</value>
 </property>
</configuration>

Configure /opt/hadoop/etc/hadoop/yarn-site.xml (ResourceManager and NodeManagers):

<configuration>
<property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop-master</value>
</property>
<property>
        <name>yarn.nodemanager.hostname</name>
        <value>hadoop-slave-1</value>
</property>
<property>
  <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>
</configuration>

Configure /opt/hadoop/etc/hadoop/slaves on master

hadoop-master
hadoop-slave-1

Disable firewall and IPv6

systemctl stop firewalld

Add to /etc/sysctl.conf:

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1

Format NameNode

su hadoop
hdfs namenode -format

Start HDFS (as user hadoop):

start-dfs.sh

Start YARN on hadoop-master:

start-yarn.sh

Now jps should show the NodeManagers on all nodes and one ResourceManager on hadooop-master.

hadoop-master node consists of a ResourceManager, NodeManager (YARN), NameNode and DataNode (HDFS).

hadoop-slave1 consists of NodeManager and a DataNode.

Test

hdfs dfsadmin -safemode leave # ?????? -- Turn off safe mode
hdfs dfs -mkdir /input                       -- Create an input folder
hdfs dfs -copyFromLocal test.txt /input   -- Create test.txt first and then issue the command. You can put whatever you want in test.txt. This copies the file in hdfs.
hdfs dfs -cat /input/test.txt | head       -- Show the contents of test.txt from hdfs
hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount /input/test.txt /output1 -- Run the actual test.

Hadoop is cool

But many things use it to be even cooler

Want to use SQL?

Want to use more of Google's papers, like BigTable?

Want a way to analyze data that will make your data scientists happy?

And many, many more!

Hadoop is powerful

Use it wisely

And try to use the right tool for the job

Questions?

Tuhin Mitra Kevin Greene