Tuhin Mitra
Kevin Greene
Doug Cutting had a vision.
To search the entire internet on an open platform.
Worked on open source search engine.
In a way, was one of the first people to really visualize GIFEE
Way to search text
Solr and Elasticsearch both heavily utilize Lucene
When you can search the web you need...
A way to crawl the web
Nutch was designed to be fast!
But it was slow and didn't scale
Two very influential papers came out
Described Google's successfully running file system
Dealt with fault tolerance and component failures
Dealt with huge files
Described Google's successfully running computation engine
Used to power Google's web crawler
Doug Cutting wrote open source versions of both, and called the result Hadoop, after his son's favorite toy.
Hadoop still alive and kicking
HDFS and MapReduce still heavily used
The YARN Scheduler was added to the core project
Many, many, many subprojects exist
Map
Reduce
Partition
Comparison
Input
Output
Let's ignore Input / Output, and combine Partition and Comparison into "Shuffle"
Input -> KV
[K1V1, K2V2, K1V3] -> [K1V1, K1V3], [K2V2]
[K1V1, K1V3] -> K1R1
But that doesn't really explain it for most people
Let's make the variables mean something
For a given text, how many times does each word appear?
Work It Harder, Make It Better Do It Faster, Makes Us Stronger More Than Ever Hour After Our Work Is Never Over
Text comes in
(work,1) (it,1) (harder,1) (make,1) (it,1) ...
comes out
(after,1) (do,1) (ever,1) (hour,1) (it,1),(it,1),(it,1) (is,1) ...
(after,1) (do,1) (ever,1) (hour,1) (it,3) (is,1) ...
Or, if you're a visual person
So how do we actually make this happen?
There are three main types of machine involved:
Like pretty much every other cluster
How did we set it up?
Setting up Hadoop can be an arduous process.
This is what worked for Tuhin.
192.168.10.102 hadoop-master 192.168.10.103 hadoop-slave-1
You may need to reserve IPs on the router
useradd hadoop passwd hadoop
su hadoop ssh-keygen -t rsa ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-master ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-slave-1 chmod 0600 ~/.ssh/authorized_keys
Download latest Oracle JDK and save it in the /opt directory.
On hadoop-master, expand Java:
cd /opt tar -zxf jdk-8u91-linux-x64.tar.gz mv jdk1.8.0_91 jdk
scp -r jdk hadoop-slave-1:/opt
alternatives --install /usr/bin/java java /opt/jdk/bin/java 2 alternatives --config java # select appropriate program (/opt/jdk/bin/java) alternatives --install /usr/bin/jar jar /opt/jdk/bin/jar 2 alternatives --install /usr/bin/javac javac /opt/jdk/bin/javac 2 alternatives --set jar /opt/jdk/bin/jar alternatives --set javac /opt/jdk/bin/javac
export JAVA_HOME=/opt/jdk export JRE_HOME=/opt/jdk/jre export PATH=$PATH:/opt/jdk/bin:/opt/jdk/jre/bin
On hadoop-master:
cd /opt wget http://www.eu.apache.org/dist/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz tar -zxf hadoop-2.7.2.tar.gz rm hadoop-2.7.2.tar.gz mv hadoop-2.7.2 hadoop
scp -r hadoop hadoop-slave-1:/opt
export HADOOP_PREFIX=/opt/hadoop export HADOOP_HOME=$HADOOP_PREFIX export HADOOP_COMMON_HOME=$HADOOP_PREFIX export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop export HADOOP_HDFS_HOME=$HADOOP_PREFIX export HADOOP_MAPRED_HOME=$HADOOP_PREFIX export HADOOP_YARN_HOME=$HADOOP_PREFIX export PATH=$PATH:$HADOOP_PREFIX/sbin:$HADOOP_PREFIX/bin
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop-master:9000/</value> </property> </configuration>
chown hadoop /opt/hadoop/ -R chgrp hadoop /opt/hadoop/ -R mkdir /home/hadoop/datanode chown hadoop /home/hadoop/datanode/ chgrp hadoop /home/hadoop/datanode/
<configuration> <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/home/hadoop/datanode</value> </property> </configuration>
mkdir /home/hadoop/namenode chown hadoop /home/hadoop/namenode/ chgrp hadoop /home/hadoop/namenode/
<property> <name>dfs.namenode.data.dir</name> <value>/home/hadoop/namenode</value> </property>
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
<configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>hadoop-master</value> </property> <property> <name>yarn.nodemanager.hostname</name> <value>hadoop-slave-1</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>
hadoop-master hadoop-slave-1
systemctl stop firewalld
Add to /etc/sysctl.conf:
net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1
su hadoop hdfs namenode -format
start-dfs.sh
start-yarn.sh
Now jps should show the NodeManagers on all nodes and one ResourceManager on hadooop-master.
hadoop-master node consists of a ResourceManager, NodeManager (YARN), NameNode and DataNode (HDFS).
hadoop-slave1 consists of NodeManager and a DataNode.
hdfs dfsadmin -safemode leave # ?????? -- Turn off safe mode hdfs dfs -mkdir /input -- Create an input folder hdfs dfs -copyFromLocal test.txt /input -- Create test.txt first and then issue the command. You can put whatever you want in test.txt. This copies the file in hdfs. hdfs dfs -cat /input/test.txt | head -- Show the contents of test.txt from hdfs hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount /input/test.txt /output1 -- Run the actual test.
Hadoop is cool
But many things use it to be even cooler
Want to use SQL?
Want to use more of Google's papers, like BigTable?
Want a way to analyze data that will make your data scientists happy?
And many, many more!
Hadoop is powerful
Use it wisely
And try to use the right tool for the job