On Github rodxavier / acm-big-data
By Rod Xavier Bondoc / @rodxavier14
University of the Philippines - Diliman | 22 February 2014 | #UPACMBigData
This refers to the size of the data.
The price to store data has dropped over the years.
BUT, we want to store data reliably.
That's why we have Storage Area Networks(SANs)
BUT, SANs can also cause problems.
This refers to the fact that data come from different sources in a variety of formats.
Data come in different formats.
We are working with structured, semi-structured and non-structured data.
We don't want to throw away any data.
This refers to the speed in which the data is created.
An open-source framework for large-scale data storage and data processing.
Sqoop, Hue, Oozie, Mahout
Pig Hive SELECT * FROM ... Map Reduce Impala HBase Hadoop Distributed File System(HDFS)We'll be using Cloudera's hadoop distribution
Open your VirtualBox and create a new VM.
This is similar to a regular filesystem.
However...
You will notice that most HDFS commands are similar to UNIX commands.
This is a programming model for processing large datasets using parallel and distributed algorithms in a cluster.
A real-world scenario
hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop- streaming-2.0.0-mr1-cdh4.1.1.jar -mapper mapper.py -reducer reducer.py -file mapper.py -file reducer.py -input input -output output
We don't want to test the code on the whole dataset.
Testing our mapper code
cat test.txt | ./mapper.py
Testing our reducer code
cat test.txt | ./mapper.py | sort | ./reducer.py
hs mapper.py reducer.py input output
Wait! There's more...
Hadoop provides a web-based interface for the job tracker. It is running on port 50030.
These are patterns that don't change records.
These are patterns that only get a subset of the data.
These patterns produce a summarized view of the data.
Disclaimer: Not all images and videos are mine. Sources may be found in the github repo.