Big Data – What is Big Data? – What can be considered as Big Data?



Big Data – What is Big Data? – What can be considered as Big Data?

0 0


acm-big-data

Slides and other related materials for the Big Data lecture I had on February 22, 2014

On Github rodxavier / acm-big-data

Big Data

Using Hadoop and Map Reduce

By Rod Xavier Bondoc / @rodxavier14

University of the Philippines - Diliman | 22 February 2014 | #UPACMBigData

What is Big Data?

According to Webopedia,

“Big data is a buzzword, or catch-phrase, used to describe a massive volume of both structured and unstructured data that is so large that it's difficult to process using traditional database and software techniques.”

According to Oracle,

“Big data is the electricity of the 21st century—a new kind of power that transforms everything it touches in business, government, and private life.”

According to IBM,

What can be considered as Big Data?

Mobile Data

Server Logs

User Behavior

3Vs of Big Data

Volume Variety Velocity

Volume

This refers to the size of the data.

The price to store data has dropped over the years.

BUT, we want to store data reliably.

That's why we have Storage Area Networks(SANs)

BUT, SANs can also cause problems.

Problems with SANs

  • Expensive
  • Slow Streaming of data across a network

Variety

This refers to the fact that data come from different sources in a variety of formats.

Data come in different formats.

We are working with structured, semi-structured and non-structured data.

We don't want to throw away any data.

Velocity

This refers to the speed in which the data is created.

Hadoop

An open-source framework for large-scale data storage and data processing.

History of Hadoop

Why Hadoop?

How does Hadoop work?

Hadoop Ecosystem

Sqoop, Hue, Oozie, Mahout

Pig Hive SELECT * FROM ... Map Reduce Impala HBase Hadoop Distributed File System(HDFS)

Let's setup our Hadoop Environment!

We'll be using Cloudera's hadoop distribution

Open your VirtualBox and create a new VM.

Hadoop Distributed File System(HDFS)

This is similar to a regular filesystem.

However...

How does HDFS work?

HDFS Demo

You will notice that most HDFS commands are similar to UNIX commands.


                    

Map Reduce

This is a programming model for processing large datasets using parallel and distributed algorithms in a cluster.

A real-world scenario

Mappers and Reducers

Daemons of MapReduce

How to run a MapReduce job?

hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-
streaming-2.0.0-mr1-cdh4.1.1.jar -mapper mapper.py -reducer reducer.py 
-file mapper.py -file reducer.py -input input -output output

Example

Writing our Mapper code


                    

Writing our Reducer code


                    

Testing our code

We don't want to test the code on the whole dataset.

Testing our mapper code

cat test.txt | ./mapper.py

Testing our reducer code

cat test.txt | ./mapper.py | sort | ./reducer.py

A shortcut for running a MapReduce job

hs mapper.py reducer.py input output

Wait! There's more...

Hadoop provides a web-based interface for the job tracker. It is running on port 50030.

Map Reduce Design Patterns

  • Filtering Patterns
  • Summarization Patterns
  • Structural Patterns

Filtering Patterns

These are patterns that don't change records.

These are patterns that only get a subset of the data.

Examples

  • Top N
  • Random Sampling
  • Simple Filter

Summarization Patterns

These patterns produce a summarized view of the data.

Examples

  • Numerical Summarization
  • Inverted Index

Structural Patterns

Questions?

Thank you!

Disclaimer: Not all images and videos are mine. Sources may be found in the github repo.

0