What's Big Data
Big data is data that exceeds the processing capacity of conventional
database systems.
4V's of Big Data
-
Volume data is getting higher/bigger than ever
-
Velocity data is increasing. ex: complex real time
data
-
Variety data is spiraling ex: unstructured video and
voice
-
Variability data types/formats also different
Multitude of Data Types
-
Structured, Semi-structured and Unstructured
- Demographic, psychographic, transactional
- Call center data, social media data, web log data, sensor networks
- Requires new storage mechanisms (Hadoop, NoSQL, ...)
- High dimensionality
Challenge
The challenge in big data analytics is to dig deeply, quickly (real time?) and widely
Real Challenge: Stream Mining?
Generate Summaries
Properly constructed summaries are useful for highlighting emerging patterns, trends, and anomalies.
Common summaries include a list of distinct items, recently trending items, heavy hitters (items that have
appeared frequently), and the top k (most
popular) items
Approximate Answers
In the streaming context summaries are constantly changing, thus in many applications approximate
answers are acceptable
Sampling from a Data Stream
Building a random sample from data that’s continuously arriving isn’t straightforward (note that in a
stream you iterate over the data set only once)
Data Mining
Beyond simple summaries, advanced algorithms for data mining have drawn interest
Application Areas
Important stream mining applications such as
- identifying trends (discover “trending topics”)
- anomaly detection (“alerts”)
- correlations
- clustering
- classification
Purpose
You have to be able to reliably consume it, normalize it, merge it with other data, apply functions
on it, store it, query it, distribute it
Batch Processing
-
Batch data processing is an efficient way of processing high volumes of data
is where a group of
transactions is collected over a period of time
-
Data is collected, entered, processed and then the batch results are produced
-
Data is processed by worker nodes and managed by master node
-
Analytical queries can be executed as distributed on offline big data. User queries are converted to
Map-Reduce jobs on cluster to process big data
-
Hive: Supports SQL like query language
-
Pig: More likely a scripting language
Map-Reduce
-
MapReduce is a programming model and software framework first developed by
Google
-
A method for distributing a task across multiple nodes
-
Each node processes data assigned to that node
-
Consists of two developer-created phases
-
Features of MapReduce systems include:
- Reliably processing a job even when machines die
- Paralellization on thousands of machines
Map
Map step: The master node takes the input, divides it into smaller sub-problems,
and distributes them to
worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker
node processes the smaller problem, and passes the answer back to its master node
function map(String name, String document) {
for each word w in document {
emit (w, 1) // emit(key, value)
}
}
Reduce
Reduce step: After map step, the master node collects the answers of all the
sub-problems and classifies all values with their keys. Each unique key and its values are sent to a
distinct reducer. This means that each reducer processes a distinct key and its values independently from
each other
function reduce(String word, Iterator partialCounts) {
sum = 0;
for each pc in partialCounts {
sum += ParseInt(pc);
emit (word, sum);
}
}
Hadoop
-
Hadoop is a scalable fault-tolerant distributed system for data storage and
processing
-
Core Hadoop has two main components:
-
HDFS: Hadoop Distributed File System is a reliable, self- healing,
redundant, high-bandwidth, distributed file system optimized for large files
-
MapReduce: Fault-tolerant distributed processing
-
Operates on unsuctured and structured data
HDFS
-
Scalable, distributed, portable file system written in Java for Hadoop framework
- Primary distributed storage used by Hadoop applications
-
HDFS can be part of a Hadoop cluster or can be a stand-alone general purpose distributed file system
-
An HDFS cluster primarily consists of
-
NameNode that manages file system metadata
-
DataNode that stores actual data
-
Stores very large files in blocks across machines in a large cluster
- Reliability and fault tolerance ensured by replicating data across multiple hosts
Analytical Queries over Offline Big Data
-
Hadoop is great for large-data processing
- But writing Java programs for everything is verbose and slow
- Not everyone wants to (or can) write Java code
-
Solution: develop higher-level data processing languages
-
Hive HQL is like SQL
-
Pig Pig Latin is a bit like Perl
Hive
-
A system for querying and managing structured data built on top of Hadoop
-
Structured data with rich data types (structs, lists and maps)
-
Directly query data from different formats (text/binary) and file formats (Flat/Sequence)
-
SQL as a familiar programming tool and for standard analytics
-
Allow embedded scripts for extensibility and for non standard applications
-
Rich MetaData to allow data discovery and for optimization
Pig
-
A platform for analyzing large data sets that consists of a high-level language for expressing data analysis
programs
-
Compiles down to MapReduce jobs
-
Common design patterns as key words (joins, distinct, counts)
-
Data flow analysis
- A script can map to multiple map-reduce jobs
-
Can be interactive mode
- Issue commands and get results