Big Data

Demystified

December, 2013

Abdullah Cetin CAVDAR / @accavdar

Concepts

What's Big Data

Big data is data that exceeds the processing capacity of conventional database systems.

4V's of Big Data

Volume data is getting higher/bigger than ever
Velocity data is increasing. ex: complex real time data
Variety data is spiraling ex: unstructured video and voice
Variability data types/formats also different

Multitude of Data Types

Structured, Semi-structured and Unstructured
- Demographic, psychographic, transactional
- Call center data, social media data, web log data, sensor networks
- Requires new storage mechanisms (Hadoop, NoSQL, ...)
- High dimensionality

Different Sectors

Challenge

The challenge in big data analytics is to dig deeply, quickly (real time?) and widely

Real Challenge: Stream Mining?

Stream Mining Usages

Generate Summaries

Properly constructed summaries are useful for highlighting emerging patterns, trends, and anomalies. Common summaries include a list of distinct items, recently trending items, heavy hitters (items that have appeared frequently), and the top k (most popular) items

Approximate Answers

In the streaming context summaries are constantly changing, thus in many applications approximate answers are acceptable

Sampling from a Data Stream

Building a random sample from data that’s continuously arriving isn’t straightforward (note that in a stream you iterate over the data set only once)

Data Mining

Beyond simple summaries, advanced algorithms for data mining have drawn interest

Application Areas

Important stream mining applications such as

identifying trends (discover “trending topics”)
anomaly detection (“alerts”)
correlations
clustering
classification

Purpose

You have to be able to reliably consume it, normalize it, merge it with other data, apply functions on it, store it, query it, distribute it

Cyber Security?

NoSQL

Not Only SQL
Usually do not require a fixed table schema nor do they use the concept of joins
Offers BASE instead of ACID (Atomicity, Consistency, Isolation, Durability)
- BAsiclly available
- Soft state (scalable)
- Eventual Consistency

What's wrong with RDBMS?

One size fits all? Not really
Rigid schema design
Harder to scale
Replication
Joins across multiple nodes? Hard
How does RDBMS handle data growth? Hard

Advantages

Flexible schema
Massive scalability
Eventual consistency
- higher performance
- availability

Disadvantages

No declarative query language
- Requires more programming
Eventual consistency
- Fewer guarantees

NoSQL Systems

Column Family
Key-Value Stores
Document Stores
Graph Databases

Column Family

Each storage block contains data from only one column
More efficient than row (or document) store if:
- Multiple row/record/documents are inserted at the same time so updates of column blocks can be aggregated
- Retrievals access only some of the columns in a row/record/document
HBase, Cassandra, ...

Key-Value Stores

Extremely simple Interface
- Data model: (key, value) pairs
Operations
- Insert(key, value)
- Fetch(key)
- Update(key)
- Delete(key)
Implementation: efficiency, scalability, fault-tolerance
- Records distributed to nodes based on key
- Replication
- Single-record transactions, eventual consistency
Amazon Dynamo, Redis, Riak, ...

Document Stores

Like Key-Value Stores except value is document
- Data model: (key, document) pairs
- Document: JSON, XML, other semi-structured formats
Operations
- Insert(key, document)
- Fetch(key)
- Update(key)
- Delete(key)
- Also fetch based on document contents
Apache CouchDB, MongoDB, Elastic Search, ...

Graph Databases

A graph is a collection nodes (things) and edges (relationships) that connect pairs of nodes
Attach properties (key-value pairs) on nodes and relationships
Relationships connect two nodes and both nodes and relationships can hold an arbitrary amount of key-value pairs
A graph database can be thought of as a key-value store, with full support for relationships
Neo4j, FlockDB, Pregel, ...

Batch Processing

Batch data processing is an efficient way of processing high volumes of data is where a group of transactions is collected over a period of time
Data is collected, entered, processed and then the batch results are produced
Data is processed by worker nodes and managed by master node
Analytical queries can be executed as distributed on offline big data. User queries are converted to Map-Reduce jobs on cluster to process big data
- Hive: Supports SQL like query language
- Pig: More likely a scripting language

Map-Reduce

MapReduce is a programming model and software framework first developed by Google
A method for distributing a task across multiple nodes
Each node processes data assigned to that node
Consists of two developer-created phases
- Map
- Reduce
Features of MapReduce systems include:
- Reliably processing a job even when machines die
- Paralellization on thousands of machines

Map

Map step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node

function map(String name, String document) {
    for each word w in document {
    emit (w, 1) // emit(key, value)
    }
    }

Reduce

Reduce step: After map step, the master node collects the answers of all the sub-problems and classifies all values with their keys. Each unique key and its values are sent to a distinct reducer. This means that each reducer processes a distinct key and its values independently from each other

function reduce(String word, Iterator partialCounts) {
    sum = 0;
    for each pc in partialCounts {
    sum += ParseInt(pc);
    emit (word, sum);
    }
    }

MapReduce Overview

Hadoop

Hadoop is a scalable fault-tolerant distributed system for data storage and processing
Core Hadoop has two main components:
- HDFS: Hadoop Distributed File System is a reliable, self- healing, redundant, high-bandwidth, distributed file system optimized for large files
- MapReduce: Fault-tolerant distributed processing
Operates on unsuctured and structured data

HDFS

Scalable, distributed, portable file system written in Java for Hadoop framework
- Primary distributed storage used by Hadoop applications
HDFS can be part of a Hadoop cluster or can be a stand-alone general purpose distributed file system
An HDFS cluster primarily consists of
- NameNode that manages file system metadata
- DataNode that stores actual data
Stores very large files in blocks across machines in a large cluster
- Reliability and fault tolerance ensured by replicating data across multiple hosts

Analytical Queries over Offline Big Data

Hadoop is great for large-data processing
- But writing Java programs for everything is verbose and slow
- Not everyone wants to (or can) write Java code
Solution: develop higher-level data processing languages
- Hive HQL is like SQL
- Pig Pig Latin is a bit like Perl

Hive

A system for querying and managing structured data built on top of Hadoop
Structured data with rich data types (structs, lists and maps)
Directly query data from different formats (text/binary) and file formats (Flat/Sequence)
SQL as a familiar programming tool and for standard analytics
Allow embedded scripts for extensibility and for non standard applications
Rich MetaData to allow data discovery and for optimization

Pig

A platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs
Compiles down to MapReduce jobs
Common design patterns as key words (joins, distinct, counts)
Data flow analysis
- A script can map to multiple map-reduce jobs
Can be interactive mode
- Issue commands and get results

Real Time Data Processing

In contrast, real time data processing involves a continual input, process and output of data
Size of data is not certain at the beginning of the processing
Data must be processed in a small time period (or near real time)
Storm is a most popular open source implementation of realtime data processing

Storm

Storm is a highly distributed real time computation system
It's scalable and fault-tolerant
Can be used with any programming language

Architecture

Architecture Components

Nimbus
- Master node (similar to Hadoop JobTracker)
- Responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures
Zookeeper
- Used for cluster coordination between Nimbus and Supervisors
- Nimbus and Supervisors are fail-fast and stateless; all state is kept in Zookeeper or on local disk
Supervisor
- Run worker processes
- The supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it

Storm Topology

Main Concepts

Streams
- Unbounded sequence of tuples/datas (input or output)
Spouts
- Source of streams
- Such as reading data from Twitter Streaming API
Bolts
- Processes input streams, does some processing and possibly emits new streams
Topologies
- Network of spouts and bolts

Slides

https://github.com/accavdar/BigDataDemystified

Big Data – Demystified – December, 2013

accavdar

Big Data – Demystified – December, 2013

0 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

BigDataDemystified

Big Data

Demystified

December, 2013

Concepts

What's Big Data

4V's of Big Data

Multitude of Data Types

Different Sectors

Challenge

Real Challenge: Stream Mining?

Stream Mining Usages

Generate Summaries

Approximate Answers

Sampling from a Data Stream

Data Mining

Application Areas

Purpose

Cyber Security?

NoSQL

NoSQL

What's wrong with RDBMS?

Advantages

Disadvantages

NoSQL Systems

Column Family

Key-Value Stores

Document Stores

Graph Databases

Batch Processing

Batch Processing

Map-Reduce

Map

Reduce

MapReduce Overview

Hadoop

HDFS

Analytical Queries over Offline Big Data

Hive

Pig

Real Time Data Processing

Real Time Data Processing

Storm

Architecture

Architecture Components

Storm Topology

Main Concepts

Slides

THE END

by Abdullah Cetin CAVDAR / @accavdar

0 0