Big Data – Demystified – December, 2013



Big Data – Demystified – December, 2013

0 0


BigDataDemystified

This presentation summarizes Big Data concepts and related technologies, open source and commercial tools used in Big Data processing and challenges in this area.

On Github accavdar / BigDataDemystified

Big Data

Demystified

December, 2013

Abdullah Cetin CAVDAR / @accavdar

Concepts

What's Big Data

Big data is data that exceeds the processing capacity of conventional database systems.

4V's of Big Data

  • Volume data is getting higher/bigger than ever
  • Velocity data is increasing. ex: complex real time data
  • Variety data is spiraling ex: unstructured video and voice
  • Variability data types/formats also different

Multitude of Data Types

  • Structured, Semi-structured and Unstructured
    • Demographic, psychographic, transactional
    • Call center data, social media data, web log data, sensor networks
    • Requires new storage mechanisms (Hadoop, NoSQL, ...)
    • High dimensionality

Different Sectors

Challenge

The challenge in big data analytics is to dig deeply, quickly (real time?) and widely

Real Challenge: Stream Mining?

Stream Mining Usages

Generate Summaries

Properly constructed summaries are useful for highlighting emerging patterns, trends, and anomalies. Common summaries include a list of distinct items, recently trending items, heavy hitters (items that have appeared frequently), and the top k (most popular) items

Approximate Answers

In the streaming context summaries are constantly changing, thus in many applications approximate answers are acceptable

Sampling from a Data Stream

Building a random sample from data that’s continuously arriving isn’t straightforward (note that in a stream you iterate over the data set only once)

Data Mining

Beyond simple summaries, advanced algorithms for data mining have drawn interest

Application Areas

Important stream mining applications such as

  • identifying trends (discover “trending topics”)
  • anomaly detection (“alerts”)
  • correlations
  • clustering
  • classification

Purpose

You have to be able to reliably consume it, normalize it, merge it with other data, apply functions on it, store it, query it, distribute it

Cyber Security?

NoSQL

NoSQL

  • Not Only SQL
  • Usually do not require a fixed table schema nor do they use the concept of joins
  • Offers BASE instead of ACID (Atomicity, Consistency, Isolation, Durability)
    • BAsiclly available
    • Soft state (scalable)
    • Eventual Consistency

What's wrong with RDBMS?

  • One size fits all? Not really
  • Rigid schema design
  • Harder to scale
  • Replication
  • Joins across multiple nodes? Hard
  • How does RDBMS handle data growth? Hard

Advantages

  • Flexible schema
  • Massive scalability
  • Eventual consistency
    • higher performance
    • availability

Disadvantages

  • No declarative query language
    • Requires more programming
  • Eventual consistency
    • Fewer guarantees

NoSQL Systems

  • Column Family
  • Key-Value Stores
  • Document Stores
  • Graph Databases

Column Family

  • Each storage block contains data from only one column
  • More efficient than row (or document) store if:
    • Multiple row/record/documents are inserted at the same time so updates of column blocks can be aggregated
    • Retrievals access only some of the columns in a row/record/document
  • HBase, Cassandra, ...

Key-Value Stores

  • Extremely simple Interface
    • Data model: (key, value) pairs
  • Operations
    • Insert(key, value)
    • Fetch(key)
    • Update(key)
    • Delete(key)
  • Implementation: efficiency, scalability, fault-tolerance
    • Records distributed to nodes based on key
    • Replication
    • Single-record transactions, eventual consistency
  • Amazon Dynamo, Redis, Riak, ...

Document Stores

  • Like Key-Value Stores except value is document
    • Data model: (key, document) pairs
    • Document: JSON, XML, other semi-structured formats
  • Operations
    • Insert(key, document)
    • Fetch(key)
    • Update(key)
    • Delete(key)
    • Also fetch based on document contents
  • Apache CouchDB, MongoDB, Elastic Search, ...

Graph Databases

  • A graph is a collection nodes (things) and edges (relationships) that connect pairs of nodes
  • Attach properties (key-value pairs) on nodes and relationships
  • Relationships connect two nodes and both nodes and relationships can hold an arbitrary amount of key-value pairs
  • A graph database can be thought of as a key-value store, with full support for relationships
  • Neo4j, FlockDB, Pregel, ...

Batch Processing

Batch Processing

  • Batch data processing is an efficient way of processing high volumes of data is where a group of transactions is collected over a period of time
  • Data is collected, entered, processed and then the batch results are produced
  • Data is processed by worker nodes and managed by master node
  • Analytical queries can be executed as distributed on offline big data. User queries are converted to Map-Reduce jobs on cluster to process big data
    • Hive: Supports SQL like query language
    • Pig: More likely a scripting language

Map-Reduce

  • MapReduce is a programming model and software framework first developed by Google
  • A method for distributing a task across multiple nodes
  • Each node processes data assigned to that node
  • Consists of two developer-created phases
    • Map
    • Reduce
  • Features of MapReduce systems include:
    • Reliably processing a job even when machines die
    • Paralellization on thousands of machines

Map

Map step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node

function map(String name, String document) {
    for each word w in document {
    emit (w, 1) // emit(key, value)
    }
    }

Reduce

Reduce step: After map step, the master node collects the answers of all the sub-problems and classifies all values with their keys. Each unique key and its values are sent to a distinct reducer. This means that each reducer processes a distinct key and its values independently from each other

function reduce(String word, Iterator partialCounts) {
    sum = 0;
    for each pc in partialCounts {
    sum += ParseInt(pc);
    emit (word, sum);
    }
    }

MapReduce Overview

Hadoop

  • Hadoop is a scalable fault-tolerant distributed system for data storage and processing
  • Core Hadoop has two main components:
    • HDFS: Hadoop Distributed File System is a reliable, self- healing, redundant, high-bandwidth, distributed file system optimized for large files
    • MapReduce: Fault-tolerant distributed processing
  • Operates on unsuctured and structured data

HDFS

  • Scalable, distributed, portable file system written in Java for Hadoop framework
    • Primary distributed storage used by Hadoop applications
  • HDFS can be part of a Hadoop cluster or can be a stand-alone general purpose distributed file system
  • An HDFS cluster primarily consists of
    • NameNode that manages file system metadata
    • DataNode that stores actual data
  • Stores very large files in blocks across machines in a large cluster
    • Reliability and fault tolerance ensured by replicating data across multiple hosts

Analytical Queries over Offline Big Data

  • Hadoop is great for large-data processing
    • But writing Java programs for everything is verbose and slow
    • Not everyone wants to (or can) write Java code
  • Solution: develop higher-level data processing languages
    • Hive HQL is like SQL
    • Pig Pig Latin is a bit like Perl

Hive

  • A system for querying and managing structured data built on top of Hadoop
  • Structured data with rich data types (structs, lists and maps)
  • Directly query data from different formats (text/binary) and file formats (Flat/Sequence)
  • SQL as a familiar programming tool and for standard analytics
  • Allow embedded scripts for extensibility and for non standard applications
  • Rich MetaData to allow data discovery and for optimization

Pig

  • A platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs
  • Compiles down to MapReduce jobs
  • Common design patterns as key words (joins, distinct, counts)
  • Data flow analysis
    • A script can map to multiple map-reduce jobs
  • Can be interactive mode
    • Issue commands and get results

Real Time Data Processing

Real Time Data Processing

  • In contrast, real time data processing involves a continual input, process and output of data
  • Size of data is not certain at the beginning of the processing
  • Data must be processed in a small time period (or near real time)
  • Storm is a most popular open source implementation of realtime data processing

Storm

  • Storm is a highly distributed real time computation system
  • It's scalable and fault-tolerant
  • Can be used with any programming language

Architecture

Architecture Components

  • Nimbus
    • Master node (similar to Hadoop JobTracker)
    • Responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures
  • Zookeeper
    • Used for cluster coordination between Nimbus and Supervisors
    • Nimbus and Supervisors are fail-fast and stateless; all state is kept in Zookeeper or on local disk
  • Supervisor
    • Run worker processes
    • The supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it

Storm Topology

Main Concepts

  • Streams
    • Unbounded sequence of tuples/datas (input or output)
  • Spouts
    • Source of streams
    • Such as reading data from Twitter Streaming API
  • Bolts
    • Processes input streams, does some processing and possibly emits new streams
  • Topologies
    • Network of spouts and bolts

Slides

https://github.com/accavdar/BigDataDemystified

THE END

by Abdullah Cetin CAVDAR / @accavdar