Introduction to Shark

SQL and Rich Analytics at Scale

Reported by Jinyang Zhou / @ailurus1991

What is Spark?

NOT a modified version of Hadoop

Separate, fast, MapReduce-link engine

In-memory data storage for very fast iterative queries
General execution graphs and powerful optimizations
Up to 100x faster than Hadoop MapReduce

Compatible with Hadoop's storage APIs

Can read/write to any Hadoop-supported system, including HDFS, HBase, SequenceFiles, etc

Oh hey, these are some notes. They'll be hidden in your presentation, but you can see them if you open the speaker notes window (hit 's' on your keyboard).

What is Shark?

A SQL analytics engine built on top of Spark
Compatible with Apache Hive data, metastores, and queries(HiveQL, UDFs, etc)
Similar speedups of up to 100x

Adoption

In use of Yahoo!, Foursquare, Berkeley, Princetion & many others(possibly Taobao, Netease)
600+ member meetup, 800+ watcherson Github
30+ contributors

This Talk

Hadoop & MapReduce
Spark
Shark: SQL on Spark
Why is Hadoop Mapreduce slow?

How do you scale up applications to PBs of data?

I can use hundreds or thousands of machines!

But distributed programming is hard(task scheduling, data synchronization, machines failures)

MapReduce

Programming model: simple abstraction(i.e. map and reduce) inspired by functional programming

Execution engine: runs on thousands of commodity machines.

Hadoop

Hadoop Distributed File System(HDFS)

A distributed file system modeled after Google File System

Hadoop MapReduce(aka MapRed, MR)

Many other related projects such as Hive(SQL on Hadoop)

Hive

A data warehouse

initially developed by Facebook
puts structure onto HDFS data(schema-on-read)
compiles HiveQL queries into MapReudce jobs
flexible and extensible: support UDFs, scripts, custom serializers, storage formats

Popular

90+% of Facebook Hadoop jobs generated by Hive

OLTP(serving) vs OLAP (analytics)

Dterminism & Idempotence

Map and reduce functions are:

deterministic
side-effect free

Tasks are thus idempotent:

Rerunning them gets you the same result

Dterminism & Idempotence

Fault-tolerance:

Rerun tasks originally schedualed on failed nodes

Straggers:

Rerun tasks originally schedualed on slow nodes

This Talk

Hadoop & MapReduce
Spark
Shark: SQL on Spark
Why is Hadoop Mapreduce slow?

Why go Beyond MapReduce?

MapReduce simplified big data analysis by giving a reliable programming model for large clusters

But as soon as it got popular, users wanted:

More complex, multi-stage applications
More interactive, ad-hoc queries

Complex jobs and interactive queries both need one thing that MapReduce lacks:

Efficient primitives for data sharing

In MapReduce, the only way to share data across jobs is stable storage(e.g. HDFS)

SLOW!

Solution

Resilient Distributed Datasets(RDDs)

Distributed collections of objects that can be stored in memory for fast reuse
Automatically recover lost data on failure
Support a wide range of applications

Programming Model

Resilient distributed datasets(RDDs)

Immutable, partitioned collections of objects
Can be cached in memory for effcient reuse

Transformations(e.g. map, filter, groupBy, join)

Build RDDs from other RDDs

Actions(e.g. count, collect, save)

Return a result or write it to storage

Fault Tolerance

RDDs track the series of transformations used to build them (their lineage) to recompute lost data

Fault Recovery Results

Tradeoff Space

Behavior with Not Enough RAM

Logistic Regression Performance

Implementation

Use Mesos/YARN to share resources with Hadoop & other frameworks

Can access any Hadoop input source (HDFS, S3, …)

20k lines of code

User Applications

In-memory analytics & anomaly detection (Conviva)
Interactive queries on data streams (Quantiﬁnd)
Exploratory log analysis (Foursquare)
Traffic estimation w/ GPS data (Mobile Millennium)
Twitter spam classiﬁcation (Monarch)

Conviva GeoReport

Group aggregations on many keys w/ same ﬁlter

40× gain over Hive from avoiding repeated reading, deserialization and ﬁltering

This Talk

Hadoop & MapReduce
Spark
Shark: SQL on Spark
Why is Hadoop Mapreduce slow?

Challenges

Data volumes expanding
Faults and stragglers complicate parallel database design
Low-latency, interactivity

MPP Databases

Vertica, SAP HANA, Teradata, Google Dremel...
Fast!
Generally not fault-tolerant; challenging for long running queries as clusters scale up.
Lack rich analytics such as ML and Graph algorithms.

MapReduce

Apache Hive, Google Tenzing, Turn Cheetah...
Deterministic, idempotent tasks: enables fine-grained fault-tolerance and resouce sharing.
Expressive ML algorithms.
High-latency, dismissed for interactive workloads.

Shark

A data warehouse that

builds on Spark,
scals out and is fault-tolerance,
supports low-latency, interactive queries through in-memory computation,
supports both SQL and complex analytics,
is compatible with Apache Hive(storage, serdes, UDFs, types, metadata).

Hive Architecture

Shark Architecture

Engine Features

Columnar Memory Store
ML Integration
Partial DAG Execution
Data Co-partitioning
Partition Pruning based on Range Statistics

Efficient In-Memory Storage

Simply caching Hive records as Java objects is inefficient due to high per-object overhead

Instead, Shark employs column-oriented storage using arrays of primitive types.

Benefit: similarly compact size to serialized data, but>5x faster to access

ML Integration

Uniﬁed system for query processing and machine learning
Query processing and ML share the same set of workers and caches

Performance

This Talk

Hadoop & MapReduce
Spark
Shark: SQL on Spark
Why is Hadoop Mapreduce slow?

Why are previous MR-based systems slow?

Disk-based intermediate outputs.
Inferior data format and layout(no control of data co-partitioning).
Execution strategies (lack of optimization based on data statistics).
Task scheduling and launch overhead!

Task Launch Overhead

Hadoop uses heartbeat to communicate scheduling decisions.

Task launch delay 5-10 seconds.

Spark uses an event-driven architecture and can launch tasks in 5ms

better parallelism
easier straggler mitigration
multi-tenancy resouce sharing

Introduction to Shark – SQL and Rich Analytics at Scale – Adoption

ailurus1991

Introduction to Shark – SQL and Rich Analytics at Scale – Adoption

0 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

Shark_Slides