Spark Streaming

What is spark streaming

An extension of the core Spark API
Enables scalable, high-throughput, fault-tolerant stream processing of live data streams
Ingest data from many sources like Kafka, Twitter, or TCP sockets
High-level functions like map, reduce, window

How spark streaming works

Use DStream to represent a continuous stream of data
A DStream is a sequence of RDDs

Spark streaming workflow

Everything starts with a StreamingContext Add an input DStream and receiver Perform transformations on DStreams Output the results

StreamingContext

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

sc = SparkContext(master, appName)
ssc = StreamingContext(sc, batch_interval)

When running on local machine, set master to "local[*]"
Set an appropriate batch interval for your application

Input DStreams

Basic Sources:
- Socket Streams: ssc.socketTextStream(hostname, port)
- File Streams: ssc.textFileStream(dataDirectory)
- Queue of RDDs
Advanced Sources:
- Kafka
- Flume
- Kinesis
- Twitter
- and more...

Transformations on DStreams

transform
join
updateStateByKey
window operations
and more...

UpdateStateByKey operation

def updateFunction(newValues, runningCount):
    if runningCount is None:
       runningCount = 0
    return sum(newValues, runningCount)
runningCounts = pairs.updateStateByKey(updateFunction)

Window operation

Needs two parameters:
window length: The duration of the window
sliding interval: The interval at which the window operation is performed

Window operation

Functions:
- window
- countByWindow
- reducedByWindow
- reducedByKeyAndWindow
- countByValueAndWindow

Output operations on DStreams

pprint: print the first ten elements of every batch of data in a DStream foreachRDD: apply arbitrary function to each RDD and execute in driver process

A simple network wordcount example

from pyspark import SparkContext, StorageLevel
from pyspark.streaming import StreamingContext

sc = SparkContext("local[*]", "NetworkWordCount")
ssc = StreamingContext(sc, 1)

lines = ssc.socketTextStream("localhost", 9999, storageLevel=StorageLevel.MEMORY_AND_DISK)
counts = lines.flatMap(lambda line: line.split(" "))\
              .map(lambda word: (word, 1))\
              .reduceByKey(lambda a, b: a+b)
counts.pprint()

ssc.start()
ssc.awaitTermination()

Deploy your application

Type git clone https://github.com/SakuraSakura/spark-project.git to clone the base code of examples Type nc -lk 9999 to start a simple socket server Type bin/spark-submit wordcount.py to start your application

Spark Streaming

Spark Streaming – Transformations on DStreams

SakuraSakura

Spark Streaming – Transformations on DStreams

0 0

spark-presentation

Spark Streaming

What is spark streaming

How spark streaming works

Spark streaming workflow

StreamingContext

Input DStreams

Transformations on DStreams

UpdateStateByKey operation

Window operation

Window operation

Output operations on DStreams

A simple network wordcount example

Deploy your application

Spark Streaming – Transformations on DStreams

SakuraSakura

Spark Streaming – Transformations on DStreams

0 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

spark-presentation

Spark Streaming

What is spark streaming

How spark streaming works

Spark streaming workflow

StreamingContext

Input DStreams

Transformations on DStreams

UpdateStateByKey operation

Window operation

Window operation

Output operations on DStreams

A simple network wordcount example

Deploy your application

0 0