First we do the talk, then we do the walk
What is Apache Spark?
- Apache Spark is a fast, general engine for large-scale data processing
- Open sourced, developed at AMPlab at UC Berkeley
Written in Scala
- Hybride programming language that runs on the JVM
Key Concepts
- Bring the processing to the data
- Store the data in memory
Spark Architecture Model
Coordination is done by the SparkContext object in your main program(called the driver program)
Cluster manager
An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)
Worker Node
Any node that can run application code in the cluster
Spark Internals
- RDD(Resilient Distributed Dataset)
-Resilient: if data in memory is lost, it can be recreated
-Distributed: stored in memory across the cluster
-Dataset: initial data can come from a file or created programmatically
RDD's are the fundamental unit of data in Spark
Most of Spark programming is performing operations on RDD's
RDD's
-Each stage of a transformation will create a new RDD
RDDs are lazy:
-A DAG (directed acyclic graph) of computation is constructed.
-The actual data is processed only when results are requested.
Transformations and Actions
RDDs support two types of operations
You can chain operations together, but keep in mind that the computation only runs when you call an
action.
Time to rock and roll!
- Spark Core
- Spark SQL
- Spark MLlib
- Spark Streaming
- Spark GraphX
nl.ncim.workshop.core
- Transformations: map, flatmap, filter
- Actions: reduceByKey, groupBy, take
nl.ncim.workshop.sql
- Structured data processing
- It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine.
- DataFrames and SQL provide a common way to access a variety of data sources:
- including Hive, Avro, Parquet, ORC, JSON, and JDBC. You can even join data across these sources.
Spark MLlib
- Machine Learning Basics
- Programming Abstractions in Spark MLlib
- Example: Predicting Digits
- Getting YOUR hands dirty
Spark MLlib — Machine Learning Basics
Gather and prepare data
"In Data Science, 80% of time is spent on data preparation, and the other 20% is spent on
complaining about the need to prepare data"
Select features
"In machine learning and statistics, feature selection, also known as variable selection,
attribute selection or variable subset selection, is the process of selecting a subset of relevant
features (variables, predictors) for use in model construction"
- Current age for predicting the probability of ending up in hospital in the coming five years.
- $\frac{\text{weight}}{\text{length}^2}$ for predicting percentage of body fat.
- You fall in category $A$, so you are very likely to do $Z$.
Select the best model
- Train model with different parameters.
- Model might be too complex, or too simple.
- Evaluate model on some validation set and use cross-validation to find an appropriate model.
Spark MLlib — Programming Abstractions
- (Dense/Sparse) Vector
- LabeledPoint
- Matrix
- Rating
- Model classes
- Pipeline API
(Dense/Sparse) Vector
- A mathematical vector containing numbers.
- Both dense and sparse vectors.
- Constructed via the mllib.linalg.Vectors class.
- Do not provide arithmetic operations.
val denseVec1 = Vectors.dense(1.0, 2.0, 3.0)
val denseVec2 = Vectors.dense(Array(1.0,2.0,3.0))
val sparseVec = Vectors.sparse(4, Array(0,2), Array(1.0, 2.0))
LabeledPoint
"LabeledPoint: A labeled data point for supervised learning algorithms such as
classification and regression. Includes a feature vector and a label."
val lp = LabeledPoint(1,Vectors.dense(3.14,1.68,1.41))
Matrix
- Integer typed row and column indices
- Double values
- Different implementations for distribution purposes (RowMatrix, BlockMatrix, CoordinateMatrix,...).
- Dense and sparse variants
Rating
"Rating: A rating of a product by a user, used in the mllib.recommendation package for
product recommendation."
Nothing more than a
case class Rating(user: Long, item: Long, rating: Double)
Model classes
- Work on RDD[Vector], RDD[LabeledPoint], etc.
- Often follow naming pattern: <problem>With<Algorithm>, e.g. LinearRegresionWithSGD.
- Either the model follows a builder pattern and has a run() method, or it has static train() and predict() methods:
val points: RDD[LabeledPoint] = // ...
val lr = new LinearRegressionWithSGD()
.setNumIterations(200)
.setIntercept(true)
val model = lr.run(points)
val model = DecisionTree.trainClassifier(
input = data,
numClasses = 10,
categoricalFeaturesInfo = Map[Int, Int](),
impurity = "gini",
maxDepth = 15,
maxBins = 5
)
Pipeline API
- Advanced API for chaining machine learning operations in one workflow
- Uses the more advanced DataFrame features compared to RDD's of simple MLlib abstractions.
- Possible pipeline: Automated feature selection -> Model training -> Validation -> Model selection -> Prediction
Spark MLlib — Example: Predicting Digits
- 42,000 drawings of digits.
- Given a drawing, predict the written digit.
- Classification problem.
- Use Decision Tree approach.
nl.ncim.workshop.streaming
Spark Streaming can read data from HDFS, Flume, Kafka, Twitter and ZeroMQ.
You can also define your own custom data sources.
Spark GraphX
- Graph Fundamentals
- Programming Abstractions in Spark GraphX
- Example: Medline Topics
- Example: Maven Dependencies
Spark GraphX — Graph Fundamentals
"In the most common sense of the term, a graph is an ordered pair $G=(V,E)$ compromising a set $V$ of vertices together with a set $E$ of edges, which are 2-element subsets of $V$."
- Graphs are all about relationships between objects.
- Vertices are the objects we are interested in.
- Edges describe the relationship between two vertices.
- The alternate way of looking at data, allows us to ask different questions, or give answers to questions much easier. These questions are typically about relationships.
Spark GraphX — Programming Abstractions in Spark
"In the most common sense of the term, a graph is an ordered pair $G=(V,E)$ compromising a set $V$ of vertices together with a set $E$ of edges, which are 2-element subsets of $V$."
Spark GraphX uses these terms as well, and has programming abstractions for them.
Graph
class Graph[VD, ED] {
val vertices: VertexRDD[VD]
val edges: EdgeRDD[ED]
}
A very rich wrapper around a collection of vertices and edges.
Vertices
- A VertexRDD[VD] is a rich wrapper around an RDD[(VertexID, VD)].
- A VertexID is a unique 64-bits Long. It allows for easy lookups and other operations on Vertices.
Edges
case class Edge[ED] (srcId: VertexId, dstId: VertexId = 0,attr: ED)
- An EdgeRDD[ED] is an RDD[Edge[ED].
- An Edge[ED] is basically a 3-tuple of (VertexId, VertexId, ED), where the edge is directed and ED is the type of relationship between the two Vertices.
Constructing a Graph
Several options for creating a Graph in GraphX:
- Graph.apply(vertices,edges), or Graph(vertices,edges) results in a Graph instance.
- The Graph object also has fromEdges() and fromEdgeTuples() methods for Graph instantiation.
- GraphLoader.edgeListFile() provides a way to load a graph from a list of edges on disk.
Data
- XML documents <MedlineCitation>.
- A MedlineCitation has a list of Keywords, which contain Major Topics.
- Cooccurrences of Major Topics may form an interesting network.
Data
- Directed graph of maven dependencies.
- Data structure:
Meetup 18 August
Introduction into Apache Spark
Created by Casper Koning
and
Robert van Rijn