Programming Experience – Create a twitterbot with Scala and Akka – Spark MLlib — Machine Learning Basics



Programming Experience – Create a twitterbot with Scala and Akka – Spark MLlib — Machine Learning Basics

0 0


rvanrijn.github.io


On Github rvanrijn / rvanrijn.github.io

Meetup 18 August

Introduction into Apache Spark

Created by Casper Koning and Robert van Rijn

First we do the talk, then we do the walk

What is Apache Spark?

  • Apache Spark is a fast, general engine for large-scale data processing

- Open sourced, developed at AMPlab at UC Berkeley

Written in Scala

- Hybride programming language that runs on the JVM

Key Concepts

- Bring the processing to the data

- Store the data in memory

Spark Architecture Model

Coordination is done by the SparkContext object in your main program(called the driver program)

Cluster manager

An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)

Worker Node

Any node that can run application code in the cluster

Spark Internals

  • RDD(Resilient Distributed Dataset)

-Resilient: if data in memory is lost, it can be recreated

-Distributed: stored in memory across the cluster

-Dataset: initial data can come from a file or created programmatically

RDD's are the fundamental unit of data in Spark Most of Spark programming is performing operations on RDD's

RDD's

  • RDDs are immutable:

-Each stage of a transformation will create a new RDD

RDDs are lazy:

-A DAG (directed acyclic graph) of computation is constructed.

-The actual data is processed only when results are requested.

Transformations and Actions

RDDs support two types of operations

You can chain operations together, but keep in mind that the computation only runs when you call an action.

Time to rock and roll!

  • Spark Core
  • Spark SQL
  • Spark MLlib
  • Spark Streaming
  • Spark GraphX

nl.ncim.workshop.core

  • Transformations: map, flatmap, filter
  • Actions: reduceByKey, groupBy, take

nl.ncim.workshop.sql

  • Structured data processing
  • It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine.
  • DataFrames and SQL provide a common way to access a variety of data sources:
  • including Hive, Avro, Parquet, ORC, JSON, and JDBC. You can even join data across these sources.

Spark MLlib

  • Machine Learning Basics
  • Programming Abstractions in Spark MLlib
  • Example: Predicting Digits
  • Getting YOUR hands dirty

Spark MLlib — Machine Learning Basics

Gather and prepare data

"In Data Science, 80% of time is spent on data preparation, and the other 20% is spent on complaining about the need to prepare data"

Select features

"In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construction"
  • Current age for predicting the probability of ending up in hospital in the coming five years.
  • $\frac{\text{weight}}{\text{length}^2}$ for predicting percentage of body fat.
  • You fall in category $A$, so you are very likely to do $Z$.

Train a model

Select the best model

  • Train model with different parameters.
  • Model might be too complex, or too simple.
  • Evaluate model on some validation set and use cross-validation to find an appropriate model.

Predict

Spark MLlib — Programming Abstractions

  • (Dense/Sparse) Vector
  • LabeledPoint
  • Matrix
  • Rating
  • Model classes
  • Pipeline API

(Dense/Sparse) Vector

  • A mathematical vector containing numbers.
  • Both dense and sparse vectors.
  • Constructed via the mllib.linalg.Vectors class.
  • Do not provide arithmetic operations.
        val denseVec1 = Vectors.dense(1.0, 2.0, 3.0)
        val denseVec2 = Vectors.dense(Array(1.0,2.0,3.0))
        val sparseVec = Vectors.sparse(4, Array(0,2), Array(1.0, 2.0))

LabeledPoint

"LabeledPoint: A labeled data point for supervised learning algorithms such as classification and regression. Includes a feature vector and a label."
             val lp = LabeledPoint(1,Vectors.dense(3.14,1.68,1.41))

Matrix

  • Integer typed row and column indices
  • Double values
  • Different implementations for distribution purposes (RowMatrix, BlockMatrix, CoordinateMatrix,...).
  • Dense and sparse variants

Rating

"Rating: A rating of a product by a user, used in the mllib.recommendation package for product recommendation." Nothing more than a
         case class Rating(user: Long, item: Long, rating: Double)

Model classes

  • Work on RDD[Vector], RDD[LabeledPoint], etc.
  • Often follow naming pattern: <problem>With<Algorithm>, e.g. LinearRegresionWithSGD.
  • Either the model follows a builder pattern and has a run() method, or it has static train() and predict() methods:
val points: RDD[LabeledPoint] = // ...
val lr = new LinearRegressionWithSGD()
                    .setNumIterations(200)
                    .setIntercept(true)
val model = lr.run(points)
val model = DecisionTree.trainClassifier(
                    input = data,
                    numClasses = 10,
                    categoricalFeaturesInfo = Map[Int, Int](),
                    impurity = "gini",
                    maxDepth = 15,
                    maxBins = 5
                    )

Pipeline API

  • Advanced API for chaining machine learning operations in one workflow
  • Uses the more advanced DataFrame features compared to RDD's of simple MLlib abstractions.
  • Possible pipeline: Automated feature selection -> Model training -> Validation -> Model selection -> Prediction

Spark MLlib — Example: Predicting Digits

  • 42,000 drawings of digits.
  • Given a drawing, predict the written digit.
  • Classification problem.
  • Use Decision Tree approach.

CODE TIME

nl.ncim.workshop.streaming

Spark Streaming can read data from HDFS, Flume, Kafka, Twitter and ZeroMQ. You can also define your own custom data sources.

Spark GraphX

  • Graph Fundamentals
  • Programming Abstractions in Spark GraphX
  • Example: Medline Topics
  • Example: Maven Dependencies

Spark GraphX — Graph Fundamentals

"In the most common sense of the term, a graph is an ordered pair $G=(V,E)$ compromising a set $V$ of vertices together with a set $E$ of edges, which are 2-element subsets of $V$."
  • Graphs are all about relationships between objects.
  • Vertices are the objects we are interested in.
  • Edges describe the relationship between two vertices.
  • The alternate way of looking at data, allows us to ask different questions, or give answers to questions much easier. These questions are typically about relationships.

Spark GraphX — Programming Abstractions in Spark

"In the most common sense of the term, a graph is an ordered pair $G=(V,E)$ compromising a set $V$ of vertices together with a set $E$ of edges, which are 2-element subsets of $V$."

Spark GraphX uses these terms as well, and has programming abstractions for them.

Graph

class Graph[VD, ED] {
        val vertices: VertexRDD[VD]
        val edges: EdgeRDD[ED]
}

A very rich wrapper around a collection of vertices and edges.

Vertices

  • A VertexRDD[VD] is a rich wrapper around an RDD[(VertexID, VD)].
  • A VertexID is a unique 64-bits Long. It allows for easy lookups and other operations on Vertices.

Edges

case class Edge[ED] (srcId: VertexId, dstId: VertexId = 0,attr: ED)
  • An EdgeRDD[ED] is an RDD[Edge[ED].
  • An Edge[ED] is basically a 3-tuple of (VertexId, VertexId, ED), where the edge is directed and ED is the type of relationship between the two Vertices.

EdgeTriplet

Constructing a Graph

Several options for creating a Graph in GraphX:

  • Graph.apply(vertices,edges), or Graph(vertices,edges) results in a Graph instance.
  • The Graph object also has fromEdges() and fromEdgeTuples() methods for Graph instantiation.
  • GraphLoader.edgeListFile() provides a way to load a graph from a list of edges on disk.

Spark GraphX — Example: Medline Topics

source: ftp://ftp.nlm.nih.gov/nlmdata/sample/medline

Data

  • XML documents <MedlineCitation>.
  • A MedlineCitation has a list of Keywords, which contain Major Topics.
  • Cooccurrences of Major Topics may form an interesting network.

Code Time

Spark GraphX — Example: Maven Dependencies

source: https://github.com/ogirardot/meta-deps

Data

  • Directed graph of maven dependencies.
  • Data structure:

Code Time

Meetup 18 August Introduction into Apache Spark Created by Casper Koning and Robert van Rijn