PySpark

Big data in Python

Very brief introduction

2016/01/14

Ivan Héda / @ivanhda

Jumpshot, Inc.

Problem

All presented frameworks are great, but ...

they are limited by performance of our machines.

We need to handle ~TBs datasets.

Frameworks covers workflow
- Lot of things is single threaded
- Parallelization needs non-trivial work
  - ipyparallel, multiprocessing, ...
- There exists frameworks to parallelize on cluster
Datasets larger than memory
- Often on multiple machines

Apache Spark

“Apache Spark™ is a fast and general engine for large-scale data processing.”

“Fast”

Holds world record in 100TB dataset sorting.
Able to cache datasets into memory.

Apache Spark

“Apache Spark™ is a fast and general engine for large-scale data processing.”

“General”

Contains a lot of high-level data-processing functions.
Built-in libs: SQL, Streaming, MLlib, GraphX
API in Scala , Java, Python and R.
Works with HDFS, HBase, Hive, Postgres, etc.
Runs locally, over Hadoop(YARN), Mesos or in standalone mode.

PySpark

PySpark is Python wrapper around Spark.

Architecture

Performance?

“Spark’s core developers have worked extensively to bridge the performance gap between JVM languages and Python. In particular, PySpark can now run on PyPy to leverage the just-in-time compiler. (Up to 50x speedup) The way Python processes communicate with the main Spark JVM programs have also been redesigned to enable worker reuse.”

Comparable with native Spark...

Enough said...

let's give it a try.

PySpark Big data in Python Very brief introduction 2016/01/14 Ivan Héda / @ivanhda Jumpshot, Inc.

PySpark – Big data in Python

vanaoff

PySpark – Big data in Python

0 0

python-workshop-spark-presentation

PySpark

Big data in Python

Very brief introduction

2016/01/14

Problem

Apache Spark

“Fast”

Apache Spark

“General”

PySpark

Architecture

Performance?

PySpark – Big data in Python

vanaoff

PySpark – Big data in Python

0 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

python-workshop-spark-presentation

PySpark

Big data in Python

Very brief introduction

2016/01/14

Problem

Apache Spark

“Fast”

Apache Spark

“General”

PySpark

Architecture

Performance?

0 0