Use the arrows to navigate through the slides. [Esc] for an overview

Polyglot data analysis

is being

pragmatic

Laurent Gautier - ODSC 2016

(notebook as HTML| ipynb)

Disclaimer:

I am here as an individual, long-active Open Source developer and data scientist. Views and opinions expressed are solely my own.

Slides

http://lgautier.github.io/odsc-pdaibp-slides

Acknowledgments

ODSC-East organizers and sponsors
authors, contributors, and sponsors for
- R
- Python
- R packages
- Python packages
- Docker
rpy2 contributors
rpy2 users (in the hope they become contributors)

This talk contains edible art runnable code

The easy way:

docker run --rm -it -p 8888:8888 \
            rpy2/odsc-pdaibp-slides

Visit http://localhost:8888/ *

*: If docker-machine (Windows, OS X), the address is:

	      docker-machine ip [MACHINE]

First, two questions. Who thinks that:

- mixing languages is a good idea ?

- is NOT a good idea ?

Is this familiar ?

cursor = dbcon.cursor()

sql = """
SELECT reviewhelpful, count(*)
FROM review
WHERE reviewscore < 2
GROUP BY reviewhelpful
"""

cursor.execute(sql)

Hardly groundbreaking.

That's polyglot data analysis though.

You* probably have already done polyglot data analysis

(*: yes, you)

You should not feel bad about it

Data in - Insights out

Data science: a language landscape

(alphabetical order - not comprehensive)

C C++ bash Javascript Julia Python R Scala SQL

Julia ⇒ Python : PyCall Julia ⇒ R : Rif or RCall [somewhere] ⇒ R : Rserve R ⇒ Python : rPython

Polyglot data analysis a project/ensemble with:

A master/glue/host language
Slave/piece/guest language

Data in tables

Detailed account of raw materials and workdays for a basketry workshop. Clay, ca. 2040 BC (Ur III). Source: wikipedia Marie-Lan Nguyen (2006)

Fine foods

J. McAuley and J. Leskovec. From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. WWW, 2013. http://snap.stanford.edu/data/web-FineFoods.html

product/productId: B00...W
review/userId: A1S...MSMR
review/profileName: [some profile name]
review/helpfulness: 0/0
review/score: 1.0
review/time: 1344556800
review/summary: Sad outcome
review/text: Five minutes in, one tentacle was bitten off, ball inside
             cracked in half. Not durable enough to be a dog toy.
             Disappointed :(. So is the dog :(.

productid VARCHAR reviewerid VARCHAR reviewername VARCHAR reviewhelpful VARCHAR reviewscore INTEGER reviewtime INTEGER reviewsummary VARCHAR reviewtext VARCHAR

Work with a relational database

In[1]:

import sqlite3
dbfilename = "/opt/data/finefoods.db"
dbcon = sqlite3.connect(dbfilename)

In[2]:

cursor = dbcon.cursor()
sql = """
SELECT reviewhelpful, count(*)
FROM review
WHERE reviewscore < 2
GROUP BY reviewhelpful
"""
cursor.execute(sql)

Out[2]:

<sqlite3.Cursor at 0x7fd01c3be7a0>

Python and SQL

In[3]:

from collections import Counter
ct = Counter()
for row_n, (score, count) in enumerate(cursor, 1):
    ct[score] = count
print(ct.most_common(n=3))

[('0/0', 12266), ('1/1', 4809), ('0/1', 3701)]

Slightly more complicated: Distribution of helpfulness rating of reviews coming from active reviewers giving a low score.

In[5]:

sql = """
select reviewhelpful, count(*)
from (select T.reviewername
      from (select reviewername, count(*) as reviewer_count
            from review
            group by reviewername) as T
      where T.reviewer_count > 5) as U
inner join review
on U.reviewername = review.reviewername
where reviewscore < 2
group by reviewhelpful
"""
cursor.execute(sql)

Out[5]:

<sqlite3.Cursor at 0x7fd01c3be880>

Complex queries require increasingly complex SQL

Object-relational mapping (ORM)

The promise to operate on ojects in the host language, without writting SQL.

Python and ORMs

cursor = dbcon.cursor()
sql = """
select *
from review
where reviewscore < 2
"""
cursor.execute(sql)
result = cursor.fetchall()

⇩

result = (Review
          .select()
          .where(Review.reviewscore < 2))

(example with SQLObject)

Set up

In[6]:

from sqlalchemy.ext.automap import automap_base
from sqlalchemy.orm import Session
from sqlalchemy import create_engine

Base = automap_base()

# engine, suppose it has two tables 'user' and 'address' set up
engine = create_engine("sqlite:////opt/data/finefoods.db")

Create classes through reflection

In[7]:

Base.prepare(engine, reflect=True)
review = Base.classes.review

Query as chained calls

In[8]:

session = Session(engine)
from sqlalchemy import func # SQL functions

query = (session
         .query(review.reviewhelpful,
                func.count(review.reviewhelpful))
         .filter(review.reviewscore < 2)
         .group_by(review.reviewhelpful)
         .order_by(func.count(review.reviewhelpful).desc()))

res = query.all()
res[:3]

Out[8]:

[('0/0', 12266), ('1/1', 4809), ('0/1', 3701)]

Chained calls generate SQL

In[9]:

from sqlalchemy.dialects import sqlite
print(str(query.statement.compile(dialect=sqlite.dialect())))

SELECT review.reviewhelpful, count(review.reviewhelpful) AS count_1 
FROM review 
WHERE review.reviewscore < ? GROUP BY review.reviewhelpful ORDER BY count(review.reviewhelpful) DESC

We traded knowledge of SQL for the knowledge of an API.

About Java's Hibernate:

Back to tables

Model tables
Make most-common operations easy

dplyr

mutate modify/add column filter filter rows select select columns group_by group rows summarize summarize groups of rows (RStudio)

datasrc <- src_sqlite(dbfilename)
review_tbl <- tbl(datasrc, "review")

res <- filter(review_tbl,
              reviewscore < 2) %>%
       count(reviewhelpful) %>%
       arrange('desc(n)')

But... we started with Python, didn't we ?

Python vs R anyone ?

PythonR !

Just import the R package from Python

In[11]:

from rpy2.robjects.lib import dplyr


datasrc  = dplyr.src_sqlite(dbfilename)
review_tbl = datasrc.get_table("review")

Query as chained calls

In[12]:

res =  (review_tbl
        .filter('reviewscore < 2')
        .count('reviewhelpful')
        .arrange('desc(n)'))

print(res)

Source: sqlite 3.8.6 [/opt/data/finefoods.db]
From: <derived table> [?? x 2]
Arrange: desc(n) 

   reviewhelpful     n
           (chr) (int)
1            0/0 12266
2            1/1  4809
3            0/1  3701
4            1/2  2718
5            2/2  2123
6            0/2  1702
7            2/3  1579
8            1/3  1342
9            3/3  1104
10           2/4   921
..           ...   ...

While we are with R, what about figures ?

(and about R as a DSL in Python ?)

In[13]:

from rpy2.robjects import r, globalenv
import rpy2.robjects.lib.ggplot2 as gg

split_review = \
    'as.integer(sapply(strsplit(reviewhelpful, "/"), "[", %i))'

p = (gg.ggplot((res.collect().
                mutate(r1 = split_review % 1,
                       r2 = split_review % 2))) +
     gg.aes_string(x='r1+1', y='r2+1', size='n') +
     gg.geom_point(alpha = 0.5) +
     gg.scale_x_log10("review voted helpful") +
     gg.scale_y_log10("review") +
     gg.scale_size(trans="sqrt"))

In [14]:

from rpy2.robjects.lib import grdevices

with grdevices.render_to_bytesio(grdevices.png, 
                                 width=800,
                                 height=600, 
                                 res=120) as b:
    p.plot()

from IPython.display import Image, display
data = b.getvalue()
display(Image(data=data, format='png', embed=True))

What about Javascript then ?

D3
RStudio's Shiny (we can call any R)
plotly (R package to plot ggplot2 objects)
Continuum's bokeh (python package)

Python functions generate Javascript code.

Python + R + javascripttrilingual before we even know it !

In[15]:

from bokeh.plotting import figure
from bokeh.plotting import figure, show, output_notebook
output_notebook()

plot = figure()
res =  (review_tbl
        .filter('reviewscore < 2')
        .count('reviewhelpful')
        .collect())

import math
plot.scatter(list(int(x.split('/')[0]) \
                  for x in res.rx2('reviewhelpful')),
             list(int(x.split('/')[1]) \
                  for x in res.rx2('reviewhelpful')),
             radius=list(math.log10(x/100) for x in res.rx2('n')),
             fill_alpha=.3)

Loading BokehJS ...

Out[15]:

<bokeh.models.renderers.GlyphRenderer at 0x7fd01c340e48>

In[16]:

show(plot)

Out[16]:

Spark

Resilient Distributed Dataset (RDD)
Abstracts the distribution of compute and data
Function composition optimized before evaluation

pyspark

Setup

In[17]:

import findspark
findspark.init()

import pyspark

conf = pyspark.conf.SparkConf()
(conf.setMaster('local[2]')
 .setAppName('ipython-notebook')
 .set("spark.executor.memory", "2g"))

sc = pyspark.SparkContext(conf=conf)

From SQL table to Spark DataFrame

In[18]:

from pyspark.sql import SQLContext, Row
sqlcontext = SQLContext(sc)
cursor.execute('select * from review limit 10000')
review = \
    sqlcontext.createDataFrame(cursor,
                               tuple(x[0] for x in cursor.description))
review.registerTempTable("review")

The data is now distributed on a Spark cluster.

Running our query

In[19]:

sql = """
SELECT reviewhelpful, count(*) AS ct
FROM review
WHERE reviewscore < 2
GROUP BY reviewhelpful
ORDER BY ct DESC
"""
counts = sqlcontext.sql(sql)

Good old SQL
Translated to Spark, optimized, and to JVM bytecode

(Lazy) evaluation

In[20]:

res = counts.collect()
res[:3]

Out[20]:

[Row(reviewhelpful='0/0', ct=241),
 Row(reviewhelpful='1/1', ct=87),
 Row(reviewhelpful='0/1', ct=80)]

Map-reduce with Spark

In[21]:

names = (review
         .flatMap(lambda p: p.reviewsummary.split())
         .map(lambda word: (word.lower(), 1))
         .reduceByKey(lambda a, b: a+b))
names.takeOrdered(15, key = lambda x: -x[1])

Out[21]:

[('great', 1259),
 ('the', 914),
 ('good', 837),
 ('for', 706),
 ('a', 695),
 ('not', 624),
 ('and', 581),
 ('best', 568),
 ('my', 451),
 ('coffee', 438),
 ('but', 401),
 ('love', 395),
 ('of', 391),
 ('to', 371),
 ('is', 350)]

SQL, Map-reduce with Spark

In[22]:

sql = """
SELECT *
FROM review
WHERE reviewscore < 2
"""
lowscore = sqlcontext.sql(sql)
names = (lowscore
         .flatMap(lambda p: p.reviewsummary.split())
         .map(lambda word: (word.lower(), 1))
         .reduceByKey(lambda a, b: a+b))
names.takeOrdered(8, key = lambda x: -x[1])

Out[22]:

[('not', 132),
 ('the', 92),
 ('is', 50),
 ('of', 48),
 ('this', 46),
 ('for', 44),
 ('i', 43),
 ('a', 42)]

The Finale

Python + R + Spark + SQL

In [23]:

lst = names.takeOrdered(8, key = lambda x: -x[1])
from rpy2.robjects.vectors import StrVector, IntVector
dataf = dplyr.DataFrame({'word': StrVector([x[0] for x in lst]),
                         'count': IntVector([x[1] for x in lst])})
p = (gg.ggplot(dataf) +
     gg.geom_bar(gg.aes_string(x='word', y='count'),
                 stat="identity"))

with grdevices.render_to_bytesio(grdevices.png, 
                                 width=800,
                                 height=600, 
                                 res=120) as b:
    p.plot()
display(Image(data=b.getvalue(), format='png', embed=True))

There would be more to cover...

C for (runtime) speed
LLVM for (runtime) speed
(Web) services

Use what is already out there. Be pragmatic.

Polyglot data analysis – is being – pragmatic

lgautier

Polyglot data analysis – is being – pragmatic

3 2 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

odsc-pdaibp-slides

Polyglot data analysis

is being

pragmatic

Laurent Gautier - ODSC 2016

Slides

Acknowledgments

This talk contains edible art runnable code

The easy way:

First, two questions. Who thinks that:

- mixing languages is a good idea ?

- is NOT a good idea ?

Is this familiar ?

Hardly groundbreaking.

That's polyglot data analysis though.

You* probably have already done polyglot data analysis

You should not feel bad about it

Data in - Insights out

Data science: a language landscape

(alphabetical order - not comprehensive)

Data in tables

Fine foods

Work with a relational database

Python and SQL

Object-relational mapping (ORM)

Python and ORMs

⇩

Set up

Create classes through reflection

Query as chained calls

Chained calls generate SQL

Back to tables

dplyr

But... we started with Python, didn't we ?

Python vs R anyone ?

PythonR !

Just import the R package from Python

Query as chained calls

While we are with R, what about figures ?

What about Javascript then ?

Spark

pyspark

Setup

From SQL table to Spark DataFrame

Running our query

(Lazy) evaluation

Map-reduce with Spark

SQL, Map-reduce with Spark

The Finale

There would be more to cover...

3 2