slides



slides

1 4


slides

my personal slides

On Github lucemia / slides

BigQuery & MapReduce

Presented by David Chen @ GliaCloud

GDE

Use Case

Ad Tech Company

User behavior analysis based on million of logs

What is BigQuery

Interactive data analysis tool for large datasets designed by Google

Some Application and dataset

https://www.reddit.com/r/bigquery/wiki/datasets

GDelt Project: A Global Database of Society

http://gdeltproject.org/ http://gdeltproject.org/globaldashboard/

Data:

  • 340GB

Supported by Google Jigsaw, the GDELT Project monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, counts, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world.

NYC Taxi

2013 NYC Taxi rides

  • 173 million rows
  • 130GB

Demo:

Cancer Genomics

  • 400GB

Demo:

Soccor

Demo: http://nbviewer.jupyter.org/github/GoogleCloudPlatform/ipython-soccer-predictions/blob/master/predict/wc-final.ipynb

Why BigQuery

Tools designed for Big Data Easy to Use Fast and Affordable

Tools designed for Big Data

Based on Dremel, Columnar Storage & multi-level execution trees. The query is processed by thousands of servers in a multi-level execution tree structure.

Simple Query

https://cloud.google.com/bigquery/sample-tables

select
    state, count(*) count_babies
from [publicdata:samples.natality]
where
    year >= 1980 and year < 1990
group by state
order by count_babies DESC
limit 10

BigQuery

Columnar Storage

Query optimizier database:

  • uniform type, better compression rate
  • only access required data
  • reduce disk I/O

Easy to Use

easy to share

no deployment and (almost) no cost while you don't need it

https://bigquery.cloud.google.com/welcome?pli=1

use SQL

It did supports JOIN

SELECT Year, Actor1Name, Actor2Name, Count FROM (
SELECT Actor1Name, Actor2Name, Year, COUNT(*) Count,
RANK() OVER(PARTITION BY YEAR ORDER BY Count DESC) rank
FROM
(
    SELECT Actor1Name, Actor2Name,  Year
    FROM [gdelt-bq:full.events]
    WHERE Actor1Name < Actor2Name
        and Actor1CountryCode != '' and Actor2CountryCode != ''
        and Actor1CountryCode!=Actor2CountryCode
),
(
    SELECT Actor2Name Actor1Name, Actor1Name Actor2Name, Year
    FROM [gdelt-bq:full.events] WHERE Actor1Name > Actor2Name
    and Actor1CountryCode != '' and Actor2CountryCode != ''
    and Actor1CountryCode!=Actor2CountryCode),
WHERE Actor1Name IS NOT null
AND Actor2Name IS NOT null
GROUP EACH BY 1, 2, 3
HAVING Count > 100
)
WHERE rank=1
ORDER BY Year

more than SQL

  • JSON format
  • schema (nested data)

Fast and Affordable

Fast

for 1.4 TB data

Type speed Hadoop with Hive 1491 sec Amazon Redshit 155 sec Google BigQuery 1.8 sec

Ref: http://www.slideshare.net/DharmeshVaya/exploring-bigdata-with-google-bigquery

Affordable

Type Price Storage $0.02 per GB / month (free after 90days) Processing $5 per TB (first 1TB free)

Interesting Dataset

Data Size Samples from US weather stations since 1929 115M Measurement data of broadband connection performance 240B Birth information for the United States 68M Word index for works of Shakespeare 164K Revision information for Wikipedia articles 314M NY Taxis Log 173M

more: https://www.reddit.com/r/bigquery/wiki/datasets

Demo: Use BigQuery, with Google Sheet

link to google sheet template

https://docs.google.com/spreadsheets/d/1RDG_Sdza9uAxsIe1SeXhYeWhmoQYh3hpD5pNwaalKes/edit#gid=760729610 Install OWOX BI BigQuery Reports plugin

BigQuery UDF

Write Javascript with BigQuery https://cloud.google.com/bigquery/user-defined-functions

// UDF definition
function urlDecode(row, emit) {
  emit({title: decodeHelper(row.title),
        requests: row.num_requests});
}

// Helper function with error handling
function decodeHelper(s) {
  try {
    return decodeURI(s);
  } catch (ex) {
    return s;
  }
}

SELECT requests, title
FROM
  urlDecode(
    SELECT
      title, sum(requests) AS num_requests
    FROM
      [fh-bigquery:wikipedia.pagecounts_201504]
    WHERE language = 'fr'
    GROUP EACH BY title
  )
WHERE title LIKE '%ç%'
ORDER BY requests DESC
LIMIT 100

MapReduce

There is two ways:

BigQuery Connector (Hadoop and Spark) MapReduce on BigQuery vis UDF (trick)

Review Mapreduce

MapReduce

Dremel (BigQuery) vs MapReduce

  • MapReduce
    • Flexible batch processing
    • High overall throughput
    • High Latency
  • Dremel
    • Optimized for interactive SQL queries
    • Very Low Latency

Map

// UDF
function mapper(v1, v2){
    return {"key":key, "value": value}
}

select key, value
    from mapper(
        select col1, col2, ....
            from [table]
    )

MapReduce

// UDF
function mapper(v1, v2){
    return {"key":key, "value": value}
}
function reducer(key, values){
    return {"result": result}
}

select key, result
    from reducer(
        select key, nest(value) as result
        from mapper(
             select col1, col2, ...
                 from [table]
       )
       group by key
   )

Example: Word count

function mapper(row, emit) {
  if(row.comment) {
  keywords = row.comment.split(' ');
  for(var i=0; i<keywords.length; i++) {
    emit({keyword: keywords[i], count: 1});
  }
  }
}

function reducer(row, emit) {
  var total = 0;
  for(var i=0;i<row.count.length; i++) {
    total += row.count[i];
  }
  emit({keyword: row.keyword, total: total});
}

bigquery.defineFunction(
  'mapper',
  ['comment'],
  [{'name': 'keyword', 'type': 'string'},
  {'name': 'count', 'type': 'int'}],
  mapper
);

bigquery.defineFunction(
  'reducer',
  ['keyword', 'count'],
  [{'name': 'keyword', 'type': 'string'},
  {'name': 'total', 'type': 'int'}],
  reducer
)

select keyword, total
    from reducer(
        select keyword, nest(count) as count
            from mapper(
                select Actor1Geo_FullName as comment
                    from [gdelt-bq:gdeltv2.events]
            )
            group by keyword
    )

Great for education

Reference

http://www.slideshare.net/BigDataSpain/jordan-tigani-big-query-bigdata-spain-2012-conference-15382442

BigQuery & MapReduce Presented by David Chen @ GliaCloud GDE