Presented by David Chen @ GliaCloud
GDE
Ad Tech Company
User behavior analysis based on million of logs
Interactive data analysis tool for large datasets designed by Google
http://gdeltproject.org/ http://gdeltproject.org/globaldashboard/
Data:
Supported by Google Jigsaw, the GDELT Project monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, counts, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world.
2013 NYC Taxi rides
Demo:
Demo:
Based on Dremel, Columnar Storage & multi-level execution trees. The query is processed by thousands of servers in a multi-level execution tree structure.
https://cloud.google.com/bigquery/sample-tables
select
state, count(*) count_babies
from [publicdata:samples.natality]
where
year >= 1980 and year < 1990
group by state
order by count_babies DESC
limit 10
Query optimizier database:
It did supports JOIN
SELECT Year, Actor1Name, Actor2Name, Count FROM (
SELECT Actor1Name, Actor2Name, Year, COUNT(*) Count,
RANK() OVER(PARTITION BY YEAR ORDER BY Count DESC) rank
FROM
(
SELECT Actor1Name, Actor2Name, Year
FROM [gdelt-bq:full.events]
WHERE Actor1Name < Actor2Name
and Actor1CountryCode != '' and Actor2CountryCode != ''
and Actor1CountryCode!=Actor2CountryCode
),
(
SELECT Actor2Name Actor1Name, Actor1Name Actor2Name, Year
FROM [gdelt-bq:full.events] WHERE Actor1Name > Actor2Name
and Actor1CountryCode != '' and Actor2CountryCode != ''
and Actor1CountryCode!=Actor2CountryCode),
WHERE Actor1Name IS NOT null
AND Actor2Name IS NOT null
GROUP EACH BY 1, 2, 3
HAVING Count > 100
)
WHERE rank=1
ORDER BY Year
for 1.4 TB data
Type speed Hadoop with Hive 1491 sec Amazon Redshit 155 sec Google BigQuery 1.8 secRef: http://www.slideshare.net/DharmeshVaya/exploring-bigdata-with-google-bigquery
link to google sheet template
https://docs.google.com/spreadsheets/d/1RDG_Sdza9uAxsIe1SeXhYeWhmoQYh3hpD5pNwaalKes/edit#gid=760729610 Install OWOX BI BigQuery Reports plugin
Write Javascript with BigQuery https://cloud.google.com/bigquery/user-defined-functions
// UDF definition
function urlDecode(row, emit) {
emit({title: decodeHelper(row.title),
requests: row.num_requests});
}
// Helper function with error handling
function decodeHelper(s) {
try {
return decodeURI(s);
} catch (ex) {
return s;
}
}
SELECT requests, title
FROM
urlDecode(
SELECT
title, sum(requests) AS num_requests
FROM
[fh-bigquery:wikipedia.pagecounts_201504]
WHERE language = 'fr'
GROUP EACH BY title
)
WHERE title LIKE '%ç%'
ORDER BY requests DESC
LIMIT 100
There is two ways:
BigQuery Connector (Hadoop and Spark) MapReduce on BigQuery vis UDF (trick)// UDF
function mapper(v1, v2){
return {"key":key, "value": value}
}
select key, value
from mapper(
select col1, col2, ....
from [table]
)
// UDF
function mapper(v1, v2){
return {"key":key, "value": value}
}
function reducer(key, values){
return {"result": result}
}
select key, result
from reducer(
select key, nest(value) as result
from mapper(
select col1, col2, ...
from [table]
)
group by key
)
function mapper(row, emit) {
if(row.comment) {
keywords = row.comment.split(' ');
for(var i=0; i<keywords.length; i++) {
emit({keyword: keywords[i], count: 1});
}
}
}
function reducer(row, emit) {
var total = 0;
for(var i=0;i<row.count.length; i++) {
total += row.count[i];
}
emit({keyword: row.keyword, total: total});
}
bigquery.defineFunction(
'mapper',
['comment'],
[{'name': 'keyword', 'type': 'string'},
{'name': 'count', 'type': 'int'}],
mapper
);
bigquery.defineFunction(
'reducer',
['keyword', 'count'],
[{'name': 'keyword', 'type': 'string'},
{'name': 'total', 'type': 'int'}],
reducer
)
select keyword, total
from reducer(
select keyword, nest(count) as count
from mapper(
select Actor1Geo_FullName as comment
from [gdelt-bq:gdeltv2.events]
)
group by keyword
)