Presented by David Chen @ GliaCloud
GDE
Ad Tech Company
User behavior analysis based on million of logs
Interactive data analysis tool for large datasets designed by Google
http://gdeltproject.org/ http://gdeltproject.org/globaldashboard/
Data:
Supported by Google Jigsaw, the GDELT Project monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, counts, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world.
2013 NYC Taxi rides
Demo:
Demo:
Based on Dremel, Columnar Storage & multi-level execution trees. The query is processed by thousands of servers in a multi-level execution tree structure.
https://cloud.google.com/bigquery/sample-tables
select state, count(*) count_babies from [publicdata:samples.natality] where year >= 1980 and year < 1990 group by state order by count_babies DESC limit 10
Query optimizier database:
It did supports JOIN
SELECT Year, Actor1Name, Actor2Name, Count FROM ( SELECT Actor1Name, Actor2Name, Year, COUNT(*) Count, RANK() OVER(PARTITION BY YEAR ORDER BY Count DESC) rank FROM ( SELECT Actor1Name, Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name < Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode ), ( SELECT Actor2Name Actor1Name, Actor1Name Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name > Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode), WHERE Actor1Name IS NOT null AND Actor2Name IS NOT null GROUP EACH BY 1, 2, 3 HAVING Count > 100 ) WHERE rank=1 ORDER BY Year
for 1.4 TB data
Type speed Hadoop with Hive 1491 sec Amazon Redshit 155 sec Google BigQuery 1.8 secRef: http://www.slideshare.net/DharmeshVaya/exploring-bigdata-with-google-bigquery
link to google sheet template
https://docs.google.com/spreadsheets/d/1RDG_Sdza9uAxsIe1SeXhYeWhmoQYh3hpD5pNwaalKes/edit#gid=760729610 Install OWOX BI BigQuery Reports plugin
Write Javascript with BigQuery https://cloud.google.com/bigquery/user-defined-functions
// UDF definition function urlDecode(row, emit) { emit({title: decodeHelper(row.title), requests: row.num_requests}); } // Helper function with error handling function decodeHelper(s) { try { return decodeURI(s); } catch (ex) { return s; } } SELECT requests, title FROM urlDecode( SELECT title, sum(requests) AS num_requests FROM [fh-bigquery:wikipedia.pagecounts_201504] WHERE language = 'fr' GROUP EACH BY title ) WHERE title LIKE '%ç%' ORDER BY requests DESC LIMIT 100
There is two ways:
BigQuery Connector (Hadoop and Spark) MapReduce on BigQuery vis UDF (trick)// UDF function mapper(v1, v2){ return {"key":key, "value": value} } select key, value from mapper( select col1, col2, .... from [table] )
// UDF function mapper(v1, v2){ return {"key":key, "value": value} } function reducer(key, values){ return {"result": result} } select key, result from reducer( select key, nest(value) as result from mapper( select col1, col2, ... from [table] ) group by key )
function mapper(row, emit) { if(row.comment) { keywords = row.comment.split(' '); for(var i=0; i<keywords.length; i++) { emit({keyword: keywords[i], count: 1}); } } } function reducer(row, emit) { var total = 0; for(var i=0;i<row.count.length; i++) { total += row.count[i]; } emit({keyword: row.keyword, total: total}); } bigquery.defineFunction( 'mapper', ['comment'], [{'name': 'keyword', 'type': 'string'}, {'name': 'count', 'type': 'int'}], mapper ); bigquery.defineFunction( 'reducer', ['keyword', 'count'], [{'name': 'keyword', 'type': 'string'}, {'name': 'total', 'type': 'int'}], reducer ) select keyword, total from reducer( select keyword, nest(count) as count from mapper( select Actor1Geo_FullName as comment from [gdelt-bq:gdeltv2.events] ) group by keyword )