On Github cbare / Strata_2014_recap_slides
Recap • February 26, 2014
Brian Granger (Cal Poly SLO)
“Compose and share reproducible stories that involve code and data”
coming in IPython 2.0
coming soon, a drop down menu for what kernel language you'd like to use
Olivier Grisel (Parietal, INRIA)
Machine learning in Python
from INRIA
Hadley Wickham (RStudio)
More awesome R packages from Hadley.
Yann Ramin (Twitter)
~
Just like you design for testability, design for monitoring
† cassandra-like key/value store, fronted by an API over caches, queues, storage
Increment a counter
Stats.incr(“some_important_counter”)
Store the value of a variable at a particular time
Stats.addGauge("current_temperature") { myThermometer.temperature }
Record running time
Stats.time("translation") { document.translate("de", "en") }
(service,source(host,process),metric)
Chris Ré (Stanford)
Examples of trained systems:
Extracting structured information from unstructured sources such as raw text
DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference
“Good features allow a simple model to beat a complex model”
Brainwash: A Data System for Feature Engineering
~
† similar to what we observe in challenges
Avery Ching (Facebook)
“iterative graph processing system”
Two phases
~
...iterate until halting criterea reached
Matei Zaharia (MIT, DataBricks)
Spark is a next-generation Hadoop, in-memory computing
Ameet Talwalkar (UC Berkeley)
BDAS is a suite of software created by the AMPLab
Data-intensive distributed computing stack
MLBase is a machine learning libraries that run over Spark (like Mahout/Hadoop)
Monica Rogati (Jawbone)