On Github arokem / 2015-11-05-IU
Ariel Rokem, University of Washington eScience Institute
Follow along at http://arokem.github.io/2015-11-05-IU/
Data Science is an interdisciplinary solution to many of the problems facing modern-day research
An example from neuroimaging
Data Science comes to the University
1. Empirical (experimental)
2. Theoretical (mathematical)
3. Simulation (computational)
4. Data-intensive (eScience)
In a variety of fields
Data is enabling new ways of doing things
But data also poses challenges
The 4 V's: Volume, Velocity, Variety, Veracity/Validity
Made his wealth wrangling financial data.
As mayor, started 311, and made NYC data openly available.
Statistics and machine learning
Programming and software engineering
Data management
Data visualization and communication
A focus on reproducibility and openess
Brain connections change with development
Individual differences account for differences in behaviour
Adapt with learning
This has clinical significance
Started in 2009 by Eleftherios Garyfallidis
Contributors from at least six different countries and many different labs
The lingua franca of reproducible computational science
Open source
Easy to learn
Phenomenal ecosystem of open-source tools
Openly available source code is good
Open development is better!
model = ReconstModel(gtab, ...)
fit = model.fit(data, ...) # => ReconstFit
model = dti.TensorModel(gtab)
fit = model.fit(data1)
prediction = fit.predict(gtab)
RMSE = np.sqrt(\ np.mean((prediction - data2) ** 2), -1))
rRMSE = RMSE / np.sqrt(\ np.mean((data1 - data2) ** 2), -1))
# Use a k of 2
dti_pred = kfold_xval(dti_model, data, 2)
csd_pred = kfold_xval(csd_model, data, 2)
fiber_model = life.FiberModel(gtab)
fit = fiber_model.fit(data, tracks)
prediction = fit.predict(gtab)
optimized_tracks = tracks[fit.beta>0]
Facilitate data-intensive research in different fields (inter- and cross- disciplinary)
Focus on methodology
Focus on reproducibility
Contribute to openly available tools, rather than/in addition to peer-reviewed publications
"Career paths for data scientists that recognize and reward contributions in methodology, computation, or development of tools are important."
Degree programs
Workshops (Software Carpentry, ...)
Project-oriented training
Focused, intensive, collaborative projects
Data scientists + domain scientists
Results that wouldn't be possible otherwise
Inspired by DSSG program at U Chicago, GA Tech
10-week internship program
16 DSSG fellows/students
6 high-school students from ALVA program
4 projects (+project leads!)
+ Data scientist mentors
It is possible to both:
Do interesting things with data, with social good implications Provide highly effective trainingStakeholder involvement is important (no projects "thrown over the fence")
In-house expertise (data scientists, program managers) are an important asset
But (hypothesis) DSSG can be translated into other settings
We wrote a paper with some ideas.
Data Science is an interdisciplinary solution to many of the problems facing modern-day research
An example from neuroimaging
Data Science comes to the University