Harder, Better, Faster:

Case Studies in Reproducible Workflows

Kathryn Huff

NYU Reproducibility Symposium

May 03, 2016

Case Study Book Concept
Case Study Contributions
Lessons Learned
Next Steps!

Reproducibility and Open Science Conference

May 21&22, 2015

Three days
Invitation Only
Case Studies, Education, Self-assessment
https://github.com/BIDS/repro-conf

- Preface (Stark) - Introduction (Kitzes) - Assessing the Reproducibility of a Research Project (Rokem, Marwick, Staneva) - The Basic Reproducible Workflow Template (Kitzes, Turek) - Introducing the Case Studies (Imamoglu, Turek) - PART 1: High-Level Case Studies - PART 2: Low-Level Case Studies - Lessons Learned (Huff et al.) - Supporting Reproducible Science (Ram, Marwick) - Glossary of Terms and Techniques (Rokem, Chirigati)

Editors

Justin Kitzes, Fatma Imamoglu, Daniel Turek

Supplementary Chapter Authors

Philip Stark
Justin Kitzes
Daniel Turek
Fatma Imamoglu
Kathryn Huff
Karthik Ram

Ariel Rokem
Ben Marwick
Valentina Staneva

Fernando Chirigati

Case Study Chapter Contributors!

Mary K. Askren
Anthony Arendt
Lorena A. Barba
Pablo Barberá
Kyle Barbary
Carl Boettiger
You-Wei Cheah
Garret Christensen
Devarshi Ghoshal
Chris Gorgolewski

Jan Gukelberger
Chris Holdgraf
Konrad Hinsen
David Holland
Chris Hartgerink
Kathryn Huff
Fatma Imamoglu
Justin Kitzes
Natalie Koh
Andy Krause

Randy LeVeque
Tara Madhyastha
José Manuel Magallanes
Ben Marwick
Olivier Mesnard
K. Jarrod Millman
K. A. S. Mislan
Kellie Ottoboni
Gilberto Pastorello
Russell Poldrack

Karthik Ram
Ariel Rokem
Rachel Slaybaugh
Valentina Staneva
Philip Stark
Daniel Turek
Daniela Ushizima
Zhao Zhang

## Lessons Learned - Pain Points - Recommmendations from the Authors - A Little Data - Needs

Pain Points

People and Skills Dependencies, Build Systems, and Packaging Hardware Access Testing Publishing Data Versioning Time and Incentives Data restrictions

## Incentives - verifiability - collaboration - efficiency - extensibility - "focus on science" - "forced planning" - "safety for evolution"

## Recommendations - version control your code - open your data - automate everywhere possible - document your processes - test everything - use free and open tools

## Recommendations: Continued - avoid excessive dependencies - when dependencies can't be avoided, package their installation - host code on a collaborative platform (e.g. GitHub) - get a Digital Object Identifier for your data and code - avoid spreadsheets, plain text data is preferred ("timeless," even) - explicitly set pseudorandom number generator seeds - workflow and provenance frameworks may be too clunky for most scientists

## Recommendations: Outliers > ... in our estimation, if someone > was to try to reproduce our research it would probably be more > natural for them to write their own scripts as this has the > additional benefit that they might not fall into any error > we may have accidentally introduced in our scripts.

## Recommendations: Outliers > Scientific funding and the number of scientists available to do the work is finite. Therefore not every scientific result can, or should be reproduced.

Emergent Needs

Better education of scientists in more reproducibility-robust tools.
Widely used tools should be more reproducible so that the common denominator tool does not undermine reproducibility.
Improved configuration and build systems for portably packaging software, data, and analysis workflows.
Reproducibility at scale for high performance computing.
Standardized hardware configurations and experimental procedures for limited-availability experimental apparatuses.
Better understanding of why researchers don't respond to the delayed incentives of unit testing as a practice.
Greater adoption of unit testing irrespective of programming language.
Broader community adoption around publication formats that allow parallel editing (i.e. any plain text markup language that can be version
Greater scientific adoption of new industry-led tools and platforms for data storage, versioning, and management.
Increased community recognition of the benefits of reproducibility.
Incentive systems where reproducibility is not self-incentivizing.
Standards around scrubbed and representational data so that analysis can be investigated separate from restricted data sets.
Community adoption for file format standards within some domains.
Domain standards which translate well outside of their own scientific communities.

Social Science Volume

Collecting Case Studies Spring/Summer 2016

Same format: 1,500-2,000 words plus one diagram
Bad Hessian blog : http://www.badhessian.org
GitHub Repo : http://github.com/BIDS/ss-repro-case-public
Email Garret Christensen (garret@berkeley.edu) or Cyrus Dioun (dioun@berkeley.edu)

Acknowledgements

Justin Kitzes
Fatma Imamoglu
Daniel Turek
Chapter Authors
Case Study Authors
Reproducibility Working Group

THE END

Katy Huff

katyhuff.github.io/2016-05-03-nyu Harder, Better, Faster: Case Studies in Reproducible Workflows by Kathryn Huff is licensed under a Creative Commons Attribution 4.0 International License.Based on a work at http://katyhuff.github.io/2016-05-03-nyu.

> 'connectome workbench', 'stata', 'zotero', ' ', 'travisci', 'vistrails', > 'osf', 'testtools', 'nipy', 'coverage/coveralls', 'ferret', 'cmake', > 'flickr api', 'amazon s3', 'nose', 'readthedocs', 'pypi', 'jira', > 'jenkins', 'ec2 s3', 'sweave', 'shell', ' jupyter', 'sql', 'dataverse', > 'rnw', 'spark', ' paraview', 'data science toolkit', 'overleaf', > 'virtualenv', 'crossref', 'spyder', 'markdown', 'dropbox', > 'scikit-image', 'awk', 'netcdf', 'petsc', 'figshare', 'sharelatex', > 'pandoc', 'ibamr', 'dcvs', 'twitter api', 'mendeley', 'word', 'd3', > 'beautiful soup', 'sed', 'devtools', 'activepapers', 'private git repo', > 'cython', 'outreg2', 'rsync', ' zenodo', 'vagrant', 'c' >

Harder, Better, Faster: – Case Studies in Reproducible Workflows – Kathryn Huff

BIDS

Harder, Better, Faster: – Case Studies in Reproducible Workflows – Kathryn Huff

1 0

2016-05-03-nyu

Harder, Better, Faster:

Case Studies in Reproducible Workflows

Kathryn Huff

NYU Reproducibility Symposium

May 03, 2016

Reproducibility and Open Science Conference

May 21&22, 2015

Editors

Supplementary Chapter Authors

Case Study Chapter Contributors!

Pain Points

Emergent Needs

Social Science Volume

Collecting Case Studies Spring/Summer 2016

Acknowledgements

THE END

Katy Huff

Harder, Better, Faster: – Case Studies in Reproducible Workflows – Kathryn Huff

BIDS

Harder, Better, Faster: – Case Studies in Reproducible Workflows – Kathryn Huff

1 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

2016-05-03-nyu

Harder, Better, Faster:

Case Studies in Reproducible Workflows

Kathryn Huff

NYU Reproducibility Symposium

May 03, 2016

Reproducibility and Open Science Conference

May 21&22, 2015

Editors

Supplementary Chapter Authors

Case Study Chapter Contributors!

Pain Points

Emergent Needs

Social Science Volume

Collecting Case Studies Spring/Summer 2016

Acknowledgements

THE END

Katy Huff

1 0