Harder, Better, Faster: – Case Studies in Reproducible Workflows – Kathryn Huff



Harder, Better, Faster: – Case Studies in Reproducible Workflows – Kathryn Huff

1 0


2016-05-03-nyu

A talk about the Lessons Learned and such from the Reproducibility Case Studies Book for the NYU Reproducibility Symposium

On Github BIDS / 2016-05-03-nyu

Harder, Better, Faster:

Case Studies in Reproducible Workflows

Kathryn Huff

NYU Reproducibility Symposium

May 03, 2016

  • Case Study Book Concept
  • Case Study Contributions
  • Lessons Learned
  • Next Steps!

Reproducibility and Open Science Conference

May 21&22, 2015

  • Three days
  • Invitation Only
  • Case Studies, Education, Self-assessment
  • https://github.com/BIDS/repro-conf
- Preface (Stark) - Introduction (Kitzes) - Assessing the Reproducibility of a Research Project (Rokem, Marwick, Staneva) - The Basic Reproducible Workflow Template (Kitzes, Turek) - Introducing the Case Studies (Imamoglu, Turek) - PART 1: High-Level Case Studies - PART 2: Low-Level Case Studies - Lessons Learned (Huff et al.) - Supporting Reproducible Science (Ram, Marwick) - Glossary of Terms and Techniques (Rokem, Chirigati)

Editors

Justin Kitzes, Fatma Imamoglu, Daniel Turek

Supplementary Chapter Authors

  • Philip Stark
  • Justin Kitzes
  • Daniel Turek
  • Fatma Imamoglu
  • Kathryn Huff
  • Karthik Ram
  • Ariel Rokem
  • Ben Marwick
  • Valentina Staneva

Fernando Chirigati

Case Study Chapter Contributors!

  • Mary K. Askren
  • Anthony Arendt
  • Lorena A. Barba
  • Pablo Barberá
  • Kyle Barbary
  • Carl Boettiger
  • You-Wei Cheah
  • Garret Christensen
  • Devarshi Ghoshal
  • Chris Gorgolewski
  • Jan Gukelberger
  • Chris Holdgraf
  • Konrad Hinsen
  • David Holland
  • Chris Hartgerink
  • Kathryn Huff
  • Fatma Imamoglu
  • Justin Kitzes
  • Natalie Koh
  • Andy Krause
  • Randy LeVeque
  • Tara Madhyastha
  • José Manuel Magallanes
  • Ben Marwick
  • Olivier Mesnard
  • K. Jarrod Millman
  • K. A. S. Mislan
  • Kellie Ottoboni
  • Gilberto Pastorello
  • Russell Poldrack
  • Karthik Ram
  • Ariel Rokem
  • Rachel Slaybaugh
  • Valentina Staneva
  • Philip Stark
  • Daniel Turek
  • Daniela Ushizima
  • Zhao Zhang
## Lessons Learned - Pain Points - Recommmendations from the Authors - A Little Data - Needs

Pain Points

People and Skills Dependencies, Build Systems, and Packaging Hardware Access Testing Publishing Data Versioning Time and Incentives Data restrictions
## Incentives - verifiability - collaboration - efficiency - extensibility - "focus on science" - "forced planning" - "safety for evolution"
## Recommendations - version control your code - open your data - automate everywhere possible - document your processes - test everything - use free and open tools
## Recommendations: Continued - avoid excessive dependencies - when dependencies can't be avoided, package their installation - host code on a collaborative platform (e.g. GitHub) - get a Digital Object Identifier for your data and code - avoid spreadsheets, plain text data is preferred ("timeless," even) - explicitly set pseudorandom number generator seeds - workflow and provenance frameworks may be too clunky for most scientists
## Recommendations: Outliers > ... in our estimation, if someone > was to try to reproduce our research it would probably be more > natural for them to write their own scripts as this has the > additional benefit that they might not fall into any error > we may have accidentally introduced in our scripts.
## Recommendations: Outliers > Scientific funding and the number of scientists available to do the work is finite. Therefore not every scientific result can, or should be reproduced.

Emergent Needs

  • Better education of scientists in more reproducibility-robust tools.
  • Widely used tools should be more reproducible so that the common denominator tool does not undermine reproducibility.
  • Improved configuration and build systems for portably packaging software, data, and analysis workflows.
  • Reproducibility at scale for high performance computing.
  • Standardized hardware configurations and experimental procedures for limited-availability experimental apparatuses.
  • Better understanding of why researchers don't respond to the delayed incentives of unit testing as a practice.
  • Greater adoption of unit testing irrespective of programming language.
  • Broader community adoption around publication formats that allow parallel editing (i.e. any plain text markup language that can be version
  • Greater scientific adoption of new industry-led tools and platforms for data storage, versioning, and management.
  • Increased community recognition of the benefits of reproducibility.
  • Incentive systems where reproducibility is not self-incentivizing.
  • Standards around scrubbed and representational data so that analysis can be investigated separate from restricted data sets.
  • Community adoption for file format standards within some domains.
  • Domain standards which translate well outside of their own scientific communities.

Social Science Volume

Collecting Case Studies Spring/Summer 2016

  • Same format: 1,500-2,000 words plus one diagram
  • Bad Hessian blog : http://www.badhessian.org
  • GitHub Repo : http://github.com/BIDS/ss-repro-case-public
  • Email Garret Christensen (garret@berkeley.edu) or Cyrus Dioun (dioun@berkeley.edu)

Acknowledgements

  • Justin Kitzes
  • Fatma Imamoglu
  • Daniel Turek
  • Chapter Authors
  • Case Study Authors
  • Reproducibility Working Group

THE END

Katy Huff

katyhuff.github.io/2016-05-03-nyu Harder, Better, Faster: Case Studies in Reproducible Workflows by Kathryn Huff is licensed under a Creative Commons Attribution 4.0 International License.Based on a work at http://katyhuff.github.io/2016-05-03-nyu.
> 'connectome workbench', 'stata', 'zotero', ' ', 'travisci', 'vistrails', > 'osf', 'testtools', 'nipy', 'coverage/coveralls', 'ferret', 'cmake', > 'flickr api', 'amazon s3', 'nose', 'readthedocs', 'pypi', 'jira', > 'jenkins', 'ec2 s3', 'sweave', 'shell', ' jupyter', 'sql', 'dataverse', > 'rnw', 'spark', ' paraview', 'data science toolkit', 'overleaf', > 'virtualenv', 'crossref', 'spyder', 'markdown', 'dropbox', > 'scikit-image', 'awk', 'netcdf', 'petsc', 'figshare', 'sharelatex', > 'pandoc', 'ibamr', 'dcvs', 'twitter api', 'mendeley', 'word', 'd3', > 'beautiful soup', 'sed', 'devtools', 'activepapers', 'private git repo', > 'cython', 'outreg2', 'rsync', ' zenodo', 'vagrant', 'c' >