0 12


An efficient workflow for reproducible science (talk at SciPy 2013)

On Github tbekolay / scipy2013-workflow

An efficient workflow for reproducible science

Trevor Bekolay University of Waterloo

PhD student at the University of Waterloo I want to talk about mostly how I am currently working such that my science is reproducible. No overarching tool, just a set of tips. What I wish I'd known when I started my PhD

Recreatability + openness

⇒ reproducibility

Reproducibility is the goal, but in order to do that we first need recreatability Reproducible = read a paper; given that text, independently set up and do the experiment. Recreatable = given access to everything that went into producing a paper, recreate it. Recreatability should be a given, but it's not, it's hard Difficulty with recreatability is understandable, but inexcusable My thought: make your own work recreatable, release it in the open, and reproducibility will follow
Read literature Form hypothesis Try stuff out Produce a research artifact A scientist does a lot of things This talk is focused on this last part, producing an artifact to be consumed by others We don't talk about this part enough You may completely disagree with me * That's great, but provide alternatives

Ideal workflow

Recreatable Simple (the fewer tools the better) Fast (able to see the result of changes) What does the ideal workflow look like? Recreatability is number 1 Sometimes comes at the expense of simplicity or speed These three are all in conflict * How to get all three of these, or at least close?
git clone
  download_data convert plot combine all paper

This is what I've done now This isn't just a complicated figure, it's a whole 22 page paper with multiple complicated figures * Here's what I learned in getting to that point

1 project : 1 directory

Tip 1
  • When you start making an artifact, make a new directory
People consume your research as the artifact Only include what you did to make that artifact There will be some duplication, but so what Also means you can put this in version control * The sooner the better!

Use virtualenv

Tip 2
and virtualenvwrapper

Use --no-site-packages (the default now) cd /project/dir && setvirtualenvproject pip install <package> pip freeze > requirements.txt Wish I had more time to talk about virtualenv! Trust me: it's worth learning Install new packages at a whim When you're done, pip freeze to make a requirements.txt

Make packages from duplicate code

Tip 3
  • You can never totally get rid of duplicate code
  • Consider making (pip installable) Python packages
Give up on having absolutely no duplicate code Kind of nice to see your progress anyhow If you repeat a ton, you're doing something novel Put it on PyPI * PyPI has a lot of crap on it, it'll be fine

Put forgettablesin a README

Tip 4 usage
download_data -- Downloads data from figshare
convert -- Convert any CSVs in data/ to HDF5

- libpng (apt-get install libpng-dev)
- python
- pip
- Packages in requirements.txt
README should contain anything you're worried about forgetting Write it for yourself

Directory structure

Tip 5
  • data
  • figures
  • paper
  • plots
  • scripts
  • requirements.txt
This is (roughly) how a paper gets made Our directory structure should reflect this * Subdirectories should be clear

Decouple analysis

Tip 6
Think of analysis as compression Going from big raw data to small important data * If an analysis needs information from two sources, it's a meta-analysis

Do everything with

Tip 7
  • Like a makefile for your artifact
  • Force yourself to put everything in here
    • subprocess is your friend contains the logic to do everything I mean everything!! Force yourself to put everything in there Easy to forget what terminal command you used when you need to do paper revisions
from scripts import analysis, plots, figures
if __name__ == '__main__':

    # Analysis
    results = {}
    for path in glob('data/*'):
        results[path] = analysis.analyze(path)

    # Meta-analysis
    meta_result = analysis.meta_analyze(results)
Skeleton example scripts/ has,,
    # Plots
    plot_files = {}
    for path in results:
        result = results[path]
        plot_files[path] = plots.make_plot(result)

    meta_plot = plots.make_meta_plot(meta_result)

    # Figures
    plot_file = plot_files.values()[0]
    figures.make_figure(plot_file, meta_plot)

Use command line arguments

Tip 8 is all you should interact with Make command line arguments for the various things you do with it


    SAVE_PLOTS = True
    plot(data, save=SAVE_PLOTS)
> python
> emacs
> python
This was something I used to do a lot Every time you open up an editor, you're expending mental energy


    SAVE_PLOTS = 'save_plots' in sys.argv
    plot(data, save=SAVE_PLOTS)
> python
> python save_plots
Bonus tip: try docopt for advanced cases
Less energy, after you make the argument If you need complex stuff, try docopt

Parallelize & cache

Tip 9
  • Profile first!
You may not actually have expensive steps But if you do, you can speed them up easily
    # Analysis
    results = {}
    for path in glob('data/*'):
        results[path] = analysis.analyze(path)
* Here's our analysis snippet from before
> ipcluster start -n 5
    from IPython import parallel

    rc = parallel.Client()
    lview = rc.load_balanced_view()

    results = {}
    for path in glob('data/*'):
        asyncresult = lview.apply(analyze, path)
        results[path] = asyncresult

    for path, asyncresult in results.iteritems():
        results[path] = asyncresult.get()
* In just a handful of extra lines, this is now done in parallel (with IPython.parallel)
    # Plots
    plot_files = {}
    for path in results:
        result = results[path]
        plot_files[path] = plots.make_plot(result)
* Here's our plot snippet from before
    plot_files = {}
    for path in results:
        # data/file1.h5 => plots/file1.svg
        plot_path = 'plots/' + os.path.splitext(
            os.path.basename(path))[0] + ".svg"

        if os.exists(plot_path):
            plot_files[path] = plot_path
            res = results[path]
            plot_files[path] = plots.make_plot(res)
Bonus tip: release cached analysis data if raw data is confidential
Now we're not reduplicating that effort You may be able to release cached analyses even if raw data is confidential

Put it all online

Tip 10
  • Let Github or Bitbucket handle web stuff
    • Papers should be changeable and forkable anyway
  • Store source and artifacts separately
Online repositories are more reliable than your computer Can have this private, but please consider making it public

tbekolay/jneurosci2013data (figshare)

I hope these tips were helpful! My JNeuroscience paper and this presentation are both on Github * Please suggest improvements!