An efficient workflow for reproducible science
Trevor Bekolay
University of Waterloo bekolay.org/scipy2013-workflow
PhD student at the University of Waterloo
I want to talk about mostly how I am
currently working such that my science is reproducible.
No overarching tool, just a set of tips.
What I wish I'd known when I started my PhD
Recreatability + openness
⇒ reproducibility
Reproducibility is the goal, but in order to do that
we first need recreatability
Reproducible = read a paper; given that text,
independently set up and do the experiment.
Recreatable = given access to everything that went
into producing a paper, recreate it.
Recreatability should be a given, but it's not, it's hard
Difficulty with recreatability
is understandable, but inexcusable
My thought: make your own work recreatable,
release it in the open, and reproducibility will follow
Read literature
Form hypothesis
Try stuff out
Produce a research artifact
A scientist does a lot of things
This talk is focused on this last part,
producing an artifact to be consumed by others
We don't talk about this part enough
You may completely disagree with me
* That's great, but provide alternatives
Ideal workflow
Recreatable
Simple (the fewer tools the better)
Fast (able to see the result of changes)
What does the ideal workflow look like?
Recreatability is number 1
Sometimes comes at the expense of simplicity or speed
These three are all in conflict
* How to get all three of these,
or at least close?
git clone
https://github.com/tbekolay/jneurosci2013.git
python run.py
download_data convert plot combine all paper
This is what I've done now
This isn't just a complicated figure,
it's a whole 22 page paper with
multiple complicated figures
* Here's what I learned in getting to that point
1 project : 1 directory
Tip 1
- When you start making an artifact, make a new directory
People consume your research as the artifact
Only include what you did to make that artifact
There will be some duplication, but so what
Also means you can put this in version control
* The sooner the better!
Use virtualenv
Tip 2
and virtualenvwrapper
Use --no-site-packages (the default now)
cd /project/dir && setvirtualenvproject
pip install <package>
pip freeze > requirements.txt
Wish I had more time to talk about virtualenv!
Trust me: it's worth learning
Install new packages at a whim
When you're done, pip freeze to make a requirements.txt
Make packages from duplicate code
Tip 3
- You can never totally get rid of duplicate code
- Consider making (pip installable) Python packages
Give up on having absolutely no duplicate code
Kind of nice to see your progress anyhow
If you repeat a ton, you're doing something novel
Put it on PyPI
* PyPI has a lot of crap on it,
it'll be fine
Put forgettablesin a README
Tip 4
run.py usage
============
download_data -- Downloads data from figshare
convert -- Convert any CSVs in data/ to HDF5
Requirements
============
- libpng (apt-get install libpng-dev)
- python
- pip
- Packages in requirements.txt
README should contain anything you're worried
about forgetting
Write it for yourself
Directory structure
Tip 5
-
scripts
-
requirements.txt
-
run.py
-
README
This is (roughly) how a paper gets made
Our directory structure should reflect this
* Subdirectories should be clear
Decouple analysis
Tip 6
Think of analysis as compression
Going from big raw data to small important data
* If an analysis needs information from two
sources, it's a meta-analysis
Do everything with run.py
Tip 7
- Like a makefile for your artifact
- Force yourself to put everything in here
-
subprocess is your friend
run.py contains the logic to do everything
I mean everything!!
Force yourself to put everything in there
Easy to forget what terminal command you used
when you need to do paper revisions
from scripts import analysis, plots, figures
if __name__ == '__main__':
# Analysis
results = {}
for path in glob('data/*'):
results[path] = analysis.analyze(path)
# Meta-analysis
meta_result = analysis.meta_analyze(results)
Skeleton example
scripts/ has analysis.py, plots.py, figures.py
# Plots
plot_files = {}
for path in results:
result = results[path]
plot_files[path] = plots.make_plot(result)
meta_plot = plots.make_meta_plot(meta_result)
# Figures
plot_file = plot_files.values()[0]
figures.make_figure(plot_file, meta_plot)
Use command line arguments
Tip 8
run.py is all you should interact with
Make command line arguments
for the various things you do with it
Bad!
SAVE_PLOTS = True
...
plot(data, save=SAVE_PLOTS)
> python run.py
> emacs run.py
# Change SAVE_PLOTS
> python run.py
This was something I used to do a lot
Every time you open up an editor,
you're expending mental energy
Good!
SAVE_PLOTS = 'save_plots' in sys.argv
...
plot(data, save=SAVE_PLOTS)
> python run.py
> python run.py save_plots
Bonus tip: try
docopt for advanced cases
Less energy, after you make the argument
If you need complex stuff, try docopt
Parallelize & cache
Tip 9
You may not actually have expensive steps
But if you do, you can speed them up easily
# Analysis
results = {}
for path in glob('data/*'):
results[path] = analysis.analyze(path)
* Here's our analysis snippet from before
> ipcluster start -n 5
from IPython import parallel
rc = parallel.Client()
lview = rc.load_balanced_view()
results = {}
for path in glob('data/*'):
asyncresult = lview.apply(analyze, path)
results[path] = asyncresult
for path, asyncresult in results.iteritems():
results[path] = asyncresult.get()
* In just a handful of extra lines,
this is now done in parallel (with IPython.parallel)
# Plots
plot_files = {}
for path in results:
result = results[path]
plot_files[path] = plots.make_plot(result)
* Here's our plot snippet from before
plot_files = {}
for path in results:
# data/file1.h5 => plots/file1.svg
plot_path = 'plots/' + os.path.splitext(
os.path.basename(path))[0] + ".svg"
if os.exists(plot_path):
plot_files[path] = plot_path
else:
res = results[path]
plot_files[path] = plots.make_plot(res)
Bonus tip: release cached analysis data
if raw data is confidential
Now we're not reduplicating that effort
You may be able to release cached analyses
even if raw data is confidential
Put it all online
Tip 10
- Let Github or
Bitbucket handle web stuff
- Papers should be changeable and forkable anyway
- Store source and artifacts separately
Online repositories are more reliable
than your computer
Can have this private,
but please consider making it public