Version Control for Data Science

Where data science and time series data become one...

...with your very own TARDIS

Simeon H.K. FitchDirector of Software ArchitectureJune 20th, 2014

{s: speaker notes}, {p: page up}, {n: page down}, {h: left}, {l: right}, {k: up}, {j: down}, {home}, {end}, {space}, {return: leave overview}, {esc: overview}, {b: period, "black screen"}, {f: full screen}

Version Control for Data Science

Part 1: What and Why

What is a Version? How do I get one? What do I do with it? Why would I want it.

Modern Version Control is like a TARDIS

Can travel through:
- Time
- Space
- Dimensions

TARDIS (/ˈtɑːdɪs/; Time and Relative Dimension in Space)

Modern Data Science Requires a TARDIS

Go forward and back in time
- Models, data, results
Create new dimensions and parallel universes
- Change the past, evaluate different futures
Travel through space
- Share text, code and data with colleagues and clients

Data Scientists Must Be Time Lords

"Data Scientist(n): Person who is better at statistics than any software engineer, and better at software engineering than any statistician."

—Josh Wills, Cloudera’s Director of Data Science

Controlling what we do, how we do it, and what we produce is key to maintaining integrity in our work.
Version Control mastery is critical to achieving this as a modern Data Scientist.

Remember: Data Scientists are just better!

Implication: to be a good data scientist means you have to be aware of software engineering best practices and steal from them. Software engineering has been doing this for decades.

Version Control for Data Science

Part 2: Approaches & Systems

Time Lord Requirements

Let's think more about the details of what we need:

Backup and Restore: Access any file, at any time-point in existence.
Synchronization: Share files and stay up-to-date, but under your control.
Short-term undo: Goofed up? Get back to the good version.
Long-term undo: Sometimes we mess up badly. Find out what changed a year ago and revert that change.
Track Changes: Not only specifics on what changed, but notes on why.
Track Ownership: Know who made the change.
Sandboxing: Able to experiment and make changes in isolated area, working out kinks before sharing them.
Branching and merging: Controlling when, where, and how major changes are introduced.

This sounds like a TARDIS!

Source: A Visual Guide to Version Control

Most of this section is Much borrowed from:

Version Control in the Wild

Homebrew VC
Abdication-Oriented VC
Time Lord VC

Abdication-Oriented also fire and forget.... you fire and it'll take care of forgetting for you. Take away: needs to be a part of your process and habits.

Homebrew VC1

Hip-Hop Approach

⌘-Z / Ctrl-Z

Taxonomist Approach

"Save As..."

ProjectBackup11a.zip

proposal_v23_agt_revised_final.docx

final_project_data_do_not_delete.csv

Not the TARDIS

Homebrew VC2

Filing Clerk Approach

Not the TARDIS

When asked to re-publish or reproduce, which version is the right one? What if you later discovered the contract ended before the copies?

Homebrew VC3

Cloud Storage Approach

MIME-Version: 1.0
Received: by 10.220.194.194 with HTTP; Thu, 19 Jun 2014 10:10:44 -0700 (PDT)
Date: Thu, 19 Jun 2014 13:10:44 -0400
Delivered-To: fitch@datamininglab.com
Message-ID: 
Subject: Backup of mumble_final_report_and_data_v24.zip
From: Simeon Fitch 
To: Simeon Fitch 
Content-Type: multipart/mixed; boundary=001a11c2dd740cbee804fc33756a

--001a11c2dd740cbee304fc337568
Content-Type: text/plain; charset=UTF-8

(backup to self)

--001a11c2dd740cbee304fc337568--
--001a11c2dd740cbee804fc33756a
Content-Type: application/x-zip; name="mumble_final_report_and_data_v24.zip"
Content-Disposition: attachment; filename="mumble_final_report_and_data_v24.zip""
Content-Transfer-Encoding: base64
X-Attachment-Id: f_hwmbquyc0

iVBORw0KGgoAAAANSUhEUgAAAnYAAAGCCAIAAADi3Rk8AAAKr2lDQ1BJQ0MgUHJvZmlsZQAASA2t
...
@datamininglab.com>@datamininglab.com>+qyohum9qg8bex3z3cxcjtaxynahmfmu8iw_dxd1n1svz3g@mail.gmail.com>

Not the TARDIS

Abdication-Oriented VC1

"Back-Me-Up-Scotty" Approach

Apple Time Machine
Windows Built-in Backup
Mozy
Carbonite

Cloud Sync Approach

Dropbox
Google Drive
AeroFS
SparkleShare
iCloud (?)

Not the TARDIS

These tools fail, and they purge history!! Google Drive has caused me significant pain.

Abdication-Oriented VC2

My-Tools-Take-Care-Of-It-For-Me Approach

Microsoft Office Revision Tracking (definitely not the TARDIS.)
Google Docs

Single-file, bound to file type. Lest you think the preacher is perfect...

Abdicating Control Hurts!

Time Lord VC1

Architecture Types:

Centralized: CVS, SourceSafe, Subversion, Team Foundation Server
Distributed: Git, Mercurial, Bazaar

Don't assume there's always an IT hoop to go through!

With the distributed model no server is necessary to get started
Even with the centralized systems you can initialize a local database, (but there are a few steps).

Don't assume you have to be a command-line ninja to be a Time Lord

All of these systems have third-party GUIs and tool integration plugins.
Caveat emptor: usability varies widely.

Authoritarian communism vs. free market??

Time Lord VC2

A Time Lord with both hearts pumping will use a Distributed Version Control System (DVCS).
The software team at ERI uses Git and Mercurial, both DVCSs.

Like The Doctor, most everyone has their personal favorite. Mine is Git with SourceTree as a GUI interface. SourceTree also supports Mercurial.

Time Lord VC3

System selection:

If the client dictates a VCS, there is no decision...
- Learn and embrace what they have as quickly as you can, with enthusiasm!
- Any version control is better than no version control, as long as you understand its limitations and it doesn't create a false sense of security.
If there's a system tightly integrated into the modeling platform...
- Push it to its limits, and augment with Git as needed (i.e. for documentation or data)
- Tip: There's a "portable" (non-installing) version of Git available for your thumb drive.

Time Lord VC4

If the VCS system selection is completely up to you, my recommendation is to use Git with SourceTree.

Mercurial with SourceTree would be my second choice, but be aware of which features are implemented as plugins and how to install them.

Let me know if you want to know the details of why I recommend Git over Mercurial, but I don't want to distract from the main message here.

Time Lord VC5

Analytics tools context:

RStudio supports Git
Knime supports most everything, via Eclipse platform
SAS Data Integration Studio supports CVS and Subversion
SPSS Collaboration and Deployment Services: didn't have time to figure it out.

Version Control for Data Science

Part 3: Fundamental Operations

Most of this section is Much borrowed from:

A Visual Guide to Version Control

Terms

Init
Clone
Add
Checkout
Commit
Push
Pull
Branch
Merge
Diff

Checkins

Checkouts and Editing

Diffs

Branching

Merging

Conflicts

Distributed vs. Centralized

Version Control for Data Science

Part 4: Time Lord Training Regime

Next Steps1

If you're not convinced version control should be an essential component of your workflow, go read: What Can Data Scientists Learn from DevOps? Go download

and play with it. It's free, fun, and fantastic.

Next Steps2

For an gentle introduction into the concepts and lingo behind modern VCS: A Visual Guide to Version Control Once you start getting the feel of using Version Control in private, read: Intro to Distributed Version Control (Illustrated)

Resources/References

Links from along the way.

Also: the Appendix

Appendix

Warning: random stuff ahead.

Versions of What?

Ultimately, all the electronic by products of your technical work.
It's like flossing: you don't have to floss the teeth you don't want to keep.
The time series data over all your technical artifacts.
The question is more appropriately stated: what do you not version control.

What to Version

Let's ask Hadley!

Examples:

Version Control System History

Generation Networking Operations Concurrency Examples First None One file at a time Locks RCS, SCCS Second Centralized Multi-file Merge before commit CVS, SourceSafe, Subversion, Team Foundation Server Third Distributed Changesets Commit before merge Bazaar, Git, Mercurial From History of Version Control by Eric Sink

Evaluation Characteristics

Time Resolution: Continuous vs. Discrete
Architecture: Centralized vs. Distributed
Social: Collaborative vs. Isolated
Granularity: Changeset vs. File/Multi-File

Notable Quotes

But don't take my word for it...

No Such Thing as a One-off Script

Devalytics: Making your analytical work easy to replicate, build upon, and scale while saving significant amounts of time in the process

In the same way the DevOps mantra is "Infrastructure as code," today’s data scientists need to think of all their scripts as actual software that will require ongoing maintenance, enhancement, and support. To paint with a broad brush, there is no such thing as a one-off script. As soon as anyone else has access to it, or if it even sticks around on your local filesystem, it will almost inevitably be reused and applied to different situations in the future.

—Donnie Berkholz in What can data scientists learn from DevOps?, RedMonk

Evaluate Effectiveness

How effective is your time travel machinery?

A lot of the analysts and data scientists we know don’t do version control... If you ask them what do you do if you want to work on different versions of an experiment with different parameters, they say, ‘Well, I make a copy of my files. I have scripts and script [copies]. How do you ensure you can get back to your old versions of your work? You use version control.

—Nick Elprin of Domino Data Lab in Data scientists need their own GitHub, VentureBeat

Engineering Practices

Faster iteration is a competitive advantage.

It terrifies me that... many data scientists don't use source control. If you ask them they might say something like "Sure I use version control! I email myself every new version of the R script. Plus it's on dropbox." ...Successful data scientists learn the value of good pipeline engineering [because] a well-engineered pipeline gets data scientists iterating much faster, which can be a big competitive edge...

—Chris Clark in Engineering Practices in Data Science, Kaggle Blog

Enabling Reproducibility

To stand behind our results, we have to be able to reproduce them!

Rule 1: For Every Result, Keep Track of How It Was Produced

...

Rule 4: Version Control All Custom Scripts

...

Rule 5: Record All Intermediate Results, When Possible in Standardized Formats

...

Rule 7: Always Store Raw Data behind Plots

... Ten Simple Rules for Reproducible Computational Research, International Society for Computational Biology

Collaboration

Solving big, interesting problems requires working effectively with others.

While these tools and technologies are fundamentally changing how we collaborate on science, there is still considerable room for improvement in how we are using them.

Version control for scientific research, BioMed Central

New Ways of Working

Everything we do needs to support who we are. —D. Bailey

Version control systems (VCS)... are now finding new applications in science. ...Git, provides a lightweight yet robust framework that is ideal for managing the full suite of research outputs such as datasets, statistical code, figures, lab notes, and manuscripts. ...this tool can be leveraged to make science more reproducible and transparent, foster new collaborations, and support novel uses.

Git can facilitate greater reproducibility and increased transparency in scienceSource Code for Biology and Medicine

"Big Data Needs to Grow Up"

We have to adopt software and practices to manage the accelerating information flow.

We are in an Information Revolution... but it is entering a new stage... [with] previously unimaginable quantities of data to measure, analyze and act on. These new data sources promise to transform our lives... but we need to get much better at handling all that data we’re producing and collecting.

—Mahesh S. Kumar in What Big Data Needs to Do to Grow Up, Harvard Business Review

version-control-for-data-science

ElderResearch

version-control-for-data-science

0 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();