version-control-for-data-science



version-control-for-data-science

0 0


version-control-for-data-science

Presentation for data scientists on using version control

On Github ElderResearch / version-control-for-data-science

Version Control for Data Science

Where data science and time series data become one...
...with your very own TARDIS
Simeon H.K. FitchDirector of Software ArchitectureJune 20th, 2014
{s: speaker notes}, {p: page up}, {n: page down}, {h: left}, {l: right}, {k: up}, {j: down}, {home}, {end}, {space}, {return: leave overview}, {esc: overview}, {b: period, "black screen"}, {f: full screen}

Version Control for Data Science

Part 1: What and Why

What is a Version? How do I get one? What do I do with it? Why would I want it.

Modern Version Control is like a TARDIS

  • Can travel through:
    • Time
    • Space
    • Dimensions

TARDIS (/ˈtɑːdɪs/; Time and Relative Dimension in Space)

Modern Data Science Requires a TARDIS

  • Go forward and back in time
    • Models, data, results
  • Create new dimensions and parallel universes
    • Change the past, evaluate different futures
  • Travel through space
    • Share text, code and data with colleagues and clients

Data Scientists Must Be Time Lords

"Data Scientist(n): Person who is better at statistics than any software engineer, and better at software engineering than any statistician."

—Josh Wills, Cloudera’s Director of Data Science
  • Controlling what we do, how we do it, and what we produce is key to maintaining integrity in our work.
  • Version Control mastery is critical to achieving this as a modern Data Scientist.
Remember: Data Scientists are just better!
Implication: to be a good data scientist means you have to be aware of software engineering best practices and steal from them. Software engineering has been doing this for decades.

Version Control for Data Science

Part 2: Approaches & Systems

Time Lord Requirements

Let's think more about the details of what we need:

  • Backup and Restore: Access any file, at any time-point in existence.
  • Synchronization: Share files and stay up-to-date, but under your control.
  • Short-term undo: Goofed up? Get back to the good version.
  • Long-term undo: Sometimes we mess up badly. Find out what changed a year ago and revert that change.
  • Track Changes: Not only specifics on what changed, but notes on why.
  • Track Ownership: Know who made the change.
  • Sandboxing: Able to experiment and make changes in isolated area, working out kinks before sharing them.
  • Branching and merging: Controlling when, where, and how major changes are introduced.
This sounds like a TARDIS!

Source: A Visual Guide to Version Control

Most of this section is Much borrowed from:

Version Control in the Wild

  • Homebrew VC
  • Abdication-Oriented VC
  • Time Lord VC
Abdication-Oriented also fire and forget.... you fire and it'll take care of forgetting for you. Take away: needs to be a part of your process and habits.

Homebrew VC1

Hip-Hop Approach

⌘-Z / Ctrl-Z

Taxonomist Approach

"Save As..."
ProjectBackup11a.zip
proposal_v23_agt_revised_final.docx
final_project_data_do_not_delete.csv

Not the TARDIS

Homebrew VC2

Filing Clerk Approach

Not the TARDIS
When asked to re-publish or reproduce, which version is the right one? What if you later discovered the contract ended before the copies?

Homebrew VC3

Cloud Storage Approach

MIME-Version: 1.0
Received: by 10.220.194.194 with HTTP; Thu, 19 Jun 2014 10:10:44 -0700 (PDT)
Date: Thu, 19 Jun 2014 13:10:44 -0400
Delivered-To: fitch@datamininglab.com
Message-ID: 
Subject: Backup of mumble_final_report_and_data_v24.zip
From: Simeon Fitch 
To: Simeon Fitch 
Content-Type: multipart/mixed; boundary=001a11c2dd740cbee804fc33756a

--001a11c2dd740cbee304fc337568
Content-Type: text/plain; charset=UTF-8

(backup to self)

--001a11c2dd740cbee304fc337568--
--001a11c2dd740cbee804fc33756a
Content-Type: application/x-zip; name="mumble_final_report_and_data_v24.zip"
Content-Disposition: attachment; filename="mumble_final_report_and_data_v24.zip""
Content-Transfer-Encoding: base64
X-Attachment-Id: f_hwmbquyc0

iVBORw0KGgoAAAANSUhEUgAAAnYAAAGCCAIAAADi3Rk8AAAKr2lDQ1BJQ0MgUHJvZmlsZQAASA2t
...
@datamininglab.com>@datamininglab.com>+qyohum9qg8bex3z3cxcjtaxynahmfmu8iw_dxd1n1svz3g@mail.gmail.com>
Not the TARDIS

Abdication-Oriented VC1

"Back-Me-Up-Scotty" Approach

  • Apple Time Machine
  • Windows Built-in Backup
  • Mozy
  • Carbonite

Cloud Sync Approach

  • Dropbox
  • Google Drive
  • AeroFS
  • SparkleShare
  • iCloud (?)
Not the TARDIS
These tools fail, and they purge history!! Google Drive has caused me significant pain.

Abdication-Oriented VC2

My-Tools-Take-Care-Of-It-For-Me Approach

  • Microsoft Office Revision Tracking (definitely not the TARDIS.)
  • Google Docs
Single-file, bound to file type. Lest you think the preacher is perfect...

Abdicating Control Hurts!

Time Lord VC1

Architecture Types:

  • Centralized: CVS, SourceSafe, Subversion, Team Foundation Server
  • Distributed: Git, Mercurial, Bazaar

Don't assume there's always an IT hoop to go through!

  • With the distributed model no server is necessary to get started
  • Even with the centralized systems you can initialize a local database, (but there are a few steps).

Don't assume you have to be a command-line ninja to be a Time Lord

  • All of these systems have third-party GUIs and tool integration plugins.
  • Caveat emptor: usability varies widely.
Authoritarian communism vs. free market??

Time Lord VC2

  • A Time Lord with both hearts pumping will use a Distributed Version Control System (DVCS).
  • The software team at ERI uses Git and Mercurial, both DVCSs.
  • Like The Doctor, most everyone has their personal favorite. Mine is Git with SourceTree as a GUI interface. SourceTree also supports Mercurial.

Time Lord VC3

System selection:

  • If the client dictates a VCS, there is no decision...
    • Learn and embrace what they have as quickly as you can, with enthusiasm!
    • Any version control is better than no version control, as long as you understand its limitations and it doesn't create a false sense of security.
  • If there's a system tightly integrated into the modeling platform...
    • Push it to its limits, and augment with Git as needed (i.e. for documentation or data)
    • Tip: There's a "portable" (non-installing) version of Git available for your thumb drive.

Time Lord VC4

  • If the VCS system selection is completely up to you, my recommendation is to use Git with SourceTree.
  • Mercurial with SourceTree would be my second choice, but be aware of which features are implemented as plugins and how to install them.
Let me know if you want to know the details of why I recommend Git over Mercurial, but I don't want to distract from the main message here.

Time Lord VC5

Analytics tools context:

  • RStudio supports Git
  • Knime supports most everything, via Eclipse platform
  • SAS Data Integration Studio supports CVS and Subversion
  • SPSS Collaboration and Deployment Services: didn't have time to figure it out.

Version Control for Data Science

Part 3: Fundamental Operations

Most of this section is Much borrowed from:

Terms

  • Init
  • Clone
  • Add
  • Checkout
  • Commit
  • Push
  • Pull
  • Branch
  • Merge
  • Diff

Checkins

Checkouts and Editing

Diffs

Branching

Merging

Conflicts

Distributed vs. Centralized

Version Control for Data Science

Part 4: Time Lord Training Regime

Next Steps1

If you're not convinced version control should be an essential component of your workflow, go read: What Can Data Scientists Learn from DevOps? Go download

and play with it. It's free, fun, and fantastic.

Next Steps2

For an gentle introduction into the concepts and lingo behind modern VCS: A Visual Guide to Version Control Once you start getting the feel of using Version Control in private, read: Intro to Distributed Version Control (Illustrated)

Resources/References

Links from along the way.

Also: the Appendix

Appendix

Warning: random stuff ahead.

Versions of What?

  • Ultimately, all the electronic by products of your technical work.

  • It's like flossing: you don't have to floss the teeth you don't want to keep.

  • The time series data over all your technical artifacts.

  • The question is more appropriately stated: what do you not version control.

What to Version

Let's ask Hadley!

Version Control System History

Generation Networking Operations Concurrency Examples First None One file at a time Locks RCS, SCCS Second Centralized Multi-file Merge before commit CVS, SourceSafe, Subversion, Team Foundation Server Third Distributed Changesets Commit before merge Bazaar, Git, Mercurial From History of Version Control by Eric Sink

Evaluation Characteristics

  • Time Resolution: Continuous vs. Discrete
  • Architecture: Centralized vs. Distributed
  • Social: Collaborative vs. Isolated
  • Granularity: Changeset vs. File/Multi-File

Notable Quotes

But don't take my word for it...

No Such Thing as a One-off Script

Devalytics: Making your analytical work easy to replicate, build upon, and scale while saving significant amounts of time in the process

In the same way the DevOps mantra is "Infrastructure as code," today’s data scientists need to think of all their scripts as actual software that will require ongoing maintenance, enhancement, and support. To paint with a broad brush, there is no such thing as a one-off script. As soon as anyone else has access to it, or if it even sticks around on your local filesystem, it will almost inevitably be reused and applied to different situations in the future.

—Donnie Berkholz in What can data scientists learn from DevOps?, RedMonk

Evaluate Effectiveness

How effective is your time travel machinery?

A lot of the analysts and data scientists we know don’t do version control... If you ask them what do you do if you want to work on different versions of an experiment with different parameters, they say, ‘Well, I make a copy of my files. I have scripts and script [copies]. How do you ensure you can get back to your old versions of your work? You use version control.

—Nick Elprin of Domino Data Lab in Data scientists need their own GitHub, VentureBeat

Engineering Practices

Faster iteration is a competitive advantage.

It terrifies me that... many data scientists don't use source control. If you ask them they might say something like "Sure I use version control! I email myself every new version of the R script. Plus it's on dropbox." ...Successful data scientists learn the value of good pipeline engineering [because] a well-engineered pipeline gets data scientists iterating much faster, which can be a big competitive edge...

—Chris Clark in Engineering Practices in Data Science, Kaggle Blog

Enabling Reproducibility

To stand behind our results, we have to be able to reproduce them!

Rule 1: For Every Result, Keep Track of How It Was Produced

...

Rule 4: Version Control All Custom Scripts

...

Rule 5: Record All Intermediate Results, When Possible in Standardized Formats

...

Rule 7: Always Store Raw Data behind Plots

... Ten Simple Rules for Reproducible Computational Research, International Society for Computational Biology

Collaboration

Solving big, interesting problems requires working effectively with others.

While these tools and technologies are fundamentally changing how we collaborate on science, there is still considerable room for improvement in how we are using them.

Version control for scientific research, BioMed Central

New Ways of Working

Everything we do needs to support who we are. —D. Bailey

Version control systems (VCS)... are now finding new applications in science. ...Git, provides a lightweight yet robust framework that is ideal for managing the full suite of research outputs such as datasets, statistical code, figures, lab notes, and manuscripts. ...this tool can be leveraged to make science more reproducible and transparent, foster new collaborations, and support novel uses.

Git can facilitate greater reproducibility and increased transparency in scienceSource Code for Biology and Medicine

"Big Data Needs to Grow Up"

We have to adopt software and practices to manage the accelerating information flow.

We are in an Information Revolution... but it is entering a new stage... [with] previously unimaginable quantities of data to measure, analyze and act on. These new data sources promise to transform our lives... but we need to get much better at handling all that data we’re producing and collecting.

—Mahesh S. Kumar in What Big Data Needs to Do to Grow Up, Harvard Business Review
0