"Data Scientist(n): Person who is better at statistics than any software engineer, and better at software engineering than any statistician."
—Josh Wills, Cloudera’s Director of Data ScienceLet's think more about the details of what we need:
MIME-Version: 1.0 Received: by 10.220.194.194 with HTTP; Thu, 19 Jun 2014 10:10:44 -0700 (PDT) Date: Thu, 19 Jun 2014 13:10:44 -0400 Delivered-To: fitch@datamininglab.com Message-ID: Subject: Backup of mumble_final_report_and_data_v24.zip From: Simeon Fitch To: Simeon Fitch Content-Type: multipart/mixed; boundary=001a11c2dd740cbee804fc33756a --001a11c2dd740cbee304fc337568 Content-Type: text/plain; charset=UTF-8 (backup to self) --001a11c2dd740cbee304fc337568-- --001a11c2dd740cbee804fc33756a Content-Type: application/x-zip; name="mumble_final_report_and_data_v24.zip" Content-Disposition: attachment; filename="mumble_final_report_and_data_v24.zip"" Content-Transfer-Encoding: base64 X-Attachment-Id: f_hwmbquyc0 iVBORw0KGgoAAAANSUhEUgAAAnYAAAGCCAIAAADi3Rk8AAAKr2lDQ1BJQ0MgUHJvZmlsZQAASA2t ... @datamininglab.com>@datamininglab.com>+qyohum9qg8bex3z3cxcjtaxynahmfmu8iw_dxd1n1svz3g@mail.gmail.com>
Architecture Types:
Don't assume there's always an IT hoop to go through!
Don't assume you have to be a command-line ninja to be a Time Lord
System selection:
Analytics tools context:
and play with it. It's free, fun, and fantastic.
Links from along the way.
Also: the Appendix
Warning: random stuff ahead.
Ultimately, all the electronic by products of your technical work.
It's like flossing: you don't have to floss the teeth you don't want to keep.
The time series data over all your technical artifacts.
The question is more appropriately stated: what do you not version control.
Devalytics: Making your analytical work easy to replicate, build upon, and scale while saving significant amounts of time in the process
In the same way the DevOps mantra is "Infrastructure as code," today’s data scientists need to think of all their scripts as actual software that will require ongoing maintenance, enhancement, and support. To paint with a broad brush, there is no such thing as a one-off script. As soon as anyone else has access to it, or if it even sticks around on your local filesystem, it will almost inevitably be reused and applied to different situations in the future.
—Donnie Berkholz in What can data scientists learn from DevOps?, RedMonkHow effective is your time travel machinery?
A lot of the analysts and data scientists we know don’t do version control... If you ask them what do you do if you want to work on different versions of an experiment with different parameters, they say, ‘Well, I make a copy of my files. I have scripts and script [copies]. How do you ensure you can get back to your old versions of your work? You use version control.
—Nick Elprin of Domino Data Lab in Data scientists need their own GitHub, VentureBeatFaster iteration is a competitive advantage.
It terrifies me that... many data scientists don't use source control. If you ask them they might say something like "Sure I use version control! I email myself every new version of the R script. Plus it's on dropbox." ...Successful data scientists learn the value of good pipeline engineering [because] a well-engineered pipeline gets data scientists iterating much faster, which can be a big competitive edge...
—Chris Clark in Engineering Practices in Data Science, Kaggle BlogTo stand behind our results, we have to be able to reproduce them!
Rule 1: For Every Result, Keep Track of How It Was Produced
...Rule 4: Version Control All Custom Scripts
...Rule 5: Record All Intermediate Results, When Possible in Standardized Formats
...Rule 7: Always Store Raw Data behind Plots
... Ten Simple Rules for Reproducible Computational Research, International Society for Computational BiologySolving big, interesting problems requires working effectively with others.
While these tools and technologies are fundamentally changing how we collaborate on science, there is still considerable room for improvement in how we are using them.
Version control for scientific research, BioMed CentralEverything we do needs to support who we are. —D. Bailey
Version control systems (VCS)... are now finding new applications in science. ...Git, provides a lightweight yet robust framework that is ideal for managing the full suite of research outputs such as datasets, statistical code, figures, lab notes, and manuscripts. ...this tool can be leveraged to make science more reproducible and transparent, foster new collaborations, and support novel uses.
Git can facilitate greater reproducibility and increased transparency in scienceSource Code for Biology and MedicineWe have to adopt software and practices to manage the accelerating information flow.
We are in an Information Revolution... but it is entering a new stage... [with] previously unimaginable quantities of data to measure, analyze and act on. These new data sources promise to transform our lives... but we need to get much better at handling all that data we’re producing and collecting.
—Mahesh S. Kumar in What Big Data Needs to Do to Grow Up, Harvard Business Review