Real World Data

Dealing with it

A talk by Travis Swicegood / @tswicegood / #rwdata

House Keeping

Where are we going?

Understanding
Strategies
Tools
Case Studies

Questions: Ask Them

Twitter!

@tswicegood | #rwdata

Slides are Online

Link at the end, so hold your horses

Texas Tribune

I work at the Texas Tribune as the Director of Technology. We do a lot of data applications where we're munging data from different sources and turning it into something more useful for our readers. This talk is about the lessons we've learned.

Story Time

What Is It?

Not Big Data

Normally…

Most real-world data, as I'm describing it, is not a big data problem. Yes, if you're dealing with traffic sensors around Houston highway system and getting hundreds of data points every second from thousands of sensors, that's a big data problem. If you're dealing with college readiness numbers from the TEA, that's not big data.

Dirty

Most of the data you get in these situations is dirty data. Most likely, someone manually entered it and it most certainly had a human interact with it at some point.

Unpredictable

Because people are involved in producing most of this data, it's very unpredictable. What was right yesterday when you pulled the data might not be true tomorrow. Plan on the data being unpredictable and you'll safeguard your processes.

Case in point, a few years ago I built the first version of the Texas Tribune bills tracker. We started work on it in November and session started the first week of January. I built everything up using the previous session (81st) as the model. One of those pieces of data was that Senator Dan Patrick was always shown in their data as "Patrick, Dan".

There were two Patricks in the legislature. Dan Patrick was a senator and Diane Patrick was a representative. Since there were two, they had their names adjusted to show that they were distinct. Or that's the way it was in the 81st Session.

Starting the day after the 82nd Session kicked off, Dan Patrick's name was presented as Patrick. Apparently, it was decided that senators should have just their last names displayed if there were no other senators that conflicted and representatives would get the Last, First treatment.

Limited

Most of the data you're going to get is going to be limited in scope. You're normally going to have to combine multiple sources to get everything you're after, because...

Never ExactlyWhat You Need

Unpleasant

I don't mean to make it sound horrible, but it's generally a headache to work with.

Constrained

Quantifiable

150 Representatives
31 Senators
254 Counties

This is a plus. Most big data problems have no upper bound. Most real-world data represents something. It's a set number of representatives or senators. Texas only has so many counties. This gives you something to test against to make sure you've got everything you need.

Really Interesting

If nothing else, there's lots of really interesting pieces of data out there that you can grab hold of to do cool things with.

Strategies

Script Everything

Seriously, Everything

Whether you're writing a scraper or adjusting first/last orders in columns of a CSV, write a script for it. It's ok to tinker with manual processes, but never use data that wasn't generated by a script. It makes the process something you can duplicate.

Keep Copies

Data frequently disappears or changes in ways that can't fully be explained to those of us who think data should live forever. Make sure to keep a copy of the original source data, whether that's CSV, a shapefile, or HTML that you're processing.

Know Your Source

Understand where you're data is coming from. Private sources might have a horse in the race so the data they provide could be slanted. Government agencies could have a mandate that forces their hand in what they provide. Make sure you know that.

Know WhatHumans Touched

Know if a human was involved with any of the data you're working with. Often, government data will have someone that's tweaking this data point, or adjusting that. Make sure you understand where humans were involved with your data so you know where to watch for anomalies.

Know Everything

Basically, know everything. When you're working with real-world data, you need to understand everything about the data that you're processing. It's not uncommon to have a row of data that looks perfectly valid, but column 221 has a special flag, that if set, means this data is junk and should be ignored.

Make sure you grok the entire data set and don't make assumptions about it.

Document Everything

At least everything you use and learn

There is going to be a time when you come back to your importer and go "why did I throw out columns 11-13?" Make sure you're documenting your work and what you learned about the data. Does column 221 provide a "junk" status on a column, make sure that's documented. Code works, English is better.

Embrace Change

Be prepared for change. Don't write brittle code that interacts with the data. Just admit it's going to change and be ready for that.

Tools

Git

This isn't huge data. You need to keep copies of it around, so throw those copies into source control. Depending on the structure and format, this can be an excellent way to see how data is changing.

CSVKit

CSVKit provides a bunch of command line tools for inspecting and modifying csv files. csv is the lingua franca of government data sets, so this tool provides some really useful ways to get data moved around.

Related, the intro documentation is an excellent way to teach someone how to use a shell.

Tabula

Demo time? Tabula is a cool, new (unreleased) tool that allows you to pull structured CSVs straight out of a PDF. The version that's launching is web-based, but it would be reasonable to create a CLI version based on this.

Unix Philosophy

Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.

This isn't a tool, but you should follow it whenever you build tools. You want all of your data processing to follow the Unix Philosophy so you can swap out any part of your tool chain.

Python

NumPy

pandas

I'm partial, but I think Python is one of the (if not the) best tool for dealing with data. It's quick, easy to work with, and provides pretty powerful tools for text processing. Throw in NumPy and pandas for dealing with heavy statistical work, and it's pretty solid.

R

That said, R is gaining a lot of traction as a processing language. It's meant to be used in a statistical setting, so it has a lot of batteries includes (or easily available).

Chrome Web Tools

This is often overlooked, but if you're building scrapers it's a great way to interact with the data.

jQuery -> pyQuery

Part of working with Chrome Web Tools is using jQuery to inspect the DOM and locate relevant information. This is part of the exploratory phase, then I move that code into Python and pyQuery to start processing the results I find.

HTTP Proxies

Charles Proxy is great for dealing with websites that you need to reverse engineer. Remember, you're scripting everything, so you want to be able to script retrieving the data too.

Glimmer Blocker is interesting as a tool that you can use for unintended purposes. It's meant to block sites and filter traffic, but it also provides logging. You can set it up as the HTTP proxy for your phone, then all traffic will go through it and get logged. You can use it to "discover" hidden APIs to get at data that's hidden in mobile applications.

OpenRefine

This lets you create rules that are applied to the entire data set. It's a great way to create a first pass at normalizing data so acronyms are expanded, and so forth.

This used to be a Google project, but it's now a full open-source project called OpenRefine that you can run yourself.

Overview

Overview takes text documents and processes them, pulling out keywords and then give you a way to explore those trees of data. I've not had a reason to use it (yet), but it looks amazing and I've seen some of the demos for it (available on the site) that provide some really interesting results.

Data Stores

CouchDB
MongoDB
Postgres/PostGIS

Both of these are included as simple data stores that are easy to get started with. Both have their pros and cons that are documented much better than I can here in a few minutes, so I encourage you to spend some time learning about that.

Real World Data – Dealing with it – House Keeping

tswicegood

Real World Data – Dealing with it – House Keeping

1 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

real-world-data

Real World Data

Dealing with it

House Keeping

Where are we going?

Questions: Ask Them

Twitter!

@tswicegood | #rwdata

Slides are Online

Texas Tribune

Story Time

What Is It?

Not Big Data

Dirty

Unpredictable

Limited

Never ExactlyWhat You Need

Unpleasant

Constrained

Really Interesting

Strategies

Script Everything

Seriously, Everything

Keep Copies

Know Your Source

Know WhatHumans Touched

Know Everything

Document Everything

Embrace Change

Tools

Git

Unix Philosophy

Chrome Web Tools

jQuery -> pyQuery

HTTP Proxies

Data Stores

Use Cases

8 Different Sources

Over 70 millions rows

1 data source has over 700 columns

12 Steps to Import

141 Sources

Mostly Provided Separately

Still Automating

1 Source

On auto-pilot

JSON Data!

Built Relationship

Talk to Your Sources

Questions?

@tswicegood | #rwdata

1 0