On Github tswicegood / real-world-data
A talk by Travis Swicegood / @tswicegood / #rwdata
Link at the end, so hold your horses
Normally…
Most real-world data, as I'm describing it, is not a big data problem. Yes, if you're dealing with traffic sensors around Houston highway system and getting hundreds of data points every second from thousands of sensors, that's a big data problem. If you're dealing with college readiness numbers from the TEA, that's not big data.Because people are involved in producing most of this data, it's very unpredictable. What was right yesterday when you pulled the data might not be true tomorrow. Plan on the data being unpredictable and you'll safeguard your processes.
Case in point, a few years ago I built the first version of the Texas Tribune bills tracker. We started work on it in November and session started the first week of January. I built everything up using the previous session (81st) as the model. One of those pieces of data was that Senator Dan Patrick was always shown in their data as "Patrick, Dan".
There were two Patricks in the legislature. Dan Patrick was a senator and Diane Patrick was a representative. Since there were two, they had their names adjusted to show that they were distinct. Or that's the way it was in the 81st Session.
Starting the day after the 82nd Session kicked off, Dan Patrick's name was presented as Patrick. Apparently, it was decided that senators should have just their last names displayed if there were no other senators that conflicted and representatives would get the Last, First treatment.
Quantifiable
Basically, know everything. When you're working with real-world data, you need to understand everything about the data that you're processing. It's not uncommon to have a row of data that looks perfectly valid, but column 221 has a special flag, that if set, means this data is junk and should be ignored.
Make sure you grok the entire data set and don't make assumptions about it.
At least everything you use and learn
There is going to be a time when you come back to your importer and go "why did I throw out columns 11-13?" Make sure you're documenting your work and what you learned about the data. Does column 221 provide a "junk" status on a column, make sure that's documented. Code works, English is better.CSVKit provides a bunch of command line tools for inspecting and modifying csv files. csv is the lingua franca of government data sets, so this tool provides some really useful ways to get data moved around.
Related, the intro documentation is an excellent way to teach someone how to use a shell.
Charles Proxy is great for dealing with websites that you need to reverse engineer. Remember, you're scripting everything, so you want to be able to script retrieving the data too.
Glimmer Blocker is interesting as a tool that you can use for unintended purposes. It's meant to block sites and filter traffic, but it also provides logging. You can set it up as the HTTP proxy for your phone, then all traffic will go through it and get logged. You can use it to "discover" hidden APIs to get at data that's hidden in mobile applications.
This lets you create rules that are applied to the entire data set. It's a great way to create a first pass at normalizing data so acronyms are expanded, and so forth.
This used to be a Google project, but it's now a full open-source project called OpenRefine that you can run yourself.