On Github kamalasaurus / GraphicalWeb
Contact Kamal Radharamanan / @kamalasaurus
I'm a prototype engineer at Ravel Law! I'd like to describe some of the challenges we've encountered trying to communicate the relationships between legal entities (like cases, judges, and lawyers), and discuss how we've used data visualization to clarify some of those relationships.
First I'm going to give a brief overview of our problem space, mention some rules and thought patterns we apply as we try to visualize our data, and then walk through an example involving a big data set.
What is the law? In our case there is: statutory law, regulatory law, and common law (case law). These documents have citations between each other, similar to a hyperlink on the world wide web.
Primarily, we focus on case law.
Perhaps more interestingly, each of these legal documents is comprised of many entities; like, lawyers, judges, and companies. These entities, naturally, are associated with each other! Opinions can be authored by the same judge or in the same jurisdiction; they can contain mentions to the same law firms or affected parties.
Our example will focus on the citation network between judges.
(...don't step in the lava)
When is a picture worth a thousand words?
When it provides context!
1) Limit the scope.
If you want to display one dimension clearly, other dimensions will be undepicted or diminished.
PROTIP: the diminished dimensions will provide the context.
2) For a data visualization, independent of words, the the final image needs a language.
I think of a language as a consistent set of rules by which information is encoded. In this case, in the image.
idiogram -> alphabet
skeuomorphic -> flat
I've come to think of data visualizations as elaborate heiroglyphics... or comic books with shapes for characters.
Malofiej Awards for excellence in infographicsJonathan Corum NY Times data journalist (check out the whale example)
Your language will consist of circles, squares, lines, and polygons (shapes). Along with colors and positioning.
Groupings are stronger than position. Positions are stronger than shapes. Shapes are stronger than colors. Colors are stronger than opacity.
The more dimensions you show, the more pristine your data must be! Random artifacts might imply relationships that don't necessarily exist.
There's a limit to how much a person can comprehend at once. Can exploit transitions to show an evolution of information, but may not be generalizable, especially if the visualization is 3D.
In general: try to show exactly 1 thing.
Other things will leak in.
The more data you're showing simultaneously, the more likely you are going to express an unintentional relationship.
Like with accidental groupings in force direction.
The visual system sees a 2D projection of 3D things. To fully reason about data mapped in 3 dimensions, it needs to be interactive. This immediately limits your explanatory capacity to scripted interactions (which are not generalizable).
You will probably discover that your data is not as pristine as you need it to be.
Power Law distribution of citations amongst case law. (I forgot to histogram between judges.)
It's a scale-free network!
Total number of cases in US Law 8~10million Total number of Links 60~80million
Data visualization is intrinsically an aggregation problem. There are graphs so large that there aren’t screens with enough pixels to display each element uniquely. What metrics help us aggregate for the legal dataset?
How many of what things are you going to be showing in general?
If you're showing a link graph, is the graph sparse or dense?
A sparse graph is where the number of edges is ~ O(n). Tree-like visualizations work here.
A dense graph is one where the number of edges is ~ O(n^2). Have to get creative w/ aggregation.
So, we're somewhere near the (log-scale) middle! We have about (O(n))^2...
Is that a significant metric? Most of these definitions are for the mathematical description of a graph. Not necessarily the visual description.
What might be more relevant is the ratio of pixels to elements.
No mans land as it were.
Here we gooooooo~
Hypothesis: Judges cite more to their controlling and controlled jurisdictions, or jurisdictions they used to occupy, or judges they once clerked for.
A classical example of somewhere visualization can help you make sense of associations.
and why they're problematic
Supreme CourtCir. 1Cir. 2Cir. 3D. CTED.NYND.NYSD.NYNYA Map.
Squares!
None of these make intuitive sense given our constraints. Let's try something a bit off the wall.
Too many lines
Mixed metaphors
Breaking the density metric!
Side-note on the random distribution of judge locations in radial distribution.
...with the links!
Exactly white noise.
How do you aggregate that data? Every aggregation you make reduces the resolution of your data.
Opacity, overlap, vs. collision detection vs. display room, vs. force directed collection
Out hypothesis is false.
Judges are very promiscuous with their citation networks.
Force direction does not always make sense, O(n^3), limits the scale of the visualization tremendously -- javascript is single threaded. Would have to precompute the coordinates.
Sum random distributions for approximate gaussians! But uniform distribution will have the most even spreading.
Need to label associations between the judges. Did so and so clerk for this and that? In general, the pattern shows that judges cite outside of their jurisdiction indiscriminately.
more + more != more