LOD-Laundromat
Publishing Other People's Dirty Data
Wouter Beek,
Laurens Rietveld,
Hamid Bazoobandi,
Jan Wielemaker,
Stefan Schlobach
VU University Amsterdam
Dirty data
- Character encoding issues
- Socket errors
- Protocol errors
- Corrupted archives
- Authenticaion problems
- Syntax errors
Evangelization
Existing solutions for cleaning data (standards, guidelines, tools)
are targeted towards human data creators, who can (and do)
choose not to use them.
Goals
- Automate the data preprocessing phase
- Disseminate all LOD in a standards-compliant / machine-processable way, right now.
- Support common uses cases: splitting/combining data, streamed processing, etc.
[E] Regex, non-RDF tooling, Pig, GNU tools
[Q] Tens of thousands of datasets
[S] Now, i.e., within days not decades
[F] Combine/split data
Use cases
LOD Observatory
Feedback to dataset publishers
Evaluation
Load balancing
Heuristics
Skip data preparation phase
Thanks to the Semantic Web Science Association (SWSA) for supporting this presentation.
Evangelization after all...
Large-scale, heterogeneous, real-world
Distribute data evenly over a given number of nodes.
Skewness of data (max. in-/outdegree)