LOD-Laundromat – Publishing Other People's Dirty Data



LOD-Laundromat – Publishing Other People's Dirty Data

1 0


wouterbeek.github.io

Personal Homepage

On Github wouterbeek / wouterbeek.github.io

LOD-Laundromat

Publishing Other People's Dirty Data

Wouter Beek, Laurens Rietveld, Hamid Bazoobandi, Jan Wielemaker, Stefan Schlobach VU University Amsterdam

Dirty data

  • Character encoding issues
  • Socket errors
  • Protocol errors
  • Corrupted archives
  • Authenticaion problems
  • Syntax errors
Evangelization Existing solutions for cleaning data (standards, guidelines, tools) are targeted towards human data creators, who can (and do) choose not to use them.

Goals

  • Automate the data preprocessing phase
  • Disseminate all LOD in a standards-compliant / machine-processable way, right now.
  • Support common uses cases: splitting/combining data, streamed processing, etc.
[E] Regex, non-RDF tooling, Pig, GNU tools [Q] Tens of thousands of datasets [S] Now, i.e., within days not decades [F] Combine/split data

LOD Laundromat

Open source: https://github.com/LODLaundry

Metadata

Use cases

LOD Observatory Feedback to dataset publishers Evaluation Load balancing Heuristics Skip data preparation phase

Thanks to the Semantic Web Science Association (SWSA) for supporting this presentation.

Evangelization after all... Large-scale, heterogeneous, real-world Distribute data evenly over a given number of nodes. Skewness of data (max. in-/outdegree)