LOD-Laundromat – Publishing Other People's Dirty Data

View Github Repository Open presentation in a new window

wouterbeek

See all presentation from wouterbeek

LOD-Laundromat – Publishing Other People's Dirty Data

1 0

wouterbeek.github.io

Personal Homepage

On Github wouterbeek / wouterbeek.github.io

LOD-Laundromat

Publishing Other People's Dirty Data

Wouter Beek, Laurens Rietveld, Hamid Bazoobandi, Jan Wielemaker, Stefan Schlobach VU University Amsterdam

Dirty data

Character encoding issues
Socket errors
Protocol errors
Corrupted archives
Authenticaion problems
Syntax errors

Evangelization Existing solutions for cleaning data (standards, guidelines, tools) are targeted towards human data creators, who can (and do) choose not to use them.

Goals

Automate the data preprocessing phase
Disseminate all LOD in a standards-compliant / machine-processable way, right now.
Support common uses cases: splitting/combining data, streamed processing, etc.

[E] Regex, non-RDF tooling, Pig, GNU tools [Q] Tens of thousands of datasets [S] Now, i.e., within days not decades [F] Combine/split data

LOD Laundromat

Open source: https://github.com/LODLaundry

Metadata

Use cases

LOD Observatory Feedback to dataset publishers Evaluation Load balancing Heuristics Skip data preparation phase

Thanks to the Semantic Web Science Association (SWSA) for supporting this presentation.

Evangelization after all... Large-scale, heterogeneous, real-world Distribute data evenly over a given number of nodes. Skewness of data (max. in-/outdegree)

LOD-Laundromat – Publishing Other People's Dirty Data

wouterbeek

LOD-Laundromat – Publishing Other People's Dirty Data

1 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

wouterbeek.github.io

LOD-Laundromat

Publishing Other People's Dirty Data

Dirty data

Goals

LOD Laundromat

Metadata

Use cases

1 0