data-hygiene-pres



data-hygiene-pres

0 0


data-hygiene-pres


On Github spatialcarpentry / data-hygiene-pres

Data Hygiene

Overview

  • cleaning is cool
  • names
  • data hoarding
  • redundancy

Data Janitor

Keeping data clean

Someone has to do it

Hygiene is a habit

  • consistency > intensity
  • keeps things presentable
  • why should data be different?

Cleaning your data

  • locate corrupt data
  • remove, or repair corrupted data
  • create procedure to prevent similar corruption

Is hygiene a hassle?

  • not if performed regularly
  • will make other work more pleasant
  • think about the future
image/svg+xml

Bad Names

  • vague
  • redundant
  • misleading
  • hard to read

Comparison

  • bigPaper.doc vs. economicEffectsOfPiracy.doc
  • estFINAL.txt vs. quarterly_estimate.txt

Naming Conventions

  • consistent delimiters
  • keep words short
  • explain what is in the file
  • don't use spaces

Data Hoarding

Keep it Simple

What do you really need?

Scope

  • know what you are looking for
  • remove the extra stuff

Redundancy

Redundancy

  • don't want it in your papers
  • don't want it in your conversations
  • don't want it in your data

Redundant data

Name NameMid name2 nameFull Bob Bobby Rob Bob Robertson James Danger 007 James Bond Darth N/A Anakin Darth Vader

Don't repeat yourself

  • good for security
  • bad for workflows

References