Machine Learning Evaluation with LeVar
A database to make ML evaluation sane
Elias Ponvert
director of data science at
What this talk is about
-
Machine learning evaluation is a pain, and in my experience we're
all terrible at it
-
Here's what we should do instead
-
Say hello to LeVar,
a database designed to help us do what we should do
Some more about LeVar
-
Demo!
- A bit about how it's implemented, if you're interested
- Open issues and next steps
Hello anybody from Austin Data Meetup!
Have I seen this talk before?
Yes, you have
Have you gotten 0.2 out yet?
No, I have not
Why not?
New baby. See the appendix.
Machine learning evaluation is a pain and we're all terrible at it
Quick show of hands, does anybody disagree with this statement?
Evaluation is important
- Evaluate early, evaluate often
- Keep a lab notebook
- Evaluate every change
- Do error analysis
- Change your model based on evidence
But what happens
- No standard source, storage or format for eval
- End up rewriting our evaluation scripts for each project
- Couple our evaluation code with our core ML code
- Error analysis is a pain
- Hard to track results on the same evaluation over time
- Harder to to comparative error analysis over time
We can do better
- Keep evaluation datasets in a centralized location
- Evaluations are largely immutable
- Not tied to any one ML framework (e.g. scikit)
- ...or big data framework (e.g. RDDs)
-
Totally agnostic WRT to method
- Human-readable datasets
- Use simple and common format for data exchange
- Open schema for data points, i.e. arbitrary features
We can do better
- Command-line tool for data import & power users
- Several standard problem schema supported
- Classification
- Regression
- Geo-prediction
- Structure prediction
- Machine translation
- Several standard useful evaluation criteria supported
- Accuracy
- Precision/recall/F-score
- ROC
- RMSE...
We can do better
- Web UI showing high-level experiment results suitable for bosses
or clients
- Web UI & CLI search for error analysis
- User can comment on anything:
- Dataset
- Experiment
- Item in dataset
- Individual prediction in experiment
- Another comment
- User can label anything
We can do better
- Sensible information architecture
- Organize datasets into groups or organizations
- Provide sensible baseline evaluations out-of-the-box
- Most-common class
- Mean value
- (Weighted) random
We can do better
- REST API to use with your favorite framework
- Straightforward client libraries
- Export and import to other formats (yo RDD)
Introducing LeVar
It does several of those things!
But don't freak it's just v0.1
Here's what's done now
- CLI (in Scala)
- Import/export TSV
- Organizations data model
- API for everything, using basic auth
- Client code for Scala
- Classification & regression datasets
- Many useful evaluations
LeVar data model
- Datasets
- Items (datum)
- Experiments
- Predictions
Everything has a creation date; everything has a UUID
The data model should support most of those
other features, so I can already do e.g. error
analysis by dropping into psql
and writing a query
This talk is really only designed to last 30 minutes
We probably have some time left over. Want to look at some baby pics?
Machine Learning Evaluation with LeVar
A database to make ML evaluation sane
Elias Ponvert
director of data science at