Robust Bayesian Modeling
Yau Group meeting
February 2, 2016
What is a Bayesian Model?
- A joint distribution of parameters β and data x.
- We usually think of exchangable models p(β,x)=p(β|α)⏟priorn∏i=1p(xi|β)⏟likelihood
- Generalise this to accomodate most common models
- conditional models: p(xi|β)→p(xi|yi,β)e.g.=N(xi|wTyi,σ)
- or latent variable models: p(xi|β)=∑zp(xi,zi|β)e.g.=∑kπkN(xi|μk,σk)
- data is assumed to be drawn from the parameter
- parameter is drawn from the prior.
- we condition on data to calculate the posterior distribution of parameters
- when the data is small, the prior plays a big role
- for large date the posterior converges to a point mass (independent of the model being true or not)
Interlude: What is the difference between parameters and latent variables?
- Joint distribution: p(β,z,x)=p(β|α)⏟parametersn∏i=1p(zi|γ)⏟latent variablesp(xi|β,zi)
- Practical notion:
- Number of latent variables grows with number of data points.
- Number of parameters stays fixed.
- There is no real difference
A useful distinction: Local vs global variables
p(β,z,x)=p(β|α)⏟globaln∏i=1p(zi|γ)⏟localp(xi|β,zi)
- The distinction is determined by conditional dependencies: p(xi,zi|x−i,z−i,β)=p(xi,zi|β)
The ith observation and the ith local variable are conditional independent, given the global variable, of all other local variables and observation
Example: Gaussian mixture model
- Which are the local and which are the global variables?
p(x|z)=∏kN(x|μk,σk)zk;p(z)=∏kπzkk
- global: means μk, standard deviations σk, mixture proportions πk;
- local: cluster assignments zk.
Motivation
- Wang and Blei, 2015 – A General Method for Robust Bayesian Modeling
… all models are wrong …
-
Robustness: Inference should be insensitive to small deviations from the model assumptions.
- Wang and Blei introduce a general framework for robust Bayesian modeling.
- quote is by george box
- the most generic approach is to use distributions with heavier tails
- until now robust models have been build on a case by case basis
- The aim of the authors is to introduce a general framework
Key idea: Localisation of global parameters
- Classical model: p(β,x)=p(β|α)n∏i=1p(xi|β)
- All data is drawn from the parameter.
- The hyperparmater α is usually fixed.
- Robust model: p(β,x)=n∏i=1p(βi|α)p(xi|β)
- Every data point is assumed drawn from an individual realisation of the parameter, which is drawn from the prior.
- Outliers are explained by variation in the parameters.
Graphical Model for Localisation
- We now need to fit the hyperparameter α.
- Fixing α would make the data points independent.
Example: Normal observation model
- Localise the precision parameter and use the conjugate prior
p(xi|α)=∫p(xi|βi)p(βi|α)dβi=∫N(xi|μ,σi)Gam−1(σi|α)dσi
p(xi|α)=Student-t(xi|μ,(λ,ν)=f(α))
2nd key idea: Empirical Bayes
- Estimate hyperparameters via maximum likelihood ˆα=arg maxαn∑i=1∫p(xi|βi)p(βi|α)dβi
- aka evidence approximation evidence=p(xi|α)=∫p(xi|βi)p(βi|α)dβi
- Here we use the data to determine the prior, is that legit?
- full bayesian inference: "bayes empirical bayes"
- needs a hyperprior
- evidence is the prob of the data, after integrating out the parameters. aka marginal likelihood.
Linear Regression
- Trainin data: yi|xi∼N(ωTxi+b,σi+0.02)σi∼Gamma(k,1)
- Test data: yi|xi∼N(ωTxi+b,0.02)
Logistic Regression
yi|xi∼Bernoulli(σ(ωTxi))
The posterior predictive
- Classical Bayesian model: p(xi|x,α)=∫p(xi|β)p(β|x,α)dβ
- Gives correct predictive distr. only if the data comes from the model.
- Robust Bayesian model p(xi|ˆα)=∫p(xi|βi)p(βi|ˆα)dβi
- Gives correct predictive distr. independent of model mismatch.
- If we want to make predictions under the model, which one should we choose?
References
- Wang and Blei 2015, "A General Method for Robust Bayesian Modeling"
- Gelman et al. 2014 "Bayesian Data Analysis", 3rd Edition
- Murphy 2012, "Machine Learning: A Probabilistic Perspective"
- Carlin and Louis 2000, "Empirical Bayes: Past, Present and Future"
1
Robust Bayesian Modeling
Yau Group meeting
Tammo Rukat
February 2, 2016