On Github scott89 / dvt
What is neural networks (CNNs, RNNs, SAE, DBN)?
How to distinguish deep and shallow models?
Are they deep or shallow models: SVMs, three-layer CNNs, 20-layer CNNs?
5 Conv + 3 fully connected
Global average pooling[2] before fully connected layers.
Multiple loss layers to enforce gradient back-propagation and encourage discrimination in lower layers.
Inceptions
Fit a residual mapping $\mathcal{F}(\mathbf{x}):=\mathcal{H}(\mathbf{x})-\mathbf{x}$, rather than fitting the original mapping $\mathcal{H}(\mathcal{x})$
Identity mapping by short connect.
Given image $I$, predict pixel labels $X=\{x_0,\ldots,x_n\}$.
CNN models the distribution by $Q(X|\theta,I)=\prod_i q_i(x_i|\theta,I)$,
where $q_i(x_i|\theta,I)=\frac{1}{Z_i}\exp(f_i(x_i;\theta,I))$
For fully supervised training:
$\arg \max\limits_{\theta} \sum_{I} Q(X|\theta, I)$For weakly supervised training:
find $\theta$ subject to $A_I Q_I \geq b_I \quad \forall I$However it is hard to directly optimize
find $\theta$ subject to $A_I Q_I \geq b_I \quad \forall I$Introduce a latent probability distribution $P(X)$:
$\min \limits_{\theta,P} D(P(X)\|Q(X|\theta))$ subject to $AP \geq b, \quad \sum \limits_{X} P(X)=1$,where KL-divergence $D(p(x)\|q(x))=\sum_x p(x)\log \frac{p(x)}{q(x)}$ measures the distance of two distributions.
Suppresion constraint suppress any label $l$ that does not appear in the image:
$\sum \limits_{i=1}^{n} p_i(l) \leq 0 \quad \forall l \notin \mathcal{L}_I$Foreground constraint encourages foreground:
$\sum \limits_{i=1}^{n} p_i(l) \geq a_l \quad \forall l \in \mathcal{L}_I$Compare with multiple instance learning (MIL) paradigm
Background constraint constrain background regions
$a_0 \leq \sum \limits_{i=1}^{n} p_i(0) \leq b_0.$Size constraint put an upper bound constraint on classes that are guaranteed to be small:
$\sum \limits_{i=1}^{n} p_i(l) \leq b_l.$VGG + Fully Connected CRF. Constrained Optimization is performed on course maps generated by VGG.
VGG + Fully Connected CRF. Constrained Optimization is performed on course maps generated by VGG.
Formulate training as hard-EM approximation, with compete-data log likelihood:
$Q(\theta;\theta^{'})=\sum \limits_{Y} P(Y|X,z;\theta^{'}) \log P(Y|X;\theta)$ $\approx \log(\hat{Y}|X;\theta)$E-step: update the latent segmentation
$\hat{Y}=\arg \max \limits_{Y} P(Y|X;\theta^{'})P(z|Y)$M-step: maximize $Q(\theta;\theta^{'})$ using stochastic gradient descent.
In summary:
Refine latend segmentation $\hat{Y}$ with current network output and image label constraint. Train the network with refined segmentation $\hat{Y}$ as GT$\arg \max \limits_{\theta} \sum \limits_{k=1}^{K} P(\alpha=k|X,\theta) \log P(\alpha) P(Y|X,\alpha,\theta)$
$\mathcal{Q}(\theta,\theta^{old}) = \sum \limits_{k=1}^{K} P(\alpha=k|X,\theta^{old}) \log P(\alpha) P(Y|X,\alpha,\theta)$
E-step: compute $P(\alpha=k|X,\theta^{old})$ given $\theta^{old}$
M-Step: maximize $\mathcal{Q}(\theta, \theta^{old})$ with respect to $\theta$
For dictonary learning:
$P(Y|X,\alpha,\theta) =\frac{1}{Z} \exp(-\|Y-H_{\alpha} \beta_{\alpha}\|)$,$P(X|\alpha,\theta) = \frac{1}{F} \exp(-\|X-L_{\alpha} \beta_{\alpha}\|)$,$\quad\theta = \{H_{\alpha},L_{\alpha}\}$,$H_{\alpha} = \arg \max \limits_{H} \sum_{X} P(\alpha|X,\theta^{old}) \log P(Y|X,\alpha,\theta)$,$=\arg \min \limits_{H} \sum_{X} P(\alpha|X,\theta^{old})\|Y-H\beta_{\alpha}\|$,and inference:
$\hat{Y} = \arg \max \limits_{\tilde{Y}} \sum \limits_{\alpha=1}^{K} P(\alpha|X,\theta) \log P(\tilde{Y}|X, \alpha,\theta) $$=\arg \min \limits_{\tilde{Y}} \sum \limits_{\alpha=1}^{K} P(\alpha|X,\theta) \|\tilde{Y}-H_{\alpha} \beta_{\alpha}\|$$=\sum_{\alpha}P(\alpha|X,\theta) H_{\alpha} \beta_{\alpha}$For network training:
$P(Y|X,\alpha,\theta) = \frac{1}{Z} \exp (-\|Y-G_{\alpha}(X;\theta_{\alpha})\|), \quad \theta=\{\theta_{\alpha}\}$,$\max \limits_{\theta_{\alpha}} \sum \limits_{X} P(\alpha|X, \theta) \log P(Y|X, \alpha, \theta)$$=\min \limits_{\theta_{\alpha}} \sum \limits_{X} P(\alpha|X, \theta) \|Y-G_{\alpha}(X;\theta_{\alpha})\|$and inference:
$\hat{Y}=\arg \max \limits_{\tilde{Y}} E_{\alpha~P(\alpha|X,\theta)}(P(\alpha, \tilde{Y}, X| \theta))$$=\arg \max \limits_{\tilde{Y}} \sum \limits_{\alpha} P(\alpha|X, \theta) \log P(\tilde{Y}|X,\alpha,\theta)$$=\arg \min \limits_{\tilde{Y}} \sum \limits_{\alpha} P(\alpha|X, \theta) \|\tilde{Y} - G_{\alpha}(X;\theta_{\alpha})\|$$=\sum \limits_{\alpha} P(\alpha|X,\theta) G_{\alpha}(X;\theta_{\alpha})$