Variational Inference &amp; the ELBO

Andreas Bogossian

Variational Inference & the ELBO

variational inference

probabilistic ML

A step-by-step derivation of the Evidence Lower Bound (ELBO), from intractable posteriors to the reconstruction-regularisation decomposition used in VAEs.

Author

Andreas Bogossian

Published

April 2, 2026

Notation Reference

Symbol	Meaning
\mathbf{x}	Observed data
\mathbf{z}	Latent variables
p(\mathbf{z} \mid \mathbf{x})	True (intractable) posterior
p(\mathbf{x} \mid \mathbf{z})	Likelihood
p(\mathbf{z})	Prior distribution
p(\mathbf{x})	Marginal likelihood (evidence)
q_\varphi(\mathbf{z} \mid \mathbf{x})	Variational approximation to the posterior
\theta	Generative model parameters
\varphi	Variational parameters
\mathcal{L}(\theta, \varphi)	Evidence Lower Bound (ELBO)
D_{\mathrm{KL}}(q \\| p)	KL divergence from q to p
\mathbb{E}_{q}[\cdot]	Expectation under distribution q
\mu_q,\ \sigma_q	Mean and std. dev. of the variational distribution
\mu_,\ \sigma_	Mean and std. dev. of the true posterior
\sigma_\ell	Likelihood standard deviation
\mathcal{N}(\mu, \sigma^2)	Gaussian distribution with mean \mu and variance \sigma^2

Let’s say we’ve observed the data \mathbf{x} and we want to know the conditional probability of the latent variables \mathbf{z}. The conditional probability distribution p(\mathbf{z} \mid \mathbf{x}) can be then formulated with the Bayes’ rule as

p(\mathbf{z} \mid \mathbf{x}) = \frac{p(\mathbf{x} \mid \mathbf{z})p(\mathbf{z})}{p(\mathbf{x})}

Now the problem is that the denominator that acts as a regularizer

p(\mathbf{x}) = \int p(\mathbf{x} \mid \mathbf{z}) p(\mathbf{z}) d \mathbf{z}

is often intractable [1]. When something is intractable, it means that it is theoretically solvable but can’t be solved in a practical amount of time or with practical computational resources.

Variational Inference: Approximating the Posterior

With variational inference, we introduce a tractable approximation q_\varphi(\mathbf{z} \mid \mathbf{x}) and find the parameters \varphi that approximate the intractable posterior p_\theta(\mathbf{z} \mid \mathbf{x}) as well as possible. The closeness is measured by KL-divergence. Even though KL-divergence is a good measure for determining the closeness of two distributions, it requires knowing the true posterior, which we don’t know because we don’t know the denominator which we’re trying to solve, but we need the posterior to solve the denominator. Using KL-divergence alone will not solve the approximation problem so we maximize a quantity called Evidence Lower BOund (ELBO).

Derivation of the ELBO [2]

Now our goal is to transform the intractable integral p_\theta(\mathbf{x}) into a tractable approximation q_\varphi(\mathbf{x}). Instead of approximating p_\theta(\mathbf{x}) we approximate \log p_\theta(\mathbf{x}). So we start from

\log p_\theta(\mathbf{x}) = \log p_\theta(\mathbf{x})

Now by multiplying by \int q_\varphi (\mathbf{z} \mid \mathbf{x}) d\mathbf{z} = 1 we get

\begin{align*} \log p_\theta(\mathbf{x}) &= \log p_\theta(\mathbf{x})\int q_\varphi (\mathbf{z} \mid \mathbf{x}) d\mathbf{z} \\ &= \int \log p_\theta(\mathbf{x}) q_\varphi (\mathbf{z} \mid \mathbf{x}) d\mathbf{z} \end{align*}

Using the property \int f(x) \, q(z) \, dz = \mathbb{E}_{q(z)}[f(x)]

\log p_\theta(\mathbf{x}) = \mathbb{E}_{q_\varphi(\mathbf{z}|\mathbf{x})} \left[ \log p_\theta(\mathbf{x}) \right]

Next, we can apply the chain rule of probability: p_\theta(\mathbf{x}, \mathbf{z}) = p_\theta(\mathbf{z} \mid \mathbf{x}) p_\theta(\mathbf{x}) \Rrightarrow p_\theta(\mathbf{x}) = \frac{p_\theta(\mathbf{x}, \mathbf{z})}{p_\theta(\mathbf{z}|\mathbf{x})}:

\log p_\theta(\mathbf{x}) = \mathbb{E}_{q_\varphi(\mathbf{z}|\mathbf{x})} \left[ \log \frac{p_\theta(\mathbf{x}, \mathbf{z})}{p_\theta(\mathbf{z}|\mathbf{x})} \right]

Multiplying by one, by introducing the new approximate tractable distribution q_\varphi (\mathbf{z} \mid \mathbf{x}):

\log p_\theta(\mathbf{x}) = \mathbb{E}_{q_\varphi(\mathbf{z}|\mathbf{x})} \left[ \log \frac{p_\theta(\mathbf{x}, \mathbf{z}) q_\varphi (\mathbf{z} \mid \mathbf{x})}{p_\theta(\mathbf{z}|\mathbf{x}) q_\varphi (\mathbf{z} \mid \mathbf{x})} \right]

With KL-divergence is defines as D_{KL}(q_\varphi(\mathbf{z}|\mathbf{x}) \| p_\theta(\mathbf{z}|\mathbf{x})) = \mathbb{E}_{q_\varphi(\mathbf{z}|\mathbf{x})} \left[ \log \frac{q_\varphi(\mathbf{z}|\mathbf{x})}{p_\theta(\mathbf{z}|\mathbf{x})} \right]:

\log p_\theta(\mathbf{x}) = \underbrace{\mathbb{E}_{q_\varphi(\mathbf{z}|\mathbf{x})} \left[ \log \frac{p_\theta(\mathbf{x}, \mathbf{z})}{q_\varphi(\mathbf{z}|\mathbf{x})} \right]}_{\text{ELBO}} + \underbrace{D_{\mathrm{KL}}\!\left( q_\varphi(\mathbf{z}|\mathbf{x}) \,\|\, p_\theta(\mathbf{z}|\mathbf{x}) \right)}_{\geq\, 0}

Since KL divergence is always non-negative, the ELBO is a lower bound on the log evidence:

\log p_\theta(\mathbf{x}) \geq \mathbb{E}_{q_\varphi(\mathbf{z}|\mathbf{x})} \left[ \log \frac{p_\theta(\mathbf{x}, \mathbf{z})}{q_\varphi(\mathbf{z}|\mathbf{x})} \right]

The notation \mathcal{L}(\theta, \varphi) is used for the ELBO, so I’ll be also using it to simplify the math notation.

What is the ELBO Optimising?

Our original goal was to find a tractable distribution q_\varphi (\mathbf{x}) that approximates p_\theta (\mathbf{x}). A novel solution to this is to maximize the ELBO. With \log p_\theta (\mathbf{x}) being fixed, maximizing the ELBO minimizes the KL-divergence.

\underbrace{\log p_\theta(\mathbf{x})}_{\text{fixed}} = \underbrace{\mathcal{L}(\theta, \varphi)}_{\uparrow \text{ maximise}} + \underbrace{D_{\mathrm{KL}}(q_\varphi \| p_\theta)}_{\downarrow \text{ minimise}}

Expanding the ELBO

The ELBO can be further decomposed by factoring using the chain rule of probability and splitting the log. This decomposition is a necessary tool for variational autoencoders (VAEs).

\begin{align*} \mathcal{L}(\theta, \varphi) &= \mathbb{E}_{q_\varphi(\mathbf{z}|\mathbf{x})} \left[ \log \frac{p_\theta(\mathbf{x}, \mathbf{z})}{q_\varphi(\mathbf{z}|\mathbf{x})} \right] \\ \mathcal{L}(\theta, \varphi) &= \mathbb{E}_{q_\varphi(\mathbf{z}|\mathbf{x})} \left[ \log \frac{p_\theta(\mathbf{x}|\mathbf{z})\, p(\mathbf{z})}{q_\varphi(\mathbf{z}|\mathbf{x})} \right] \\ &= \underbrace{\mathbb{E}_{q_\varphi(\mathbf{z}|\mathbf{x})} \left[ \log p_\theta(\mathbf{x}|\mathbf{z}) \right]}_{\text{reconstruction term}} - \underbrace{D_{\mathrm{KL}}\!\left( q_\varphi(\mathbf{z}|\mathbf{x}) \,\|\, p(\mathbf{z}) \right)}_{\text{regularisation term}} \end{align*}

Here the reconstruction term measures how well the model recovers \mathbf{x} from samples of the latent space and the regularisation term penalises q_\varphi(\mathbf{z}|\mathbf{x}) for deviating from the prior p(\mathbf{z}).

A Simple Example

Model:

p(z) = \mathcal{N}(0, 1), \qquad p(x \mid z) = \mathcal{N}(z, \sigma_\ell^2)

True posterior (conjugate, so tractable here):

p(z \mid x) = \mathcal{N}\!\left(\mu_*, \sigma_*^2\right), \quad \mu_* = \frac{x}{\sigma_\ell^2 + 1}, \quad \sigma_*^2 = \frac{\sigma_\ell^2}{\sigma_\ell^2 + 1}

Variational family: q(z) = \mathcal{N}(\mu_q, \sigma_q^2).

We optimise (\mu_q, \sigma_q) by gradient ascent on the ELBO and compare with the truth.

np.random.seed(42)
x_obs   = 1.8
sigma_l = 1.2
mu_post   = x_obs / (sigma_l**2 + 1)
sig2_post = sigma_l**2 / (sigma_l**2 + 1)

def elbo_gaussian(mu_q, log_sq):
    sq = np.exp(log_sq)
    sq2 = sq**2
    recon = -0.5 * np.log(2 * np.pi * sigma_l**2) \
            - 0.5 * (sq2 + (mu_q - x_obs)**2) / sigma_l**2
    kl = 0.5 * (sq2 + mu_q**2 - 1 - 2*log_sq)
    return recon - kl

mu_q, log_sq = 0.0, 0.0
lr = 0.08
history = []
for _ in range(120):
    sq = np.exp(log_sq); sq2 = sq**2
    grad_mu  = (x_obs - mu_q)/sigma_l**2 - mu_q
    grad_lsq = -sq2/sigma_l**2 + 1 - sq2
    mu_q   += lr * grad_mu
    log_sq += lr * grad_lsq
    history.append({
        "iter":   len(history) + 1,
        "mu_q":   round(mu_q, 4),
        "sq":     round(np.exp(log_sq), 4),
        "elbo":   round(elbo_gaussian(mu_q, log_sq), 4),
    })

ojs_define(
    history   = history,
    mu_post   = mu_post,
    sig_post  = float(np.sqrt(sig2_post)),
)

viewof iter = Inputs.range(
  [1, 40],
  { value: 1, step: 1, label: "VI iteration" }
)

snap = history[iter - 1]

z_vals = Array.from({length: 500}, (_, i) => -3 + i * 7 / 499)

function normalPDF(zArr, mu, sigma) {
  return zArr.map(zi =>
    Math.exp(-0.5 * ((zi - mu) / sigma) ** 2) / (sigma * Math.sqrt(2 * Math.PI))
  )
}

true_curve = z_vals.map((zi, i) => ({
  z: zi,
  y: normalPDF(z_vals, mu_post, sig_post)[i],
}))

q_curve = z_vals.map((zi, i) => ({
  z: zi,
  y: normalPDF(z_vals, snap.mu_q, snap.sq)[i],
}))

Plot.plot({
  width: 700,
  height: 360,
  marginLeft: 50,
  x: { label: "z", domain: [-3, 4] },
  y: { label: "density" },
  marks: [
    Plot.line(true_curve, {
      x: "z", y: "y",
      stroke: "steelblue", strokeWidth: 2.5, strokeDasharray: "6 3",
    }),
    Plot.line(q_curve, {
      x: "z", y: "y",
      stroke: "darkorange", strokeWidth: 2.5,
    }),
    Plot.ruleX([mu_post], {
      stroke: "steelblue", strokeDasharray: "4 3", strokeOpacity: 0.5,
    }),
    Plot.ruleX([snap.mu_q], {
      stroke: "darkorange", strokeDasharray: "4 3", strokeOpacity: 0.5,
    }),
    Plot.text(
      [{ z: mu_post + 0.08, label: `μ* = ${mu_post.toFixed(2)},  σ* = ${sig_post.toFixed(2)}` }],
      { x: "z", y: 0.88, text: "label", fontSize: 12, fill: "steelblue", textAnchor: "start" }
    ),
    Plot.text(
      [{ z: snap.mu_q + 0.08, label: `μq = ${snap.mu_q.toFixed(2)},  σq = ${snap.sq.toFixed(2)}` }],
      { x: "z", y: 0.78, text: "label", fontSize: 12, fill: "darkorange", textAnchor: "start" }
    ),
  ],
})

(a)

(b)

(c)

(d)

(e)

(f)

Figure 1: VI convergence: the variational q approaches the true posterior. Drag the slider to step through iterations. Dashed blue line is the true posterior; solid orange line is the current q.

ll_data = history.slice(0, 40).map(d => ({ iter: d.iter, elbo: d.elbo }))

Plot.plot({
  width: 700,
  height: 200,
  marginLeft: 60,
  x: { label: "iteration" },
  y: { label: "ELBO" },
  marks: [
    Plot.line(ll_data, {
      x: "iter", y: "elbo", stroke: "steelblue", strokeWidth: 2,
    }),
    Plot.ruleX([iter], {
      stroke: "darkorange", strokeOpacity: 0.4, strokeDasharray: "4 2",
    }),
    Plot.dot([ll_data[iter - 1]], {
      x: "iter", y: "elbo", fill: "darkorange", r: 5,
    }),
  ],
})

(a)

(b)

Figure 2: ELBO rises monotonically during VI optimisation.

# True log evidence via numerical integration
z_grid = np.linspace(-10, 10, 10000)
dz = z_grid[1] - z_grid[0]
log_joint = stats.norm.logpdf(z_grid, x_obs, sigma_l) + stats.norm.logpdf(z_grid, 0, 1)
log_evidence_numerical = logsumexp(log_joint) + np.log(dz)

print(f"True log p(x) (numerical): {log_evidence_numerical:.4f}")
print(f"Final ELBO:                {history[-1]['elbo']:.4f}")
print(f"Difference:                {abs(log_evidence_numerical - history[-1]['elbo']):.4f}")

True log p(x) (numerical): -2.0289
Final ELBO:                -2.0289
Difference:                0.0000

As can be seen in Figure 1, the approximation approaches the true distribution. This is verified numerically in the code block above. The code block shows that the KL divergence has converged to zero meaning that the distributions are the same up until numerical precision.

Summary

In the beginning, we motivated the need for variational inference to solve intractable posteriors. Then we derived ELBO and showed that maximizing it is equivalent to minimizing the KL divergence between the approximation and the true posterior. Then finally, we used the theory on a simple Gaussian example confirming that we were able to approximate the posterior well.

References

[1]

C. M. Bishop, Pattern recognition and machine learning. Springer, 2006.

[2]

U. Jamil, “Variational Autoencoder - Model, ELBO, loss function and maths explained easily!” https://www.youtube.com/watch?v=iwEzwTTalbg, Jun. 2023.

Reuse

CC BY 4.0