Ben's Blog
Published on
Reading time
19 min read

Diffusion verses Flow Matching

An accessible introduction to diffusion and flow matching models. This post aims to be both complete and easy-to-follow as a reference for implementing diffusion models yourself.

This post is a summary of a couple of generative modeling papers recently that have been impactful. I'm primarily writing this for my own understanding, but hopefully it's useful to others as well, since this is a fast-paced area of research and it can be hard to dive deeply into each new paper that comes out.

This post uses what I'll call "math for computer scientists", meaning there will likely be a lot of abuse of notation and other hand-waving with the goal of conveying the underlying idea more clearly. If there is a mistake (and it looks unintentional) then let me know!

All Papers

This post will discuss three papers:

  • Diffusion from Denoising Diffusion Probabilistic Models
  • Latent Diffusion from High-Resolution Image Synthesis with Latent Diffusion Models (a.k.a. the Stable Diffusion paper)
  • Flow Matching from Flow Matching for Generative Modeling

Other Good References

These papers overlap and cite each other in various ways.

Diffusion

Diffusion models can be summarized very shortly as:

  1. Start with a random image.
  2. Iteratively add noise to the image.
  3. Train a model to predict some part of the noise that was added.
  4. Iteratively remove noise from the image by predicting and subtracting it.

Consider Figure 2 from the original paper:

Figure 2 from the original flow matching paper.

In the above diagram:

  • xTx_T is some noise, typically Gaussian noise
  • x0x_0 is the original image
  • pθ(xt1xt)p_{\theta}(x_{t-1}|x_t) is called the reverse process, a distribution over the sligtly less noisy image given the current image
  • q(xtxt1)q(x_t|x_{t-1}) is called the forward process, a Markov chain that adds Gaussian noise to the data

How do we convert a regular image to Gaussian noise?

Let's assume we have a noisy image xt1x_{t-1} and we want to make it slightly noisier (in other words, take a step along the forward process, q(xtxt1)q(x_t|x_{t-1})). We sample the slightly noisier image xtx_t from the following distribution:

xtN(1βtxt1,βtI)x_t \sim \mathcal{N}(\sqrt{1 - \beta_t} x_{t - 1}, \beta_t \textbf{I})

This can be read as, "sample xtx_t from a Gaussian distribution with mean 1βtxt1\sqrt{1 - \beta_t} x_{t - 1} and variance βtI\beta_t \textbf{I}." The matrix I\textbf{I} is just the identity matrix, so βt\beta_t is the variance of the Gaussian noise being added.

Recall that the sum of two Gaussian distributions is also a Gaussian distribution:

XN(μY,σX2)YN(μY,σY2)X+YN(μX+μY,σX2+σY2)\begin{aligned} X & \sim \mathcal{N}(\mu_Y, \sigma_X^2) \\ Y & \sim \mathcal{N}(\mu_Y, \sigma_Y^2) \\ X + Y & \sim \mathcal{N}(\mu_X + \mu_Y, \sigma_X^2 + \sigma_Y^2) \end{aligned}

It's also worth noting that multiplying a zero-mean Gaussian by some factor α\alpha is equivalent to multiplying the variance by α2\alpha^2:

αN(0,I)=N(0,α2I)\alpha \mathcal{N}(\textbf{0}, \textbf{I}) = \mathcal{N}(\textbf{0}, \alpha^2 \textbf{I})

So we can rewrite the distribution from earlier as:

N(1βtxt1,βtI)=N(1βtxt1,0)+N(0,βtI)=1βtxt1+βtN(0,I)\begin{aligned} \mathcal{N}(\sqrt{1 - \beta_t} x_{t - 1}, \beta_t \textbf{I}) & = \mathcal{N}(\sqrt{1 - \beta_t} x_{t - 1}, \textbf{0}) + \mathcal{N}(\textbf{0}, \beta_t \textbf{I}) \\ & = \sqrt{1 - \beta_t} x_{t - 1} + \sqrt{\beta_t} \mathcal{N}(\textbf{0}, \textbf{I}) \end{aligned}

This is our forward process q(xtxt1)q(x_t|x_{t-1}):

q(xtxt1)=1βtxt1+βtN(0,I)q(x_t|x_{t-1}) = \sqrt{1 - \beta_t} x_{t - 1} + \sqrt{\beta_t} \mathcal{N}(\textbf{0}, \textbf{I})

So, to recap, in order to convert a regular image to Gaussian noise, we repeatedly apply the q(xtxt1)q(x_t|x_{t-1}) rule to add noise to the image, and TT \to \infty will result in a random distribution of noise.

The variances for each step of the qq update is given by some schedule β1,β2,,βT\beta_1, \beta_2, \dots, \beta_T. The schedule is typically linearly increasing from 0 to 1, so that on the final step when βT=1\beta_T = 1 we will sample a completely noisy image from the distribution N(0,I)\mathcal{N}(\textbf{0}, \textbf{I}).

How can you efficiently train the model?

Put differently, how do you sample xtx_t in closed form (i.e., without sampling xt1,...,x1x_{t - 1}, ..., x_{1})?

Rather than having to take our original image and run 1 to TT steps to get a noisy image, we can use the reparametrization trick.

If we start with our original image x0x_0, we can write the first slightly noisy image x1x_1 as:

x1=1β1x0+N(0,β1I)=1β1x0+β1N(0,I)\begin{aligned} x_1 & = \sqrt{1 - \beta_1} x_0 + \mathcal{N}(\textbf{0}, \beta_1 \textbf{I}) \\ & = \sqrt{1 - \beta_1} x_0 + \sqrt{\beta_1} \mathcal{N}(\textbf{0}, \textbf{I}) \\ \end{aligned}

We can rewrite this using αt=1βt\alpha_t = 1 - \beta_t as:

x1=α1x0+N(0,(1α1)I)x_1 = \sqrt{\alpha_1} x_0 + \mathcal{N}(\textbf{0}, (1 - \alpha_1) \textbf{I})

We can then write x2x_2 as:

x2=α2x1+N(0,(1α2)I)=α2(α1x0+N(0,(1α1)I))+N(0,(1α2)I)=α1α2x0+N(0,α2(1α1)I)+N(0,(1α2)I)=α1α2x0+N(0,(α2(1α1)+(1α2))I)=α1α2x0+N(0,(1α1α2)I)\begin{aligned} x_2 & = \sqrt{\alpha_2} x_1 + \mathcal{N}(\textbf{0}, (1 - \alpha_2) \textbf{I}) \\ & = \sqrt{\alpha_2} (\sqrt{\alpha_1} x_0 + \mathcal{N}(\textbf{0}, (1 - \alpha_1) \textbf{I})) + \mathcal{N}(\textbf{0}, (1 - \alpha_2) \textbf{I}) \\ & = \sqrt{\alpha_1 \alpha_2} x_0 + \mathcal{N}(\textbf{0}, \alpha_2 (1 - \alpha_1) \textbf{I}) + \mathcal{N}(\textbf{0}, (1 - \alpha_2) \textbf{I}) \\ & = \sqrt{\alpha_1 \alpha_2} x_0 + \mathcal{N}(\textbf{0}, (\alpha_2 (1 - \alpha_1) + (1 - \alpha_2)) \textbf{I}) \\ & = \sqrt{\alpha_1 \alpha_2} x_0 + \mathcal{N}(\textbf{0}, (1 - \alpha_1 \alpha_2) \textbf{I}) \\ \end{aligned}

This can be extended recursively1, so we can write xtx_t in closed form as:

xt=α1α2αtx0+N(0,(1α1α2αt)I)x_t = \sqrt{\alpha_1 \alpha_2 \dots \alpha_t} x_0 + \mathcal{N}(\textbf{0}, (1 - \alpha_1 \alpha_2 \dots \alpha_t) \textbf{I})

It's common to express the product as a new variable:

αˉt=α1α2αt=i=1tαi\bar{\alpha}_t = \alpha_1 \alpha_2 \dots \alpha_t = \prod_{i=1}^{t} \alpha_i

Also, the usual notation is to write ϵtN(0,I)\epsilon_t \sim \mathcal{N}(\textbf{0}, \textbf{I}), giving the final equation for sampling xtx_t as:

xt=αˉtx0+1αˉtϵtx_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon_t

Sampling ϵt\epsilon_t from N(0,I)\mathcal{N}(\textbf{0}, \textbf{I}) and using it to derive xtx_t is called Monte Carlo sampling. Alternatively, we can use our qq notation from earlier to specify the closed-form distribution that the sample is drawn from:

q(xtx0)=αˉtx0+1αˉtN(0,I)=N(αˉtx0,(1αˉt)I)\begin{aligned} q(x_t | x_0) & = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \mathcal{N}(\textbf{0}, \textbf{I}) \\ & = \mathcal{N}(\sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t) \textbf{I}) \\ \end{aligned}

How is the model trained?

The diffusion model training and sampling algorithms

The main goal of the learning process is to maximize the likelihood of the data after repeatedly applying the reverse process. First, we sample some noise ϵtN(0,I)\epsilon_t \sim \mathcal{N}(\textbf{0}, \textbf{I}) and then we apply the forward process q(xtx0)q(x_t | x_0) to get xtx_t:

xt=αˉtx0+1αˉtϵtx_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon_t

The diffusion model training process involves training a model which takes xtx_t and tt as input and predicts ϵt\epsilon_t:

ϵ^t=ϵθ(xt,t)\hat{\epsilon}_t = \epsilon_{\theta}(x_t, t)

We can train the model to minimize the mean squared error between ϵt\epsilon_t and ϵ^t\hat{\epsilon}_t:

L=ϵtϵ^t2\mathcal{L} = ||\epsilon_t - \hat{\epsilon}_t||^2

So, the model is predicting the noise between the noisy image and the original image.

How do you sample from the model?

Now that we've ironed out the math for the forward process, we need to flip it around to get the reverse process. In other words, given that we have q(xtxt1)q(x_t | x_{t-1}), we need to derive q(xt1xt)q(x_{t-1} | x_t) 2. The first step is to apply the chain rule:

q(xt1xt)=q(xtxt1)q(xt1)q(xt)q(xtxt1)q(xt1)\begin{aligned} q(x_{t-1} | x_t) & = \frac{q(x_t | x_{t-1}) q(x_{t-1})}{q(x_t)} \\ & \propto q(x_t | x_{t-1}) q(x_{t-1}) \\ \end{aligned}

We drop the denominator because xtx_t is constant when we are sampling.

Recall the probability density function for a normal distribution:

N(xμ,σ2)=12πσ2exp((xμ)22σ2)\mathcal{N}(x | \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp(-\frac{(x - \mu)^2}{2 \sigma^2})

We can use this to rewrite q(xtxt1)q(x_t | x_{t-1}) as a function of xt1x_{t-1}:

q(xtxt1)=N(xtαˉtxt1,(1αˉt)I)=12π(1αˉt)exp((xtαˉtxt1)22(1αˉt))\begin{aligned} q(x_t | x_{t-1}) & = \mathcal{N}(x_t | \sqrt{\bar{\alpha}_t} x_{t-1}, (1 - \bar{\alpha}_t) \textbf{I}) \\ & = \frac{1}{\sqrt{2 \pi (1 - \bar{\alpha}_t)}} \exp(-\frac{(x_t - \sqrt{\bar{\alpha}_t} x_{t-1})^2}{2 (1 - \bar{\alpha}_t)}) \\ \end{aligned}

Similarly for q(xt1)q(x_{t-1}):

q(xt1)=N(xt1αˉt1x0,(1αˉt1)I)=12π(1αˉt1)exp((xt1αˉt1x0)22(1αˉt1))\begin{aligned} q(x_{t-1}) & = \mathcal{N}(x_{t-1} | \sqrt{\bar{\alpha}_{t-1}} x_0, (1 - \bar{\alpha}_{t-1}) \textbf{I}) \\ & = \frac{1}{\sqrt{2 \pi (1 - \bar{\alpha}_{t-1})}} \exp(-\frac{(x_{t-1} - \sqrt{\bar{\alpha}_{t-1}} x_0)^2}{2 (1 - \bar{\alpha}_{t-1})}) \\ \end{aligned}

Anyway, somehow if you do some crazy math you can eventually arrive at the equations for the forward process, which are:

xt1=1αt(xt1αt1αˉtϵθ(xt,t))+σtzx_{t - 1} = \frac{1}{\sqrt{\alpha_t}}(x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_{\theta}(x_t, t)) + \sigma_t z

where zN(0,I)z \sim \mathcal{N}(\textbf{0}, \textbf{I}) is new noise to add at each step and ϵθ(xt,t)\epsilon_{\theta}(x_t, t) is the output of the model.

Latent Diffusion

The latent diffusion paper is most notable because it produces high-quality image samples with a relatively simple model. It contains two training phases:

  1. Autoencoder to learn a lower-dimensional latent representation
  2. Diffusion model learned in the latent space

The main insight is that learning a diffusion model in the full pixel space is computationally expensive and has a lot of redundancy. Instead, we can first obtain a lower-rank representation of the image, learn a diffusion model on that representation, then use the decoder to reconstruct the image. So gaining insight into the latent diffusion model comes from understanding how this latent space is constructed.

When looking through the latent diffusion repository code, it's important to remember that a lot of stuff might be in the taming transformers repository.

How is the autoencoder constructed?

The autoencoder is an encoder-decoder trained with perceptual loss and patch-based adversarial loss, which helps ensure that the reconstruction better matches how humans perceive images.

The perceptual loss takes a pre-trained VGG16 model and extracts four sets of features, projects them to a lower-dimensional space, then computes the mean squared error between the features of the original image and the reconstructed image.

The patch-based adversarial loss is a discriminator trained to classify whether a patch of an image is real or fake. The discriminator is trained with a hinge loss, which is a loss function that penalizes the discriminator more for misclassifying real images as fake than fake images as real.

What is the KL divergence penalty?

Besides the reconstruction loss, an additional slight penalty is imposed on the latent representation to make it closer to a normal distribution, in the form of minimizing the KL divergence between the latent distribution and a standard normal distribution. Recalling that the KL divergence between two distributions pp and qq is defined as:

DKL(pq)=p(x)logp(x)q(x)dx=p(x)logp(x)dxp(x)logq(x)dx\begin{aligned} D_{KL}(p || q) & = \int p(x) \log \frac{p(x)}{q(x)} dx \\ & = \int p(x) \log p(x) dx - \int p(x) \log q(x) dx \\ \end{aligned}

The first term is the entropy of a normal distribution and the second term is the cross-entropy between the two distributions, which have the following forms respectively:

H(p)=12log(2πeσp2)H(p,q)=12log(2πeσq2)+(μpμq)2+σp212σq2\begin{aligned} H(p) & = \frac{1}{2} \log (2 \pi e \sigma_p^2) \\ H(p, q) & = \frac{1}{2} \log (2 \pi e \sigma_q^2) + \frac{(\mu_p - \mu_q)^2 + \sigma_p^2 - 1}{2 \sigma_q^2} \\ \end{aligned}

We can substitute these into the KL divergence equation to get:

DKL(pq)=12logσq2σp2+σp2+(μpμq)212σq2D_{KL}(p || q) = \frac{1}{2} \log \frac{\sigma_q^2}{\sigma_p^2} + \frac{\sigma_p^2 + (\mu_p - \mu_q)^2 - 1}{2 \sigma_q^2}

We can rewrite the KL divergence between N(μ,σ2)\mathcal{N}(\mu, \sigma^2) and N(0,1)\mathcal{N}(0, 1) as:

DKL(N(μ,σ2)N(0,1))=σ2+μ21logσ2D_{KL}(\mathcal{N}(\mu, \sigma^2) || \mathcal{N}(0, 1)) = \frac{\sigma^2 + \mu^2 - 1 - \log{\sigma}}{2}

Here's a PyTorch implementation of this, from the latent diffusion repository:

def kl_loss(mean: Tensor, log_var: Tensor) -> Tensor:
    # mean, log_var are image tensors with shape [batch_size, channels, height, width]
    logvar = torch.clamp(torch, -30.0, 20.0)
    var = logvar.exp()
    return 0.5 * torch.sum(torch.pow(mean, 2) + var - 1.0 - logvar, dim=[1, 2, 3])

Flow Matching

Flow-based models are another type of generative model which rely on the idea of "invertible transformations". Suppose you have a function f(x)f(x) which can reliably map your data distribution to a standard normal distribution, and is also invertible; then the function f1(x)f^{-1}(x) can be used to map from points in the standard normal distribution to your data distribution. This is the basic idea behind flow-based models.

Note that the sections that follow are going to feel like a lot of math, but they are a windy path to get to a nice and easy-to-understand comparison with diffusion models, which is: If you write the steps that diffusion models take as an ODE, the line they trace to get to the final point is not straight; why not just make it straight? Neural networks probably like predicting straight lines. See Figure 3 from the flow matching paper below:

ODE paths for diffusion equations verses optimal transport equations

What is a continuous normalizing flow?

Continuous normalizing flows were first introduced in the paper Neural Ordinary Differential Equations. Consider the update rule of a recurrent neural network:

ht+1=ht+f(ht,θt)\textbf{h}_{t + 1} = \textbf{h}_t + f(\textbf{h}_t, \theta_t)

In a vanilla RNN, ff just does a matrix multiplication on ht\textbf{h}_t. This can be thought of as a discrete update rule over time. Neural ODEs are the continuous version of this:

dh(t)dt=f(h(t),t,θ)\frac{d \textbf{h}(t)}{dt} = f(\textbf{h}(t), t, \theta)

The diffusion process described earlier can be conceptualized as a neural ODE - you just have to have an infinite number of infinitesimally small diffusion steps.

This formulation permits us to use any off-the-shelf ODE solver to generate samples from the distribution. The simplest method is to just sample some small Δt\Delta t and use Euler's method to solve the ODE (as in the figure below).

How do we train a continuous normalizing flow?

Sampling from an ODE is fine. The real question is how we parameterize these, and what update rule we use to update the parameters.

The goal of the optimization is to make the output of our ODE solver close to our data. This can be formulated as:

L(z(t1))=L(z(t0)+t0t1f(z(t),t,θ))\mathcal{L}(z(t_1)) = \mathcal{L}\big( z(t_0) + \int_{t_0}^{t_1} f(z(t), t, \theta) \big)

To optimize this, we need to know how L(z(t))\mathcal{L}(z(t)) changes with respect to z(t)z(t):

a(t)=L(z(t))z(t)da(t)dt=a(t)Tf(z(t),t,θ)z\begin{aligned} a(t) & = \frac{\partial \mathcal{L}(z(t))}{\partial z(t)} \\ \frac{d a(t)}{dt} & = -a(t)^T \frac{\partial f(z(t), t, \theta)}{\partial z} \\ \end{aligned}

We can use the second equation to move backwards along tt using another ODE solver, back-propagating the loss at each step.

This function is called the adjoint, and is illustrated in Figure 2 from the original Neural ODE paper, copied below. It's useful to know about but mainly as a barrier to overcome further down - we don't want to actually use it because it is computationally expensive to unroll every time we want to update our model.

Adjoint

However, the above process can be computationally expensive to do; the analogy in our diffusion model would be having to update every single point along our diffusion trajectory on every update, each time using an ODE solver. Instead, the paper Flow Matching for Generative Modeling proposes a different approach, called Continuous Flow Matching (CFM).

What is continuous flow matching?

First, some terminology: the paper makes use of optimal transport to speed up the training process. This basically just means the most efficient way to move between two points given some constraints. Alternatively, it is the path which minimizes a total cost.

The goal of CFM is to avoid going through the entire ODE solver on every update step. Instead, in order to scale our flow matching model training, we want to be able to sample a single point, and then use optimal transport to move that point to the data distribution.

First, we can express our continuous normalizing flow as a function:

ϕt(x):[0,1]×RdRd\phi_t(x) : [0, 1] \times \mathbb{R}^d \rightarrow \mathbb{R}^d

This can be read as, "a function mapping from a point in Rd\mathbb{R}^d (i.e., a dd-dimensional vector) and a time between 0 and 1 to another point in Rd\mathbb{R}^d". We are interested in the first derivative of the function:

vt(ϕt(x))=ddtϕt(x)ϕ0(x)=x\begin{aligned} v_t(\phi_t(x)) & = \frac{d}{dt} \phi_t(x) \\ \phi_0(x) & = x \\ \end{aligned}

where x=(x1,,xd)Rdx = (x_1, \dots, x_d) \in \mathbb{R}^d are the points in the data distribution (for example, our images).

In a CNF, the function vtv_t, usually called the vector field, is parameterized by the neural network. Our goal is to update the neural network so that we can use it to move from some initial point sampled from a prior distribution to one of the points in our data distribution by using an ODE solver (in other words, by following the vector field).

Some more notation:

  • p0p_0 is the prior distribution, usually a standard normal N(0,I)\mathcal{N}(0, I)
  • qq is the true data distribution, which is unknown, but we get samples from it in the form of images (for example)
  • p1p_1 is the posterior distribution, which we want to be close to qq
  • ptp_t is the distribution of points at time tt between p0p_0 and p1p_1. Think of these as noisy images from somewhere along some path from our prior to posterior distributions.

How do we learn the vector field?

The goal of the learning process, as with most learning processes, is to maximize the likelihood of the data distribution. We can express this using the flow matching objective:

LFM(θ)=Et,pt(x)vt(x)ut(x)2\mathcal{L}_{FM}(\theta) = \mathbb{E}_{t,p_t(x)} || v_t(x) - u_t(x) ||^2

where:

  • vtv_t is the output of the neural network for time tt and the data sample(s) xx
  • utu_t is the "true vector field" (i.e., the vector field that would take us from the prior distribution to the data distribution)

So the loss function is simply doing mean squared error between the neural network output and some "ground truth" vector field value. Seems simple enough, right? The problem is that we don't really know what ptp_t and utu_t are, since we could take many paths from p0p_0 to p1p_1.

The insight from this paper starts with the marginal probability path. This should be familiar if you are familiar with graphical models like conditional random fields. The idea is that, given some sample from our data distribution, we can marginalize ptp_t over all the different ways we could get from ptp_t to some sample in our data distribution:

pt(x)=pt(xx1)q(x1)dx1p_t(x) = \int p_t(x | x_1) q(x_1) dx_1

This can be read as, "pt(x)p_t(x) is the distribution over all the noisy images that can be denoised to an image in our data distribution".

We can also marginalize over the vector field:

ut(x)=ut(xx1)pt(xx1)q(x1)pt(x)dx1u_t(x) = \int u_t(x | x_1) \frac{p_t(x | x_1) q(x_1)}{p_t(x)} dx_1

This can be read as, "ut(x)u_t(x) is the distribution over all the vector fields that could take us from a noisy image to an image in our data distribution, weighted by the probability of each path that the process would take".

Rather than computing the (intractable) integrals in the equations above, we can instead condition on a single sample x1q(x1)x_1 \sim q(x_1), use that sample to get a noisy sample xpt(xx1)x \sim p_t(x | x_1) and then use that noisy sample to compute the direction ut(xx1)u_t(x | x_1) that our vector field should take to recover the original image x1x_1, finally minimizing the conditional flow matching objective:

LCFM(θ)=Et,q(x1),pt(xx1)vt(x)ut(xx1)2\mathcal{L}_{CFM}(\theta) = \mathbb{E}_{t,q(x_1),p_t(x|x_1)} || v_t(x) - u_t(x | x_1) ||^2

Without going into the math, the paper shows that this objective has the same gradients as the earlier objective, which is a pretty interesting result. It basically means that we can just follow our vector field from the original image to the noisy image, and that vector field is the optimal one to follow backwards to our original image.

The conditional path pt(xx1)p_t(x | x_1) is chosen to progressively add noise to the sample x1x_1 (this is what diffusion models do):

pt(xx1)=N(xμt(x1),σt(x1)2I)p_t(x|x_1) = \mathcal{N}(x | \mu_t(x_1), \sigma_t(x_1)^2 I)

where:

  • μt\mu_t is the time-dependent mean of the Gaussian distribution, ranging from μ0(x1)=0\mu_0(x_1) = 0 (i.e., the prior is has mean 0) to μ1(x1)=x1\mu_1(x_1) = x_1 (i.e., the posterior is has mean x1x_1)
  • θt\theta_t is the time-dependent standard deviation, ranging from θ0(x1)=1\theta_0(x_1) = 1 (i.e., the prior has standard deviation 1) to θ1(x1)=σmin\theta_1(x_1) = \sigma_{\text{min}} (i.e., the posterior has some very small amount of noise)

Using the above notation, the paper considers the flow:

ϕt(x)=σt(x1)x+μt(x1)\phi_t(x) = \sigma_t(x_1)x + \mu_t(x_1)

Remember that ϕt(x)\phi_t(x) is the flow at time tt for the sample xx, meaning the point that we would get to if we followed the vector field from xx for time tt (in other words, the noisy image).

Recall from earlier that ut(ϕt(x)x1)u_t(\phi_t(x) | x_1) is just the derivative of this field, which gives us a closed form solution for our target values in our LCFM\mathcal{L}_{CFM} objective:

ut(xx1)=ddtϕt(x)=σt(x1)σt(x1)(xμt(x1))+μt(x1)\begin{aligned} u_t(x|x_1) & = \frac{d}{dt} \phi_t(x) \\ & = \frac{\sigma_t'(x_1)}{\sigma_t(x_1)} (x - \mu_t(x_1)) + \mu_t'(x_1) \\ \end{aligned}

where:

  • σt(x1)\sigma_t'(x_1) is the derivative of σt(x1)\sigma_t(x_1) with respect to tt
  • μt(x1)\mu_t'(x_1) is the derivative of μt(x1)\mu_t(x_1) with respect to tt

This is basically just a more general formulation of diffusion models. Specifically, diffusion models can be expressed as:

μt(x1)=α1tx1σt(x1)=1α1t2\begin{aligned} \mu_t(x_1) & = \alpha_{1 - t}x_1 \\ \sigma_t(x_1) & = \sqrt{1 - \alpha_{1 - t}^2} \\ \end{aligned}

although α\alpha here is slightly different from earlier.

Alternatively, the optimal transport conditioned vector field can be expressed as:

μt(x)=tx1σt(x)=1(1σmin)t\begin{aligned} \mu_t(x) & = t x_1 \\ \sigma_t(x) & = 1 - (1 - \sigma_{\text{min}}) t \\ \end{aligned}

This vector field linearly scales the mean from the image down to 0, and linearly scales the standard deviation from σmin\sigma_{\text{min}} up to 1. This has the derivatives:

μt(x)=x1σt(x)=(1σmin)\begin{aligned} \mu_t'(x) & = x_1 \\ \sigma_t'(x) & = -(1 - \sigma_{\text{min}}) \\ \end{aligned}

Plugging into the above equation gives us ut(xx1)u_t(x | x_1) (don't worry, it's just basic algebra):

ut(xx1)=(1σmin)1(1σmin)t(xtx1)+x1=(1σmin)x+tx1(1σmin)+x1(1(1σmin)t)1(1σmin)t=(1σmin)x+tx1tx1σmin+x1tx1+tx1σmin1(1σmin)t=x1(1σmin)x1(1σmin)t\begin{aligned} u_t(x | x_1) & = \frac{-(1 - \sigma_{\text{min}})}{1 - (1 - \sigma_{\text{min}})t} (x - t x_1) + x_1 \\ & = \frac{-(1 - \sigma_{\text{min}}) x + t x_1 (1 - \sigma_{\text{min}}) + x_1 (1 - (1 - \sigma_{\text{min}}) t)}{1 - (1 - \sigma_{\text{min}}) t} \\ & = \frac{-(1 - \sigma_{\text{min}}) x + tx_1 - tx_1 \sigma_{\text{min}} + x_1 - tx_1 + tx_1 \sigma_{\text{min}}}{1 - (1 - \sigma_{\text{min}}) t} \\ & = \frac{x_1 - (1 - \sigma_{\text{min}}) x}{1 - (1 - \sigma_{\text{min}}) t} \end{aligned}

So, to recap the learning procedure:

  1. Choose a sample x1x_1 from the dataset.
  2. Compute ut(xx1)u_t(x | x_1) using the equation above.
  3. Predict vt(xx1)v_t(x | x_1) using the neural network.
  4. Minimize the mean squared error between the two.

Then, sampling from the model is just a matter of following the flow from some random noise vector along the vector field predicted by the neural network, as you would with a regular ODE.

Specifically, they found that they were able to get good quality samples using a fixed-step ODE solver (the simplest kind) using 100\leq 100 steps.

Footnotes

  1. Proof by "trust me, bro"

  2. Alternatively denoted p(xt1xt)p(x_{t-1} | x_t) so that you can just use the qq function everywhere