Exploring the Fundamentals of Diffusion Models

This blog post explores the recently popular diffusion models. Since there are many excellent explanations available, I don't intend to duplicate their detailed discussions. Rather, I try to offer a succinct summary that highlights the essential principles and foundational concepts behind diffusion models.

Diffusion models belong to the family of generative neural network models, alongside Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs). The term "diffusion" originates from non-equilibrium thermodynamics, wherein particles (images) migrate from high concentration regions (represented by a data manifold filled with structured data) to the low concentration regions (In essence, the vast majority of points within the entire space are overwhelmingly dominated by noisy images). This concept illustrates how diffusion models iteratively refine noise into coherent, structured data by effectively reversing this natural diffusion process.

To make the concept of the diffusion process easier to grasp, let's use head CT scans as an example.  Imagine that each of these scans is made up of 256 shades of gray and is sized at $300\times 300$ pixels. In this context, each image can be thought of as a single point, denoted as $\mathbf{x}=(x_1, \ldots, x_{300^2})$, within a vast discrete set $\mathbb{V} = \{0, \ldots, 255\}^{300^2}$. Here, $\mathbb{V}$ represents all possible images that could be formed by any combination of grayscale intensities across the $300^2$ pixels, where each $x_j$ (the coordinate along the $j$-th axis) corresponds to the intensity of the $j$-th pixel.

The sheer size of $\mathbb{V}$ is astronomical, containing $256^{300^2}$ potential points, a number far exceeding the estimated number of atoms in the universe. Head CT images are not random assortments of pixels; they are highly structured configurations, representing meaningful and specific anatomical information. Therefore, the entire set of possible head CT images, referred to as the data manifold and denoted by $\mathbb{M}$, takes up only an extremely tiny portion of $\mathbb{V}$. From this perspective, the diffusion process acts to disperse these structured, highly concentrated points (representing head CT images) across the broader, mostly uncharted territory of $\mathbb{V}$, which is predominantly filled with random, less meaningful noise.

Diffusion models can be viewed as sharing some conceptual similarities with VAEs, yet they fundamentally differ in their operating principles and mechanisms.  VAEs compress input data into a dense latent space and then expand this compressed representation back to the original data form. This operation is significantly propelled by the reparameterization trick, which introduces variability by injecting noise into the latent space. Diffusion models start with structured input data and then systematically introduce noise, emulating the gradual process of natural diffusion until they achieve a predetermined noise level. Following this, the process is inverted; the models meticulously remove noise through a series of denoising steps. This careful, stage-by-stage denoising strips away the layers of added noise, progressively revealing the data's original structure.

A notable distinction between the two is in their treatment of latent variables. Unlike VAEs, where the latent space is typically lower-dimensional and abstract, diffusion models maintain latent variables that are dimensionally the same as the original data. In diffusion models, the encoding process is predefined and fixed, while only the decoding process is subject to learning. In VAEs, both the encoding and decoding processes are learned simultaneously, allowing for a joint optimization of these functions.

To mathematically delineate the diffusion model, let's start with $\mathbf{x}_0$, representing the initial, structured image before the addition of any noise. The distribution of $\mathbf{x}_0$ is denoted as $q(\mathbf{x}_0)$. The model then engages in a forward diffusion process, generating a progressively less structured sequence of states $\mathbf{x}_1, \ldots, \mathbf{x}_T$ by incrementally adding noise. This process is described by a Markovian transition defined as $$q(\mathbf{x}_{t} | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1-\beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I})$$ where $\mathbf{x}_t$ is the state of the data at time step $t$, $\beta_t$ represents the noise variance at time step $t$, and $\mathbf{I}$ denotes the identity matrix. In this context, $\mathcal{N}(\mathbf{x}, \mu, \Sigma)$ represents the Gaussian distribution with mean vector $\mu$ and covariance matrix $\Sigma$, mathematically defined as:  $$\mathcal N(\mathbf{x};\mu,\Sigma)=\frac{c}{\mbox{det}(\Sigma)}\mbox{exp}( (\mathbf{x}-\mu)^T\Sigma^{-1}(\mathbf{x}-\mu)).$$where $c$ indicates the normalization constant, and the superscript $T$ denotes the transpose operation. This formula, $\mathbf{x}_t = \sqrt{1-\beta_t}\mathbf{x}_{t-1} + \sqrt{\beta_t}\mathbf{z}_{t}$ for $t = 1, \ldots, T$, outlines the mechanism through which noise $\mathbf{z}_{t} \sim \mathcal{N}(0, \mathbf{I})$ is systematically introduced to the data. It transforms the original structured image $\mathbf{x}_{0}$ into a sequence that becomes increasingly disordered as the process unfolds. We should note that $\mathbf{x}_{t}$ can be directly derived from $\mathbf{x}_{0}$ because a direct calculation reveals $\overline{\alpha}$ that $\mathbf{x}_t = \sqrt{\overline{\alpha}_t}\mathbf{x}_{0} + \sqrt{1-\overline{\alpha}_t}\mathbf{\epsilon}$, with $\mathbf{\epsilon} \sim \mathcal{N}(0, \mathbf{I})$ and $\overline{\alpha}_t=\Pi_{i=1}^t \underbrace{(1-\beta_i)}_{\alpha_i}$. 



Here, $\overline{\alpha}_t \searrow 0$ as $t\nearrow \infty$. While the transformation from the original structured image $\mathbf{x}_{0}$ to $\mathbf{x}_{t}$ can be directly determined, we highlight the intentional journey towards increased disorder as the diffusion steps advance.

The same orderly method is applied in reverse during the denoising process, where we begin with the noisy image $\mathbf{x}_{T}$ and carefully reverse the sequence, progressively restoring the image's initial anatomical structure, ultimately recovering a denoised data sample $\mathbf{x}_{0}$ that follows the original structured distribution $q(\mathbf{x}_{0})$. To reverse the diffusion process, our goal is to derive $q(\mathbf{x}_{t-1}|\mathbf{x}_t)$  based on the predetermined diffusion model $q(\mathbf{x}_{t}|\mathbf{x}_{t-1})$. However, directly calculating $q(\mathbf{x}_{t-1}|\mathbf{x}_t)$ from $q(\mathbf{x}_{t}|\mathbf{x}_{t-1})$ through Bayes' rule (which is $q(\mathbf{x}_{t}|\mathbf{x}_{t-1})\frac{q(\mathbf{x}_{t-1})}{q(\mathbf{x}_{t})}$) poses significant challenges due to the complexity of the computations involved. Specifically, this calculation demands the execution of a daunting task: evaluating the expression $q(\mathbf{x}_{t}|\mathbf{x}_{t-1})  \frac{{\mathbb{E}}_{q(\mathbf{x}_{0})} [q(\mathbf{x}_{t-1}|\mathbf{x}_{0})]}{{\mathbb{E}}_{q(\mathbf{x}_{0})} [q(\mathbf{x}_{t}|\mathbf{x}_{0})]}$. This requires integrating over all possible initial states $\mathbf{x}_{0}$ to find the expected values, a process that is computationally prohibitive.

Mimicking the strategies used in VAEs, we leverage a neural network to model the distribution $p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \mu_\theta(\mathbf{x}_t, t), \Sigma_\theta(\mathbf{x}_t, t))$, aiming to approximate $q(\mathbf{x}_{t-1}|\mathbf{x}_t)$ by fine-tuning the neural network's parameters $\theta$. The final goal here is to minimize the difference, such as the KL divergence, between the model's distribution $p_\theta(\mathbf{x}_{0})$ and the data distribution $p_{data}(\mathbf{x}_{0})$, with the underlying assumption that the bulk of the true data comes from $p_{data}$. Concurrently, the goal involves maximizing the log-likelihood $\log p_\theta(\mathbf{x}_{0})=\log \Bbb E_{q(\mathbf{x}_{1:T}|\mathbf{x}_0)} \left[ \frac{ q(\mathbf{x}_{T}) \Pi_{t=1}^T p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{ q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\right]$, where $q(\mathbf{x}_{1:T}|\mathbf{x}_{0})=\Pi_{t=1}^T q(\mathbf{x}_{t}|\mathbf{x}_{t-1})$.  To navigate towards this objective, we employ the Evidence Lower Bound (ELBO), a technique utilized in VAEs, utilizing Jensen's inequality $\log \mathbb{E} [...] \ge \mathbb{E} [\log ...]]$. Following algebraic computations similar to those used in VAEs, this objective is refined to minimizing the following terms: $$\Bbb E_{q(\mathbf{x}_{t}|\mathbf{x}_0)} \left[\mathcal{D}_{KL}\left(q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)\| p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)\right)\right],~ t=1,\cdots, T$$ where $\mathcal{D}_{KL}$ denotes the Kullback-Leibler divergence. For a detailed exploration of the diffusion model, including algorithms, consider consulting the informative lecture notes provided in reference [4].

This diffusion model can be described using Stochastic Differential Equations (SDEs) that characterize both the forward (noising) process and the backward (denoising) process. The forward (noising) process is governed by the SDE: $ dx_t = g(t)dW_t$ where $dW_t$ denotes the differential of the Wiener process (or Brownian motion), and $g(t)$ is the diffusion coefficient that controls the amount of noise added at each time step. The full form of this SDE is given by $ dx_t = f(x,t) dt +g(t)dW_t$.  However, for simplicity of explanain, the drift term $f(x,t)$ is set to zero, $f(x,t)=0$. The reverse (denoising) process is represented by the reverse-time SDE [Anderson 1982]: $ dx_t = -g(t)^2 \nabla_x \log p(x_t) dt + g(t)d\bar{W}_t$ where the drift term $g(t)^2 \nabla_x \log p(x_t,t) dt$ guides the data points (noisy data) back toward regions of higher probability density, $dt$ is the negative time step, and  $d\bar{W_t}$ is a reverse-time Weiner process, which controlles stochasticity to ensure that the generated samples retain variability similar to the original distribution $p(x_0)$. 

Score-based Generative model

When tackling highly ill-posed or underdetermined problems in medical imaging without access to paired data, it would be essential to use a prior distribution  $p(x)$ to constrain the search within a space of highly structured images (or data manifold) that may occupy only a tiny fraction of the entire space. Although the exact prior $p(x)$ is impossible to know precisely, it can be approximated roughly by a parametric model $p(x;\theta)$ represented by a neural network, with parameters $\theta$ learned from training data.

Efficient sampling from $p(x; \theta)$ using a well-designed generative model is a critical aspect of this approach. Given that directly learning the distribution  $ p(x;\theta)$ is often infeasible, focusing on learning the score function  $s(x;\theta)=\nabla_x \log p(x;\theta)$ is a more practical and tractable solution. (Why? Directly learning $ p(x; \theta)$ requires computing the normalization constant to ensure that the total probability sums to one. Specifically, if a neural network outputs a function $f(x;\theta)$, the probability distribution can be expressed as $p(x; \theta) = \frac{1}{Z(\theta)} e^{f(x; \theta)}$, where $Z(\theta)$ normalizes the distribution. Unfortunately, calculating $Z(\theta)$ can be computationally expensive, especially in high-dimensional spaces. The score function, on the other hand, bypasses this challenge because it involves only the gradient of the log-probability, $\nabla_x \log p(x; \theta) = \nabla_x f(x; \theta)$, eliminating the need for normalization and thereby simplifying the learning process. Additionally, in high-dimensional spaces, data points are extremely sparse, making it highly challenging to estimate $p(x)$ accurately. Modeling $p(x; \theta)$ necessitates a vast number of parameters, and learning $p(x; \theta)$ for all possible $x$ becomes computationally prohibitive. In contrast, the score function only requires learning the gradient of the log-density rather than the density itself. This approach is often more feasible because it focuses on the local structure of the data, capturing how the probability density varies in small regions around each data point.)

Diffusion models employ Langevin dynamics for denoising by leveraging the score function $s(x;\theta)$ as the force vector field acting on the particles (representing noisy images). The denoising process, which transitions from $x_{n-1}$ to $x_n$, follows the update rule: $$x_n : =x_{n-1} +\gamma_n \nabla \log p(x_{n-1};\theta) +\sqrt{\gamma_n} z_n, ~ ~z_n\sim N(0,I)$$.
This iterative process refines noisy images by reducing noise through the guidance of the learned score function, with the effectiveness depending on the appropriate choices of $\gamma_n$.  This iterative process is closely related to the previously mentioned reverse-time SDE (again using the simplified form) : $dx_t = -g(t)^2 \nabla_x \log p(x_t) dt + g(t)d\bar{W}_t$. By considering $dt\approx -1$ (since $dt$ represents a negative time step in the reverse process)) and $t=n$, we can approximate the terms as follows: $dx\approx x_n-x_{n-1}$; $-g(t)^2 \nabla_x \log p(x_t) dt\approx \alpha_n \nabla_x \log p(x_{n-1};\theta)$, and $d\bar{W}_t=z_n$. This connection illustrates how the iterative update rule can be viewed as a discrete approximation of the continuous reverse-time SDE, where the learned score function plays a central role in guiding the denoising process. Given that this is a reverse SDE, the notation of the update rule might be confusing to a general reader. To clarify it within the context of the reverse diffusion process, we can express it as: $$x_{t-\Delta t} : =x_{t} +\gamma_t \nabla \log p(x_t;\theta) +\sqrt{\gamma_t} z_t, ~ ~z_t\sim N(0,I)$$ For a conditional reverse-time SDE, an additional term $\nabla_x \log p(\text{"condition"}|x_t;\theta)$ should be added to the drift term to incorporate the conditioning information.

Next, we will explain how to effectively learn the score function  $s(x;\theta)$ from the data. Given training data $\text{Data}_{\text{tr}}=\{x^{(1)}, \cdots,  x^{(K)}\}$ (a sample distribution of reference images), we try to match the score $s(x;\theta)$ to $\nabla \log p_{\text{tr-data}}$, where  $p_{\text{tr-data}}(x)=\frac{1}{K}\sum_k g_\sigma (x-x^{(k)})$ and $g_\sigma(x-x^{(k)})$ is a Gaussian kernel with variance $\sigma^2$ centered at $x^{(k)}$.
The score matching objective can be understood as minimizing the Fisher divergence between the data distribution $p_{\text{tr-data}}(x)$ and the model distribution. The expanded form of the score matching objective is: $$J_1(\theta)=\mathbb{E}_{p_{\text{tr-data}}} \left[ | \nabla_x \log p_{\text{tr-data}}(x) - \nabla_x \log p(x; \theta) |^2 \right].$$

Optimizing $J_1(\theta)$ with respect to $\theta$ is equivalent to optimizing the reduced form: $$J_2(\theta)=\mathbb{E}_{p_{\text{tr-data}}} \left[ | \nabla_x \log p(x; \theta)|^2 - 2 \nabla_x \log p_{\text{tr-data}}(x) \cdot \nabla_x \log p(x; \theta) \right]. $$
Expanding the cross term, we get: $$\mathbb{E}_{p_{\text{tr-data}}} \left[ \nabla_x \log p_{\text{tr-data}}(x) \cdot \nabla_x \log p(x; \theta) \right] = \int \nabla_x p_{\text{tr-data}}(x) \cdot \nabla_x \log p(x; \theta) \, dx$$
where we use the idenitity $p_{\text{tr-data}}(x) \nabla_x \log p_{\text{tr-data}}(x) =\nabla_x p_{\text{tr-data}}(x)$.
Using Gauss's Diveregnce Theorm with zero boundary condition, we obtain: $$\int \nabla_x p_{\text{tr-data}}(x) \cdot \nabla_x \log p(x; \theta) \, dx = \int p_{\text{tr-data}}(x) \cdot \nabla_x \cdot \nabla_x \log p(x; \theta) \, dx.$$
Therefore, the cross term simplifies to: $$\mathbb{E}_{p_{\text{tr-data}}} \left[ \nabla_x \log p_{\text{tr-data}}(x) \cdot \nabla_x \log p(x; \theta) \right] = -\mathbb{E}_{p_{\text{tr-data}}} \left[ \nabla_x^2 \log p(x; \theta) \right].$$
Substituting this back into the simplified objective, we get: $$J_2(\theta) = \mathbb{E}_{p_{\text{tr-data}}} \left[ | s(x; \theta)|^2 + 2 \nabla\cdot s(x; \theta) \right].$$

The term $\nabla\cdot s(x; \theta) = \text{trace} \nabla_x \nabla_x \log p(x; \theta)$, representing the trace of the Hessian, imposes a significant computational burden when dealing with medical images, where the dimensionality can reach millions of pixels. To achieve scalability, sliced score matching is used by replacing $|s(x; \theta)|^2 + 2 \nabla\cdot s(x; \theta)$ with $\frac{1}{n}\sum_{i}^n [ |v_i \cdot s(x;\theta)|^2 +2v_i^T \nabla s(x;\theta) v_i ]$ where $\{v_i\}_{i=1}^n$ are random projection directions.

Alternatively, Denoising Score Matching (DSM) can be employed, which involves perturbing the data with Gaussian noise and then learning the score function to approximate the original, denoised data. Note that $\nabla_x g_\sigma(x-x^{(k)})= \frac{ x-x^{(k)}}{\sigma^2}$ where $g_\sigma(x-x^{(k)})$ is a Gaussian kernel with variance $\sigma^2$ centered at $x^{(k)}$. Given a data $\{x^{(1)}, \cdots, x^{(K)}\}$, the objective is to minimize the following loss function: $$\frac{1}{KM}\sum_{k=1}^K\sum_{m+1}^M | s(x^{(k)}+z_{k,m};\theta)+\frac{z_{k,m}}{\sigma^2}|^2$$
where $z_{k,m}$ is Gaussian noise added to the data points; $M$ is the number of noise samples per data point (which should be very large for effective training), and $\sigma$ is the noise level that must be chosen appropriately to ensure the effective learning of the score function. This loss can be understood as an approximation of the Fisher divergence, expressed as: $$\frac{1}{K}\sum_{k=1}^K \int |s(x;\theta)-\nabla \log p_{\text{tr-data}}(x)|^2 g_\sigma(x-x^{(k)}) dx$$ Hence, DSM relies on the assumption that the data distribution $p_{\text{tr-data}}(x)$ can be approximated by a mixture of Gaussian kernels centered at the observed data points, such that: $p_{\text{tr-data}}(x)=\frac{1}{K}\sum_k g_\sigma (x-x^{(k)})$. However, this approach faces challenges: estimated score functions with small \(\sigma\) are often inaccurate in regions where \( p_{\text{tr-data}} \) is very small, while using a large \(\sigma\) may result in imprecise guidance near the data points. To address this issue, Song et al. (2021)[2] introduced a method that perturbs the data with Gaussian noise at multiple scales: \[ \frac{1}{KT} \sum_{t=1}^T \sum_{k=1}^K \lambda(\sigma_t) \int \left| s(x_t, \sigma_t; \theta) - \frac{x_t - x}{\sigma_t^2} \right|^2 g_{\sigma_t}(x_t - x^{(k)}) \, dx, \] where \(\lambda(\sigma_t)\) is a weighting function, and \(\sigma_1 < \sigma_2 < \cdots < \sigma_T\) represent different scales of noise. This multi-scale approach helps to balance the trade-off between accuracy near the data and robustness in low-density regions, improving the overall performance of the score estimation.

For high-dimensional medical images, the computational cost of DSM can be significant, and Gaussian smoothing in the context of sparse data may result in poor score function estimation, potentially failing to capture the local structure accurately. When addressing image restoration problems in degraded medical images, the noise and artifacts are often complex and vary between patients. Consequently, diffusion models that rely on a score function derived from Gaussian assumptions associated with sparse training data may result in suboptimal performance.

For diffusion models to perform well in the field of medical imaging, a diverse and representative dataset is essential. However, achieving this is challenging in clinical practice due to various constraints, including limitations in data collection (including ethical and privacy concerns), patient variability (encompassing anatomical and pathological differences), and the inherent nature of medical imaging (such as variability in image quality influenced by both patients and operators), among other factors. 

Reference

[1]"Generative modeling by estimating gradients of the data distribution."  Song Yang, and Stefano Ermon,  Advances in neural information processing systems 32 (2019). 
[2] Score-Based Generative Modeling through Stochastic Differential Equations (2021) Y Song, J Sohl-Dickstein, DP Kingma, A Kumar, S Ermon, B Poole, International Conference on Learning Representations 
[3] Foundation for Diffusion Model Developments: Jascha Sohl-Dickstein et al. ("Deep Unsupervised Learning using Nonequilibrium Thermodynamics," ICML 2015); Max Welling & Yee Whye Teh ("Bayesian Learning via Stochastic Gradient Langevin Dynamics," ICML 2011)
[4] Stable Diffusion Models through Score-Based and Gradient Estimation Techniques: Yang Song & Stefano Ermon on "Generative Modeling by Estimating Gradients of the Data Distribution" (NeurIPS 2019) and "Improved Techniques for Training Score-Based Generative Models" (NeurIPS 2020) 
[5] Denoising Diffusion Probabilistic Models: Ho et al. (arXiv preprint arXiv:2006.11239, 2020); Song et al. ("Denoising Diffusion Implicit Models," arXiv:2010.02502, 2020); and  Dhariwal & Nichol ("Improved Denoising Diffusion Probabilistic Models," arXiv:2102.09672, 2021)
[6] Strümke, Inga, and Helge Langseth. "Lecture Notes in Probabilistic Diffusion Models." arXiv preprint arXiv:2312.10393 (2023).
[7] Anderson, Brian DO. "Reverse-time diffusion equation models." Stochastic Processes and their Applications 12.3 (1982)




Comments

Popular posts from this blog

University Education: Reflecting Life's Complexities and Challenges

AI as an Aid, Not a Replacement: Enhancing Medical Practice without Supplanting Doctors