Exploring the Fundamentals of Diffusion Models
This blog post explores the recently popular diffusion models. Since there are many excellent explanations available, I don't intend to duplicate their detailed discussions. Rather, I try to offer a succinct summary that highlights the essential principles and foundational concepts behind diffusion models.
Diffusion models belong to the family of generative neural network models, alongside Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs). The term "diffusion" originates from non-equilibrium thermodynamics, wherein particles (images) migrate from high concentration regions (represented by a data manifold filled with structured data) to the low concentration regions (In essence, the vast majority of points within the entire space are overwhelmingly dominated by noisy images). This concept illustrates how diffusion models iteratively refine noise into coherent, structured data by effectively reversing this natural diffusion process.
To make the concept of the diffusion process easier to grasp, let's use head CT scans as an example. Imagine that each of these scans is made up of 256 shades of gray and is sized at $300\times 300$ pixels. In this context, each image can be thought of as a single point, denoted as $\mathbf{x}=(x_1, \ldots, x_{300^2})$, within a vast discrete set $\mathbb{V} = \{0, \ldots, 255\}^{300^2}$. Here, $\mathbb{V}$ represents all possible images that could be formed by any combination of grayscale intensities across the $300^2$ pixels, where each $x_j$ (the coordinate along the $j$-th axis) corresponds to the intensity of the $j$-th pixel.
The sheer size of $\mathbb{V}$ is astronomical, containing $256^{300^2}$ potential points, a number far exceeding the estimated number of atoms in the universe. Head CT images are not random assortments of pixels; they are highly structured configurations, representing meaningful and specific anatomical information. Therefore, the entire set of possible head CT images, referred to as the data manifold and denoted by $\mathbb{M}$, takes up only an extremely tiny portion of $\mathbb{V}$. From this perspective, the diffusion process acts to disperse these structured, highly concentrated points (representing head CT images) across the broader, mostly uncharted territory of $\mathbb{V}$, which is predominantly filled with random, less meaningful noise.Diffusion models can be viewed as sharing some conceptual similarities with VAEs, yet they fundamentally differ in their operating principles and mechanisms. VAEs compress input data into a dense latent space and then expand this compressed representation back to the original data form. This operation is significantly propelled by the reparameterization trick, which introduces variability by injecting noise into the latent space. Diffusion models start with structured input data and then systematically introduce noise, emulating the gradual process of natural diffusion until they achieve a predetermined noise level. Following this, the process is inverted; the models meticulously remove noise through a series of denoising steps. This careful, stage-by-stage denoising strips away the layers of added noise, progressively revealing the data's original structure.
A notable distinction between the two is in their treatment of latent variables. Unlike VAEs, where the latent space is typically lower-dimensional and abstract, diffusion models maintain latent variables that are dimensionally the same as the original data. In diffusion models, the encoding process is predefined and fixed, while only the decoding process is subject to learning. In VAEs, both the encoding and decoding processes are learned simultaneously, allowing for a joint optimization of these functions.
To mathematically delineate the diffusion model, let's start with $\mathbf{x}_0$, representing the initial, structured image before the addition of any noise. The distribution of $\mathbf{x}_0$ is denoted as $q(\mathbf{x}_0)$. The model then engages in a forward diffusion process, generating a progressively less structured sequence of states $\mathbf{x}_1, \ldots, \mathbf{x}_T$ by incrementally adding noise. This process is described by a Markovian transition defined as $$q(\mathbf{x}_{t} | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1-\beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I})$$ where $\mathbf{x}_t$ is the state of the data at time step $t$, $\beta_t$ represents the noise variance at time step $t$, and $\mathbf{I}$ denotes the identity matrix. In this context, $\mathcal{N}(\mathbf{x}, \mu, \Sigma)$ represents the Gaussian distribution with mean vector $\mu$ and covariance matrix $\Sigma$, mathematically defined as: $$\mathcal N(\mathbf{x};\mu,\Sigma)=\frac{c}{\mbox{det}(\Sigma)}\mbox{exp}( (\mathbf{x}-\mu)^T\Sigma^{-1}(\mathbf{x}-\mu)).$$where $c$ indicates the normalization constant, and the superscript $T$ denotes the transpose operation. This formula, $\mathbf{x}_t = \sqrt{1-\beta_t}\mathbf{x}_{t-1} + \sqrt{\beta_t}\mathbf{z}_{t}$ for $t = 1, \ldots, T$, outlines the mechanism through which noise $\mathbf{z}_{t} \sim \mathcal{N}(0, \mathbf{I})$ is systematically introduced to the data. It transforms the original structured image $\mathbf{x}_{0}$ into a sequence that becomes increasingly disordered as the process unfolds. We should note that $\mathbf{x}_{t}$ can be directly derived from $\mathbf{x}_{0}$ because a direct calculation reveals $\overline{\alpha}$ that $\mathbf{x}_t = \sqrt{\overline{\alpha}_t}\mathbf{x}_{0} + \sqrt{1-\overline{\alpha}_t}\mathbf{\epsilon}$, with $\mathbf{\epsilon} \sim \mathcal{N}(0, \mathbf{I})$ and $\overline{\alpha}_t=\Pi_{i=1}^t \underbrace{(1-\beta_i)}_{\alpha_i}$.
Comments
Post a Comment