Synthetic Paired Data Generation for Medical Imaging: Bridging the Gap Toward Faithfully Reproducing Patient-Dependent Conditional Structure

The performance of supervised learning in digital medical imaging modalities such as ultrasound and low-dose CBCT depends critically on the availability of paired datasets. These datasets must capture variability across patients, anatomical structures, and disease presentations, while providing accurate and consistent labels aligned with the measured images. Diagnostic tasks—including segmentation and detection—are particularly dependent on such paired data, requiring reliable annotations such as lesion localization, bounding regions, and clinically meaningful diagnostic labels. Consequently, robust model training requires large-scale datasets with high-quality annotations spanning diverse patient populations.

However, in real clinical settings, such high-quality paired datasets are often unavailable due to the limited representation of abnormal cases, the absence of ground truth, inter-observer variability in annotations, patient-specific image heterogeneity, and the inherent variability in imaging physics and acquisition conditions.   As a result, the true joint distribution $p_{\mathrm{real}}(x,y)$ is only sparsely observed, where $x \in \mathcal{X}$ denotes a clinical image and $y \in \mathcal{Y}$ denotes its corresponding label.

Existing approaches to mitigate this limitation include synthetic paired data generation, image enhancement, and self-supervised learning. Image enhancement methods attempt to transform noisy or artifact-corrupted clinical images into cleaner representations, thereby facilitating downstream labeling. Physics-based approaches generate structurally consistent image pairs but often lack photorealism, while generative models improve visual realism but may fail to preserve precise anatomical correspondence. Self-supervised learning leverages intrinsic structures such as redundancy, multi-view consistency, or known forward operators to learn representations without paired supervision, yet it often struggles to bridge the domain gap between proxy tasks and real clinical data.

In this blog, we focus on synthetic paired data generation as a principled approach to constructing surrogate training data. The problem can be viewed as a domain gap under constrained supervision: although the forward imaging physics is known or partially known, the clinical data distribution remains complex, biased, and largely unpaired. The supervised learning objective is given by\[\min_\theta \; \mathbb{E}_{(x,y)\sim p_{\mathrm{real}}(x,y)}\big[\ell(f_\theta(x),y)\big],\] but in practice only limited samples from $p_{\mathrm{real}}(x,y)$ are available.

To address this, synthetic pairs $(\tilde{x},\tilde{y}) \sim p_{\mathrm{syn}}(x,y)$ are generated with the goal of approximating the real joint distribution, $p_{\mathrm{syn}}(x,y) \approx p_{\mathrm{real}}(x,y)$.

Synthetic data are constructed from a latent anatomical variable $z \in \mathcal{Z}$ encoding structure and pathology, using $\tilde{x} = \mathrm{Fwd}_\phi(z,\eta_{\mathrm{noise}})$,  $\tilde{y} = T_{\mathrm{label}}(z),$ where $\phi$ denotes acquisition parameters and $\eta_{\mathrm{noise}}$ models noise and artifacts. This ensures structural consistency between images and labels. The resulting synthetic distribution is $p_{\mathrm{syn}}(x,y) =\int p(x\mid z,\phi,\eta_{\mathrm{noise}})\, p(y\mid z)\, p(z,\phi,\eta_{\mathrm{noise}})\, dz\, d\phi\, d\eta_{\mathrm{noise}}.$

Training is then formulated as a hybrid objective combining synthetic supervision with regularization on real data, \[\min_\theta \;\lambda_{\mathrm{syn}} \, \mathbb{E}_{(\tilde{x},\tilde{y})\sim p_{\mathrm{syn}}}\big[\ell(f_\theta(\tilde{x}),\tilde{y})\big]+\lambda_{\mathrm{real}} \, \mathbb{E}_{x\sim p_{\mathrm{real}}}\big[R_{\mathrm{eg}}(x;\theta)\big].\]

To explicitly address the domain gap, the predictor is decomposed as $f_\theta = h_\theta \circ f^{(\mathrm{inter})}_\theta,$and the objective is extended with feature-level alignment,\[\min_\theta \;\lambda_{\mathrm{syn}} \, \mathbb{E}_{(\tilde{x},\tilde{y})\sim p_{\mathrm{syn}}}\big[\ell(h_\theta(f^{(\mathrm{inter})}_\theta(\tilde{x})), \tilde{y})\big]+\lambda_{\mathrm{real}} \, \mathbb{E}_{x\sim p_{\mathrm{real}}}\big[R_{\mathrm{eg}}(x;\theta)\big]+\lambda_{\mathrm{gap}} \,D_{\mathrm{dist}}\!\left(f^{(\mathrm{inter})}_\theta(\tilde{x}),f^{(\mathrm{inter})}_\theta(x)\right).\]

Despite this formulation, generating synthetic pairs that faithfully reflect clinical reality remains challenging. The key difficulty lies in modeling variability across heterogeneous patient populations. While synthetic pipelines assume $z \sim p(z)$, the true distribution is multi-modal and population-dependent, better described as $p_{\mathrm{real}}(z) = \sum_k \pi_k p_k(z)$. Approximating this distribution with a single model leads to loss of rare but clinically important cases and unrealistic interpolation across anatomies.

Furthermore, the conditional relationships $p(y \mid z)$ and $p(x \mid z)$ are themselves patient-dependent. Diagnostic labels depend on demographic and clinical context, and imaging formation varies with patient-specific factors such as body composition and pathology. Consequently, the real data distribution is more accurately described as \[p_{\mathrm{real}}(x,y)=\int p(x \mid z,\phi,\text{patient}) \, p(y \mid z,\text{patient}) \, p(z,\text{patient}) \, dz,\] whereas synthetic models approximate\[p_{\mathrm{syn}}(x,y)=\int p(x \mid z,\phi) \, p(y \mid z) \, p(z) \, dz.\]

The omission of patient-dependent conditioning constitutes the fundamental source of the domain gap. Therefore, the central challenge of synthetic paired data generation is not merely to produce realistic images, but to faithfully reproduce the full conditional structure induced by heterogeneous patient populations.

Comments

Popular posts from this blog

Optimizing Data Simplification: Principal Component Analysis for Linear Dimensionality Reduction

Physics-Informed Neural Networks: Fundamental Limitations and Conditional Usefulness

Practical Vector Calculus for the AI Era