Fetal Ultrasound interpretation is fundamentally constrained by challenges such as multi-view heterogeneity, complex fetal anatomy, and a wide spectrum of developmental abnormalities, which collectively hinder the development of robust vision–language models for downstream diagnostic and report generation tasks. Moreover, labeling fetal ultrasound data requires specialized obstetric expertise and careful cross-view correlation, making large-scale annotation both time-consuming and resource-intensive. Consequently, an important research objective is to effectively leverage existing multi-center clinical data by uncovering latent associations among views, diseases, and textual findings to enhance learning efficiency and generalization across gestational stages.
The generalized version of our obstetric ultrasound report template via deep learning, established with reference to multiple international clinical guidelines. It provides a consistent and clinically grounded format for training and evaluating deep learning systems.
Illustration of FetalMind and GPT-5 Case Study. (Case 127858) Correct answer is skeletal dysplasia. GPT-5 misclassified it as normal, while FetalMind correctly identified skeletal dysplasia by integrating multi-view structures and blood flow features.
@misc{he2025fetalmind,
title={Epistemic-aware Vision-Language Foundation Model for Fetal Ultrasound Interpretation},
author={Xiao He, Huangxuan Zhao, Guojia Wan, Wei Zhou, Yanxing Liu, Juhua Liu, Yongchao Xu, Yong Luo, Dacheng Tao, and Bo Du},
year={2025},
eprint={2505.12953},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.12953},
}