image FetalMind: Epistemic-aware Vision–Language Foundation Model for Fetal Ultrasound Interpretation

ArXiv 2025

Xiao He1, Huangxuan Zhao†1,
Guojia Wan1,
Yanxing Liu1,
Juhua Liu1,
Yongchao Xu1,
Yong Luo1,

Dacheng Tao2, Bo Du†1,

1Wuhan University, 2Nanyang Technological University


image

A fetal vision–language foundation model, FetalMind, pretrained on multi-center fetal ultrasound data, integrating structured clinical knowledge and reinforcement learning for report generation and diagnostic reasoning across diverse gestational stages.

Performance road.

Motivations

Fetal Ultrasound interpretation is fundamentally constrained by challenges such as multi-view heterogeneity, complex fetal anatomy, and a wide spectrum of developmental abnormalities, which collectively hinder the development of robust vision–language models for downstream diagnostic and report generation tasks. Moreover, labeling fetal ultrasound data requires specialized obstetric expertise and careful cross-view correlation, making large-scale annotation both time-consuming and resource-intensive. Consequently, an important research objective is to effectively leverage existing multi-center clinical data by uncovering latent associations among views, diseases, and textual findings to enhance learning efficiency and generalization across gestational stages.

Obstetric Ultrasound Report (Deep Learning Version)

The generalized version of our obstetric ultrasound report template via deep learning, established with reference to multiple international clinical guidelines. It provides a consistent and clinically grounded format for training and evaluating deep learning systems.

EarthSynth Model

Illustration of FetalMind and GPT-5 Case Study. (Case 127858) Correct answer is skeletal dysplasia. GPT-5 misclassified it as normal, while FetalMind correctly identified skeletal dysplasia by integrating multi-view structures and blood flow features.

Citation

Please consider cite us if you find our dataset, or model is useful to you.

      @misc{he2025fetalmind,
        title={Epistemic-aware Vision-Language Foundation Model for Fetal Ultrasound Interpretation}, 
        author={Xiao He, Huangxuan Zhao, Guojia Wan, Wei Zhou, Yanxing Liu, Juhua Liu, Yongchao Xu, Yong Luo, Dacheng Tao, and Bo Du},
        year={2025},
        eprint={2505.12953},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2510.12953}, 
        }