Early Accepted · MICCAI 2026 Top 9%
EchoPilot: Training-Free Ultrasound Video Segmentation via Scale-Space Semantic Prompting and Reliability-Gated Memory
1The Hong Kong University of Science and Technology (Guangzhou) · 2Third Affiliated Hospital of Sun Yat-Sen University · 3Hong Kong Metropolitan University
* Corresponding author: Lei Zhu (leizhu@hkust-gz.edu.cn).
TL;DR — EchoPilot turns one click and one category name into stable ultrasound video masks, without training, task-specific fine-tuning, or dense first-frame annotations.
Abstract
Ultrasound video segmentation is clinically valuable yet difficult due to speckle noise, weak boundaries, and rapid anatomical deformation. Recent promptable foundation models enable point-guided segmentation, but direct deployment in ultrasound remains unreliable: a single point provides insufficient spatial context to resolve scale ambiguity, and greedy memory updates amplify early errors into severe temporal drift.
We present EchoPilot, a training-free framework for ultrasound video segmentation under sparse first-frame interaction, requiring only a single point click and an anatomical category name. EchoPilot mimics clinical global-to-local reasoning by orchestrating frozen medical vision-language, vision foundation, and video segmentation priors. It resolves initialization ambiguity with Scale-Space Semantic Prompting and a parameter-free S.E.E.D. criterion, then suppresses temporal drift with Reliability-Gated Memory. We also contribute a dynamic fetal placenta ultrasound VOS dataset with 671 annotated frames.
Method Overview
Stage I selects a semantic context scale with BioMedCLIP and S.E.E.D., then uses DINOv3 features to synthesize auxiliary point prompts. Stage II uses a reliability gate to decide whether each predicted frame should be written into the SAM2 or MedSAM2 memory bank.
Training-Free
All priors are frozen at inference time.
Sparse Interaction
Only one first-frame point and a category name are required.
Scale-Aware Initialization
S.E.E.D. chooses the view where the target is recognizable.
Drift-Resistant Tracking
Unreliable frames are prevented from contaminating memory.
Main Results
EchoPilot is evaluated on CAMUS, Breast Lesion, and fetal Placenta ultrasound videos. It consistently improves Dice and ASD under both SAM2 and MedSAM2 pretrained weights, and outperforms the finetuned MedSAM3 baseline despite using no task-specific training.
| Method | Venue | Prompt | CAMUS | Breast Lesion | Placenta | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Dice ↑ | ASD ↓ | F ↑ | Dice ↑ | ASD ↓ | F ↑ | Dice ↑ | ASD ↓ | F ↑ | |||
| MedSAM3 | arXiv'25 | P+T | 67.15 | 13.98 | 15.24 | 56.93 | 43.31 | 16.58 | 20.69 | 140.16 | 5.39 |
| Pretrained Weights: SAM2 | |||||||||||
| SAM2 | ICLR'25 | P | 28.41 | 56.84 | 0.80 | 56.00 | 56.54 | 19.83 | 16.74 | 267.24 | 0.89 |
| MA-SAM2 | MICCAI'25 | P | 28.43 | 56.81 | 0.81 | 55.76 | 56.61 | 19.67 | 16.75 | 267.26 | 0.90 |
| SAM2Long | ICCV'25 | P | 28.44 | 56.65 | 0.80 | 55.41 | 54.96 | 19.56 | 16.76 | 268.99 | 0.89 |
| EchoPilot | — | P+T | 34.09 | 28.86 | 3.21 | 63.38 | 22.98 | 21.35 | 39.74 | 84.96 | 5.92 |
| Pretrained Weights: MedSAM2 | |||||||||||
| MedSAM2 | arXiv'25 | P | 90.83 | 2.29 | 77.34 | 61.24 | 86.12 | 23.63 | 33.52 | 129.56 | 3.27 |
| MA-SAM2 | MICCAI'25 | P | 90.85 | 2.28 | 77.37 | 61.15 | 86.14 | 23.25 | 33.54 | 129.54 | 3.29 |
| SAM2Long | ICCV'25 | P | 91.01 | 2.23 | 78.91 | 66.73 | 47.00 | 26.10 | 34.92 | 120.38 | 3.48 |
| EchoPilot | — | P+T | 95.31 | 0.83 | 77.35 | 68.44 | 28.20 | 23.91 | 38.87 | 63.40 | 3.28 |
Dice and F-score are percentages. Lower ASD is better. EchoPilot is the only method here that combines point and text prompts without finetuning any backbone. On the challenging Placenta dataset with SAM2 weights, EchoPilot improves Dice from 16.74 → 39.74 and reduces ASD by 68%; with MedSAM2 weights, it reaches 95.31 Dice and 0.83 ASD on CAMUS.
Qualitative Comparison
Qualitative comparison across time steps. Baselines progressively leak into background anatomy or drift away from the target, while EchoPilot remains anchored by better initialization and gated memory updates.
Ablation Studies
The ablations isolate the two core mechanisms: semantic scale selection for robust first-frame prompts, and memory gating for reducing temporal error accumulation.
| Stage-I Variant | CAMUS | Breast Lesion | Fetal Placenta | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dice | ASD | ΔDice | ΔASD | Dice | ASD | ΔDice | ΔASD | Dice | ASD | ΔDice | ΔASD | |
| Pretrained Weights: SAM2 | ||||||||||||
| SAM2 base, 1-click random | 28.41 | 56.84 | — | — | 56.00 | 56.54 | — | — | 16.74 | 267.24 | — | — |
| VFM-only | 30.52 | 30.82 | +2.11 | −26.02 | 54.87 | 29.38 | −1.13 | −27.16 | 39.73 | 87.34 | +22.99 | −179.90 |
| UniMedCLIP + VFM | 29.98 | 31.16 | +1.57 | −25.68 | 57.78 | 27.61 | +1.78 | −28.93 | 39.04 | 88.33 | +22.30 | −178.91 |
| BioMedCLIP + VFM, EchoPilot | 34.09 | 28.86 | +5.68 | −27.98 | 63.38 | 22.98 | +7.38 | −33.56 | 39.74 | 84.96 | +23.00 | −182.28 |
| Pretrained Weights: MedSAM2 | ||||||||||||
| MedSAM2 base, 1-click random | 90.83 | 2.29 | — | — | 61.24 | 86.12 | — | — | 33.52 | 129.56 | — | — |
| VFM-only | 78.32 | 2.65 | −12.51 | +0.36 | 67.72 | 24.92 | +6.48 | −61.20 | 37.81 | 64.50 | +4.29 | −65.06 |
| UniMedCLIP + VFM | 84.69 | 2.08 | −6.14 | −0.21 | 68.64 | 21.22 | +7.40 | −64.90 | 38.54 | 63.53 | +5.02 | −66.03 |
| BioMedCLIP + VFM, EchoPilot | 95.31 | 0.83 | +4.48 | −1.46 | 68.44 | 28.20 | +7.20 | −57.92 | 38.87 | 63.40 | +5.35 | −66.16 |
Reliability-gated memory reduces ASD on Breast Lesion from 55.87 → 28.20 by preventing uncertain predictions from entering memory. The gate remains stable for τ in [0.1, 0.5]; at τ=0.9, the rejection rate exceeds 60%, over-rejecting useful temporal context.
Performance Profile
Radar summary against MedSAM2 and MedSAM3. EchoPilot improves the balance of Dice, ASD, and boundary scores across all three ultrasound datasets.
Dataset
As a dataset contribution, we curate a dynamic fetal placenta ultrasound VOS dataset with 671 annotated frames, addressing a clinically important setting that is not covered by existing public ultrasound video benchmarks. EchoPilot is evaluated on this dataset together with CAMUS and Breast Lesion. Public datasets should be downloaded from their original sources. The fetal placenta dataset is treated as internal unless a separate release approval is granted.
BibTeX
@inproceedings{xiao2026echopilot,
title = {EchoPilot: Training-Free Ultrasound Video Segmentation via Scale-Space Semantic Prompting and Reliability-Gated Memory},
author = {Xiao, Ruiqiang and Xing, Zhaohu and Yang, Yijun and Han, Zhenyan and Wang, Weiming and Wu, Kaishun and Zhu, Lei},
booktitle = {International Conference on Medical Image Computing and Computer-Assisted Intervention},
year = {2026}
}
