Early Accepted · MICCAI 2026 Top 9%

EchoPilot: Training-Free Ultrasound Video Segmentation via Scale-Space Semantic Prompting and Reliability-Gated Memory

Ruiqiang Xiao¹, Zhaohu Xing¹, Yijun Yang¹, Zhenyan Han², Weiming Wang³, Kaishun Wu¹, Lei Zhu^1,*

¹The Hong Kong University of Science and Technology (Guangzhou) · ²Third Affiliated Hospital of Sun Yat-Sen University · ³Hong Kong Metropolitan University

* Corresponding author: Lei Zhu (leizhu@hkust-gz.edu.cn).

Paper Code BibTeX arXiv

EchoPilot targets category-anchored sparse-interactive ultrasound VOS: the user provides one positive point and an anatomical category name on the first frame; all foundation models remain frozen.

TL;DR — EchoPilot turns one click and one category name into stable ultrasound video masks, without training, task-specific fine-tuning, or dense first-frame annotations.

01 · Abstract

Abstract

Ultrasound video segmentation is clinically valuable yet difficult due to speckle noise, weak boundaries, and rapid anatomical deformation. Recent promptable foundation models enable point-guided segmentation, but direct deployment in ultrasound remains unreliable: a single point provides insufficient spatial context to resolve scale ambiguity, and greedy memory updates amplify early errors into severe temporal drift.

We present EchoPilot, a training-free framework for ultrasound video segmentation under sparse first-frame interaction, requiring only a single point click and an anatomical category name. EchoPilot mimics clinical global-to-local reasoning by orchestrating frozen medical vision-language, vision foundation, and video segmentation priors. It resolves initialization ambiguity with Scale-Space Semantic Prompting and a parameter-free S.E.E.D. criterion, then suppresses temporal drift with Reliability-Gated Memory. We also contribute a dynamic fetal placenta ultrasound VOS dataset with 671 annotated frames.

02 · Method

Method Overview

Stage I selects a semantic context scale with BioMedCLIP and S.E.E.D., then uses DINOv3 features to synthesize auxiliary point prompts. Stage II uses a reliability gate to decide whether each predicted frame should be written into the SAM2 or MedSAM2 memory bank.

Training-Free

All priors are frozen at inference time.

Sparse Interaction

Only one first-frame point and a category name are required.

Scale-Aware Initialization

S.E.E.D. chooses the view where the target is recognizable.

Drift-Resistant Tracking

Unreliable frames are prevented from contaminating memory.

03 · Main Results

Main Results

EchoPilot is evaluated on CAMUS, Breast Lesion, and fetal Placenta ultrasound videos. It consistently improves Dice and ASD under both SAM2 and MedSAM2 pretrained weights, and outperforms the finetuned MedSAM3 baseline despite using no task-specific training.

Method	Venue	Prompt	CAMUS			Breast Lesion			Placenta
Method	Venue	Prompt	Dice ↑	ASD ↓	F ↑	Dice ↑	ASD ↓	F ↑	Dice ↑	ASD ↓	F ↑
MedSAM3	arXiv'25	P+T	67.15	13.98	15.24	56.93	43.31	16.58	20.69	140.16	5.39
Pretrained Weights: SAM2
SAM2	ICLR'25	P	28.41	56.84	0.80	56.00	56.54	19.83	16.74	267.24	0.89
MA-SAM2	MICCAI'25	P	28.43	56.81	0.81	55.76	56.61	19.67	16.75	267.26	0.90
SAM2Long	ICCV'25	P	28.44	56.65	0.80	55.41	54.96	19.56	16.76	268.99	0.89
EchoPilot	—	P+T	34.09	28.86	3.21	63.38	22.98	21.35	39.74	84.96	5.92
Pretrained Weights: MedSAM2
MedSAM2	arXiv'25	P	90.83	2.29	77.34	61.24	86.12	23.63	33.52	129.56	3.27
MA-SAM2	MICCAI'25	P	90.85	2.28	77.37	61.15	86.14	23.25	33.54	129.54	3.29
SAM2Long	ICCV'25	P	91.01	2.23	78.91	66.73	47.00	26.10	34.92	120.38	3.48
EchoPilot	—	P+T	95.31	0.83	77.35	68.44	28.20	23.91	38.87	63.40	3.28

Dice and F-score are percentages. Lower ASD is better. EchoPilot is the only method here that combines point and text prompts without finetuning any backbone. On the challenging Placenta dataset with SAM2 weights, EchoPilot improves Dice from 16.74 → 39.74 and reduces ASD by 68%; with MedSAM2 weights, it reaches 95.31 Dice and 0.83 ASD on CAMUS.

04 · Qualitative Comparison

Qualitative Comparison

Qualitative comparison across time steps. Baselines progressively leak into background anatomy or drift away from the target, while EchoPilot remains anchored by better initialization and gated memory updates.

05 · Ablation Studies

Ablation Studies

The ablations isolate the two core mechanisms: semantic scale selection for robust first-frame prompts, and memory gating for reducing temporal error accumulation.

Stage-I Variant	CAMUS				Breast Lesion				Fetal Placenta
Stage-I Variant	Dice	ASD	ΔDice	ΔASD	Dice	ASD	ΔDice	ΔASD	Dice	ASD	ΔDice	ΔASD
Pretrained Weights: SAM2
SAM2 base, 1-click random	28.41	56.84	—	—	56.00	56.54	—	—	16.74	267.24	—	—
VFM-only	30.52	30.82	+2.11	−26.02	54.87	29.38	−1.13	−27.16	39.73	87.34	+22.99	−179.90
UniMedCLIP + VFM	29.98	31.16	+1.57	−25.68	57.78	27.61	+1.78	−28.93	39.04	88.33	+22.30	−178.91
BioMedCLIP + VFM, EchoPilot	34.09	28.86	+5.68	−27.98	63.38	22.98	+7.38	−33.56	39.74	84.96	+23.00	−182.28
Pretrained Weights: MedSAM2
MedSAM2 base, 1-click random	90.83	2.29	—	—	61.24	86.12	—	—	33.52	129.56	—	—
VFM-only	78.32	2.65	−12.51	+0.36	67.72	24.92	+6.48	−61.20	37.81	64.50	+4.29	−65.06
UniMedCLIP + VFM	84.69	2.08	−6.14	−0.21	68.64	21.22	+7.40	−64.90	38.54	63.53	+5.02	−66.03
BioMedCLIP + VFM, EchoPilot	95.31	0.83	+4.48	−1.46	68.44	28.20	+7.20	−57.92	38.87	63.40	+5.35	−66.16

Reliability-gated memory reduces ASD on Breast Lesion from 55.87 → 28.20 by preventing uncertain predictions from entering memory. The gate remains stable for τ in [0.1, 0.5]; at τ=0.9, the rejection rate exceeds 60%, over-rejecting useful temporal context.

06 · Performance Profile

Performance Profile

Radar plot comparing EchoPilot with MedSAM2 and MedSAM3

Radar summary against MedSAM2 and MedSAM3. EchoPilot improves the balance of Dice, ASD, and boundary scores across all three ultrasound datasets.

07 · Dataset

Dataset

As a dataset contribution, we curate a dynamic fetal placenta ultrasound VOS dataset with 671 annotated frames, addressing a clinically important setting that is not covered by existing public ultrasound video benchmarks. EchoPilot is evaluated on this dataset together with CAMUS and Breast Lesion. Public datasets should be downloaded from their original sources. The fetal placenta dataset is treated as internal unless a separate release approval is granted.

08 · Citation

BibTeX

@misc{xiao2026echopilottrainingfreeultrasoundvideo,
  title         = {EchoPilot: Training-Free Ultrasound Video Segmentation via Scale-Space Semantic Prompting and Reliability-Gated Memory},
  author        = {Ruiqiang Xiao and Zhaohu Xing and Yijun Yang and Zhenyan Han and Weiming Wang and Kaishun Wu and Lei Zhu},
  year          = {2026},
  eprint        = {2605.25944},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2605.25944}
}