Early Accepted · MICCAI 2026 Top 9%

EchoPilot: Training-Free Ultrasound Video Segmentation via Scale-Space Semantic Prompting and Reliability-Gated Memory

Ruiqiang Xiao1, Zhaohu Xing1, Yijun Yang1, Zhenyan Han2, Weiming Wang3, Kaishun Wu1, Lei Zhu1,*

1The Hong Kong University of Science and Technology (Guangzhou)  ·  2Third Affiliated Hospital of Sun Yat-Sen University  ·  3Hong Kong Metropolitan University

* Corresponding author: Lei Zhu (leizhu@hkust-gz.edu.cn).

EchoPilot task setting and concept

EchoPilot targets category-anchored sparse-interactive ultrasound VOS: the user provides one positive point and an anatomical category name on the first frame; all foundation models remain frozen.

TL;DR — EchoPilot turns one click and one category name into stable ultrasound video masks, without training, task-specific fine-tuning, or dense first-frame annotations.

01 · Abstract

Abstract

Ultrasound video segmentation is clinically valuable yet difficult due to speckle noise, weak boundaries, and rapid anatomical deformation. Recent promptable foundation models enable point-guided segmentation, but direct deployment in ultrasound remains unreliable: a single point provides insufficient spatial context to resolve scale ambiguity, and greedy memory updates amplify early errors into severe temporal drift.

We present EchoPilot, a training-free framework for ultrasound video segmentation under sparse first-frame interaction, requiring only a single point click and an anatomical category name. EchoPilot mimics clinical global-to-local reasoning by orchestrating frozen medical vision-language, vision foundation, and video segmentation priors. It resolves initialization ambiguity with Scale-Space Semantic Prompting and a parameter-free S.E.E.D. criterion, then suppresses temporal drift with Reliability-Gated Memory. We also contribute a dynamic fetal placenta ultrasound VOS dataset with 671 annotated frames.

02 · Method

Method Overview

Overview of EchoPilot

Stage I selects a semantic context scale with BioMedCLIP and S.E.E.D., then uses DINOv3 features to synthesize auxiliary point prompts. Stage II uses a reliability gate to decide whether each predicted frame should be written into the SAM2 or MedSAM2 memory bank.

Training-Free

All priors are frozen at inference time.

Sparse Interaction

Only one first-frame point and a category name are required.

Scale-Aware Initialization

S.E.E.D. chooses the view where the target is recognizable.

Drift-Resistant Tracking

Unreliable frames are prevented from contaminating memory.

03 · Main Results

Main Results

EchoPilot is evaluated on CAMUS, Breast Lesion, and fetal Placenta ultrasound videos. It consistently improves Dice and ASD under both SAM2 and MedSAM2 pretrained weights, and outperforms the finetuned MedSAM3 baseline despite using no task-specific training.

Method Venue Prompt CAMUS Breast Lesion Placenta
Dice ↑ ASD ↓ F ↑ Dice ↑ ASD ↓ F ↑ Dice ↑ ASD ↓ F ↑
MedSAM3 arXiv'25 P+T 67.15 13.98 15.24 56.93 43.31 16.58 20.69 140.16 5.39
Pretrained Weights: SAM2
SAM2 ICLR'25 P 28.41 56.84 0.80 56.00 56.54 19.83 16.74 267.24 0.89
MA-SAM2 MICCAI'25 P 28.43 56.81 0.81 55.76 56.61 19.67 16.75 267.26 0.90
SAM2Long ICCV'25 P 28.44 56.65 0.80 55.41 54.96 19.56 16.76 268.99 0.89
EchoPilot P+T 34.09 28.86 3.21 63.38 22.98 21.35 39.74 84.96 5.92
Pretrained Weights: MedSAM2
MedSAM2 arXiv'25 P 90.83 2.29 77.34 61.24 86.12 23.63 33.52 129.56 3.27
MA-SAM2 MICCAI'25 P 90.85 2.28 77.37 61.15 86.14 23.25 33.54 129.54 3.29
SAM2Long ICCV'25 P 91.01 2.23 78.91 66.73 47.00 26.10 34.92 120.38 3.48
EchoPilot P+T 95.31 0.83 77.35 68.44 28.20 23.91 38.87 63.40 3.28

Dice and F-score are percentages. Lower ASD is better. EchoPilot is the only method here that combines point and text prompts without finetuning any backbone. On the challenging Placenta dataset with SAM2 weights, EchoPilot improves Dice from 16.74 → 39.74 and reduces ASD by 68%; with MedSAM2 weights, it reaches 95.31 Dice and 0.83 ASD on CAMUS.

04 · Qualitative Comparison

Qualitative Comparison

Qualitative comparison across time steps

Qualitative comparison across time steps. Baselines progressively leak into background anatomy or drift away from the target, while EchoPilot remains anchored by better initialization and gated memory updates.

05 · Ablation Studies

Ablation Studies

The ablations isolate the two core mechanisms: semantic scale selection for robust first-frame prompts, and memory gating for reducing temporal error accumulation.

Stage-I Variant CAMUS Breast Lesion Fetal Placenta
Dice ASD ΔDice ΔASD Dice ASD ΔDice ΔASD Dice ASD ΔDice ΔASD
Pretrained Weights: SAM2
SAM2 base, 1-click random 28.41 56.84 56.00 56.54 16.74 267.24
VFM-only 30.52 30.82 +2.11 −26.02 54.87 29.38 −1.13 −27.16 39.73 87.34 +22.99 −179.90
UniMedCLIP + VFM 29.98 31.16 +1.57 −25.68 57.78 27.61 +1.78 −28.93 39.04 88.33 +22.30 −178.91
BioMedCLIP + VFM, EchoPilot 34.09 28.86 +5.68 −27.98 63.38 22.98 +7.38 −33.56 39.74 84.96 +23.00 −182.28
Pretrained Weights: MedSAM2
MedSAM2 base, 1-click random 90.83 2.29 61.24 86.12 33.52 129.56
VFM-only 78.32 2.65 −12.51 +0.36 67.72 24.92 +6.48 −61.20 37.81 64.50 +4.29 −65.06
UniMedCLIP + VFM 84.69 2.08 −6.14 −0.21 68.64 21.22 +7.40 −64.90 38.54 63.53 +5.02 −66.03
BioMedCLIP + VFM, EchoPilot 95.31 0.83 +4.48 −1.46 68.44 28.20 +7.20 −57.92 38.87 63.40 +5.35 −66.16
Reliability-gated memory ablation

Reliability-gated memory reduces ASD on Breast Lesion from 55.87 → 28.20 by preventing uncertain predictions from entering memory. The gate remains stable for τ in [0.1, 0.5]; at τ=0.9, the rejection rate exceeds 60%, over-rejecting useful temporal context.

06 · Performance Profile

Performance Profile

Radar plot comparing EchoPilot with MedSAM2 and MedSAM3

Radar summary against MedSAM2 and MedSAM3. EchoPilot improves the balance of Dice, ASD, and boundary scores across all three ultrasound datasets.

07 · Dataset

Dataset

As a dataset contribution, we curate a dynamic fetal placenta ultrasound VOS dataset with 671 annotated frames, addressing a clinically important setting that is not covered by existing public ultrasound video benchmarks. EchoPilot is evaluated on this dataset together with CAMUS and Breast Lesion. Public datasets should be downloaded from their original sources. The fetal placenta dataset is treated as internal unless a separate release approval is granted.

08 · Citation

BibTeX

@inproceedings{xiao2026echopilot,
  title     = {EchoPilot: Training-Free Ultrasound Video Segmentation via Scale-Space Semantic Prompting and Reliability-Gated Memory},
  author    = {Xiao, Ruiqiang and Xing, Zhaohu and Yang, Yijun and Han, Zhenyan and Wang, Weiming and Wu, Kaishun and Zhu, Lei},
  booktitle = {International Conference on Medical Image Computing and Computer-Assisted Intervention},
  year      = {2026}
}