Confidence-Guided Retrieval Refinement for Audio Moment Retrieval
A retrieve–rerank framework that localizes natural-language queries inside long-form (5-minute) YouTube audio. A cross-modal retriever returns top-K temporal candidates; a second-stage reranker — trained with Direct Alignment Preference Optimization (DAPO) — selectively refines them only when reranker confidence exceeds a learned threshold, avoiding destructive corrections on locally ambiguous candidates.
Boundaries are further sharpened by retrieval-grounded span refinement with an IoU overlap constraint, preventing hallucinated spans. On CASTELLA (1,347 test queries), the full system beats published UVCOM by +2.82 R1@0.5, +1.97 R1@0.7, and +1.75 mAP.


