数字创新中心

Center for Digital Innovation

Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning

Yan Li, Weiwei Guo, Xue Yang, Ning Liao, Dunyun He, Jiaqi Zhou, Wenxian Yu

European Conference on Computer Vision (2025)

摘要/Abstract

越来越多的遥感图像刺激了可扩展对象检测器的发展,这些检测器可以检测训练类别之外的对象,而无需花费大量成本收集新的标记数据。在本文中,我们的目标是开发航拍图像中的开放词汇目标检测(OVD)技术,将目标词汇量扩大到训练数据之外。 OVD 的性能很大程度上依赖于与类无关的区域提议和新对象类别的伪标签的质量。为了同时生成高质量的提案和伪标签,我们提出了CastDet ,一个由CLIP激活的学生教师开放词汇对象检测框架。我们的端到端框架遵循学生-教师自学习机制,采用 RemoteCLIP 模型作为一位拥有丰富知识的额外全知教师。通过这样做,我们的方法不仅促进了新颖的对象提议,而且促进了分类。此外,我们设计了一种动态标签队列策略,以在批量训练期间保持高质量的伪标签。我们对多个现有的空中物体检测数据集进行了广泛的实验,这些数据集是为 OVD 任务设置的。实验结果表明,我们的 CastDet 实现了卓越的开放词汇检测性能,例如,在 VisDroneZSD 新颖类别上达到 46.5% mAP,比最先进的开放词汇检测器高出 21.0% mAP。据我们所知,这是第一个应用和开发航空图像开放词汇目标检测技术的工作。该代码可在https://github.com/lizzy8587/CastDet获取。

An increasingly massive number of remote-sensing images spurs the development of extensible object detectors that can detect objects beyond training categories without costly collecting new labeled data. In this paper, we aim to develop open-vocabulary object detection (OVD) technique in aerial images that scales up object vocabulary size beyond training data. The performance of OVD greatly relies on the quality of class-agnostic region proposals and pseudo-labetls for novel object categories. To simultaneously generate high-quality proposals and pseudo-labels, we propose CastDet, a CLIP-activated student-teacher open-vocabulary object Detection framework. Our end-to-end framework following the student-teacher self-learning mechanism employs the RemoteCLIP model as an extra omniscient teacher with rich knowledge. By doing so, our approach boosts not only novel object proposals but also classification. Furthermore, we devise a dynamic label queue strategy to maintain high-quality pseudo labels during batch training. We conduct extensive experiments on multiple existing aerial object detection datasets, which are set up for the OVD task. Experimental results demonstrate our CastDet achieving superior open-vocabulary detection performance, e.g., reaching 46.5% mAP on VisDroneZSD novel categories, which outperforms the state-of-the-art open-vocabulary detectors by 21.0% mAP. To our best knowledge, this is the first work to apply and develop the open-vocabulary object detection technique for aerial images. The code is available at https://github.com/lizzy8587/CastDet.

相关信息/Info

作者/Authors

链接/Link

Yan Li, Weiwei Guo, Xue Yang, Ning Liao, Dunyun He, Jiaqi Zhou, Wenxian Yu

https://link.springer.com/chapter/10.1007/978-3-031-73016-0_25

图片/Figures