摘要: |
现有的行人检索模型在视觉与文本全局特征对齐方面取得了进展,但它们在捕捉行人细节信息以及深入挖掘图像与文本内部依赖关系方面仍存在不足,为此,本文提出了一种基于跨模态属性匹配对齐的文本检索行人模型(Cross-modal Attribute Matching Alignment Network, CAMN),通过全局和局部属性特征的精确对齐来优化文本到图像的检索效果。首先,设计了一个新的视觉特征提取网络(Multi-head Feature Self-Attention Network,MHANet)来帮助特征提取网络获得更细致的全局视觉特征。其次,为改善视觉与文本局部属性特征在关联性上的不足问题,本文提出了跨模态属性注意模块(Cross-modal Attribute Attention, ACA)。实验结果表明,该模型在CUHK-PEDES、ICFG-PEDES、RSTPReid三个公开数据集上,与ViTAA模型相比,在Rank-5分别提高了8.33%、9.3%、9.73%,并且与其他算法相比有明显的优势。 |
关键词: 行人检索 跨模态 注意力机制 属性对齐 CAMN |
DOI: |
投稿时间:2025-03-02修订日期:2025-04-21 |
基金项目: |
|
CAMN: A Pedestrian Model for Text Retrieval Based on Cross-Modal Attribute Matching Alignment |
QIN Xiao1, LU Hongfei1, WU Kunsheng2
|
(1.Guangxi Key Laboratory of Human-Computer Interaction and Intelligent Decision Making,Nanning Normal University;2.School of Physics and Electronic Information,Guangxi Minzu University,Nanning) |
Abstract: |
Existing pedestrian retrieval models have made progress in visual and textual global feature alignment, but they are still deficient in capturing pedestrian detail information as well as digging deeper into the internal dependency relationship between image and text, for this reason, this paper proposes a text retrieval pedestrian model based on cross-modal Attribute Matching Alignment (CAMN). Alignment Network, CAMN) to optimize text-to-image retrieval by precise alignment of global and local attribute features. First, a new visual feature extraction network (Multi-head Feature Self-Attention Network, MHANet) is designed to help the feature extraction network obtain more detailed global visual features. Second, in order to improve the problem of insufficient correlation between visual and textual local attribute features, this paper proposes the Cross-modal Attribute Attention (ACA) module. The experimental results show that this model improves 8.33%, 9.3%, and 9.73% in Rank-5 compared with ViTAA model on three public datasets, namely, CUHK-PEDES, ICFG-PEDES, and RSTPReid, respectively, and has a significant advantage over other algorithms. |
Key words: pedestrian retrieval cross-modality attention mechanism attribute alignment CAMN |