引用本文: |
-
覃晓,卢虹妃,吴琨生.CAMN:基于跨模态属性匹配对齐的文本检索行人模型[J].广西科学院学报,2025,41(1):1-11. [点击复制]
- QIN Xiao,LU Hongfei,WU Kunsheng.CAMN: A Pedestrian Model for Text Retrieval Based on Cross-modal Attribute Matching Alignment[J].Journal of Guangxi Academy of Sciences,2025,41(1):1-11. [点击复制]
|
|
摘要: |
现有的行人检索模型在图像与文本全局特征对齐方面取得了良好的进展,但其在捕捉行人细节信息,以及深入挖掘图像与文本内部依赖关系方面仍存在不足。针对上述问题,首先,本文设计了一个新的图像特征提取网络——多头自注意力网络(Multi-head self-attention network,MHANet)来获得更细致的全局图像特征。其次,为改善图像与文本局部属性特征在关联性上的不足,本文提出跨模态属性注意力(Cross-modal attribute attention,ACA)模块,旨在以文本信息为指导,强化图像的局部属性特征表达。最后,结合MHANet和ACA模块,本文提出基于跨模态属性匹配对齐的文本检索行人模型(Cross-modal Attribute Matching Alignment Network,CAMN),该模型通过全局和局部属性特征的精确对齐来优化文本到图像的检索效果。实验结果表明,与图像文本属性对齐网络(Visual-textual attributes alignment in person search by natural language,ViTAA)相比,CAMN在CUHK-PEDES、ICFG-PEDES、RSTPReid 3个公开数据集上的Rank-5分别提高了8.33百分点、9.30百分点、9.73百分点,并且与其他算法相比有明显的优势,这表明CAMN具有图文属性特征对齐能力,同时克服了传统图像特征提取方法的局限性。 |
关键词: 行人检索 跨模态 注意力机制 属性对齐 |
DOI:10.13657/j.cnki.gxkxyxb.20250429.001 |
投稿时间:2024-12-02修订日期:2025-02-05 |
基金项目:科技部科技创新2030-“脑科学与类脑研究”重大项目(2021ZD0201904)和广西科技重大专项(桂科AA22068057)资助。 |
|
CAMN: A Pedestrian Model for Text Retrieval Based on Cross-modal Attribute Matching Alignment |
QIN Xiao1, LU Hongfei1, WU Kunsheng2
|
(1.Guangxi Key Laboratory of Human-machine Interaction and Intelligent Decision, Nanning Normal University, Nanning, Guangxi, 530100, China;2.School of Physics and Electronic Information, Guangxi Minzu University, Nanning, Guangxi, 530006, China) |
Abstract: |
Existing pedestrian retrieval models have made good progress in the global feature alignment between image and text.However,it still has shortcomings in capturing pedestrian details and deeply mining the internal dependencies between image and text.In view of the above problems,firstly,this paper designs a new image feature extraction network (Multi-head self-attention network,MHANet) to obtain more detailed global image features.Secondly,in order to improve the problem of insufficient correlation between image and text local attribute features,this paper proposes a cross-modal attribute attention (ACA) module,which aims to strengthen the local attribute feature expression of the image under the guidance of text information.Finally,combined with the MHANet and ACA modules,this paper proposes a text retrieval pedestrian model based on Cross-modal Attribute Matching Alignment Network (CAMN),which optimizes the text-to-image retrieval effect by accurately aligning global and local attribute features.The experimental results show that compared with the visual-textual attributes alignment in person search by natural language (ViTAA) network,the Rank-5 of CAMN on the three public datasets of CUHK-PEDES,ICFG-PEDES,and RSTPReid is increased by 8.33 percentage point,9.30 percentage point,and 9.73 percentage point,respectively.Compared with other algorithms,it has obvious advantages,which indicates that CAMN has the ability to align image and text attributes,and overcomes the limitations of traditional image feature extraction methods. |
Key words: pedestrian retrieval cross-modality attention mechanism attribute alignment |