引用本文
  • 覃晓,张金勇,龚远旭,吴琨生,黄豪杰,淳鑫,元昌安.GCTR:粒度统一的跨模态文本行人检索网络模型[J].广西科学,2024,31(5):988-1001.    [点击复制]
  • QIN Xiao,ZHANG Jinyong,GONG Yuanxu,WU Kunsheng,HUANG Haojie,CHUN Xin,YUAN Chang'an.GCTR:A Granularity-Unified Cross-Modal Text-person Retrieval Model[J].Guangxi Sciences,2024,31(5):988-1001.   [点击复制]
【打印本页】 【在线阅读全文】【下载PDF全文】 查看/发表评论下载PDF阅读器关闭

←前一篇|后一篇→

过刊浏览    高级检索

本文已被:浏览 25次   下载 23 本文二维码信息
码上扫一扫!
GCTR:粒度统一的跨模态文本行人检索网络模型
覃晓1,2,3, 张金勇1, 龚远旭1, 吴琨生1, 黄豪杰1, 淳鑫1, 元昌安2,3
0
(1.南宁师范大学, 广西人机交互与智能决策重点实验室, 广西南宁 530100;2.广西科学院, 广西南宁 530012;3.广西区域多源数据集成与智能处理协同创新中心, 广西桂林 541004)
摘要:
现有的文本行人检索网络模型在检索任务中缺乏对图文语义联系的关注,且容易忽略文本与图像特征之间的粒度差异,针对这两大问题,本研究提出一种粒度统一的跨模态文本行人检索网络模型(Granularity-unified Cross-modal Text-person Retrieval model,GCTR)。首先,GCTR利用具备跨模态迁移知识能力的视觉语言预训练模型来获取具有基础关联性的文本和图像特征;其次,本研究提出一个跨模态粒度特征增强模块(Cross-Model Feature Enhancement module,CMFE),它利用跨模态特征增强码表(Enhanced Cross-modal Feature Codebook,ECFC)获取具有统一粒度的图像文本特征,解决了图文特征粒度差异的问题;最后,结合局部和全局的匹配损失策略完成模型的训练。GCTR在CUHK-PEDES、ICFG-PEDES和RSTPReid 3个公开数据集上的表现均优于现有的主流模型,证明了GCTR在跨模态文本行人检索任务上的优越性。
关键词:  跨模态检索  图文检索  行人检索  视觉语言预训练  粒度特征增强
DOI:10.13656/j.cnki.gxkx.20241127.015
投稿时间:2024-07-13修订日期:2024-09-25
基金项目:科技部科技创新2030—“脑科学与类脑研究”重大项目(2021ZD0201904)和广西科技重大专项(桂科AA22068057)资助。
GCTR:A Granularity-Unified Cross-Modal Text-person Retrieval Model
QIN Xiao1,2,3, ZHANG Jinyong1, GONG Yuanxu1, WU Kunsheng1, HUANG Haojie1, CHUN Xin1, YUAN Chang'an2,3
(1.Guangxi Key Laboratory of Human-machine Interaction and Intelligent Decision, Nanning Normal University, Nanning, Guangxi, 530100, China;2.Guangxi Academy of Sciences, Nanning, Guangxi, 530012, China;3.Guangxi Regional Collaborative Innovation Center for Multi-Source Data Integration and Intelligent Processing, Guilin, Guangxi, 541004, China)
Abstract:
The existing text-image person retrieval tasks-lack the attention to the semantic connection between text and image and neglects the granularity difference between text and image features.In view of the problems above,a Granularity-unified Cross-Modal Text-person Retrieval model (GCTR) is proposed.Firstly,GCTR utilizes the contrastive language-image pre-training model,which possesses the cross-modal transfer learning capability,to obtain text and image features with basic relevance.Secondly,a Cross-model Feature Enhancement module (CMFE) is proposed,which utilizes the Enhanced Cross-modal Feature Codebook (ECFC) to obtain text and image features with unified granularity,solving the problem of granularity difference between text and image features.Finally,a combination of local and global matching loss strategies is employed to optimize the model training process.The retrieval evaluation metrics achieved by GCTR on three public datasets CUHK-PEDES,ICFG-PEDES,and RSTPReid surpass those of existing state-of-the-art methods,demonstrating the effectiveness of the proposed GCTR model in cross-modal text-image person retrieval tasks.
Key words:  cross-modal retrieval  text-image retrieval  person retrieval  visual language pre-training  granularity feature enhancement

用微信扫一扫

用微信扫一扫