本文已被:浏览 3次 下载 0次 |
|
MSViT:融合多尺度特征的轻量级图像分类模型 |
覃 晓1, 彭 磊1, 廖惠仙2, 元昌安3, 赵剑波1, 邓超1, 钱泉梅1, 卢虹妃1, 龚远旭1
|
|
(1.南宁师范大学广西人机交互与智能决策重点实验室;2.广东财贸职业学院数字技术学院;3.广西科学院) |
|
摘要: |
针对现有Vison Transformer模型在局部特征捕捉和多尺度特征融合方面的局限性,本文提出了一种新型的融合多尺度特征的轻量化图像分类混合模型MSViT(Multi-scale Vision Transformer)。首先,在编码器中设计了捕获通道特征的多尺度前馈神经网络(MSFFN)模块,能够更好地提取空间和多尺度通道特征。其次,设计一个新的级联特征融合解码器,通过整合特征金字塔网络(FPN)和多阶段特征融合解码器,显著提升了模型对不同尺度特征的交互和融合能力。最后,模型引入多阶损失函数,以全面优化不同尺度特征在图像分类任务中的表现。为了验证MSViT的有效性,本文在四个实验数据集ImageNet的子集、Cifar100、APTOS2019和Mushroom66上进行大量的实验。其中在ImageNet子集上的实验结果显示,MSViT实现了87.58%的Top-1准确率,较Edgevits_xxs提升了2.27%。这些结果证明了MSViT在图像分类任务中的有效性。 |
关键词: 图像分类 多尺度特征融合 多阶损失函数 特征金字塔网络 Transformer |
DOI: |
投稿时间:2024-07-13修订日期:2024-10-14 |
基金项目:科技部科技创新2030-“脑科学与类脑研究”重大项目(2021ZD0201904)和广西科技重大专项(桂科AA22068057) |
|
MSViT: A Lightweight Image Classification Model Integrating Multi-Scale Features |
QIN Xiao1, PENG Lei1, LIAO Huixian2, YUAN Changan3, ZHAO Jianbo1, DENG Chao1, QIAN Quanmei1, LU Hongfei1, GONG Yuanxu1
|
(1.Guangxi Key Laboratory of Human-Computer Interaction and Intelligent Decision Making, Nanning Normal University;2.Guangdong Finance and Trade Vocational College of digital technology;3.Guangxi Academy of Sciences) |
Abstract: |
Aiming at the limitations of existing Vison Transformer models in local feature capture and Multi-scale feature fusion, this paper proposes a new lightweight image classification hybrid model MSViT (Multi-scale Vision Transformer), which integrates multi-scale features. Firstly, a multi-scale feedforward neural network (MSFFN) module is designed to capture channel features in the encoder, which can better extract spatial and multi-scale channel features. Secondly, a new cascaded feature fusion decoder is designed. By integrating the feature pyramid network (FPN) and the multi-stage feature fusion decoder, the interaction and fusion ability of the model for different scale features are significantly improved. Finally, a multi-order loss function is introduced to optimize the performance of different scale features in image classification tasks. In order to verify the validity of MSViT, a large number of experiments were conducted on a subset of four experimental datasets ImageNet, Cifar100, APTOS2019 and Mushroom66. The experimental results on the ImageNet subset show that MSViT achieves 87.58% Top-1 accuracy, which is 2.27% higher than Edgevits_xxs. These results demonstrate the effectiveness of MSViT in image classification tasks. |
Key words: Image classification Multi-scale feature fusion multi-order loss function Feature pyramid Network Transformer |