广西科学

引用本文：

本文已被：浏览 57次下载 0次
MSViT:融合多尺度特征的轻量级图像分类模型
覃晓¹, 彭磊¹, 廖惠仙², 元昌安³, 赵剑波¹, 邓超¹, 钱泉梅¹, 卢虹妃¹, 龚远旭¹
0 字体:加大+\|默认\|缩小-
(1.南宁师范大学广西人机交互与智能决策重点实验室;2.广东财贸职业学院数字技术学院;3.广西科学院)

摘要:

针对现有Vison Transformer模型在局部特征捕捉和多尺度特征融合方面的局限性，本文提出一种新型的融合多尺度特征的轻量化图像分类混合模型（Multi-scale Vision Transformer, MSViT）。首先，在编码器中设计捕获通道特征的多尺度前馈神经网络（Multi-Scale Feed Forward Network, MSFFN）模块，该模块能有效提取空间和多尺度通道特征。其次，设计一个新的级联特征融合解码器（Cascade Feature Fusion Decoder，CFFD），通过整合特征金字塔网络（FPN）和多阶段特征融合解码器，显著提升模型对不同尺度特征的交互和融合能力。最后，模型引入多阶损失函数，以全面优化不同尺度特征在图像分类任务中的表现。为了验证MSViT模型的有效性，本文在4个实验数据集ImageNet-1k的1个子集（Small_ImageNet）、Cifar100、糖尿病视网膜病变数据集（APTOS2019）、蘑菇数据集（Mushroom66）]上进行大量的实验。其中在Small_ImageNet数据集上的实验结果显示，MSViT实现了87.58%的Top-1准确率，较EdgeViT-XXS提升了2.27%。实验结果证明了MSViT在图像分类任务中的有效性。

关键词: 图像分类多尺度特征融合多阶损失函数特征金字塔网络（FPN） Transformer

DOI：

投稿时间：2024-07-13修订日期：2025-02-27

基金项目:科技部科技创新2030-“脑科学与类脑研究”重大项目（2021ZD0201904）和广西科技重大专项（桂科AA22068057）

MSViT: A Lightweight Image Classification Model Integrating Multi-Scale Features

QIN Xiao¹, PENG Lei¹, LIAO Huixian², YUAN Changan³, ZHAO Jianbo¹, DENG Chao¹, QIAN Quanmei¹, LU Hongfei¹, GONG Yuanxu¹

(1.Guangxi Key Laboratory of Human-Computer Interaction and Intelligent Decision Making, Nanning Normal University;2.Guangdong Finance and Trade Vocational College of digital technology;3.Guangxi Academy of Sciences)

Abstract:

Aiming at the limitations of existing Vison Transformer models in local feature capture and Multi-scale feature fusion, this paper proposes a new lightweight image classification hybrid model MSViT (Multi-Scale Vision Transformer), which integrates multi-scale features. Firstly, a Multi-Scale Feed Forward Network (MSFFN) module is designed to capture channel features in the encoder, which can effectively extract spatial and multi-scale channel features. Secondly, a new cascaded feature fusion decoder is designed. By integrating the Feature Pyramid Network (FPN) and the multi-stage feature fusion decoder, the interaction and fusion ability of the model for different scale features are significantly improved. Finally, a multi-order loss function is introduced to optimize the performance of different scale features in image classification tasks. In order to verify the validity of the MSViT model, a large number of experiments were conducted on a subset of 4 experimental datasets ImageNet-1k (Small_ImageNet), Cifar100, diabetic retinopathy dataset (APTOS2019) and mushroom dataset (Mushroom66). The experimental results on Small_ImageNet dataset show that MSViT achieves 87.58% Top-1 accuracy, which is 2.27% higher than EdgeViT-XXS. The experimental results demonstrate the effectiveness of MSViT in image classification tasks.

Key words: Image classification Multi-scale feature fusion Multi-order loss function Feature Pyramid Network (FPN) Transformer

用微信扫一扫