【打印本页】 【在线阅读全文】【下载PDF全文】 查看/发表评论下载PDF阅读器关闭

过刊浏览    高级检索

本文已被:浏览 47次   下载 0  
覃  晓1, 彭  磊1, 廖惠仙2, 元昌安3, 赵剑波1, 邓超1, 钱泉梅1, 卢虹妃1, 龚远旭1
针对现有Vison Transformer模型在局部特征捕捉和多尺度特征融合方面的局限性,本文提出一种新型的融合多尺度特征的轻量化图像分类混合模型(Multi-scale Vision Transformer, MSViT)。首先,在编码器中设计捕获通道特征的多尺度前馈神经网络(Multi-Scale Feed Forward Network, MSFFN)模块,该模块能有效提取空间和多尺度通道特征。其次,设计一个新的级联特征融合解码器(Cascade Feature Fusion Decoder,CFFD),通过整合特征金字塔网络(FPN)和多阶段特征融合解码器,显著提升模型对不同尺度特征的交互和融合能力。最后,模型引入多阶损失函数,以全面优化不同尺度特征在图像分类任务中的表现。为了验证MSViT模型的有效性,本文在4个实验数据集ImageNet-1k的1个子集(Small_ImageNet)、Cifar100、糖尿病视网膜病变数据集(APTOS2019)、蘑菇数据集(Mushroom66)]上进行大量的实验。其中在Small_ImageNet数据集上的实验结果显示,MSViT实现了87.58%的Top-1准确率,较EdgeViT-XXS提升了2.27%。实验结果证明了MSViT在图像分类任务中的有效性。
关键词:  图像分类  多尺度特征融合  多阶损失函数  特征金字塔网络(FPN)  Transformer
MSViT: A Lightweight Image Classification Model Integrating Multi-Scale Features
QIN Xiao1, PENG Lei1, LIAO Huixian2, YUAN Changan3, ZHAO Jianbo1, DENG Chao1, QIAN Quanmei1, LU Hongfei1, GONG Yuanxu1
(1.Guangxi Key Laboratory of Human-Computer Interaction and Intelligent Decision Making, Nanning Normal University;2.Guangdong Finance and Trade Vocational College of digital technology;3.Guangxi Academy of Sciences)
Aiming at the limitations of existing Vison Transformer models in local feature capture and Multi-scale feature fusion, this paper proposes a new lightweight image classification hybrid model MSViT (Multi-Scale Vision Transformer), which integrates multi-scale features. Firstly, a Multi-Scale Feed Forward Network (MSFFN) module is designed to capture channel features in the encoder, which can effectively extract spatial and multi-scale channel features. Secondly, a new cascaded feature fusion decoder is designed. By integrating the Feature Pyramid Network (FPN) and the multi-stage feature fusion decoder, the interaction and fusion ability of the model for different scale features are significantly improved. Finally, a multi-order loss function is introduced to optimize the performance of different scale features in image classification tasks. In order to verify the validity of the MSViT model, a large number of experiments were conducted on a subset of 4 experimental datasets ImageNet-1k (Small_ImageNet), Cifar100, diabetic retinopathy dataset (APTOS2019) and mushroom dataset (Mushroom66). The experimental results on Small_ImageNet dataset show that MSViT achieves 87.58% Top-1 accuracy, which is 2.27% higher than EdgeViT-XXS. The experimental results demonstrate the effectiveness of MSViT in image classification tasks.
Key words:  Image classification  Multi-scale feature fusion  Multi-order loss function  Feature Pyramid Network (FPN)  Transformer

