引用本文: |
-
覃晓,彭磊,廖惠仙,元昌安,赵剑波,邓超,钱泉梅,卢虹妃,龚远旭.MSViT:融合多尺度特征的轻量化图像分类混合模型[J].广西科学,2024,31(5):912-924. [点击复制]
- QIN Xiao,PENG Lei,LIAO Huixian,YUAN Chang'an,ZHAO Jianbo,DENG Chao,QIAN Quanmei,LU Hongfei,GONG Yuanxu.MSViT:A Lightweight Image Classification Hybrid Model Integrating Multi-Scale Features[J].Guangxi Sciences,2024,31(5):912-924. [点击复制]
|
|
|
|
本文已被:浏览 29次 下载 28次 |
 码上扫一扫! |
MSViT:融合多尺度特征的轻量化图像分类混合模型 |
覃晓1,2, 彭磊1, 廖惠仙3, 元昌安4, 赵剑波1, 邓超1, 钱泉梅1, 卢虹妃1, 龚远旭1
|
|
(1.南宁师范大学广西人机交互与智能决策重点实验室, 广西南宁 530100;2.广西区域多源数据集成与智能处理协同创新中心, 广西桂林 541004;3.广东财贸职业学院数字技术学院, 广东清远 511510;4.广西科学院, 广西南宁 530007) |
|
摘要: |
针对现有Vision Transformer (ViT) 模型在局部特征捕捉和多尺度特征融合方面的局限性,本文提出一种新型的融合多尺度特征的轻量化图像分类混合模型(Multi-Scale Vision Transformer,MSViT)。首先,在编码器中设计捕获通道特征的多尺度前馈神经网络(Multi-Scale Feed Forward Network,MSFFN)模块,该模块能有效提取空间和多尺度通道特征。其次,设计一个新的级联特征融合解码器(Cascade Feature Fusion Decoder,CFFD),通过整合特征金字塔网络(Feature Pyramid Network,FPN)和多阶段特征融合解码器,显著提升模型对不同尺度特征的交互和融合能力。最后,模型引入多阶损失函数,以全面优化不同尺度特征在图像分类任务中的表现。为了验证MSViT的有效性,在4个实验数据集[ImageNet-1k的1个子集(Small_ImageNet)、Cifar 100、糖尿病视网膜病变数据集(APTOS 2019)、蘑菇数据集(Mushroom 66)]上进行大量的实验。其中在Small_ImageNet数据集上的实验结果显示,MSViT实现了87.58%的Top-1准确率,较EdgeViT-XXS提升了2.27%。实验结果证明了MSViT在图像分类任务中的有效性。 |
关键词: 图像分类 多尺度特征融合 多阶损失函数 特征金字塔网络(FPN) Transformer |
DOI:10.13656/j.cnki.gxkx.20241127.009 |
投稿时间:2024-07-13修订日期:2024-10-14 |
基金项目:科技部科技创新2030-“脑科学与类脑研究”重大项目(2021ZD0201904)和广西科技重大专项(桂科AA22068057)资助。 |
|
MSViT:A Lightweight Image Classification Hybrid Model Integrating Multi-Scale Features |
QIN Xiao1,2, PENG Lei1, LIAO Huixian3, YUAN Chang'an4, ZHAO Jianbo1, DENG Chao1, QIAN Quanmei1, LU Hongfei1, GONG Yuanxu1
|
(1.Guangxi Key Laboratory of Human-Computer Interaction and Intelligent Decision Making, Nanning Normal University, Nanning, Guangxi, 530100, China;2.Guangxi Regional Collaborative Innovation Center for Multi-Source Data Integration and Intelligent Processing, Guilin, Guangxi, 541004, China;3.College of Digital Technology, Guangdong Vocational College of Finance and Trade, Qingyuan, Guangdong, 511510, China;4.Guangxi Academy of Sciences, Nanning, Guangxi, 530007, China) |
Abstract: |
Aiming at the limitations of existing Vison Transformer (ViT) models in local feature capture and multi-scale feature fusion,a new lightweight image classification hybrid model integrating multi-scale features (Multi-Scale Vision Transformer,MSViT) is proposed.Firstly,a Multi-Scale Feed Forward Network (MSFFN) module is designed to capture channel features in the encoder,which can effectively extract spatial and multi-scale channel features.Secondly,a new Cascade Feature Fusion Decoder (CFFD) is designed.By integration of the Feature Pyramid Network (FPN) and the multi-stage feature fusion decoder,the interaction and fusion ability of the model for different scale features are significantly improved.Finally,a multi-order loss function is introduced to optimize the performance of different scale features in image classification tasks.To validate the effectiveness of the MSViT model,a large number of experiments are conducted on 4 datasets [a subset of ImageNet-1k (Small_ImageNet),Cifar 100,APTOS 2019,and Mushroom 66].The experimental results on Small_ImageNet show that MSViT achieves the Top-1 accuracy of 87.58%,which is 2.27% higher than that of EdgeViTs-XXS.The experimental results demonstrate the effectiveness of MSViT in image classification tasks. |
Key words: image classification multi-scale feature fusion multi-order loss function Feature Pyramid Network (FPN) Transformer |
|
|
|
|
|