广西科学

引用本文：

吴兰,杨攀,李斌全,王涵.大词汇量环境噪声下的多模态视听语音识别方法[J].广西科学,2023,30(1):52-60. [点击复制]
WU Lan,YANG Pan,LI Binquan,WANG Han.A Multi-modality Audio-Visual Speech Recognition Method under Large Vocabulary Environmental Noise[J].Guangxi Sciences,2023,30(1):52-60. [点击复制]

【打印本页】【在线阅读全文】【下载PDF全文】【查看/发表评论】【下载PDF阅读器】【关闭】

本文已被：浏览 412次下载 875次	码上扫一扫！
大词汇量环境噪声下的多模态视听语音识别方法
吴兰, 杨攀, 李斌全, 王涵
0 字体:加大+\|默认\|缩小-
(河南工业大学电气工程学院, 河南郑州 450001)

摘要:

视听语音识别(Audio-Visual Speech Recognition,AVSR)技术利用唇读和语音识别(Audio-Visual Speech Recognition,AVSR)的关联性和互补性可有效提高字符识别准确率。针对唇读的识别率远低于语音识别、语音信号易受噪声破坏、现有的视听语音识别方法在大词汇量环境噪声中的识别率大幅降低等问题，本文提出一种多模态视听语音识别(Multi-modality Audio-Visual Speech Recognition,MAVSR)方法。该方法基于自注意力机制构建双流前端编码模型，引入模态控制器解决环境噪声下音频模态占据主导地位而导致的各模态识别性能不均衡问题，提高识别稳定性与鲁棒性，构建基于一维卷积的多模态特征融合网络，解决音视频数据异构问题，提升音视频模态间的关联性与互补性。与现有主流方法对比，在仅音频、仅视频、音视频融合3种任务下，该方法的识别准确率提升7.58%以上。

关键词: 注意力机制|多模态|视听语音识别|唇读|语音识别

DOI：10.13656/j.cnki.gxkx.20230308.006

基金项目:国家自然科学基金项目(61973103)，河南省自然科学基金项目(222300420039)和郑州市科技局自然科学项目(21ZZXTCX01)资助。

A Multi-modality Audio-Visual Speech Recognition Method under Large Vocabulary Environmental Noise

WU Lan, YANG Pan, LI Binquan, WANG Han

(School of Electrical Engineering, Henan University of Technology, Zhengzhou, Henan, 450001, China)

Abstract:

Audio-Visual Speech Recognition (AVSR) technology can effectively improve the accuracy of character recognition by using the relevance and complementarity of lip reading and speech recognition.In view of the problems that the recognition rate of lip reading is much lower than that of speech recognition,the speech signal is easily damaged by noise,and the recognition rate of existing Audio-Visual Speech Recognition (AVSR) methods in large vocabulary environment noise is greatly reduced,a Multi-modality Audio-Visual Speech Recognition (MAVSR) method is proposed.This method constructs a dual-stream front-end coding model based on the self-attention mechanism,and introduces a modal controller to solve the problem of unbalanced recognition performance of each mode caused by the dominance of audio modes in the environment noise,and improves the stability and robustness of recognition.A multi-modal feature fusion network based on one-dimensional convolution is constructed to solve the heterogeneous problem of audio and video data and improve the correlation and complementarity between audio and video modes.Compared with the existing mainstream methods,the recognition accuracy of this method is increased by more than 7.58% under the three tasks of audio-only,video-only,and audio-video fusion.

Key words: attention mechanisms|multi-modality|audio-visual speech recognition|lip reading|automatic speech recognition

用微信扫一扫