广西科学

引用本文：

余俊晖,陈艳平,秦永彬,黄辉.基于机器阅读理解的中文司法实体识别优化策略研究[J].广西科学,2023,30(1):27-34. [点击复制]
YU Junhui,CHEN Yanping,QIN Yongbin,HUANG Hui.Research on Optimization Strategy of Chinese Judicial Entity Recognition Based on Machine Reading Comprehension[J].Guangxi Sciences,2023,30(1):27-34. [点击复制]

【打印本页】【在线阅读全文】【下载PDF全文】【查看/发表评论】【下载PDF阅读器】【关闭】

本文已被：浏览 337次下载 495次	码上扫一扫！
基于机器阅读理解的中文司法实体识别优化策略研究
余俊晖^1,2, 陈艳平^1,2, 秦永彬^1,2, 黄辉^1,2
0 字体:加大+\|默认\|缩小-
(1.公共大数据国家重点实验室, 贵州贵阳 550025;2.贵州大学计算机科学与技术学院, 贵州贵阳 550025)

摘要:

针对中文司法领域信息抽取数据集中实体专业性较强、现有机器阅读理解(MRC)模型无法通过构建问句提供充足的标签语义且在噪声样本上表现不佳等问题，本研究提出了一种联合优化策略。首先，通过聚合在司法语料中多次出现的实体构建司法领域词典，将专业性较强的实体知识注入RoBERTa-wwm预训练语言模型进行预训练。然后，通过基于自注意力机制来区分每个字对不同标签词的重要性，从而将实体标签语义融合到句子表示中。最后，在微调阶段采用对抗训练算法对模型进行优化，增强模型的鲁棒性和泛化能力。在2021年中国法律智能评测(CAIL2021)司法信息抽取数据集上的实验结果表明：相较于基线模型，本研究方法F1值提高了2.79%，并且模型在CAIL2021司法信息抽取赛道中获得了全国三等奖的成绩，验证了联合优化策略的有效性。

关键词: 司法信息抽取|预训练|自注意力机制|标签语义|对抗训练

DOI：10.13656/j.cnki.gxkx.20230308.003

基金项目:国家自然科学基金通用联合基金重点项目(U1836205)，国家自然科学基金重大研究计划项目(91746116)，国家自然科学基金项目(62166007,62066007,62066008)，贵州省科技重大专项计划项目(黔科合重大专项字[2017]3002)和贵州省科学技术基金重点项目(黔科合基础[2020]1Z055)资助。

Research on Optimization Strategy of Chinese Judicial Entity Recognition Based on Machine Reading Comprehension

YU Junhui^1,2, CHEN Yanping^1,2, QIN Yongbin^1,2, HUANG Hui^1,2

(1.State Key Laboratory of Public Big Data, Guiyang, Guizhou, 550025, China;2.College of Computer Science and Technology, Guizhou University, Guiyang, Guizhou, 550025, China)

Abstract:

Aiming at the problems that the entities in the Chinese judicial information extraction dataset are highly professional,the existing Machine Reading Comprehension (MRC) model cannot provide sufficient label semantics by constructing questions and performs poorly on noise samples,a joint optimization strategy is proposed in this study.Firstly,a judicial domain dictionary is constructed by aggregating entities that appear many times in the judicial corpus,and professional entity knowledge is injected into the RoBERTa-wwm pre-training language model for pre-training.Then,the entity label semantics are integrated into the sentence representation by distinguishing the importance of each word to different label words based on the self-attention mechanism.Finally,in the fine-tuning stage,the adversarial training algorithm is used to optimize the model to enhance the robustness and generalization ability of the model.The experimental results on the 2021 China Legal Intelligence Evaluation (CAIL2021) judicial information extraction dataset show that compared with the baseline model,the F1 value of this research method is increased by 2.79%.And the model in the CAIL2021 judicial information extraction track won the national third prize,which verified the effectiveness of the joint optimization strategy.

Key words: judicial information extraction|pre-training|self-attention mechanism|label semantics|adversarial training

用微信扫一扫