[1]胡布焕,张晶,张凌.一种基于语义相似的中文文档抄袭检测方法[J].深圳大学学报理工版,2020,37(增刊1):107-111.[doi:10.3724/SP.J.1249.2020.99107]
 HU Buhuan,ZHANG Jing,and ZHANG Ling.A plagiarism detection approach for Chinese documents based on semantic textual similarity[J].Journal of Shenzhen University Science and Engineering,2020,37(增刊1):107-111.[doi:10.3724/SP.J.1249.2020.99107]
点击复制

一种基于语义相似的中文文档抄袭检测方法()
分享到:

《深圳大学学报理工版》[ISSN:1000-2618/CN:44-1401/N]

卷:
第37卷
期数:
2020年增刊1
页码:
107-111
栏目:
教育大数据技术与应用
出版日期:
2020-11-20

文章信息/Info

Title:
A plagiarism detection approach for Chinese documents based on semantic textual similarity
文章编号:
202099020
作者:
胡布焕张晶张凌
广东省计算机网络重点实验室,华南理工大学计算机科学与工程学院,广东广州 510641
Author(s):
HU Buhuan ZHANG Jing and ZHANG Ling
Guangdong Province Key Laboratory of Computer Network, College of Computer Science and Technology, South China University of Technology, Guangzhou 510006, Guangdong Province, P.R.China
关键词:
计算机科学自然语言处理抄袭检测语义相似度词向量表示
Keywords:
computer science natural language processing plagiarism detection semantic similarity word vector representation
分类号:
TP391.1
DOI:
10.3724/SP.J.1249.2020.99107
文献标志码:
A
摘要:
为解决在文本抄袭行为中由于避开检测而对文本内容进行的一些同义词替换、文本释义等操作问题,提出了一种基于语义相似计算的中文文档抄袭检测方法,将文档以句子为单位切分,利用word2vec模型将句子中的词语表示为词向量的形式,作为卷积神经网络(convolutional neural net-work, CNN)的输入,使用卷积神经网络提取和筛选句子的特征,计算句子对之间的差异,输出句子对的相似度,相似度高的句子对视为抄袭. 利用大型可公开的腾讯云文本相似数据集检测试学生作业的抄袭情况,结果表明,传统的移动窗口指纹特征提取法虽然能够较为准确地找出两个文档中相同的片段,但是对于语义相似的文本容易受到噪声影响,提出的基于语义相似计算方法能够发现文档中语义相似的部分.
Abstract:
In order to solve the problem of some operations that interfere with detection, such as synonym substitution, text paraphrase, etc., we propose a Chinese documents plagiarism detection approach based on semantic textual similarity. Firstly, we divide the document into sentence units and use word2vec to have a vector representation of each word of a sentence as the input of the convolutional neural network (CNN). Then, the CNN is applied to extract and filter the features of sentences, calculate the difference between sentence pairs, output the similarity of sentence pairs. Pair sentences with the highest similarity are considered as the candidates for plagiarism. Finally, copy-and-paste documents and semantically similar documents are used as the dataset to verify and compare the proposed method with the traditional fingerprint feature extraction method. The proposed method is tested on a large publicly available Tencent cloud text similarity data set, and applied to the plagiarism detection of students’ homework. The results show that although the traditional fingerprint feature extraction method can find the same fragments in two documents accurately, it is sensitive to the noise in the semantically similar documents, while the proposed approach can overcome this disadvantage.

参考文献/References:

[1] TEDDI F. “We know it when we see it”? is not good enough: toward a standard definition of plagiarism that transcends the, fraud, and copyright[C]// Proc Asia Paci Conf on Educational Inte-grity. [S.l.:s.n.]:2009.
[2] ALI W. A novel framework for plagiarism detection: a case study for urdu language[C]//2018 24th International Conference on Automation and Computing. Newcastle:[s.n.], 2018:1-6.
[3] ZUBAREV D V, SOCHENKOVIovI. Paraphrased plagiarism detection using sentence similarity [J].Institute for Systems Analysis Federal Research 2017,24:399-408.
[4] SYED F H. On retrieving intelligently plagiarized documents using semantic similarity[J]. Engineering Applications of Artificial Intelligence, 2015,12:246-258.
[5] Hunt E , Janamsetty R , Kinares C, et al. Machine learning models for paraphrase identification and its applications on plagiarism detection[C]// 2019 IEEE International Conference on Big Knowledge (ICBK). [S.l.]:IEEE, 2019.
[6] GHARAVI E, BIJARI K, ZAHIRNIA K, et al. A deep learning approach to persian plagiarism detection [J]. FIRE (Working Notes), 2016,34:154-159.
[7] ALZAHRANI S , ALJUAID H. Identifying cross-lingual plagiarism using rich semantic features and deep neural networks: a study on Arabic-English plagiarism cases[J]. Journal of King Saud University-Computer and Information Sciences, 2020,26:131-140.
[8] XIAO C , WANG W , LIN X , et al. Efficient similarity joins for near duplicate detection[C]// Proceeding of the 17th International Conference on World Wide Web. [S.l.]: ACM,2008.
[9] SHANCHENG T, YUNYUE B, FUYU M. A Chinese short text semantic similarity computation model based on stop words and TongyiciCilin[C]// 2017 6th International Conference on Computer Science and Network Technology (ICCSNT). [S.l.:s.n.]: 2017.
[10] TANG Shancheng, BAI Yunyue, MA Fuyu. Chinese semantic text similarity training dataset [M]. Xi’an: Xi’an University of Science and Technology,2016.

相似文献/References:

[1]刘海,汤庸,陈启买.基于描述逻辑的模糊软集参数转换算法[J].深圳大学学报理工版,2011,28(No.6(471-564)):495.
 LIU Hai,TANG Yong,and CHEN Qi-mai.Description logic based fuzzy soft set parameters conversion algorithm[J].Journal of Shenzhen University Science and Engineering,2011,28(增刊1):495.
[2]陈星宇,黄俊文,周展,等.基于本体论的大数据下用户需求表征[J].深圳大学学报理工版,2017,34(2):173.
 Chen Xingyu,Huang Junwen,Zhou Zhan,et al. Ontology-based user needs representation in the big data context[J].Journal of Shenzhen University Science and Engineering,2017,34(增刊1):173.
[3]陈星宇,周展,黄俊文,等.基于关键词挖掘的客户细分方法[J].深圳大学学报理工版,2017,34(3):300.[doi:10.3724/SP.J.1249.2017.03300]
 Chen Xingyu,Zhou Zhan,Huang Junwen,et al.A keyword-based mining method for customer segmentation[J].Journal of Shenzhen University Science and Engineering,2017,34(增刊1):300.[doi:10.3724/SP.J.1249.2017.03300]

备注/Memo

备注/Memo:
Received:2020-10-02
Foundation:Fund Project of China Education and Research Network (NGII20190511)
Corresponding author:Lecturer ZHANG Jing.E-mail:zhjing@scut.deu.cn
Citation:HU Buhuan,ZHANG Jing,ZHANG Ling,et al.A plagiarism detection approach for Chinese documents based on semantic textual similarity[J]. Journal of Shenzhen University Science and Engineering, 2020, 37(Suppl.1): 107-111.(in Chinese)
基金项目:中国教育和科研计算机网资助项目(NGII20190615)
作者简介:胡布焕(1998—),华南理工大学硕士研究生.研究方向:自然语言处理.E-mail:liuzc@nankai.edu.cn
引文:胡布焕,张晶,张凌. 一种基于语义相似的中文文档抄袭检测方法[J]. 深圳大学学报理工版,2020,37(增刊1):107-111.
更新日期/Last Update: 2020-11-26