一种基于语义相似的中文文档抄袭检测方法

广东省计算机网络重点实验室,华南理工大学计算机科学与工程学院,广东广州 510641

计算机科学; 自然语言处理; 抄袭检测; 语义相似度; 词向量表示

A plagiarism detection approach for Chinese documents based on semantic textual similarity
HU Buhuan, ZHANG Jing, and ZHANG Ling

Guangdong Province Key Laboratory of Computer Network, College of Computer Science and Technology,South China University of Technology, Guangzhou 510006, Guangdong Province, P.R.China

computer science; natural language processing; plagiarism detection; semantic similarity; word vector representation

DOI: 10.3724/SP.J.1249.2020.99107

备注

为解决在文本抄袭行为中由于避开检测而对文本内容进行的一些同义词替换、文本释义等操作问题,提出了一种基于语义相似计算的中文文档抄袭检测方法,将文档以句子为单位切分,利用word2vec模型将句子中的词语表示为词向量的形式,作为卷积神经网络(convolutional neural net-work, CNN)的输入,使用卷积神经网络提取和筛选句子的特征,计算句子对之间的差异,输出句子对的相似度,相似度高的句子对视为抄袭. 利用大型可公开的腾讯云文本相似数据集检测试学生作业的抄袭情况,结果表明,传统的移动窗口指纹特征提取法虽然能够较为准确地找出两个文档中相同的片段,但是对于语义相似的文本容易受到噪声影响,提出的基于语义相似计算方法能够发现文档中语义相似的部分.

In order to solve the problem of some operations that interfere with detection, such as synonym substitution, text paraphrase, etc., we propose a Chinese documents plagiarism detection approach based on semantic textual similarity. Firstly, we divide the document into sentence units and use word2vec to have a vector representation of each word of a sentence as the input of the convolutional neural network(CNN). Then, the CNN is applied to extract and filter the features of sentences, calculate the difference between sentence pairs, output the similarity of sentence pairs. Pair sentences with the highest similarity are considered as the candidates for plagiarism. Finally, copy-and-paste documents and semantically similar documents are used as the dataset to verify and compare the proposed method with the traditional fingerprint feature extraction method. The proposed method is tested on a large publicly available Tencent cloud text similarity data set, and applied to the plagiarism detection of students' homework. The results show that although the traditional fingerprint feature extraction method can find the same fragments in two documents accurately, it is sensitive to the noise in the semantically similar documents, while the proposed approach can overcome this disadvantage.

·