网络多媒体数据中舆情关联主题的挖掘方法

1)西安工程大学理学院 陕西西安 710048; 2)西安交通大学智能网络与网络安全教育部重点实验室 陕西西安 710049

模式识别; 图像处理; 微型博客; 新浪微博; 多媒体数据; 文本检测; 文本提取; 主题识别; 舆情监管

Mining method of public opinion related topic in network multimedia data
LIU Runqi1, HE Xingshi1, NAN Yifei2, and WANG Bo2

1)School of Science, Xi'an Polytechnic University, Xi'an 710048, Shaanxi Province, P.R.China2)Ministry of Education Key Lab for Intelligent Networks and Network Security, Xi'an Jiaotong University, Xi'an 710049, Shaanxi Province, P.R.China

pattern recognition; image processing; microblog; Sina weibo; multimedia data; text detection; text extraction; subject recognition; public opinion supervision

DOI: 10.3724/SP.J.1249.2020.01072

备注

如何高效地从图像、视频等多媒体数据中挖掘网络舆情事件的关联主题给网络舆情的有效监管带来了重大挑战.研究图像和视频截图等多媒体数据中文本信息的抽取方法,并在此基础上实现舆情关联主题的检测.选择新浪微博中的3个典型舆情事件为研究对象,设计网络爬虫收集事件中的文本、图像和视频多模态数据; 采用连接文本提议网络(connectionist text proposal network, CTPN)的文字检测算法实现文本信息定位,利用DenseNet网络和连接时序分类(connectionist temporal classification, CTC)相结合的方法进行文本提取; 提出多粒度潜在狄利克雷分布(multi granularity-latent Dirichlet allocation, MG-LDA)和jieba分词相结合的舆情关联主题提取方法.实验结果表明,所提出的方法可准确提取多媒体数据中不同格式、不同分辨率、不同颜色、不定位置和不同角度的文本信息,为精确把握舆情演化态势提供有力的数据支撑.

Social media has become the platform for rumors rapid propagation, more and more users adopt the pictures and videos to express their opinions in order to avoid being detected by text-based approaches, which has greatly affected the efficiency of online public opinion monitoring. For tackling the above-mentioned problem, this paper mainly studies how to extract the related opinions from the multimedia network data. Firstly, three typical events in Sina weibo are selected as the research targets, in which the web crawler is designed to collect the multimedia data. Secondly, the text detection algorithm based on connectionist text proposal network(CTPN)is employed to perform the text localization, and then a fusion method by combining DenseNet and connectionist temporal classification(CTC)is employed to perform text extraction. Finally, an effective algorithm by combining multi granularity-latent Dirichlet allocation(MG-LDA)and jieba is proposed to accurately identify the related topics from the extracted text. The experimental results show that the proposed method can accurately extract the texts from multimedia with different formats, resolutions and colors, and can also extract the texts with different rotating angles. Our research provides the solid foundations for online public opinion monitoring.

·