基于观测点机制的异常点检测算法

1.河北民族师范学院数学与计算机科学学院,河北承德067055;2.深圳大学计算机与软件学院,广东深圳518060;3.人工智能与数字经济广东省实验室(深圳),广东深圳518107

人工智能;异常点检测;观测点;近邻;局部异常因子;概率密度函数;核密度估计;数据挖掘

A new outlier detection algorithm based on observation-point mechanism
YU Wanguo1,HE Yulin2,3,and QIN Huilin2

1.College of Mathematics and Computer Science, Hebei Normal University for Nationalities, Chengde 067055, Hebei Province, P. R. China;2.College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, Guangdong Province, P. R. China;3.Guangdong Laboratory of Artificial Intelligence and Digital Economy (Shenzhen), Shenzhen 518107, Guangdong Province, P. R. China

artificial intelligence; outlier detection; observation point; nearest neighbor; local outlier factor;probability density function; kernel density estimation; data mining

DOI: 10.3724/SP.J.1249.2022.03355

备注

异常点检测是数据挖掘领域的一个重要研究方向,传统的基于近邻和局部异常因子的异常点检测算法存在计算复杂度高和误检率高的缺陷.为解决该缺陷,提出一种基于观测点机制的异常点检测(observation-pointmechanism-basedoutlierdetection,OPOD)算法.首先在原始样本空间中随机放置若干观测点,然后计算观测点与样本点之间的距离,将原始数据转换为与观测点相对应的距离数据,再估计距离数据的概率密度函数,进而计算距离数据出现的概率值,最后通过对多个观测点距离数据概率值的融合最终确定原始样本点中的异常点.基于PyCharm平台,采用sklearn.datasets的make_blobs函数生成仿真数据集,分别测试不同规模和不同维度数据集对OPOD算法性能的影响,并对比了OPOD算法、基于局部异常因子的异常点检测(localoutlierfactor-basedoutlierdetection,LOFOD)算法和基于近邻的异常点检测(nearestneighbor-basedoutlierdetection,NNOD)算法的运行时间、异常点召回率和误检率.结果表明,OPOD算法具有对异常点进行检测的能力,且随着观测点数量的增加算法呈收敛趋势;观测点选取合适的条件下,具有比基于近邻和局部异常因子的异常点检测算法更低的时间复杂度和更好的异常点检测效果.
Outlier detection is an important branch of data mining research, and has wide applications in the fields of finance, telecommunications, and biology. The traditional nearest neighbor-based outlier detection (NNOD) and local outlier factor-based outlier detection (LOFOD) algorithms generally have high computational complexity and high false-detection rates. This paper proposes an observation-point mechanism-based outlier detection (OPOD) algorithm comprising four core steps: i) generating random observation points in the original data space; ii) estimating the probability density function of distance values between the given observation point and all data points; iii) calculating the probabilities of distance values for the given observation point; and iv) detecting outliers by combining the probabilities corresponding to the different observation points. Extensive experiments are conducted to demonstrate the feasibility, rationality, and effectiveness of the OPOD algorithm. The experimental results show that the OPOD algorithm converges as the number of observation points increases, and can attain better detection performance with lower computation complexity than the NNOD and LOFOD algorithms.
·