深圳大学学报理工版

异常点检测是数据挖掘领域的一个重要研究方向，传统的基于近邻和局部异常因子的异常点检测算法存在计算复杂度高和误检率高的缺陷．为解决该缺陷，提出一种基于观测点机制的异常点检测（observation-pointmechanism-basedoutlierdetection，OPOD）算法．首先在原始样本空间中随机放置若干观测点，然后计算观测点与样本点之间的距离，将原始数据转换为与观测点相对应的距离数据，再估计距离数据的概率密度函数，进而计算距离数据出现的概率值，最后通过对多个观测点距离数据概率值的融合最终确定原始样本点中的异常点．基于PyCharm平台，采用sklearn.datasets的make_blobs函数生成仿真数据集，分别测试不同规模和不同维度数据集对OPOD算法性能的影响，并对比了OPOD算法、基于局部异常因子的异常点检测（localoutlierfactor-basedoutlierdetection，LOFOD）算法和基于近邻的异常点检测（nearestneighbor-basedoutlierdetection，NNOD）算法的运行时间、异常点召回率和误检率．结果表明，OPOD算法具有对异常点进行检测的能力，且随着观测点数量的增加算法呈收敛趋势；观测点选取合适的条件下，具有比基于近邻和局部异常因子的异常点检测算法更低的时间复杂度和更好的异常点检测效果．

Outlier detection is an important branch of data mining research, and has wide applications in the fields of finance, telecommunications, and biology. The traditional nearest neighbor-based outlier detection (NNOD) and local outlier factor-based outlier detection (LOFOD) algorithms generally have high computational complexity and high false-detection rates. This paper proposes an observation-point mechanism-based outlier detection (OPOD) algorithm comprising four core steps: i) generating random observation points in the original data space; ii) estimating the probability density function of distance values between the given observation point and all data points; iii) calculating the probabilities of distance values for the given observation point; and iv) detecting outliers by combining the probabilities corresponding to the different observation points. Extensive experiments are conducted to demonstrate the feasibility, rationality, and effectiveness of the OPOD algorithm. The experimental results show that the OPOD algorithm converges as the number of observation points increases, and can attain better detection performance with lower computation complexity than the NNOD and LOFOD algorithms.

引言
1 两种经典的异常点检测算法
2 基于观测点的异常点检测算法
3 实验验证与结果分析
结语

图1 采用sklearn.datasets的make_blobs函数生成2维仿真数据集Fig. 1 (Color online) Two dimensional synthetic data sets generated with make_blobs function in sklearn.datasets

图2 观测点与样本点之间观测距离的概率分布Fig. 2 (Color online) Probability distributions of observa⁃tion distances between observation points and data points

图3 OPOD算法对不同规模数据集异常点检测的收敛性（L=40） Fig. 3 Convergence of OPOD algorithm for outlier detec⁃tion under different size data sets (L=40)

图4 数据规模对OPOD算法观测点个数的影响Fig. 4 Impact of data size on number of observation points in OPOD algorithm

图5 OPOD算法对不同维度数据集异常点检测的收敛性（N=1 010） Fig. 5 Convergence of OPOD algorithm for different data dimensions (N=1 010)

图6 数据维度对OPOD算法观测点个数的影响Fig. 6 Impact of data dimension on number of observation points in OPOD algorithm

图7 NNOD、LOFOD和OPOD算法对不同规模数据集进行异常点检测的时间对比Fig. 7 Time comparison among NNOD, LOFOD and OPOD algorithms for different data sizes

图8 NNOD、LOFOD和OPOD算法对不同维度数据集进行异常点检测的时间对比Fig. 8 Time comparison among NNOD, LOFOD and OPOD algorithms for different data dimensions

表1 OPOD、NNOD和LOFOD算法使用的数据集Table 1 The data sets used in comparison among OPOD, NNOD and LOFOD algorithms

表2 OPOD和NNOD算法的召回率、误检率和运行时间Table 2 The recall (R), false detection rate (F) and run time (t) of OPOD and NNOD algorithms

表3 OPOD和LOFOD算法的召回率、误检率和运行时间Table 3 The recall (R), false detection rate (F) and run time (t) of OPOD and LOFOD algorithms

[1]HODGE V, AUSTIN J. A survey of outlier detection methodologies [J]. Artificial Intelligence Review, 2004, 22 (2): 85-126.
[2]WANG H, BAH M J, HAMMAD M. Progress in outlier detection techniques: a survey [J]. IEEE Access, 2019, 7:107964-108000.
[3]王立英.异常点检测算法及在网络入侵检测中的应用研究[D].济南：山东师范大学，2020. WANG Liying. Research on outlier detection algorithm and its application in network intrusion detection system [D]. Jinan: Shandong Normal University, 2020. (in Chi⁃nese)
[4]陈溟.基于模糊局部离群因子（LOF）的信用卡欺诈检测研究[J].金融理论与实践，2016（10）：54-57. CHEN Ming. Research on credit card fraud detection based on fuzzy local outlier factor (LOF) [J]. Financial Theory and Practice, 2016(10): 54-57. (in Chinese)
[5]郭丽娟，张玉波，尹立群，等.基于离群点检测的变电主设备异常辨识与规律分析[J].南方电网技术， 2018，12（9）：14-21.GUO Lijuan, ZHANG Yubo, YIN Liqun, et al. Identifica⁃tion and analysis of main substation equipment abnormal data based on outlier detection method [J]. Southern Power System Technology, 2018, 12(9): 14-21. (in Chinese)
[6]易江，孙国栋.基于小波变换的天然地震信号异常点检测[J].科技经济导刊，2017，25（1）：33. YI Jiang, SUN Guodong. Outlier detection of natural seismic signal based on wavelet transform [J]. Technology and Economic Guide, 2017, 25(1): 33. (in Chinese)[7]WILKINSON L. Visualizing big data outliers through dis⁃tributed aggregation [J]. IEEE Transactions on Visualiza⁃tion and Computer Graphics, 2017, 24(1): 256-266.
[8]CHEN Lin, HE Jing. A histogram-based outlier profile for atomic structures derived from cryo-electron microscopy [C]//Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. Niagara Falls, USA: ACM, 2019: 586-591.
[9]SMITI A. A critical overview of outlier detection methods [J]. Computer Science Review, 2020, 38: 100306.
[10] GAN Guojun, NG M K. k-means clustering with outlier removal [J]. Pattern Recognition Letters, 2017, 90: 8-14.
[11]EMADI H S, MAZINANI S M. A novel anomaly detection algorithm using DBSCAN and SVM in wireless sensor net⁃works [J]. Wireless Personal Communications, 2018, 98 (2): 2025-2035.
[12]DING Feng, WANG Jian, GE Jiaqi, et al. Anomaly detec⁃tion in large-scale trajectories using hybrid grid-based hierarchical clustering [J]. International Journal of Robotics&Automation, 2018, 33(5): 474-480.
[13]KNORR E M, NG R T. Algorithms for mining distance-based outliers in large datasets [C]// Proceedings of the 24th International Conference on Very Large Data Bases. San Francisco, USA: [s. n.], 1998, 98: 392-403.
[14]胡云，施珺，王崇骏，等.基于全局最近邻的离群点检测算法[J].计算机应用，2011，31（10）：2778-2781.HU Yun, SHI Jun, WANG Chongjun, et al. Outlier detec⁃tion algorithm based on global nearest neighborhood [J]. Journal of Computer Applications, 2011, 31(10): 2778-2781. (in Chinese)
[15]HAUTAMAKI V, KARKKAINEN I, FRANTI P. Outlier detection using k-nearest neighbour graph [C]// Pro⁃ceedings of the 17th International Conference on Pattern Recognition. Cambridge, UK: IEEE, 2004: 430-433.
[16]BREUNIG M M, KRIEGEL H P, NG R T, et al. LOF:identifying density-based local outliers [J]. ACM SIGMOD Record, 2000, 29(2): 93-104.
[17]PAPADIMITRIOU S, KITAGAWA H, GIBBONS P B, et al. LOCI: fast outlier detection using the local correlation integral [C]// Proceedings of the 19th International Conference on Data Engineering. Bangalore, India: IEEE, 2003: 315-326.
[18]KRIEGEL H P, PEER K, SCHUBERT E, et al. LoOP:local outlier probabilities [C]// Proceedings of the 18th ACM Conference on Information and Knowledge Manage⁃ment. New York, USA: ACM, 2009: 1649-1652.
[19]HE Yulin, YE Xuan, HUANG Defa, et al. Novel kernel density estimator based on ensemble unbiased cross-validation [J]. Information Sciences, 2021, 581: 327-344.
[20]GHOSH S. Kernel smoothing: principles, methods and applications [M]. Hoboken, USA: John Wiley & Sons, 2018.
[21]NIXON M, AGUADO A. Feature extraction and image processing for computer vision [M]. [S. l.]: Academic Press, 2019.
[22]SALLOUM S, HUANG J Z, HE Yulin. Random sample partition: a distributed data model for big data analysis [J]. IEEE Transactions on Industrial Informatics, 2019, 15 (11): 5846-5854.

备注

引言

1 两种经典的异常点检测算法

2 基于观测点的异常点检测算法

3 实验验证与结果分析

结语

期刊信息

备注

引言

1 两种经典的异常点检测算法

2 基于观测点的异常点检测算法

3 实验验证与结果分析

结 语

期刊信息

结语