基于分层抽样的不均衡数据集成分类

北京交通大学计算机与信息技术学院,北京 100044

人工智能; 不均衡分类; 分层抽样; 集成学习; 聚类; 数据挖掘

Stratified sampling based ensemble classification for imbalanced data
WANG Xinyue and JING Liping

School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, P.R.China

artificial intelligence; imbalance classification; stratified sampling; ensemble learning; clustering; data mining

DOI: 10.3724/SP.J.1249.2019.01024

备注

不均衡数据分类是数据挖掘领域的一个难点问题,对多数类样本进行降采样可简单且有效地解决不均衡数据处理面临的两大核心问题,即如何从数类占绝对优势的数据集合中最大程度地挖掘少数类信息; 如何确保在不过度损失多数类信息的前提下构建学习器.但现有的降采样方法往往会破坏原始数据结构特性或造成严重的信息损失.本研究提出一种基于分层抽样的不均衡数据集成分类方法(简记为EC-SS),通过充分挖掘多数类样本的结构信息,对其进行聚类划分; 再在数据块上进行分层抽样来构建集成学习数据成员,以确保单个学习器的输入数据均衡且保留原始数据的结构信息,提升后续集成分类性能.在不均衡数据集Musk1、Ecoli3、Glass2和Yeast6上,对比EC-SS方法与基于随机抽样的不均衡数据集成分类方法、自适应采样学习方法、基于密度估计的过采样方法和代价敏感的大间隔分类器方法的分类性能,结果表明,EC-SS方法能有效提升分类性能.

The imbalanced data set is ubiquitous in the real-world applications. There are two key issues for the downsampling processing based imbalanced data classification. One is how to maximize the mining of minority type of information from the data set with absolute dominance of several types. The second one is how to ensure that the learner is built without excessive loss of most types of information. A simple and effective strategy is to conduct the downsampling on the majority class. The existing methods usually suffer from losing the information or destroying the intrinsic structure of the original data set. In this paper, we propose a new imbalanced data ensemble classification method with the aid of stratified sampling on majority class(EC-SS). To mine the hidden structure in majority class sufficiently, an adaptive self-tuning clustering strategy is adopted to separate the major-class samples into different strata and then the stratified sampling is used to under-sample the majority class. This strategy works well to generate the data components for subsequent ensemble learning, and its main advantage is to keep the data structure of the original data set. A series of experiments on the real benchmark datasets Musk1, Ecoli3, Glass2, and Yeast6, show that the proposed EC-SS outperforms the baselines of ensemble classification based on random sampling(EC-RS), adaptive sampling with optimal cost for class-imbalance learning(AdaS), kernel based adaptive synthetic data generation(KernelADASYN)and cost-sensitive large margin distribution machine(CS-LDM).

·