深圳大学学报理工版

基于粒子群优化的生物组学数据分类模型选择

杨峻山¹,纪震¹,谢维信¹,朱泽轩²

1)深圳大学信息工程学院,广东深圳 518060; 2)深圳大学计算机与软件学院,广东深圳 518060

关键词：生物组学; 粒子群优化; 样本平衡; 特征选择; 分类模型; 模型选择; 数据挖掘

Model selection based on particle swarm optimization for omics data classification

Yang Junshan¹, Ji Zhen^1, Xie Weixin¹, and Zhu Zexuan²

1)College of Information Engineering, Shenzhen University, Shenzhen 518060, Guangdong Province, P.R.China 2)College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060,Guangdong Province, P.R.China

Keywords： omics dataset; particle swarm optimization; data sampling; feature selection; classification model; model selection; data mining

DOI: 10.3724/SP.J.1249.2016.03264

备注

摘要

全文

图/表

参考文献

针对生物组学数据普遍存在的高维小样本和样本分布不平衡问题,提出基于粒子群优化分类模型选择算法.该算法中粒子编码由样本平衡模型、特征选择模型和分类模型及超参数构成,粒子种群以达到以生物组学数据最佳分类性能为目标,通过对粒子的速度和位置进行迭代更新,得到模型组合及超参数的最优解.在8组真实生物组学数据集上的实验结果表明,所提模型选择算法能够避免人为选择所带来的主观偏差,提高最佳分类性能和稳定性.

A new model selection algorithm based on particle swarm optimization is proposed for omics data classification. Specifically, the algorithm is designed to handle the high dimensionality, small sample size and class imbalance problems that are inherent in omics data. The particles encode candidate combinations of data sampling, feature selection, classification models and their corresponding parameter settings. The swarm optimization is targeted at the best classification performance. The particle velocity and position are iteratively updated until some stopping criteria are met and then the optimal solution model combination is output. The simulation results on eight real-world omics datasets show that the proposed model selection algorithm is capable of avoiding the bias introduced by manual settings and leading to accurate and reliable classification performance.