XIA Yulan,XIE Jiming,WANG Yajing,et al.Activity prediction of anti-cancer drug candidate ERα inhibitor[J].Journal of Shenzhen University Science and Engineering,2022,39(5):529-537.[doi:10.3724/SP.J.1249.2022.05529]





Activity prediction of anti-cancer drug candidate ERα inhibitor
夏玉兰1 谢济铭1 王雅婧2 卢梦媛1 王锦锐1 秦雅琴1
1)昆明理工大学交通工程学院,云南昆明 650504
XIA Yulan1 XIE Jiming1 WANG Yajing2 LU Mengyuan1 WANG Jinrui1 and QIN Yaqin1
1) Faculty of Transportation Engineering, Kunming University of Science and Technology, Kunming 650504, Yunnan Province, P.R.China
2) The First Affiliated Hospital of Wenzhou Medical University, Wenzhou 325006, Zhejiang Province, P.R.China
computer application integrated learning biological activity prediction feature selection hyperparameter optimization random forest
乳腺癌是目前威胁全球女性健康最常见的恶性肿瘤.本研究通过统计分析并采用随机森林方法,确定了雌激素受体α亚型(estrogen receptor alpha subtype, ERα)在乳腺的发育过程中起着重要的作用,被视为乳腺癌治疗的重要靶标, 拮抗ERα活性的化合物可作为乳腺癌治疗的候选药物.为有效预测小样本、多特征条件下的乳腺癌治疗靶标ERα的化合物生物活性, 提出一种抗乳腺癌药物定量结构-活性关系的集成机器学习预测模型, 称为Mul-BHO-Bi-LSTM(multivariate-Bayesian hyperparametric optimization bi-directional long short-term memory)模型.对1 974个化合物的729个分子描述符信息进行描述性统计和多重共线性诊断, 采用随机森林方法,筛选20个显著变量的重要性评分大于0.01的变量.构建基于卷积神经网络的二维特征矩阵,采用贝叶斯超参数优化方法,对双向长短期记忆(bi-directional long short-term memory, Bi-LSTM)模型进行超参数寻优.对模型的预测效果进行分析和评价,结果显示,相比梯度提升决策树(gradient boosting decision tree, GBDT)集成学习方法,Mul-BHO-Bi-LSTM模型的预测效果较优,模型误差相关指标均方误差、归一化均方根误差、误差平均值、误差标准差均小于0.15,关联指标R2和r达0.99以上,表明Mul-BHO-Bi-LSTM的集成机器学习预测模型具有较好鲁棒性和泛化性.该预测模型可为抗乳腺癌药物的筛选与设计提供方法.
Breast cancer is the most common malignancy which threats the women’s health worldwide. Studies have revealed that the estrogen receptor alpha subtype (ERα) plays an important role in breast development and is considered as an important target for breast cancer treatment. Compounds that can antagonize ERα activity may be candidates for breast cancer treatment. A quantitative structure-activity relationship prediction model is proposed to predict the bioactivity of compounds that can be applied to anti-breast cancer drugs under small samples and multi-characteristic conditions. First, the descriptive statistics and multicollinearity diagnosis are performed on the information of 729 molecular descriptors of 1 974 compounds, and the random forest method is used to screen 20 significant variables with variable importance measure that is greater than 0.01. Then, a CNN-based two-dimensional feature matrix is constructed, and a Bayesian hyperparametric optimization (BHO) method is used to perform hyperparametric optimization of the Bi-LSTM model. Finally, the prediction effect of model is analyzed and evaluated. The results show that compared with the GBDT integrated learning method, the prediction effect of Mul-BHO-Bi-LSTM integrated machine learning prediction model is better, and the model error indexes MSE, NRMSE, error mean, and error std are less than 0.15, and the correlated indicators R2 and r are above 0.99, indicating that the integrated machine learning predictionmodel of Mul-BHO-Bi-LSTM has the good robustness and generalization, and the model can provide a method for the screening and design of anti-breast cancer drugs.


[1] 刘宗超,李哲轩,张阳,等.2020全球癌症统计报告解读[J].肿瘤综合治疗电子杂志,2021,7(2):1-13.
LIU Zongchao, LI Zhexuan, ZHANG Yanget al. Interpretation on the report of Global Cancer Statistics 2020 [J]. Journal of Multidisciplinary Cancer, 2021, 7(2): 1-13.(in Chinese)
[2] 孙少康,黄勇,李志明,等.生物活性多糖抗乳腺癌作用研究进展[J].世界中医药,2021,16(18):2798-2805.
SUN Shaokang, HUANG Yong, LI Zhiming, et al. Research progress of the effects of bioactive polysaccharides on anti-breast cancer [J]. World Chinese Medicine, 2021, 16(18): 2798-2805.(in Chinese)
[3] KIDERA A, KONISHI Y, OKA M, et al. Statistical analysis of the physical properties of the 20 naturally occurring amino acids [J]. Journal of Protein Chemistry, 1985, 4(1): 23-55.
[4] 王青艳,谢能中,许晓东.药物分子设计中定量结构-活性关系计算方法的研究[J].广西科学,2014,21(1):6-11.
WANG Qingyan, XIE Nengzhong, XU Xiaodong. Study of mathematical method in the quantitative structure-activity relationship for drug design [J]. Guangxi Sciences, 2014, 21(1): 6-11.(in Chinese)
[5] LAVECCHIA A. Machine-learning approaches in drug discovery: methods and applications [J]. Drug Discovery Today, 2014, 20(3): 318-331.
[6] STEPHENSON N, SHANE E, CHASE J, et al. Survey of machine learning techniques in drug discovery [J]. Current Drug Metabolism, 2019, 20(3): 185-193.
[7] 黄斌.基于支持向量学习机预测药物透血脑屏障的活性[J].计算机与应用化学,2009,26(2):188-190.HUANG Bin. Prediction of blood-brain barrier penetrating drugs using supporting vector machine [J]. Computers and Applied Chemistry, 2009, 26(2): 188-190.(in Chinese)
[8] SARDARI S, KOHANZAD H, GHAVAMI G. Artificial neural network modeling of antimycobacterial chemical space to introduce efficient descriptors employed for drug design [J]. Chemometrics and Intelligent Laboratory Systems, 2014, 130: 151-158.
[9] DUTT R, MADAN A K. Development and application of novel molecular descriptors for predicting biological activity [J]. Medicinal Chemistry Research, 2017, 26(9): 1988-2006.
[10] 陆家兴,陈明,秦玉芳,等.基于LINCS-L1000扰动信号通过SAE-XGBoost算法预测药物诱导下的细胞活性[J].生物工程学报,2021,37(4):1346-1359.LU Jiaxing, CHEN Ming, QIN Yufang, et al. Prediction of drug-induced cell viability by SAE-XGBoost algorithm based on LINCS-L1000 perturbation signal [J]. Chinese Journal of Biotechnology, 2021, 37(4): 1346-1359.(in Chinese)
[11] BERGSTRA J, BENGIO Y. Random search for hyper-parameter optimization [J]. Journal of Machine Learning Research, 2012, 13(1): 281-305.
[12] 李玉娟.基于改进粒子群算法的深度学习超参数优化方法[J].信息通信,2020(1):52-53,55.
LI Yujuan. Deep learning hyperparameter optimization method based on improved particle swarm optimization [J]. Information & Communications, 2020(1): 52-53, 55.(in Chinese)
[13] WU Jia, CHEN Xiuyun, ZHANG Hao, et al. Hyperparameter optimization for machine learning models based on Bayesian optimization [J]. Journal of Electronic Science & Technology, 2019, 17(1): 26-40.
[14] 朱钰,郑屹然,尹默.统计学意义下的多重共线性检验方法[J].统计与决策,2020,36(7):34-36.
ZHU Yu, ZHENG Yiran, YIN Mo. Multicollinearity test under statistical significance [J]. Statistics and Decision,2020, 36(7): 34-36.(in Chinese)
[15] BREIMAN L. Random forests, machine learning 45 [J]. Journal of Clinical Microbiology, 2001, 2: 199-228.
[16] 魏腾飞,潘庭龙.基于改进PSO优化LSTM网络的短期电力负荷预测[J].系统仿真学报,2021,33(8):1866-1874.
WEI Tengfei, PAN Tinglong. Short-term power load forecasting based on LSTM neural network optimized by improved PSO [J]. Journal of System Simulation, 2021, 33(8): 1866-1874.(in Chinese)
[17] 尹诗,侯国莲,迟岩,等.风电机组发电机前轴承健康度预测方法及实现[J].系统仿真学报,2021,33(6):1323-1333.
YIN Shi, HOU Guolian, CHI Yan, et al. Prediction method for health degree of front bearing of wind turbine generator and implementation [J]. Journal of System Simulation, 2021, 33(6): 1323-1333.(in Chinese)
[18] 周飞燕,金林鹏,董军.卷积神经网络研究综述[J].计算机学报,2017,40(6):12291252.
ZHOU Feiyan, JIN Linpeng, DONG Jun. Review of convolutional neural network [J]. Chinese Journal of Computers, 2017, 40(6): 1229-1252.(in Chinese)
[19] SHAHRIARI B, SWERSKY K, WANGZ, et al. Taking the human out of the loop: a review of Bayesian optimization [J]. Proceedings of the IEEE, 2015, 104(1): 148-175.
[20] FRIEDMAN J H. Greedy function approximation: a gradient boosting machine [J]. Annals of Statistics, 2001, 29(5): 1189-1232.


 CAI Hua-li,LIU Lu,FAN Kun,et al.Web services recommendation based on BPSO[J].Journal of Shenzhen University Science and Engineering,2010,27(5):49.
 Zhu Zexuan,Zhang Yongpeng,You Zhuhong,et al.Advances in the compression of high-throughput DNA sequencing data[J].Journal of Shenzhen University Science and Engineering,2013,30(5):409.[doi:10.3724/SP.J.1249.2013.04409]
 Zhang Dian,Ming Zhong,Liu Gang,et al.An empirical study of radio signal strength in sensor networks using MICA2 nodes[J].Journal of Shenzhen University Science and Engineering,2014,31(5):63.[doi:10.3724/SP.J.1249.2014.01063]
 Liao Rijun,Li Xiongjun,Xu Jianjie,et al.Discussions on applications of Arnold transformation in binary image scrambling[J].Journal of Shenzhen University Science and Engineering,2015,32(5):428.[doi:10.3724/SP.J.1249.2015.04428]
 Li Xiongjun,Liao Rijun,Li Jinlong,et al.Quasi-symmetry and the half-cycle phenomenon in scrambling degrees for images with pixel locations scrambled by Arnold transformation[J].Journal of Shenzhen University Science and Engineering,2015,32(5):551.[doi:10.3724/SP.J.1249.2015.06551]
 WANG Xinyue and JING Liping.Stratified sampling based ensemble classification for imbalanced data[J].Journal of Shenzhen University Science and Engineering,2019,36(5):24.[doi:10.3724/SP.J.1249.2019.01024]
 CHAI Bianfang,CAO Xinyu,WEI Chunli,et al.An active semi-supervised structure exploring algorithm for large networks[J].Journal of Shenzhen University Science and Engineering,2020,37(5):243.[doi:10.3724/SP.J.1249.2020.03243]
 LIU Chaobin,SUN Xue,LIU Jian,et al.Campus intelligent security construction based on internet of things[J].Journal of Shenzhen University Science and Engineering,2020,37(5):128.[doi:10.3724/SP.J.1249.2020.99128]
 YANG Yang.Design and implementation of big data platform in colleges[J].Journal of Shenzhen University Science and Engineering,2020,37(5):146.[doi:10.3724/SP.J.1249.2020.99146]
 GONG Ligan,GU Kun,MING Xinming,et al.Analysis of college students’ consumption behavior based on campus card data[J].Journal of Shenzhen University Science and Engineering,2020,37(5):150.[doi:10.3724/SP.J.1249.2020.99150]


Received: 2021- 11-03; Accepted: 2022-01-15; Online (CNKI): 2022-07-21
Foundation: National Natural Science Foundation of China (71861016)
Corresponding author: Professor QIN Yaqin. E-mail: qinyaqin@kust.edu.cn
Citation: XIA Yulan,XIE Jiming,WANG Yajing, et al.Activity prediction of anti-cancer drug candidate ERα inhibitor [J]. Journal of Shenzhen University Science and Engineering, 2022, 39(5): 529-537.(in Chinese)
作者简介:夏玉兰(1997—),昆明理工大学硕士研究生.研究方向:系统建模与仿真、机器学习.E-mail: xiayulan@stu.kust.edu.cn
更新日期/Last Update: 2022-09-30