抗癌候选药物ERα抑制剂活性预测

1.昆明理工大学交通工程学院,云南昆明650504;2.温州医科大学第一临床医学院,浙江温州325006

计算机应用;集成学习;生物活性预测;特征筛选;超参数优化;随机森林

Activity prediction of anti-cancer drug candidate ERα inhibitor
XIA Yulan1,XIE Jiming1,WANG Yajing2,LU Mengyuan1,WANG Jinrui1,and QIN Yaqin1

1.Faculty of Transportation Engineering, Kunming University of Science and Technology, Kunming 650504, Yunnan Province, P. R. China;2.The First Affiliated Hospital of Wenzhou Medical University, Wenzhou 325006, Zhejiang Province, P. R. China

computer application; integrated learning; biological activity prediction; feature selection; hyperparameter optimization; random forest

DOI: 10.3724/SP.J.1249.2022.05529

备注

乳腺癌是目前威胁全球女性健康最常见的恶性肿瘤.本研究通过统计分析并采用随机森林方法,确定了雌激素受体α亚型(estrogenreceptoralphasubtype,ERα)在乳腺的发育过程中起着重要的作用,被视为乳腺癌治疗的重要靶标,拮抗ERα活性的化合物可作为乳腺癌治疗的候选药物.为有效预测小样本、多特征条件下的乳腺癌治疗靶标ERα的化合物生物活性,提出一种抗乳腺癌药物定量结构-活性关系的集成机器学习预测模型,称为Mul-BHO-Bi-LSTM(multivariate-Bayesianhyperparametricoptimizationbi-directionallongshort-termmemory)模型.对1974个化合物的729个分子描述符信息进行描述性统计和多重共线性诊断,采用随机森林方法,筛选20个显著变量的重要性评分大于0.01的变量.构建基于卷积神经网络的二维特征矩阵,采用贝叶斯超参数优化方法,对双向长短期记忆(bi-directionallongshort-termmemory,Bi-LSTM)模型进行超参数寻优.对模型的预测效果进行分析和评价,结果显示,相比梯度提升决策树(gradientboostingdecisiontree,GBDT)集成学习方法,Mul-BHO-Bi-LSTM模型的预测效果较优,模型误差相关指标均方误差、归一化均方根误差、误差平均值、误差标准差均小于0.15,关联指标R2和r达0.99以上,表明Mul-BHO-Bi-LSTM的集成机器学习预测模型具有较好鲁棒性和泛化性.该预测模型可为抗乳腺癌药物的筛选与设计提供方法.
Breast cancer is the most common malignancy which threats the women's health worldwide. Studies have revealed that the estrogen receptor alpha subtype (ERα) plays an important role in breast development and is considered as an important target for breast cancer treatment. Compounds that can antagonize ERα activity may be candidates for breast cancer treatment. A quantitative structure-activity relationship prediction model is proposed to predict the bioactivity of compounds that can be applied to anti-breast cancer drugs under small samples and multi-characteristic conditions. First, the descriptive statistics and multicollinearity diagnosis are performed on the information of 729 molecular descriptors of 1 974 compounds, and the random forest method is used to screen 20 significant variables with variable importance measure that is greater than 0. 01. Then, a CNN-based two-dimensional feature matrix is constructed, and a Bayesian hyperparametric optimization (BHO) method is used to perform hyperparametric optimization of the Bi-LSTM model. Finally, the prediction effect of model is analyzed and evaluated. The results show that compared with the GBDT integrated learning method, the prediction effect of Mul-BHO-Bi-LSTM integrated machine learning prediction model is better, and the model error indexes MSE, NRMSE, error mean, and error std are less than 0. 15, and the correlated indicators R2 and r are above 0. 99, indicating that the integrated machine learning predictionmodel of Mul-BHO-Bi-LSTM has the good robustness and generalization, and the model can provide a method for the screening and design of anti-breast cancer drugs.
·