基于标签传播的协同分类欺诈检测方法

1)北京邮电大学计算机学院(国家示范性软件学院),北京邮电大学可信分布式计算与服务教育部重点实验室,北京 100876; 2)北邮-华融智慧金融联合实验室,北京 100876; 3)华融融通(北京)科技有限公司,北京 100033

计算机软件; 欺诈检测; 协同分类; 网络借贷; 标签传播; 机器学习

Collective classification method based on label propagation for fraud detection
ZHAO Pengya1, 2, FU Xiangling1, 2, WU Weiqiang2, 3, LI Da2, 3, and GAO Songfeng2, 3

1)School of Computer Science(National Pilot Software Engineering School), Key Laboratory of Trustworthy Distributed Computing and Service(BUPT), Ministry of Education, Beijing University of Posts and Telecommunications, Beijing 100876, P.R.China 2)BUPT and Huarong Joint Lab of Smart Finance, Beijing 100876, P.R.China 3)Huarong Rongtong(Beijing)Technology Co., Ltd., Beijing 100033, P.R.China

computer software; fraud detection; collective classification; online lending; label propagation; machine learning

DOI: 10.3724/SP.J.1249.2020.05482

备注

网络借贷领域中的欺诈检测是根据收集到的用户历史交易数据等信息,来判断该用户是欺诈用户还是正常用户.现有方法认为用户是独立存在的,忽略了用户之间的关联信息.考虑到目前欺诈逐渐成为群体行为,在欺诈网络内呈现出欺诈节点与非欺诈节点关联稀疏,而欺诈节点间关联紧密的现象,提出基于标签传播的协同分类欺诈检测方法.通过收集真实网上借贷公司的用户通话数据,构建用户之间的通话关联网络,利用标签传播算法扩散欺诈节点的标签信息,确定未知标签节点是否为欺诈用户.通过对权重进行幂操作,改进了标签传播算法中概率转移矩阵的初始化方法,使其适应欺诈场景下正负样本分布不平衡的现象.在有标签样本比例极低且训练样本分布不均衡的真实借贷数据集中进行了7次测试,采用所提算法检测到欺诈用户的精确率最高达17%,所得F1值与精确率都比经典的WvRn算法更优.

In the field of online lending, the key problem for fraud detection is how to judge whether the user is a fraudster or a normal user based on the collected historical transaction data of the user. At present, the representative research methods treat any user as an independent node and ignore the related information among users. Considering that the fraud is gradually becoming a group behavior, the relationships among fraud nodes and non-fraud nodes are sparse in social networks, and the relationships among fraud nodes are closely related, we propose a collective classification fraud detection method with label propagation. A call-records-based user association network is constructed based on the phone call records between users of online lending company, and we use the label propagation algorithm to spread the label information of fraud node to determine whether the unlabeled node is a fraudulent user. In addition, we improve the initialization method of transition probability matrixin label propagation algorithm by the operation of weights powering to avoid the performance degradation of label propagation algorithm caused by the unbalanced distribution of fraud data. Finally, the validation experiment is conducted in a real loan data set with a very low proportion of labeled samples and unbalanced training sample distribution. By using the proposed method in this article, the accuracy rate of fraud user detection reaches 17%, and the F1 value and accuracy rate are both better than those of the classic WvRn algorithm.

·