Research Papers

Multimodal detection framework for financial fraud integrating LLMs and interpretable machine learning

Expand
  • 1Information Management School, Sun Yat-sen University, Guangzhou, China;
    2Lingnan College, Sun Yat-Sen University, Guangzhou, China

Received date: 2025-04-17

  Revised date: 2025-08-05

  Accepted date: 2025-08-18

  Online published: 2025-08-29

Abstract

Purpose: This study aims to integrate large language models (LLMs) with interpretable machine learning methods to develop a multimodal data-driven framework for predicting corporate financial fraud, addressing the limitations of traditional approaches in long-text semantic parsing, model interpretability, and multi-source data fusion, thereby providing regulatory agencies with intelligent auditing tools.
Design/methodology/approach: Analyzing 5,304 Chinese listed firms’ annual reports (2015-2020) from the CSMAD database, this study leverages the Doubao LLMs to generate chunked summaries and 256-dimensional semantic vectors, developing textual semantic features. It integrates 19 financial indicators, 11 governance metrics, and linguistic characteristics (tone, readability) with fraud prediction models optimized through a group of Gradient Boosted Decision Tree (GBDT) algorithms. SHAP value analysis in the final model reveals the risk transmission mechanism by quantifying the marginal impacts of financial, governance, and textual features on fraud likelihood.
Findings: The study found that LLMs effectively distill lengthy annual reports into semantic summaries, while GBDT algorithms (AUC > 0.850) outperform the traditional Logistic Regression model in fraud detection. Multimodal fusion improved performance by 7.4%, with financial, governance, and textual features providing complementary signals. SHAP analysis revealed financial distress, governance conflicts, and narrative patterns (e.g., tone anchoring, semantic thresholds) as key fraud indicators, highlighting managerial intent in report language.
Research limitations: This study identifies three key limitations: 1) lack of interpretability for semantic features, 2) absence of granular fraud-type differentiation, and 3) unexplored comparative validation with other deep learning methods. Future research will address these gaps to enhance fraud detection precision and model transparency.
Practical implications: The developed semantic-enhanced evaluation model provides a quantitative tool for assessing listed companies’ information disclosure quality and enables practical implementation through its derivative real-time monitoring system. This advancement significantly strengthens capital market risk early warning capabilities, offering actionable insights for securities regulation.
Originality/value: This study presents three key innovations: 1) A novel “chunking-summarization-embedding” framework for efficient semantic compression of lengthy annual reports (30,000 words); 2) Demonstration of LLMs’ superior performance in financial text analysis, outperforming traditional methods by 19.3%; 3) A novel “language-psychology-behavior” triad model for analyzing managerial fraud motives.

Cite this article

Hui Nie, Zhao-hui Long, Ze-jun Fang, Lu-qiong Gao . Multimodal detection framework for financial fraud integrating LLMs and interpretable machine learning[J]. Journal of Data and Information Science, 0 : 202582901 -202582901 . DOI: 10.2478/jdis-2025-0046

References

[1] Ali A. A., Khedr A. M., El-Bannany M., et al. (2023). A Powerful Predicting Model for Financial Statement Fraud Based on Optimized XGBoost.Ensemble Learning Technique. Applied Sciences, 13(4), 2272.
[2] Ashtiaini, M. N., & Raahemi, B. (2021). Intelligent Fraud Detection in Financial Statements Using Machine Learning and Data Mining: A Systematic Literature Review.IEEE Access, 10, 72504-72525.
[3] Bhattacharya, I., & Mickovic, A. (2024). Accounting Fraud Detection Using Contextual Language Learning.International Journal of Accounting Information Systems, 53, 100682.
[4] Brown N. C., Crowley R. M., & Elliott W. B. (2020). What Are You Saying? Using Topic to Detect Financial Misreporting. Journal of Accounting Research, 58, 237-291.
[5] Chen G. J. Lin H.,Wang L. (2005). Corporate Governance, Reputation Mechanism and the Behavior of Listed Firms in Committing in Fraud.Nankai Business Review. (06), 35-40.
[6] Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(pp. 785-794). ACM.
[7] Craja P., Kim A., & Lessmann S. (2020). Deep Learning for Detecting Financial Statement Fraud.Decision Support Systems, 139, 113421.
[8] Fukas P., Rebstadt J., Menzel L., & others. (2022). Towards Explainable Artificial Intelligence in Financial Fraud Detection: Using Shapley Additive Explanations to Explore Feature Importance. In International Conference on Advanced Information Systems Engineering (pp. 109-126). Cham: Springer International Publishing.
[9] Hajek, P., & Henriques, R. (2017). Mining Corporate Annual Reports for Intelligent Detection of Financial Statement Fraud - A Comparative Study of Machine Learning Methods.Knowledge-Based Systems, 128, 139-152.
[10] Kirkos E., Spathis C., & Manolopoulos Y. (2007). Data Mining Techniques for the Detection of Fraudulent Financial Statements.Expert Systems with Applications, 32(4), 995-1003.
[11] Li A., Wang D., Xu W., Li Z., & Yao S. (2023). Financial Fraud Detection for Growth Enterprise Market Listed Companies based on Data Fusion.Data Analysis and Knowledge Discovery, 7(05), 33-47.
[12] Liu, Y. Q., Wu, B. & Zhang, M. (2022). Financial Fraud Recognition Model and Application. Journal of Quantitative & Technological Economics, 39(07): 152-175. http://doi:10.13653/j.cnki.jqte.2022.07.008
[13] Liu W., Wang Z., & Zhang X. (2025). Research On Financial Fraud Detection By Integrating Latent Semantic Features Of Annual Report Text With Accounting Indicators. Journal of Accounting & Organizational Change, 2025. https://doi.org/10.1108/JAOC-06-2024-0199
[14] Lundberg S. M., Erion G., Chen H., DeGrave A., Prutkin J. M., Nair B., Katz R., Himmelfarb J., Bansal N., & Lee S.-I. (2020). From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence, 2(1), 56-67. https://doi.org/10.1038/s42256-019-0138-9
[15] Qian,P. & Luo. M. (2015). Predicting Accounting Fraud in China. Accounting Research, (07),18-25+96.
[16] Shapley, L. S. (1953). A Value For n-Person Games.Contributions to the Theory of Games, 307-317.
[17] Sun J., Fujita H., Chen P., & others. (2017). Dynamic Financial Distress Prediction with Concept Drift Based On Time Weighting Combined With Adaboost Support Vector Machine Ensemble.Knowledge-Based Systems, 120, 4-14.
[18] Tan,J. H. & Wang, X. Y. (2022). Corporate Fraud and Manipulation of Annual Report Text Information.China Soft Science, (03), 99-111.
[19] Wang K. M., Wang H. J., Li, D. D. & Dai, X. Y. (2018). Complexity of Annual Report and Management Self-interest: Empirical Evidence from Chines Listed Firms. Journal of Management World, 34 (12), 120-132+194. doi:10.19744/j.cnki.11-1235/f.2018.0038.
[20] Wang, G., Maj, J. J. & Chen, G. (2022). Attentive Statement Fraud Detection: Distinguishing Multimodal Financial Data with Fine-Grained Attention.Decision Support Systems, 167, 113913.
[21] Wu X.,& Du, S.(2022). An Analysis on Financial Statement Fraud Detection For Chinese Listed Companies Using Deep Learning. IEEE Access, 10, 22516-22532. https://doi.org/10.1109/ACCESS.2022.3153478
[22] Xiong, F. J. & Zhang, L. P. (2016). Risk Identification and Evidence Collection of Financial Fraud on the Listed Companies. Research on Economics and Management,37(10),138-144. http://doi:10.13502/j.cnki.issn1000-7636.2016.10.016.
[23] Yadav, A. K. S. (2017). Financial Statement Fraud Detection Using Optimized Deep Neural Network. In International Conference On Engineering, Applied Sciences And System Modeling(pp. 131-141). Singapore: Springer Nature Singapore.
[24] Yu E. P.-y., Luu B. V., & Chen C. H. (2020). Greenwashing in environmental, social and governance disclosures.Research in International Business and Finance, 52, 101192.
[25] Zeng, Q. S, Zhou, B.,Zhang,C. & Chen, X. Y.(2018). Annual Report Tone and Insider Trading: “Consistent” or “Deceptive”?. Journal of Management World, 34(09),143-160.doi:10.19744/j.cnki.11-1235/f.2018.09.012.
[26] Zhang Y., Liu T.,& Li, W.(2024). Corporate Fraud Detection Based On Linguistic Readability Vector: Application To Financial Companies In China. International Review Of Financial Analysis, 95(Part B). https://doi.org/10.1016/j.irfa.2024.103405
[27] Zhao, Z., & Bai, T. (2022). Financial Fraud Detection and Prediction In Listed Companies Using SMOTE And Machine Learning Algorithms.Entropy, 24(8), 1157.
[28] Zhao, N.H., & Zhang, T.Y. (2022). Deep learning-based fraud detection in financial statements using MD&A text.Friends of Accounting, (08), 140-149.
Outlines

/

京ICP备05002861号-43

Copyright © 2023 All rights reserved Journal of Data and Information Science

E-mail: jdis@mail.las.ac.cn Add:No.33, Beisihuan Xilu, Haidian District, Beijing 100190, China

Support by Beijing Magtech Co.ltd E-mail: support@magtech.com.cn