Research Papers
Hui Nie, Zhao-hui Long, Ze-jun Fang, Lu-qiong Gao
Accepted: 2025-08-29
Purpose: This study aims to integrate large language models (LLMs) with interpretable machine learning methods to develop a multimodal data-driven framework for predicting corporate financial fraud, addressing the limitations of traditional approaches in long-text semantic parsing, model interpretability, and multi-source data fusion, thereby providing regulatory agencies with intelligent auditing tools.
Design/methodology/approach: Analyzing 5,304 Chinese listed firms’ annual reports (2015-2020) from the CSMAD database, this study leverages the Doubao LLMs to generate chunked summaries and 256-dimensional semantic vectors, developing textual semantic features. It integrates 19 financial indicators, 11 governance metrics, and linguistic characteristics (tone, readability) with fraud prediction models optimized through a group of Gradient Boosted Decision Tree (GBDT) algorithms. SHAP value analysis in the final model reveals the risk transmission mechanism by quantifying the marginal impacts of financial, governance, and textual features on fraud likelihood.
Findings: The study found that LLMs effectively distill lengthy annual reports into semantic summaries, while GBDT algorithms (AUC > 0.850) outperform the traditional Logistic Regression model in fraud detection. Multimodal fusion improved performance by 7.4%, with financial, governance, and textual features providing complementary signals. SHAP analysis revealed financial distress, governance conflicts, and narrative patterns (e.g., tone anchoring, semantic thresholds) as key fraud indicators, highlighting managerial intent in report language.
Research limitations: This study identifies three key limitations: 1) lack of interpretability for semantic features, 2) absence of granular fraud-type differentiation, and 3) unexplored comparative validation with other deep learning methods. Future research will address these gaps to enhance fraud detection precision and model transparency.
Practical implications: The developed semantic-enhanced evaluation model provides a quantitative tool for assessing listed companies’ information disclosure quality and enables practical implementation through its derivative real-time monitoring system. This advancement significantly strengthens capital market risk early warning capabilities, offering actionable insights for securities regulation.
Originality/value: This study presents three key innovations: 1) A novel “chunking-summarization-embedding” framework for efficient semantic compression of lengthy annual reports (30,000 words); 2) Demonstration of LLMs’ superior performance in financial text analysis, outperforming traditional methods by 19.3%; 3) A novel “language-psychology-behavior” triad model for analyzing managerial fraud motives.