1 Introduction
2 Literature review
2.1 Indicators for corporate financial fraud detection
2.2 Financial fraud prediction methods
2.3 Critical review
3 Research design
3.1 Research framework
Figure 1. Research framework for financial fraud detection. |
3.2 Annual report feature extraction
3.2.1 Semantic features
Figure 2. Annual report semantic feature extraction framework. |
Table 1. Semantic feature extraction based on the Doubao LLMs. |
| LLM | Prompt | Input | Output |
|---|---|---|---|
| Doubao-pro-32k_v241215 General Model | You are an outstanding natural language processing assistant. Please summarize the input text, ensuring that no critical information is lost or misrepresented. The summary should be fluent, logically clear, and complete and not exceed 256 characters. | Chunked text from MA&A (Original text) | Chunked text summary |
| Doubao-pro-32k _v241215 General Model | You are an outstanding natural language processing assistant. Please summarize the input text, ensuring that no critical information is lost or misrepresented. The summary should be fluent, logically clear, and complete and not exceed 1024 characters. | Joined summaries of chunked text | Full-text summary |
| Doubao-embedding _v240715 Embedding Vector Model | None | Full-text summary | Embedding Vector |
3.2.2 Tone of the annual report
3.2.3 Text readability
3.3 Research models
3.3.1 Ensemble tree-based ML algorithms
3.3.2 Interpretable ML with SHAP
Figure 3. SHAP feature attribution diagram. |
3.3.3 Large language models
4 Experiments and results
4.1 Sample construction
4.2 Feature variables
Table 2. Financial Indicators (19 Items). |
| Indicator Type | Indicator Variable | Variable Definition and Operation |
|---|---|---|
| Solvency | Aslbrt | Asset Liability Ratio; Total Liabilities / Total Assets |
| Curtrt | Flow Rate Ratio; Total Liability Ratio / Total Flow Asset | |
| Qikrt | Quick Ratio; (Current Assets - Inventory) / Current Liabilities | |
| Equrt | Equity Ratio; Total Liabilities /Total Shareholders’ Equity | |
| Pmcptdbrt | Long-term Debt Ratio; Total Non-flowing Debt / (Total Shareholders’ Equity +Total Non-flowing Liabilities) | |
| Profitability | Roe_1 | Return on Equity (ROE); Net Profit / Total Shareholders’ Equity |
| Salnpm | Net Profit margin, Net Profit / Revenue | |
| Salgm | Gross Profit Margin, Gross Profit / Revenue | |
| Salpm | Profit Margin, Profit / Revenue | |
| Growth Potential | Atrt | Total Assets Growth Rate; (Current Period Adjusted Figure - Prior Year Same Period Adjusted Figure)/ Prior Year Same Period Adjusted Figure for ABS |
| Opicrt | Operating Revenue Growth Rate, (Current Period Adjusted Figure - Prior Year Same Period Adjusted Figure)/ Prior Year Same Period Adjusted Figure for ABS | |
| Oirt | Operating Profit Growth Rate, (Current Period Adjusted Figure - Prior Year Same Period Adjusted Figure)/ Prior Year Same Period Adjusted Figure for ABS | |
| Operational Efficiency | Actrcbto | Accounts Receivable Turnover, Revenue /((Beginning Net Accounts Receivable + Ending Net Accounts Receivable)/2) |
| Fxastto | Fixed Asset Turnover, Total Revenue /((Beginning Fixed Assets + Ending Fixed Assets)/2) | |
| Totastto | Total Asset Turnover, Total Revenue /((Beginning Total Assets + Ending Total Assets)/2) | |
| Actpayto | Accounts Payable Turnover, Cost of Goods Sold (COGS) /((Beginning Accounts Payable + Ending Accounts Payable)/2) | |
| Dash Flow Adequacy | Opncf_rev | Net operating cash flow / Total Revenue |
| Opncfrt | Percentage of net cash flow from operating activities; Net cash flow from operating activities /(Net cash flow from operating activities + Net cash flow from investing activities + Net cash flow from financing activities) | |
| Csopindex | Cash flow to sales ratio, Net cash flow from operating activities / Cash from operations |
Table 3. Corporate Governance Indicators (11 Items). |
| Indicator Type | Indicator Variable | Variable Definition and Value |
|---|---|---|
| Ownership Structure | ShrHolder1 | First Largest Shareholder Ownership Ratio |
| ShrHolder3 | Sum of Ownership Ratios of Top Three Shareholders | |
| ShrHolder5 | Sum of Ownership Ratios of Top Five Shareholders | |
| ShrHolder10 | Sum of Ownership Ratios of Top Ten Shareholders | |
| StOwRt | State-Owned Share Ratio | |
| Internal Executive Characteristics | Cmceo_Dum | Whether serving as both Chairman and CEO (1=Yes, 0=No) |
| Cmgm_Dum | Whether serving as both Chairman and General Manager (1=Yes, 0=No) | |
| MShrRat | Management Ownership Ratio: Percentage of company shares held by management | |
| BShrRat | Board of Directors Ownership Ratio: Percentage of company shares held by all board members | |
| SShrRat | Supervisory Board Ownership Ratio: Percentage of company shares held by all supervisory board members | |
| InDrcRat | Proportion of Independent Directors: Ratio of independent directors to the total number of directors |
Table 4. Annual report feature indicators. |
| Indicator Type | Indicator Variable | Variable Definition |
|---|---|---|
| Tone | LM_tone | Tone value of the full annual report, calculated based on the LM dictionary. |
| MDA_tone | Tone value of the annual report’s MD&A section, calculated based on the LM dictionary. | |
| Semantic Features | f1 f2 … fn | Semantic feature vector extracted from the MD&A (n-dimensional). |
| Readability | Size | Character count of the annual report’s MD&A section |
| Ls | Proportion of long sentences | |
| Finance_density | Density of financial terms: the number of financial and accounting terms per hundred characters | |
| Obscure_density | Density of rare words: the number of less frequently used Chinese words per hundred characters | |
| Transition_density | Density of adversative conjunctions: the number of Chinese adversative conjunctions per hundred characters |
4.3 Experimental design
4.4 Experimental results
4.4.1 Semantic feature extraction based on LLM
Table 5. Sample summaries of corporate annual reports. |
| Chinese | English |
|---|---|
| 藏格控股 2019 年年度报告显示, 管理层围绕年度目标努力基本完成任务。当年钾肥生产 108.29 万吨, 营收 20.18 亿元, 净利润 3.08 亿元, 业绩下降因产品销量、价格、产量及成本、存货跌价等因素。 报告期主要工作包括推进电池级碳酸锂项目投产、改造技术工艺降成本、坚持创新发展获补助资金。多项财务数据有变动, 如营业收入合计同比减 24.70%, 销售、管理、财务、研发费用有不同变化。 研发投入大幅增长, 现金流方面各活动现金流量有升降。非主营业务投资收益为负, 还涉及资产减值、营业外收支等。资产及负债有比重增减, 获取重大股权投资, 收购西藏巨龙铜业 37%股权, 本期投资亏损 1.68 亿元。 募集资金使用有进展, 曾自筹资金投入募投项目, 2019 年募投项目结项节余补流。公司积累了经验和技术储备, 目前面临钾肥价格、新业务开拓、环保等风险, 将采取相应措施应对。还存在股份质押、业绩补偿股份回购注销风险, 补偿义务人正积极处置。 未来公司将在稳步发展氯化钾基础上开发高附加值产品, 利用老卤生产新能源材料, 开发锂系列产品, 加强技术和人才储备等。2019 年多次通过电话沟通方式接待个人咨询, 内容涉及公司多个方面情况, 多数未提供书面报告。 | The 2019 annual report of Zangge Holdings indicates that the management team largely accomplished its annual objectives. Potassium fertilizer production reached 1.0829 million tons, with revenue of 2.018 billion yuan and net profit of 308 million yuan. The decline in performance was attributed to factors such as product sales volume, price, production volume, costs, and inventory depreciation. Key initiatives during the reporting period included advancing the commissioning of the battery-grade lithium carbonate project, improving technical processes to reduce costs, and securing subsidy funds through innovation-driven development. Several financial metrics showed changes, such as a 24.70% year-on-year decrease in total operating revenue, with variations in sales, management, financial, and R&D expenses. R&D investment increased significantly, while cash flows from various activities fluctuated. Non-core business investment returns were negative, and the report also covered asset impairments, non-operating income, and expenses. Assets and liabilities experienced shifts in proportions, with significant equity investments, including the acquisition of a 37% stake in Xizang Julong Copper for a loss of 168 million yuan during the period. Progress was made in the use of raised funds, with self-raised capital allocated to fundraising projects. In 2019, surplus funds from completed fundraising projects were redirected to working capital. The company has accumulated experience and technical reserves but faces risks such as potassium fertilizer price fluctuations, new business expansion, and environmental challenges, for which corresponding measures will be implemented. Risks related to share pledges, performance compensation share repurchases, and cancellations remain, with obligors actively addressing them. Moving forward, the company will focus on developing high-value-added products based on stable potassium chloride production, utilizing brine for new energy materials, expanding lithium-based products, and strengthening technical and talent reserves. In 2019, the company responded to numerous individual inquiries via phone consultations, covering various aspects of the company, most of which did not result in written reports. |
(Company: Zangge Holdings; Year: 2019; Stock Code: 000408) |
Table 6. Performance comparison of financial fraud prediction models with multi-dimensional embeddings. |
| Semantic features | Dimension | Accuracy | Precision | Recall | F1 | AUC |
|---|---|---|---|---|---|---|
| Embedding vectors | 1024 | 0.732 | 0.715 | 0.693 | 0.704 | 0.811 |
| 512 | 0.732 | 0.706 | 0.714 | 0.710 | 0.805 | |
| 256 | 0.742 | 0.713 | 0.733 | 0.733 | 0.817 | |
| 128 | 0.695 | 0.656 | 0.705 | 0.679 | 0.762 |
Note: The MLP architecture in the experiment comprises an input layer matching embedding dimensions, two hidden layers progressively reducing to 128 and 64 dimensions, and a binary fraud classification output. ReLU activations between layers enhance nonlinear modeling, enabling effective capture of complex data patterns. |
Table 7. Experimental results of word embedding models (Input Dimension 256). |
| Embedding model | Introduction | Accuracy | AUC | F1 |
|---|---|---|---|---|
| Doubao-embedding _v240715 ⑤ | A high-performance text embedding model developed by ByteDance’s Volcano Engine, part of the Doubao large model ecosystem, specializing in generating high-quality semantic vectors. | 0.742 | 0.817 | 0.733 |
| Embedding-V1 ⑥ | Baidu’s general-purpose text embedding model, incorporating the company’s NLP expertise. It supports Chinese and multilingual tasks, generating high-dimensional dense vectors (1024 dimensions). | 0.738 | 0.812 | 0.715 |
| BGE-large-zh ⑦ | An open-source Chinese text embedding model from Beijing Academy of Artificial Intelligence (BAAI). Fine-tuned on RoBERTa-large architecture, it delivers excellent performance on Chinese text. | 0.739 | 0.802 | 0.709 |
| Text2Vec-base-chinese ⑧ | A classic open-source Chinese text embedding model based on transformer architecture, providing static contextual embeddings. | 0.609 | 0.642 | 0.577 |
4.4.2 Financial fraud detection using multimodal data fusion
Table 8. Comparative performance of ML algorithms in fraud setection task. |
| ML Algorithms | AUC | Accuracy | F1 |
|---|---|---|---|
| LR | 0.750 | 0.685 | 0.679 |
| LightGBM | 0.868 | 0.791 | 0.787 |
| XGBoost | 0.871 | 0.792 | 0.789 |
| CatBoost | 0.852 | 0.776 | 0.773 |
Note: Parameter settings for the models are detailed in Table A2 in Appendix. |
Table 9. Impact of textural features: Ablation analysis. |
| Features | AUC | ∆AUC | Accuracy | ∆Accuracy | F1 | ∆F1 |
|---|---|---|---|---|---|---|
| Financial | 0.820 | - | 0.740 | - | 0.694 | - |
| Financial+Tone | 0.828 | +0.8% | 0.749 | +0. 9% | 0.727 | +3.3% |
| Financial+Tone+Readability | 0.841 | +1.3% | 0.766 | +1.7% | 0.745 | +1.8% |
| Financial+Tone+Readability+Semantics | 0.880 | +3.9% | 0.798 | +3.2% | 0.780 | +3.5% |
Table 10. Ablation analysis on multi-feature fusion. |
| Features | AUC | ∆AUC | Accuracy | ∆Accuracy | F1 | ∆F1 |
|---|---|---|---|---|---|---|
| Financial | 0.820 | - | 0.740 | - | 0.694 | - |
| Financial + Textual | 0.880 | +6.0% | 0.798 | +5.8% | 0.780 | +8.6% |
| Financial + Textual + Corporate governance | 0.894 | +1.4% | 0.812 | +1.4% | 0.796 | +1.6% |
4.4.3 Feature importance analysis
Figure 4. Feature importance analysis (Top 10% Features). |
4.4.4 Feature impact mechanism analysis
Figure 5. The relationship between Top10% important features and corporate fraud risk (Beeswarm plot). |
Figure 6. Relationship between annual report semantic features (f119, f11) and corporate fraud risk. |
5 Discussion
5.1 The findings
5.2 Sensitivity to sample composition
5.3 Limitations and future works
6 Conclusion and implications
Funding information
Acknowledgements
Author contributions
Data availability statement
Appendix
Table A1. Corporate fraud type frequency and incidence statistics. |
| Violation Type | Description | Count | Frequency |
|---|---|---|---|
| Delayed disclosure | Failure to disclose material information within the prescribed timeframe, violating the timely disclosure requirements under securities regulations. | 72 | 16% |
| False statement | Intentional misrepresentation or falsification of financial statements, reports, or other disclosures, including misleading statements that distort the true financial condition or operational performance of the company. | 57 | 13% |
| Significant omissions | Deliberate or negligent failure to disclose material information that could significantly impact investors’ decisions, thereby violating the principle of completeness in disclosure. | 51 | 11% |
| Inflated assets | Misstating the value or existence of assets in financial statements, including overstating asset values or recording non-existent assets. | 28 | 6% |
| Fabricated profits | Artificially inflating or fabricating profits through improper accounting practices, such as recognizing revenue prematurely or manipulating expense records. | 9 | 2% |
| Guarantee breaches | Referring to a listed company or its affiliates providing guarantees in violation of regulations or beyond authorized limits, harming the interests of the company and investors. | 8 | 2% |
| Misleading disclosures | Providing inaccurate, incomplete, or misleading information in disclosures, whether in financial reports, announcements, or other regulatory filings, which misleads investors or regulators. | 5 | 1% |
| Others | - | 78 | 17% |
Table A2. The ML models’ configuration for fraud detection. |
| Algorithms | Parameters | Value | Meaning |
|---|---|---|---|
| Logistic Regression | C | 100 | The reciprocal of regularization strength |
| max_iter | 300 | The maximum number of iterations for solver convergence. | |
| penalty | ‘l1’ | The type of regularization; ‘l1’ denotes L1 regularization. | |
| solver | ‘liblinear’ | The algorithm | |
| tol | 1.00E-06 | The error tolerance for the stopping criterion. | |
| XGBoost | subsample | 0.899 | The sample proportion per tree during training |
| n_estimators | 449 | The boosting tree count | |
| min_child_weight | 4 | The minimum child node weight fraction | |
| max_depth | 9 | The tree’s maximum depth | |
| learning_rate | 0.1 | The learning rate | |
| gamma | 0.000278 | The minimum loss reduction for node splits | |
| colsample_bytree | 0.6 | The feature proportion per tree during construction | |
| LightGBM | colsample_bytree | 0.6 | Feature proportion per tree |
| max_depth | 6 | Maximum tree depth | |
| min_child_samples | 9 | Minimum child node samples | |
| n_estimators | 280 | Number of boosting trees | |
| num_leaves | 40 | Maximum leaf nodes | |
| random_state | 0 | Random seed for reproducibility | |
| reg_alpha | 0.01 | L1 regularization coefficient | |
| reg_lambda | 0.001 | L2 regularization coefficient | |
| subsample | 0.8 | Sample proportion per tree | |
| CatBoost | subsample | 0.8 | The proportion of samples for training each tree |
| min_data_in_leaf | 1 | Minimum samples per leaf node | |
| l2_leaf_reg | 0.01 | L2 regularization coefficient | |
| learning_rate | 0.1 | Learning rate for step size control | |
| iterations | 280 | Number of boosting trees | |
| depth | 6 | Maximum tree depth | |
| colsample_bylevel | 0.8 | Feature proportion per level | |
| border_count | 40 | Number of bins for numerical features |
Table A3. Model performance under imbalanced data. |
| Fraud to Non-Fraud Sample Ratio | Accuracy | AUC | F1 | Precision | Recall |
|---|---|---|---|---|---|
| 1:1 | 0.742 | 0.817 | 0.713 | 0.713 | 0.733 |
| 1:2 | 0.732 | 0.770 | 0.568 | 0.564 | 0.571 |
| 1:5 | 0.856 | 0.755 | 0.470 | 0.511 | 0.430 |
| 1:10 | 0.819 | 0.613 | 0.222 | 0.273 | 0.188 |
| 1:20 | 0.946 | 0.586 | 0.167 | 0.273 | 0.120 |


