The studies about multimodal sentiment analysis can be divided into the following categories, which mainly are feature-level fusion, intermediate fusion, and decision-level fusion. In the aspect of feature-level fusion, Poria et al. (
2016) concatenated the textual features extracted from CNN, facial features extracted from tools, and audio features extracted from OpenSMILE into a long vector, and then the fused vector was fed into the classifier to perform sentiment analysis. Cambria et al. (
2013) proposed the model
Sentic Blending to analyze the sentiment based on the affective knowledge fusing multiple unimodal features. Pérez Rosas et al. (
2013) analyzed the sentiment in videos through fusing the textual features, facial features, and audio features. Intermediate fusion usually happens in neural networks. To analyze the sentiment in videos, Zadeh et al. (
2017) proposed a novel intermediate fusion technique in a way of tensor fusion. The embeddings extracted from different modality subnetworks were fused in a form of vector outer product. Majumder et al. (
2018) proposed a hierarchical fusion strategy. In the strategy, unimodal features are fused two in two and the fused bimodal features are fused again to generate trimodal features. Besides, K. Zhang et al. (
2022) construct the model based on reinforcement learning and domain knowledge to recognize emotions from the real-time videos, and obtained satisfying results in the public datasets. In decision-level fusion, Song et al. (
2018) employed different classifiers to recognize the sentiment of different modalities, and then used an Artificial Neural Network or k-Nearest Neighbor to classify the outputs, producing the final labels.