Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning

  1Shenzhen Technology University   2Carnegie Mellon University   3Alibaba Group   4National University of Singapore   5Chinese Academy of Sciences

Abstract

Accurate emotion perception is crucial for various applications, including human-computer interaction, education, and counseling. However, traditional single-modality approaches often fail to capture the complexity of real-world emotional expressions, which are inherently multimodal. Moreover, existing Multimodal Large Language Models (MLLMs) face challenges in integrating audio and recognizing subtle facial micro-expressions. To address this, we introduce the MERR dataset, containing 28,618 coarse-grained and 4,487 fine-grained annotated samples across diverse emotional categories. This dataset enables models to learn from varied scenarios and generalize to real-world applications. Furthermore, we propose Emotion-LLaMA, a model that seamlessly integrates audio, visual, and textual inputs through emotion-specific encoders. By aligning features into a shared space and employing a modified LLaMA model with instruction tuning, Emotion-LLaMA significantly enhances both emotional recognition and reasoning capabilities. Extensive evaluations show Emotion-LLaMA outperforms other MLLMs, achieving top scores in Clue Overlap (7.83) and Label Overlap (6.25) on EMER, an F1 score of 0.9036 on MER2023-SEMI challenge, and the highest UAR (45.59) and WAR (59.37) in zero-shot evaluations on DFEW dataset.


Demo Presentation

MERR Dataset

MY ALT TEXT MY ALT TEXT

Comparison of emotional datasets. The table presents a comparative analysis of several key emotional datasets, including DFEW, MER2023, EMER, and MERR. It highlights the unique features and contributions of each dataset, such as the range of emotion categories, availability of multimodal annotations, and dataset size. This comparison underscores the significance of the MERR dataset in advancing multimodal emotion recognition and reasoning research.

Framework

MY ALT TEXT

Architecture of Emotion-LLaMA, which integrates audio, visual, and text inputs for multimodal emotional recognition and reasoning.

Multiview Multimodal Encoder. To capture emotional cues in audio and visual modalities, we leverage the HuBERT model as our audio encoder \(\mathcal{E}^{aud}\) and a multiview visual encoder \(\mathcal{E}^{vis}\). HuBERT extracts a comprehensive auditory representation \(u^{aud}\) from the input audio signal, exhibiting remarkable performance in emotion recognition tasks.

We use a vision preprocessor to unify vision modalities, including facial sequences and peak frame extracted from the input video. Three visual encoders \(\mathcal{E}^{vis} = \{\mathcal{E}^{vis}_{glo}, \mathcal{E}^{vis}_{loc}, \mathcal{E}^{vis}_{temp}\}\) are employed to comprehensively extract complementary multi-view visual emotional features:

  • Local Encoder: A ViT-structured model pre-trained by the MAE scheme extracts static facial expression features. A facial sequence is fed into the local encoder, and the output frame-wise features are fused by average pooling, producing the local visual feature \(u^{vis}_{loc} = \mathrm{AVG}(\mathcal{E}^{vis}_{loc}(V))\).
  • Temporal Encoder: A VideoMAE model, produces the temporal feature \(u^{vis}_{temp} = \mathcal{E}^{vis}_{temp}(V)\) of a facial sequence, learning facial dynamics that indicate emotional states and offering a temporal dynamic view of human emotion.
  • Global Encoder: A ViT-structured model, EVA, initialized with official pre-trained weights, produces the visual feature \(u^{vis}_{glo} = \mathcal{E}^{vis}_{glo}(\mathit{Frame}_{\mathit{peak}})\), capturing not only facial expressions but also background context.

Multimodal Emotion Recognition and Reasoning

carousel image group 1

Detailed examples of multimodal emotion recognition and reasoning performed by the Emotion-LLaMA model. The figure showcases the model's core capabilities in accurately identifying emotions from multimodal data and generating human-like explanations for its predictions. These examples demonstrate the model's proficiency in capturing subtle emotional cues, integrating information across modalities, and providing meaningful insights into its decision-making process.

carousel image group 2

Detailed examples of general tasks performed by the Emotion-LLaMA model. The figure illustrates the model's versatility and robustness in handling tasks beyond emotion recognition, such as face detection and question answering. These examples highlight the model's ability to process and understand visual and textual information, enabling its application in a wide range of scenarios.

Comparisons with SOTA MLLMs

MY ALT TEXT

To illustrate the qualitative performance of Emotion-LLaMA, we present a detailed comparison of emotion reasoning results across different models. The table displays the emotion reasoning results of the four highest-scoring models. The video shows a person smiling while questioning another individual, an expression of dissatisfaction that suggests an angry emotional state. Accurate emotion reasoning for this sample necessitates integrating information from multiple modalities. PandaGPT and Valley captured the correct visual features but failed to incorporate information from other modalities, incorrectly classifying the emotion as happy. In contrast, VideoChat-Embed eventually reached the correct inference, but its reasoning was compromised by hallucinations. Emotion-LLaMA went a step further by recognizing the tone of the person and combining subtle facial expressions with multimodal information for accurate emotion reasoning. This example demonstrates the superiority of our model in understanding and integrating emotional cues from various modalities, resulting in more precise and contextually relevant emotion recognition.