MERR Dataset Construction

A detailed guide on how the MERR dataset was constructed using our automated pipeline.

Construction Pipeline
Data Source
Construction Steps
Example Annotation
Quality Control
1. Filtering Criteria
2. Human Verification
Limitations
Future Improvements
Tools and Models Used
Access the Pipeline
Questions?

Construction Pipeline

We built a unified MER dataset construction pipeline called MER-Factory.

Full documentation is available at: MER-Factory Documentation

Data Source

The MERR dataset is constructed from MER2023-SEMI, which contains over 70,000 unlabeled video clips. We utilize several powerful multimodal models to extract emotional cues from different modalities, and then use the LLaMA-3 model to summarize all emotional cues for inference, resulting in the final multimodal description.

MER2023 SEMI

Construction Steps

Step 1: Data Filtering

We employed OpenFace to extract faces from video segments, which were then aligned to identify various facial muscle movements, resulting in the detection of Action Units (AUs).

Certain combinations of these muscle movements correlate with specific emotions:

Surprise: AU05 (upper lid raiser) + AU26 (jaw drop)
Happiness: AU06 (cheek raiser) + AU12 (lip corner puller)
Sadness: AU01 (inner brow raiser) + AU04 (brow lowerer) + AU15 (lip corner depressor)
Anger: AU04 (brow lowerer) + AU05 (upper lid raiser) + AU07 (lid tightener) + AU23 (lip tightener)

Each specific combination of Action Units was assigned a pseudo-label, signifying that the sample was selected and exhibited strong emotional expression characteristics.

Result: 28,618 samples were selected and assigned pseudo-labels.

AU to Label Mapping

Step 2: Visual Expression Description

Due to natural actions such as blinking and speaking in the videos, different combinations of Action Units are extracted from various frames. Thus, determining the AUs that most accurately represent the current emotion is crucial.

Our approach involves analyzing the amplitude values of the Action Units to identify the “emotional peak frame”:

Identify the most frequently occurring Action Units across all frames
Sum their amplitude values
Select the frame with the highest total as the emotional peak frame
Map the Action Units of this frame to their corresponding visual expression descriptions

AU to Description

Step 3: Visual Objective Description

We input the complete emotional peak frame into the MiniGPT-v2 model, enabling it to describe:

Scene context
Character gestures and postures
Environmental factors
Visual details

Example output:

A person is sitting in a dimly lit room, their face showing signs of distress 
with furrowed brows and a slight frown. Their body posture is tense, with 
shoulders slightly hunched forward.

Step 4: Audio Tone Description

We use audio as input for the Qwen-Audio model, which then describes:

Speaker’s tone and intonation
Voice quality (pitch, volume, speed)
Emotional prosody
Acoustic patterns

These audio clues are equally crucial for understanding the emotion.

Example output:

The speaker's voice is trembling slightly, with a lower pitch and slower pace, 
indicating a sad or distressed emotional state.

While Qwen-Audio performs well among various large audio models, some errors are present in emotion descriptions since these models are not specifically trained on emotional content.

Step 5: Coarse-Grained Synthesis

By integrating visual and audio descriptions with lexical subtitles in a templated sequence, we generate a coarse-grained emotional description.

Template structure:

[Visual Expression] + [Audio Tone] + [Textual Content] → Emotion Label

Result: 28,618 coarse-grained descriptions were produced.

Example coarse-grained annotation:

Visual: Furrowed brows, slight frown, tense posture
Audio: Trembling voice, lower pitch, slower pace
Text: "I can't believe this is happening..."
Label: Sadness

Step 6: Fine-Grained Generation

Merely concatenating components does not truly explain the triggers behind emotions. Therefore, we input all emotional clues into the LLaMA-3 model to:

Sift through and correctly identify relevant clues
Combine different cues for inference
Generate comprehensive emotional descriptions
Filter erroneous or contradictory descriptions

Since the emotional clues previously gathered were unverified, they included some erroneous or contradictory descriptions. Using the output from LLaMA-3, we could easily filter out these samples.

Additionally, we:

Removed duplicates from the original dataset
Randomly selected neutral samples to enrich the dataset
Balanced the distribution across emotion categories

Result: The final MERR dataset contains 4,487 samples with detailed multimodal descriptions.

Example Annotation

Here’s a complete example of a fine-grained annotation:

Example Annotation

The annotation includes:

Video ID: sample_00000047
Emotion Label: Happiness
Peak Frame: Frame 23 (highest AU activation)
Action Units: AU06 + AU12 (happiness indicators)
Visual Description: “Genuine smile with raised cheeks and pulled lip corners”
Audio Description: “Bright, higher-pitched voice with upward intonation”
Text: “I’m so excited to share this news with you!”
Multimodal Reasoning: “The combination of a genuine smile (Duchenne smile with AU06 and AU12), enthusiastic vocal tone, and positive language indicates strong happiness and excitement about sharing positive news.”

Quality Control

Filtering Criteria

We applied multiple quality control measures:

AU Confidence Threshold: Only frames with AU detection confidence > 0.8
Audio Quality: Signal-to-noise ratio (SNR) > 20dB
Video Quality: Resolution ≥ 480p, clear facial visibility
Text Quality: Proper transcription with < 10% error rate
Consistency Check: Cross-modal consistency validation

Human Verification

A subset of 500 samples was manually verified by human annotators:

Agreement rate: 94.2% for coarse-grained labels
Agreement rate: 91.8% for fine-grained descriptions

Limitations

Disgust Emotion

During the data annotation process, only 2 “disgust” samples were identified. Due to their limited number, we chose not to include them in the MERR dataset.

We plan to explore more effective data filtering techniques to uncover more samples of less common emotions.

Audio Model Limitations

In our tests and usage, Qwen-Audio performed exceptionally well among various large audio models. However, since these models are not specifically trained on emotional content, many errors are present in the emotion descriptions.

Further research into the application of large audio models in emotion recognition is needed.

Language Limitation

The current dataset primarily contains Chinese language content with English translations. Expanding to more languages is part of future work.

Future Improvements

We are actively working on:

Emotion Coverage: Techniques to identify rare emotions like disgust and contempt
Audio Models: Training emotion-specific audio analysis models
Multilingual Support: Expanding to multiple languages
Real-world Scenarios: Including more diverse contexts and situations
Temporal Dynamics: Better capturing emotion transitions over time

Tools and Models Used

Component	Model/Tool	Purpose
Face Detection	OpenFace	Extract facial features and Action Units
Visual Description	MiniGPT-v2	Generate scene and gesture descriptions
Audio Description	Qwen-Audio	Analyze tone and vocal characteristics
Fine-grained Generation	LLaMA-3	Synthesize multimodal reasoning
Feature Extraction	HuBERT, EVA, MAE, VideoMAE	Extract multimodal features

Access the Pipeline

To use our dataset construction pipeline for your own data:

Visit MER-Factory
Follow the documentation
Customize the pipeline for your needs

Questions?

For questions about the dataset construction process:

Open an issue on GitHub
Refer to the MER-Factory documentation
Contact the authors

MERR Dataset Construction

Table of Contents

Construction Pipeline

Data Source

Construction Steps

Step 1: Data Filtering

Step 2: Visual Expression Description

Step 3: Visual Objective Description

Step 4: Audio Tone Description

Step 5: Coarse-Grained Synthesis

Step 6: Fine-Grained Generation

Example Annotation

Quality Control

Filtering Criteria

Human Verification

Limitations

Disgust Emotion

Audio Model Limitations

Language Limitation

Future Improvements

Tools and Models Used

Access the Pipeline

Questions?