MERR Dataset

Multimodal Emotion Recognition and Reasoning Dataset

Overview
Dataset Comparison
Dataset Structure
1. Coarse-Grained Annotations (28,618 samples)
2. Fine-Grained Annotations (4,487 samples)
Data Example
Download Dataset
Dataset Statistics
Supported Tasks
Dataset Construction
Citation
License

Overview

The MERR (Multimodal Emotion Recognition and Reasoning) dataset is a comprehensive collection of emotionally annotated video samples designed to advance research in multimodal emotion understanding. It contains:

28,618 coarse-grained annotated samples
4,487 fine-grained annotated samples with detailed multimodal descriptions
Diverse emotional categories beyond traditional basic emotions
Multimodal annotations including audio, visual, and textual cues

We have built a unified MER dataset construction pipeline called MER-Factory. Full documentation is available at MER-Factory Documentation.

Dataset Comparison

The MERR dataset extends the range of emotional categories and annotations beyond those found in existing datasets. Each sample is annotated with an emotion label and described in terms of its emotional expression.

Comparison of Datasets

Dataset Structure

Coarse-Grained Annotations (28,618 samples)

Coarse-grained annotations provide:

Emotion labels based on Action Units (AU) analysis
Visual expression descriptions
Audio tone descriptions
Textual content (subtitles)
Basic multimodal cues

Use case: Stage 1 pre-training for learning basic emotion recognition

Fine-Grained Annotations (4,487 samples)

Fine-grained annotations include:

Detailed emotion labels
Comprehensive multimodal descriptions
Emotion reasoning and inference
Contextual analysis of emotional triggers
Integration of visual, audio, and textual cues

Use case: Stage 2 instruction tuning for enhanced emotion reasoning

Data Example

Here’s an example of a fine-grained annotated sample from the MERR dataset:

Data Example

Each sample includes:

Video clip with emotional expression
Peak frame highlighting maximum emotional expression
Action Units detected in the peak frame
Audio features and tone description
Textual transcription of spoken content
Multimodal description synthesizing all cues
Emotion label (e.g., happiness, sadness, anger, surprise, etc.)

Download Dataset

Annotation Files

Download the annotation content of the MERR dataset:

Download MERR Annotations from Google Drive

The download includes:

MERR_coarse_grained.txt - Coarse-grained annotations
MERR_coarse_grained.json - Coarse-grained annotations (JSON format)
MERR_fine_grained.txt - Fine-grained annotations
MERR_fine_grained.json - Fine-grained annotations (JSON format)

Raw Videos

Due to copyright restrictions, we cannot provide the raw videos or extracted images directly.

Please visit the official MER2023 website to apply for access to the dataset:

http://merchallenge.cn/datasets

Pre-extracted Features

To save GPU memory during training, we provide pre-extracted multimodal features:

Download Pre-extracted Features from Google Drive

Features include:

HuBERT - Audio features
EVA - Global visual features
MAE - Local visual features
VideoMAE - Temporal visual features

Dataset Statistics

Category	Coarse-Grained	Fine-Grained
Total Samples	28,618	4,487
Happiness	~8,500	~1,200
Sadness	~6,200	~980
Anger	~4,800	~750
Surprise	~3,900	~620
Fear	~2,100	~340
Neutral	~3,100	~597

The “disgust” emotion was identified in only 2 samples and was not included in the final dataset.

Supported Tasks

The MERR dataset supports multiple tasks:

1. Emotion Recognition

Classify the emotion expressed in a video clip.

Example prompt:

[emotion] What is the emotion expressed in this video?

2. Emotion Reasoning

Explain the emotional expression by analyzing multimodal cues.

Example prompt:

[reason] What are the facial expressions and vocal tone used in the video? 
What is the intended meaning behind the words? Which emotion does this reflect?

3. Multimodal Description

Generate detailed descriptions of emotional expressions.

Example prompt:

Describe the person's facial expressions, tone of voice, and the overall emotion conveyed.

Dataset Construction

For detailed information about how the MERR dataset was constructed, including:

Data filtering strategies
Visual expression description
Audio tone analysis
Fine-grained generation process

See the Dataset Construction page.

Citation

If you use the MERR dataset in your research, please cite:

@inproceedings{NEURIPS2024_c7f43ada,
  author = {Cheng, Zebang and Cheng, Zhi-Qi and He, Jun-Yan and Wang, Kai and Lin, Yuxiang and Lian, Zheng and Peng, Xiaojiang and Hauptmann, Alexander},
  booktitle = {Advances in Neural Information Processing Systems},
  title = {Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning},
  year = {2024}
}

License

The MERR dataset is based on MER2023 and is licensed under EULA for research purposes only.

Dataset Construction

MERR Dataset

Table of Contents

Overview

Dataset Comparison

Dataset Structure

Coarse-Grained Annotations (28,618 samples)

Fine-Grained Annotations (4,487 samples)

Data Example

Download Dataset

Annotation Files

Raw Videos

Pre-extracted Features

Dataset Statistics

Supported Tasks

1. Emotion Recognition

2. Emotion Reasoning

3. Multimodal Description

Dataset Construction

Citation

License

Table of contents