Training Emotion-LLaMA

Complete guide to training your own Emotion-LLaMA model from scratch.

Training Overview
Prerequisites
1. Hardware Requirements
2. Software Requirements
Stage 1: Pre-training
Stage 2: Instruction Tuning
Training Tips
Evaluation During Training
Hyperparameter Tuning
Troubleshooting
1. Common Issues
Next Steps
Questions?

Training Overview

Emotion-LLaMA training consists of two stages:

Stage 1: Pre-training - Train on coarse-grained MERR dataset (28,618 samples)
Stage 2: Instruction Tuning - Fine-tune on fine-grained MERR dataset (4,487 samples)

This two-stage approach enables the model to:

Learn basic multimodal emotion recognition in Stage 1
Develop advanced emotion reasoning capabilities in Stage 2

Prerequisites

Hardware Requirements

GPUs: 4x NVIDIA GPUs with at least 24GB VRAM each (e.g., RTX 3090, A5000, or better)
RAM: 64GB or more recommended
Storage: 100GB+ free space for datasets and checkpoints

Software Requirements

✅ Emotion-LLaMA environment installed (see Getting Started)
✅ PyTorch with CUDA support
✅ Distributed training libraries (automatically included in environment)

Stage 1: Pre-training

Step 1: Download the Dataset

Due to copyright restrictions, we cannot provide raw videos directly.

Visit the official MER2023 website to apply for dataset access:

http://merchallenge.cn/datasets

After obtaining access, specify the dataset path in minigpt4/configs/datasets/firstface/featureface.yaml:

# Set Dataset video path
image_path: /path/to/datasets/Emotion/MER2023/video

To extract rich emotion features, we use:

HuBERT - Audio Encoder
EVA - Global Visual Encoder
MAE - Local Visual Encoder
VideoMAE - Temporal Encoder

To save GPU memory, we use pre-extracted features instead of loading all encoders directly.

Download the pre-extracted features: Google Drive Link

Save the features to your dataset folder, then modify the get() function in minigpt4/datasets/datasets/first_face.py to set the feature paths.

The specific feature extraction process is available in the “feature_extract” folder: Google Drive Link

Step 3: Configure Dataset

In minigpt4/configs/datasets/firstface/featureface.yaml, select the coarse-grained annotations:

# Use coarse-grained annotations for Stage 1
annotation_file: MERR_coarse_grained.txt

This provides 28,618 samples for pre-training.

Step 4: Configure Multi-task Instructions

Set the task types in minigpt4/datasets/datasets/first_face.py:

self.task_pool = [
    "emotion",      # Multi-modal emotion recognition task
    "reason",       # Multi-modal emotion inference task
    # "reason_v2",  # Advanced reasoning (Stage 2)
]

Each task randomly selects prompts from different instruction pools:

Emotion Task Examples:

“What is the emotion expressed in this video?”
“Identify the primary emotion shown.”
“Classify the emotional state.”

Reason Task Examples:

“What are the facial expressions and vocal tone used? What emotion does this reflect?”
“Analyze the multimodal cues and explain the emotion.”
“Why is this person experiencing this emotion?”

Step 5: Run Pre-training

Execute the training script with 4 GPUs:

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc-per-node 4 train.py --cfg-path train_configs/Emotion-LLaMA_finetune.yaml

Training Configuration (train_configs/Emotion-LLaMA_finetune.yaml):

model:
  arch: minigpt_v2
  llama_model: "/path/to/checkpoints/Llama-2-7b-chat-hf"
  ckpt: "/path/to/checkpoints/minigptv2_checkpoint.pth"
  lora_r: 64
  lora_alpha: 16

datasets:
  feature_face_caption:
    batch_size: 1

run:
  lr_sched: "linear_warmup_cosine_lr"
  init_lr: 1e-5
  min_lr: 1e-6
  warmup_lr: 1e-6
  weight_decay: 0.05
  max_epoch: 30
  num_workers: 6
  iters_per_epoch: 1000
  warmup_steps: 1000
  seed: 42
  amp: True

Training Progress

Monitor the training with:

Training loss: Should decrease steadily
Validation accuracy: Should improve over epochs
Checkpoints: Saved in checkpoints/save_checkpoint/

Expected training time: ~2-3 days on 4x A5000 GPUs

Stage 2: Instruction Tuning

For advanced emotion reasoning with fine-grained annotations, see Instruction Tuning.

Training Tips

Optimize GPU Memory

If you encounter out-of-memory errors:

Batch size is already 1 (per GPU)
- With 4 GPUs, effective batch size = 4
Mixed precision is enabled by default:
```
amp: True  # Already enabled
```

Use gradient checkpointing:

use_grad_checkpoint: True  # Already enabled

Reduce image size (if desperate):
```
image_size: 224  # Instead of 448
```

Monitor Training

Use TensorBoard to visualize training:

tensorboard --logdir=checkpoints/save_checkpoint/

View metrics at: http://localhost:6006

Resume Training

If training is interrupted, resume from the last checkpoint:

resume_ckpt_path: "checkpoints/save_checkpoint/checkpoint_15.pth"

Evaluation During Training

To evaluate on validation set during training:

run:
  evaluate: true
  eval_epoch_interval: 5  # Evaluate every 5 epochs

Hyperparameter Tuning

Key hyperparameters to tune:

Parameter	Description	Default	Range
`init_lr`	Initial learning rate	1e-5	1e-6 to 1e-4
`batch_size`	Batch size per GPU	1	1 to 2
`max_epoch`	Number of epochs	30	20 to 50
`warmup_steps`	Warmup steps	1000	500 to 2000
`weight_decay`	Weight decay	0.05	0.01 to 0.1
`lora_r`	LoRA rank	64	32 to 128
`lora_alpha`	LoRA alpha	16	8 to 32

Troubleshooting

Common Issues

Issue: “CUDA out of memory”

Solution: Reduce batch size or enable gradient accumulation

Issue: “NaN loss during training”

Solution: Reduce learning rate or enable gradient clipping

Issue: “Slow training speed”

Solution: Increase num_workers or use pre-extracted features

Issue: “Model not converging”

Solution: Check data preprocessing and try different learning rates

Next Steps

After completing Stage 1 pre-training:

Questions?

For training-related questions:

Check the Troubleshooting section
Open an issue on GitHub
Review the original training configuration files

Instruction Tuning

Training Emotion-LLaMA

Table of Contents

Training Overview

Prerequisites

Hardware Requirements

Software Requirements

Stage 1: Pre-training

Step 1: Download the Dataset

Step 3: Configure Dataset

Step 4: Configure Multi-task Instructions

Step 5: Run Pre-training

Training Progress

Stage 2: Instruction Tuning

Training Tips

Optimize GPU Memory

Monitor Training

Resume Training

Evaluation During Training

Hyperparameter Tuning

Troubleshooting

Common Issues

Next Steps

Questions?

Table of contents

Training Emotion-LLaMA

Table of Contents

Training Overview

Prerequisites

Hardware Requirements

Software Requirements

Stage 1: Pre-training

Step 1: Download the Dataset

Step 2: Prepare Multi-modal Encoders

Step 3: Configure Dataset

Step 4: Configure Multi-task Instructions

Step 5: Run Pre-training

Training Progress

Stage 2: Instruction Tuning

Training Tips

Optimize GPU Memory

Monitor Training

Resume Training

Evaluation During Training

Hyperparameter Tuning

Troubleshooting

Common Issues

Next Steps

Questions?

Table of contents