Training Emotion-LLaMA

Complete guide to training your own Emotion-LLaMA model from scratch.


Table of Contents

  1. Training Overview
  2. Prerequisites
    1. Hardware Requirements
    2. Software Requirements
  3. Stage 1: Pre-training
    1. Step 1: Download the Dataset
    2. Step 2: Prepare Multi-modal Encoders
    3. Step 3: Configure Dataset
    4. Step 4: Configure Multi-task Instructions
    5. Step 5: Run Pre-training
    6. Training Progress
  4. Stage 2: Instruction Tuning
  5. Training Tips
    1. Optimize GPU Memory
    2. Monitor Training
    3. Resume Training
  6. Evaluation During Training
  7. Hyperparameter Tuning
  8. Troubleshooting
    1. Common Issues
  9. Next Steps
  10. Questions?

Training Overview

Emotion-LLaMA training consists of two stages:

  1. Stage 1: Pre-training - Train on coarse-grained MERR dataset (28,618 samples)
  2. Stage 2: Instruction Tuning - Fine-tune on fine-grained MERR dataset (4,487 samples)

This two-stage approach enables the model to:

  • Learn basic multimodal emotion recognition in Stage 1
  • Develop advanced emotion reasoning capabilities in Stage 2

Prerequisites

Hardware Requirements

  • GPUs: 4x NVIDIA GPUs with at least 24GB VRAM each (e.g., RTX 3090, A5000, or better)
  • RAM: 64GB or more recommended
  • Storage: 100GB+ free space for datasets and checkpoints

Software Requirements

  • ✅ Emotion-LLaMA environment installed (see Getting Started)
  • ✅ PyTorch with CUDA support
  • ✅ Distributed training libraries (automatically included in environment)

Stage 1: Pre-training

Step 1: Download the Dataset

Due to copyright restrictions, we cannot provide raw videos directly.

Visit the official MER2023 website to apply for dataset access:

http://merchallenge.cn/datasets

After obtaining access, specify the dataset path in minigpt4/configs/datasets/firstface/featureface.yaml:

# Set Dataset video path
image_path: /path/to/datasets/Emotion/MER2023/video

Step 2: Prepare Multi-modal Encoders

To extract rich emotion features, we use:

  • HuBERT - Audio Encoder
  • EVA - Global Visual Encoder
  • MAE - Local Visual Encoder
  • VideoMAE - Temporal Encoder

To save GPU memory, we use pre-extracted features instead of loading all encoders directly.

Download the pre-extracted features: Google Drive Link

Save the features to your dataset folder, then modify the get() function in minigpt4/datasets/datasets/first_face.py to set the feature paths.

The specific feature extraction process is available in the “feature_extract” folder: Google Drive Link

Step 3: Configure Dataset

In minigpt4/configs/datasets/firstface/featureface.yaml, select the coarse-grained annotations:

# Use coarse-grained annotations for Stage 1
annotation_file: MERR_coarse_grained.txt

This provides 28,618 samples for pre-training.

Step 4: Configure Multi-task Instructions

Set the task types in minigpt4/datasets/datasets/first_face.py:

self.task_pool = [
    "emotion",      # Multi-modal emotion recognition task
    "reason",       # Multi-modal emotion inference task
    # "reason_v2",  # Advanced reasoning (Stage 2)
]

Each task randomly selects prompts from different instruction pools:

Emotion Task Examples:

  • “What is the emotion expressed in this video?”
  • “Identify the primary emotion shown.”
  • “Classify the emotional state.”

Reason Task Examples:

  • “What are the facial expressions and vocal tone used? What emotion does this reflect?”
  • “Analyze the multimodal cues and explain the emotion.”
  • “Why is this person experiencing this emotion?”

Step 5: Run Pre-training

Execute the training script with 4 GPUs:

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc-per-node 4 train.py --cfg-path train_configs/Emotion-LLaMA_finetune.yaml

Training Configuration (train_configs/Emotion-LLaMA_finetune.yaml):

model:
  arch: minigpt_v2
  llama_model: "/path/to/checkpoints/Llama-2-7b-chat-hf"
  ckpt: "/path/to/checkpoints/minigptv2_checkpoint.pth"
  lora_r: 64
  lora_alpha: 16

datasets:
  feature_face_caption:
    batch_size: 1

run:
  lr_sched: "linear_warmup_cosine_lr"
  init_lr: 1e-5
  min_lr: 1e-6
  warmup_lr: 1e-6
  weight_decay: 0.05
  max_epoch: 30
  num_workers: 6
  iters_per_epoch: 1000
  warmup_steps: 1000
  seed: 42
  amp: True

Training Progress

Monitor the training with:

  • Training loss: Should decrease steadily
  • Validation accuracy: Should improve over epochs
  • Checkpoints: Saved in checkpoints/save_checkpoint/

Expected training time: ~2-3 days on 4x A5000 GPUs


Stage 2: Instruction Tuning

For advanced emotion reasoning with fine-grained annotations, see Instruction Tuning.


Training Tips

Optimize GPU Memory

If you encounter out-of-memory errors:

  1. Batch size is already 1 (per GPU)
    • With 4 GPUs, effective batch size = 4
  2. Mixed precision is enabled by default:
    amp: True  # Already enabled
    
  3. Use gradient checkpointing:
    use_grad_checkpoint: True  # Already enabled
    
  4. Reduce image size (if desperate):
    image_size: 224  # Instead of 448
    

Monitor Training

Use TensorBoard to visualize training:

tensorboard --logdir=checkpoints/save_checkpoint/

View metrics at: http://localhost:6006

Resume Training

If training is interrupted, resume from the last checkpoint:

resume_ckpt_path: "checkpoints/save_checkpoint/checkpoint_15.pth"

Evaluation During Training

To evaluate on validation set during training:

run:
  evaluate: true
  eval_epoch_interval: 5  # Evaluate every 5 epochs

Hyperparameter Tuning

Key hyperparameters to tune:

Parameter Description Default Range
init_lr Initial learning rate 1e-5 1e-6 to 1e-4
batch_size Batch size per GPU 1 1 to 2
max_epoch Number of epochs 30 20 to 50
warmup_steps Warmup steps 1000 500 to 2000
weight_decay Weight decay 0.05 0.01 to 0.1
lora_r LoRA rank 64 32 to 128
lora_alpha LoRA alpha 16 8 to 32

Troubleshooting

Common Issues

Issue: “CUDA out of memory”

  • Solution: Reduce batch size or enable gradient accumulation

Issue: “NaN loss during training”

  • Solution: Reduce learning rate or enable gradient clipping

Issue: “Slow training speed”

  • Solution: Increase num_workers or use pre-extracted features

Issue: “Model not converging”

  • Solution: Check data preprocessing and try different learning rates

Next Steps

After completing Stage 1 pre-training:

  1. Continue to Instruction Tuning (Stage 2)
  2. Evaluate your trained model
  3. Deploy the model for inference

Questions?

For training-related questions:


Table of contents