Multi-TW addresses gaps in evaluating multimodal large language models (MLLMs) for Traditional Chinese by providing 900 multiple-choice questions spanning image+text and audio+text modalities. It additionally reports latency to reflect practical deployment considerations.
Key Contributions
- First tri-modal benchmark focusing on Traditional Chinese QA.
- Includes professionally sourced exam-style items (SC-TOP collaboration).
- Evaluates both closed- and open-source any-to-any and VLM pipelines.
- Provides latency analysis highlighting efficiency trade-offs.
Notable Findings
Closed-source models currently lead overall, yet open-source models remain competitive in audio tasks. End-to-end any-to-any architectures yield latency advantages over pipelines requiring separate transcription.
Impact
Supports future fine-tuning efforts and standardized evaluation for Taiwanese AI ecosystems.