Research

Multi-TW — Benchmarking Multimodal Models on Traditional Chinese QA in Taiwan

multimodal
benchmark
traditional-chinese
latency

Introduces Multi-TW, the first Traditional Chinese tri-modal (image, audio, text) benchmark evaluating performance and latency of any-to-any multimodal models.

Abstract visualization representing multimodal AI benchmarking.

Multi-TW addresses gaps in evaluating multimodal large language models (MLLMs) for Traditional Chinese by providing 900 multiple-choice questions spanning image+text and audio+text modalities. It additionally reports latency to reflect practical deployment considerations.

Key Contributions

  • First tri-modal benchmark focusing on Traditional Chinese QA.
  • Includes professionally sourced exam-style items (SC-TOP collaboration).
  • Evaluates both closed- and open-source any-to-any and VLM pipelines.
  • Provides latency analysis highlighting efficiency trade-offs.

Notable Findings

Closed-source models currently lead overall, yet open-source models remain competitive in audio tasks. End-to-end any-to-any architectures yield latency advantages over pipelines requiring separate transcription.

Impact

Supports future fine-tuning efforts and standardized evaluation for Taiwanese AI ecosystems.