Post

[ACL 2025] Pun2Pun: Benchmarking LLMs on Textual-Visual Chinese-English Pun Translation via Pragmatics Model and Linguistic Reasoning

[ACL 2025] Pun2Pun: Benchmarking LLMs on Textual-Visual Chinese-English Pun Translation via Pragmatics Model and Linguistic Reasoning

Lost in Translation? When Puns Travel Between Languages

We’ve all been there: a pun so clever it makes you groan and grin at the same time. But what happens when you try to translate that pun between two very different languages—like Chinese and English?

Turns out, it’s harder than it looks. Most machine translation systems still struggle with humor, wordplay, and cultural nuance. That’s why we built Pun2Pun—a new benchmark designed specifically to test how well AI models can handle pun translation between Chinese and English, both in text and images.

In our paper, we introduce a clever strategy called the Constant-Variable Optimization (CVO) Model and a new metric named Overlap (Ovl) to measure translation quality. We tested several state-of-the-art LLMs (including GPT-4o, Claude, DeepSeek, and others), and found that even the best models still have a long way to go when it comes to preserving humor across languages.

Want to see the puns, the methods, and the sometimes-hilarious sometimes-cringey results?

📄 Read the full paper here:

acl

Abstract:

Puns, as a unique form of linguistic creativity, present significant challenges in cross-lingual translation, particularly between linguistically distant languages like Chinese and English, where it’s often considered a “mission impossible”. We introduce Pun2Pun, a novel benchmark for quantitatively evaluating pun translation between Chinese and English while preserving both linguistic mechanisms and humorous effects. We propose the adaptation of Constant-Variable Optimization (CVO) Model for translation strategy and concomitant Overlap (Ovl) metric for translation quality assessment. Our approach provides a robust quantitative evaluation framework to assess models’ complex linguistic and cultural reasoning capabilities in pun translation. Through extensive experiments on both textual and visual puns, we demonstrate that our translation strategy model significantly improves performance, particularly for better-performing models. Our findings reveal exciting potentials and current limitations of LLMs in preserving sophisticated humor across linguistic and cultural boundaries.

This post is licensed under CC BY 4.0 by the author.