AI Chatbots Struggle with Nuances of Persian Politeness, New Benchmark Reveals
MENLO PARK, CA – Artificial intelligence systems, even those specifically tuned for Persian language, consistently fail to grasp the complexities of taarof – a core element of Iranian social etiquette characterized by ritualized politeness and indirect communication. A new study reveals that leading large language models (LLMs) correctly navigate taarof situations only 34 to 42 percent of the time,a stark contrast to the 82 percent accuracy achieved by native Persian speakers. The findings highlight a significant cultural blind spot in AI development as these systems are increasingly deployed in global contexts.
The inability of AI to understand taarof isn’t merely a matter of linguistic translation; itS a failure to recognize deeply ingrained cultural cues governing everyday interactions for millions worldwide. This misinterpretation can have real-world consequences, potentially derailing negotiations, damaging relationships, and reinforcing harmful stereotypes. Researchers have introduced “TAAROFBENCH,” the first benchmark designed to measure AI’s ability to reproduce this intricate practice, exposing a persistent tendency toward Western-style directness in models like GPT-4o, Claude 3.5 Haiku,Llama 3,DeepSeek V3,and Dorna – a Persian-tuned variant of Llama 3.
The study, led by Nikta gohari Sadr of Brock University, along with researchers from Emory University and othre institutions, defines taarof as a system where “what is said often differs from what is meant.” It manifests as repeated offers followed by initial refusals, insistent gift-giving met with polite declines, and compliments deflected only to be reaffirmed. This “polite verbal wrestling,” as described by Rafiee (1991), involves a delicate interplay of offer and refusal, shaping expressions of generosity, gratitude, and requests.
“Cultural missteps in high-outcome settings can derail negotiations, damage relationships, and reinforce stereotypes,” the researchers write. The development of TAAROFBENCH aims to address this gap, providing a tool for evaluating and improving AI’s cultural competency and ultimately fostering more effective and respectful cross-cultural communication. Further research will focus on refining the benchmark and developing strategies to better equip AI systems with the ability to understand and respond appropriately to culturally nuanced interactions.