NAACL 2025 Findings Paper

Our paper ‘Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation’ (with Luca Moroni, Giovanni Puccetti, Pere-LluĂ­s Huguet Cabot, Andrei Stefan Bejgu, Edoardo Barba, Felice Dell’Orletta, Andrea Esuli and Roberto Navigli) has been accepted at NAACL 2025 (Findings)! In this work, we explore various vocabulary adaptation techniques to tailor English LLMs for the Italian language. We introduce Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that learns neural mapping to accomplish vocabulary substitution, which achieve state-of-the-art performances on several downstream tasks. We adapted two LLMs: Mistral-7b-v0.1, reducing token fertility by 25%, and Llama-3.1-8b, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, after the adaptation of the vocabulary, these models can recover their performances with a relatively limited stage of continual training on the target language. Finally, we test the adapted models' capabilities on several multi-choice and generative tasks.


Location
NAACL 2025, Albuquerque, New Mexico.
Alessio Miaschi
Alessio Miaschi
Full-time researcher (RTDA) in Natural Language Processing