Beyond the Spelling Miracle: Investigating Substring Awareness in Character-Blind Language Models

Cristiano Ciaccio, Marta Sartor, Alessio Miaschi, Felice Dell'Orletta

July 2025

PDF

Abstract

Correctly identifying characters and substrings of words should be a basic but essential ability of any Language Model that aims to proficiently understand and produce language. Despite so, the majority of Pre-trained Language Models (PLMs) are ‘character-blind’ and struggle in spelling tasks, although they still seem to acquire some character knowledge during pre-training, a phenomenon dubbed Spelling Miracle. To shed light on this phenomenon, we systematically evaluate a range of PLMs with different parameter sizes using a controlled binary substring identification task. Through a series of experiments, we propose the first comprehensive investigation on where, when, and how a PLMs develop awareness of characters and substrings, with a particular linguistic focus on morphemic units such as prefixes, suffixes, and roots.

Type

Conference paper

Publication

In Proceedings of the Findings of the 2025 Annual Meeting of the Association for Computational Linguistics (Findings of ACL 2025, Vienna, Austria)

Source Themes

Beyond the Spelling Miracle: Investigating Substring Awareness in Character-Blind Language Models

Abstract

Alessio Miaschi

Full-time researcher (RTD) in Natural Language Processing