Beyond the Spelling Miracle: Investigating Substring Awareness in Character-Blind Language Models

Abstract

Correctly identifying characters and substrings of words should be a basic but essential ability of any Language Model that aims to proficiently understand and produce language. Despite so, the majority of Pre-trained Language Models (PLMs) are ‘character-blind’ and struggle in spelling tasks, although they still seem to acquire some character knowledge during pre-training, a phenomenon dubbed Spelling Miracle. To shed light on this phenomenon, we systematically evaluate a range of PLMs with different parameter sizes using a controlled binary substring identification task. Through a series of experiments, we propose the first comprehensive investigation on where, when, and how a PLMs develop awareness of characters and substrings, with a particular linguistic focus on morphemic units such as prefixes, suffixes, and roots.

Publication
In Proceedings of the Findings of the 2025 Annual Meeting of the Association for Computational Linguistics (Findings of ACL 2025, Vienna, Austria) (upcoming)
Alessio Miaschi
Alessio Miaschi
Full-time researcher (RTD) in Natural Language Processing