A Triadic Suffix Tokenization Scheme for Numerical Reasoning

20260 citationsPreprintgreen Open Access

Authors

Olga Chetverina · Moscow Institute of Physics and Technology

Abstract

Standard subword tokenization methods fragment numbers inconsistently, causing large language models (LLMs) to lose positional and decimal structure—a primary driver of errors in arithmetic and scientific reasoning. We introduce Triadic Suffix Tokenization (TST), a deterministic scheme that partitions digits into three-digit triads and annotates each triad with an explicit magnitude marker. Critically, the scheme defines a \emph{fixed, one-to-one mapping} between suffixes and orders of magnitude for the integer part (thousands, millions, billions, etc.) and a parallel system of replicated markers for fractional depth (tenths, thousandths, millionths, etc.). This contrasts with approaches that only group digits (e.g., commas), which leave magnitude to be inferred from position. The scheme adds at most 10,000 fixed tokens to an existing vocabulary, covers 33 orders of magnitude (\(10^{-15}\) to \(10^{18}\)), and preserves exact digits while making order-of-magnitude relationships transparent at the token level. TST is architecture-agnostic and can be integrated as a drop-in preprocessing step. Experimental validation is deferred to future work.

Topics & Keywords

Mathematics, Computing, and Information Processing Natural Language Processing Techniques Handwritten Text Recognition Techniques

UN Sustainable Development Goals

Quality Education

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.19136134