A Triadic Suffix Tokenization Scheme for Numerical Reasoning

20260 citationsPreprintgreen Open Access

Authors

Olga Chetverina · Moscow Institute of Physics and Technology

Abstract

Standard subword tokenization methods (e.g., BPE) fragment numbers inconsistently, causing large language models (LLMs) to lose the positional and decimal structure of numerical data. This is a primary driver of hallucinations in arithmetic and scientific reasoning tasks. We introduce a Triadic Suffix Tokenization Scheme (TST), a deterministic approach designed specifically for numerical values. By partitioning digits into fixed 3-digit triads relative to the decimal point and annotating each triade with a magnitude suffix (both for integer and fractional parts), the scheme aligns the model's vocabulary with the standard human-readable decimal system (thousands, millions, etc.). Key Technical Parameters: Vocabulary Overhead: The scheme requires adding exactly 10,000 fixed tokens to the model’s existing vocabulary. Operational Range: Provides native, high-precision representation for numerical values across 33 orders of magnitude, spanning from \(10^{-15}\) to \(10^{18}\). Precision: Guaranteed preservation of the fractional structure, enabling reliable operations with floating-point data in scientific and financial contexts. The proposed scheme effectively bridges the gap between linguistic processing and symbolic computation. By ensuring a one-to-one mapping between numerical magnitude and token structure, it significantly enhances the model's zero-shot arithmetic capabilities and eliminates token-boundary-related errors in numerical reasoning. This is a production-ready, "drop-in" enhancement for existing LLM architectures, requiring minimal vocabulary expansion to achieve a radical improvement in mathematical accuracy. Experimental validation is left for future work; we invite the community to evaluate this scheme on numerical reasoning benchmarks.

Topics & Keywords

Numerical Methods and Algorithms Mathematics, Computing, and Information Processing Polynomial and algebraic computation

UN Sustainable Development Goals

Quality Education

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.19024286