Search for a command to run...
Standard subword tokenization methods (e.g., BPE) fragment numbers inconsistently, causing large language models (LLMs) to lose the positional and decimal structure of numerical data. This is a primary driver of hallucinations in arithmetic and scientific reasoning tasks. We introduce a Triadic Suffix Tokenization Scheme (TST), a deterministic approach designed specifically for numerical values. By partitioning digits into fixed 3-digit triads relative to the decimal point and annotating each triade with a magnitude suffix (both for integer and fractional parts), the scheme aligns the model's vocabulary with the standard human-readable decimal system (thousands, millions, etc.). Key Technical Parameters: Vocabulary Overhead: The scheme requires adding exactly 10,000 fixed tokens to the model’s existing vocabulary. Operational Range: Provides native, high-precision representation for numerical values across 33 orders of magnitude, spanning from \(10^{-15}\) to \(10^{18}\). Precision: Guaranteed preservation of the fractional structure, enabling reliable operations with floating-point data in scientific and financial contexts. The proposed scheme effectively bridges the gap between linguistic processing and symbolic computation. By ensuring a one-to-one mapping between numerical magnitude and token structure, it significantly enhances the model's zero-shot arithmetic capabilities and eliminates token-boundary-related errors in numerical reasoning. This is a production-ready, "drop-in" enhancement for existing LLM architectures, requiring minimal vocabulary expansion to achieve a radical improvement in mathematical accuracy. Experimental validation is left for future work; we invite the community to evaluate this scheme on numerical reasoning benchmarks.