Search for a command to run...
The study analyzes the performance of modern optical character recognition systems on a corpus of Russian court decisions. The research evaluates six optical character recognition engines using Character Error Rate, accuracy metrics, and an overall quality index combining recognition and downstream extraction quality. The evaluation reveals that system DeepSeekOCR demonstrates a very low error rate (Character Error Rate = 0.0185), whereas PaddleOCR exhibits a substantially higher error rate (Character Error Rate = 0.4026). The investigation shows that optical character recognition errors substantially degrade the completeness of data extraction, reducing recall to 0.43–0.49 in certain engines. The study examines the application of Mixture-of-Experts based neural models (DeepSeek-MoE, Qwen3-MoE) for named-entity recognition on optical character recognition -derived text and demonstrates their robustness under noisy conditions. The results indicate that Mixture-of-Experts -models maintain high entity-extraction performance (F1 = 0.80–0.85), while conventional models under the same conditions suffer a drop of F1 to approximately 0.50. The work builds an annotated judicial-document corpus with inter-annotator agreement κ = 0.89 and develops an end-to-end pipeline combining high-precision optical character recognition and adaptive Mixture-of-Experts -based Named Entity Recognition. The conclusions substantiate that the integration of DeepSeekOCR and Mixture-of-Experts -based models yields optimal extraction quality (F1 = 0.84) and demonstrates maximal resilience to scan-quality degradation.
DOI: 10.1117/12.3108907