Search for a command to run...
This study examines the determinants of California house prices using the 1990 census block-group dataset (n = 20,640) and a unified evaluation protocol with a fixed train/validation/test split (60/20/20). This study compares two linear hedonic baselines, OLS without distances (A1) and OLS with five geographic distances (A2), to a Random Forest (RF) trained on log prices and back-transformed to dollars via Duan smearing, while A1/A2 are estimated in levels (USD). Models are compared primarily by test-set R2in dollars. RF attains the highest accuracy (test R20.834, RMSE$47k), outperforming A2 (R20.648), while A2 improves modestly over A1, quantifying the incremental value of explicit geographic accessibility. Cross-method evidence converges on a stable core of determinants: median income and distance to coast rank first and second, followed by latitude/longitude and distances to major cities; structural count variables are comparatively weaker at the block-group level. Price-tier analysis (Low/Mid/High, defined at training P30/P70) shows stronger fit in the low tier and larger errors/heterogeneity at the high end. Robustness checks (Winsorization, Huber, dropping distances, higher-order lat/long polynomials) do not overturn conclusions; dropping distances degrades performance most. Overall, purchasing power and spatial accessibility jointly organize the cross-section of California house prices, with nonlinearities favoring RF for prediction and A2 for interpretation.
Published in: Theoretical and Natural Science
Volume 142, Issue 1, pp. 167-179