Search for a command to run...
Due to the size of the data involved, performance is an important consideration in the task of detecting fraudulent Medicare insurance claims. We evaluate CatBoost and XGBoost on the task of Medicare fraud detection, and report performance in terms of running time and Area Under the Receiver Operating Characteristic Curve (AUC). We show that adding a categorical feature for XGBoost and CatBoost improves performance in terms of AUC, and that CatBoost's performance is higher in a statistically significant sense. Moreover, we conduct experiments to find the optimal number of decision trees to use for XGBoost and CatBoost in the task of Medicare fraud detection. This is an important contribution because the number of trees in the ensemble governs overall resource consumption of a Gradient Boosted Decision Tree implementation. We find that with a purely numerical dataset, CatBoost and XGBoost yield nearly equivalent performance in terms of AUC, and XGBoost has a shorter training time. With respect to Medicare fraud detection, to the best of our knowledge, this is the first study to evaluate the performance of CatBoost and XGBoost in terms of running time and AUC on highly imbalanced, Big Data. Our contribution of evaluating running time performance on a large imbalanced dataset benefits researchers looking for more efficient utilization of valuable resources.