Search for a command to run...
The identification of essential genes has garnered considerable attention from researchers in recent years. This process of identification uncovers minimal functional modules that enable the survival of an organism, making it of paramount importance in the fields of biomedicine and biotechnology. To address this challenging issue, computational methods have become increasingly utilized to complement experimental approaches, which tend to be intricate and costly. Various classifiers, based on the selection of feature sets, have been proposed and have shown promising results thus far. In this paper, leveraging 50 sulfate reducing bacteria (SRB) organisms - microbes frequently associated with biofilm formation, biofilm-driven corrosion, and complex microbial community dynamics; we aim to show that classifiers can achieve very good performance using only a minimal set of relevant features. Specifically, we demonstrate that classifier performance can be improved by considering minimal relevant features while taking into account the taxonomy of different organisms. A total of 37,500 features were generated from nucleotide and protein sequences of 41 SRB organisms to construct a machine learning model system aimed at predicting essential genes. Our feature engineering module identified 58 subsets of features. Through cross-validation, we achieved competitive intra-organism prediction performance. The best models obtained had an AUC of 0.99, precision of 0.99, recall of 0.99, and an F1-score of 0.99. Subsequently, this system was used to perform extra-organism (new organism not seen by the model) validation using nine left-out SRB organisms. The results obtained for these test organisms demonstrated the efficacy of our models with maximum precision, maximum recall, maximum F1-score, and maximum AUC equal to 0.99, <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$0.99,0.99$</tex>, and 0.97, respectively. Our approach has significantly outperformed previously proposed methods in terms of average metrics, indicating better generalization of the models. Finally, this approach allows researchers to evaluate the predicted result in the lab with fewer variables to consider in their experimental design.