Search for a command to run...
Abstract Motivation Large-scale gene knockdown/knockout screens have been used to gain insight into a wide array of phenotypes and biological processes. However, conducting such experiments is expensive and labor-intensive. In this work, we present a general graph-based machine-learning approach that can predict the effects of gene perturbations on molecular phenotypes of interest given some measured phenotypic effects of other gene perturbations. The motivation for learning models that can predict the effects of gene perturbations is fourfold. Such models can (1) predict effects for unmeasured genes in cases in which cost or technical barriers preclude perturbing every gene, (2) prioritize unmeasured genes or sets of genes for subsequent perturbation experiments, (3) hypothesize mechanisms that underlie the relationships between the perturbed genes and their effects, and (4) generalize to other unmeasured phenotypes of interest. Results We evaluate our approach by applying it, in conjunction with four different learning methods, to learn models for four varied phenotypes. Our empirical evaluation demonstrates that the learned models (1) show relatively high levels of predictive accuracy across the four phenotypes, (2) have better predictive accuracy than several standard baselines, (3) can often learn accurate models with small training sets, (4) benefit from having multiple sources of evidence in the input representation, (5) can, in many cases, transfer their predictive value to other phenotypes. Data availability The assembled data sets and source code for this work are available at: https://github.com/Craven-Biostat-Lab/graph-molecular-phenotype-prediction Author summary One general approach for gaining insight into the genes involved in a specific biological process is to conduct an experiment in which individual genes are perturbed and the effect on the process is measured for each perturbation. Large-scale experiments of this type have provided important biological insights, but they are often expensive and labor-intensive to perform. As a result, it is not always feasible to measure the effects of perturbing every gene. In this article, we present a machine-learning approach to predicting the effects of gene perturbations using available experimental data and biological network information. Our method can estimate the effects of genes that have not yet been experimentally measured, helping researchers identify promising genes to study next. In addition, the models can suggest hypotheses about the molecular interactions that link genes to the biological process of interest. Approaches like this may help guide experimental studies and accelerate the discovery of gene–phenotype relationships.