Using machine learning to identify mammalian essential genes

By IMPC

Published 20th February 2019

By Kathryn Hentges and Andrew Doig

Essential genes are those that are required for an organism to survive. We have been interested in studying genes that are essential during development, which could be viewed as the genetic basis for building an organism. Developmentally essential genes thus produce lethal phenotypes in knockout experiments. In some organisms with small genomes and experimental accessibility, essential genes have been identified through direct testing. Although the IMPC is in the process of determining gene function for all protein coding genes in the mouse genome, at present thousands of genes have not yet been tested. To identify essential genes on a genome-wide scale, and also determine the properties that distinguish essential genes from non-essential genes, we utilized machine learning to predict the essentiality status of all mouse protein coding genes that lack experimental data at present.

To generate a machine learning classifier for mouse gene essentiality, we compiled a list of approximately 1300 known essential mouse genes and approximately 3400 known non-essential mouse genes, previously studied in knockout experiments.  A set of features, which included genomic, proteomic, and expression data, were obtained for each gene in the genome. We then used machine learning to find features that were likely to be associated with essential genes and those that were not likely to be associated with essential genes. We found that features associated with intracellular functions, such as transcriptional regulation, were highly likely to be associated with essential genes, and those associated with cellular interactions, such as extra cellular signaling, were likely to be found in non-essential genes. Using these features, our classifier was used to predict the essentiality status of all protein coding genes in the mouse genome.  We confirmed that our classification predictions were accurate by checking our predictions against experimental results that were generated by the IMPC during the course of our study and hence not included in our initial gene sets.  This comparison showed that our machine learning classifier was correct for approximately 80% of genes. Our results can be found at http://essentiality.ls.manchester.ac.uk.

Additionally, we compared our findings on mouse essential genes to studies of human essential genes.  Orthologous genes in both species tended to have the same essentiality status. Overall, features enriched in essential and non-essential mouse genes were enriched in human genes of the same essentiality status.  Due to this conservation in function, our predictions may be useful for identification of human gene essentiality and understanding the functions required for mammalian development.  Our predictions can also aid investigators planning mouse knockout experiments by giving an indication of whether a lethal phenotype is likely to result from creating a null allele of the gene of interest.


Original Publication: Tian D, Wenlock S, Kabir M, Tzotzos G, Doig AJ, Hentges KE. Identifying mouse developmental essential genes using machine learning. Dis Model Mech. 2018 Dec 13;11(12). pii: dmm034546. doi: 10.1242/dmm.034546. PMID:30563825

By IMPC

Published 20th February 2019