About GenePlexus

Description

The GenePlexus webserver enables users to predict novel genes similar to their genes of interest based on their patterns of connectivity in human genome-scale molecular interaction networks.

A user can supply a list of genes of their interest to GenePlexus and select a gene network of their choice. GenePlexus will then train a machine learning model that captures the patterns of network connectivity of the user-defined genes in contrast to other genes in the network. This machine learning model is then used to return a prediction of how associated every gene in the network is to the input gene list based on their network connectivity patterns. Additionally, GenePlexus also enables the user to interpret the custom trained machine-learning model by comparing it to pretrained models for various bioloical processes (defined based on the Gene Ontology) and diseases (defined based on curations from DisGeNet terms). Users can visualize the top predictions in the form of an interactive network graph and download/export all the results in multiple convenient formats.

Brief overview of the inputs and outputs

Inputs

Users need to provide the following inputs:

  • A list of human genes.
  • A choice of a human genome-scale molecular network: BioGRID, STRING, GIANT-TN, or STRING-EXP.
  • A choice of how the network connections are represented as features in the supervised machine learning model: Adjacency, Influence, or Embedding.
  • A choice of whether the input genes represent a cellular process/pathway or a disease. This choice informs what other genes in the network GenePlexus will use (as the "negative class") to contrast against the user-provided genes (the "positive class").
  • A unique user supplied job name (optional).

Outputs

GenePlexus uses the inputs provided by the user to return the following outputs:

  • A prediction for every gene in the network on whether it belongs to the positive class, as defined by the user-supplied input gene file.
  • The similarity of the custom machine-learning model (trained on the user-supplied input gene file) to other machine-learning models trained using lists of genes annotated to biological process terms in the Gene Ontology or diseases in the DisGeNet database.
  • An interactive graph of the network connectivity of the genes with the highest prediction scores
  • A summary table of how the user-supplied input gene list is converted to Entrez gene identifiers and which of those Entrez genes were in the chosen network.

For detailed information on the inputs and outputs see the Help page.

Overview of the working and motivation of GenePlexus

The goal of GenePlexus is, given a list of genes, predict the association of any gene in the human genome to that list of genes based on the patterns of their connectivity in an underlying genome-scale gene interaction network. This goal is accomplished by casting this as a binary classification machine learning problem, where the user-supplied gene list is considered the "positive class", and a set of carefully chosen genes are automatically assigned to the "negative class". GenePlexus then uses a regularized logistic regression classifier to train the model that can distinguish genes in the positive class from the negative class.

The motivation for this network-based approach comes from the difficulty in choosing what features to use for a given machine learning problem. For example, if one wanted to predict if someone was a Democrat or a Republican then maybe choosing features such as state lived in, age, income, etc., might make sense. However, what would be a good feature set to use to predict if a person would like the movie Cinderella? Would the same feature set be able to be used to predict if someone wanted to buy a refrigerator? A powerful method is to forgo traditional feature design and instead use the behaviors/preferences of people within a social network to make predictions. When the problem is cast this way, the features of the machine learning model are always the same; the connections of people to each other in the social network. The only thing that changes with each machine learning problem is the definition of what constitutes the positive and negative classes.

Designing a feature set for a given problem in genetics is very difficult, but luckily biologists have been studying the interaction of genes and proteins for many years and there exist genome-wide scale molecular networks representing this vast knowledge. GenePlexus uses these networks as the features in its machine learning models, and the user defines the machine learning problem they want to solve by supplying the genes that belong to the positive class and indicating the type of genes that should constitute the negative class. Once these classes are defined, GenePlexus spins up a virtual machine that can handle the GBs worth of data used to train a machine-learning model specific to the user-supplied gene list. Interactive results are then retrievable using the custom job name provided upon running the model.

Previous published works that have used this approach of network-based gene classification

License

The results of the GenePlexus webserver are licensed under the Creative Commons License: Attribution 4.0 International

Website Development

The GenePlexus backend code and pre-trained models were generated by Dr. Christopher Mancuso and Remy Liu, Department of Computational Mathematics, Science and Engineering, Michigan State University

This GenePlexus website and cloud engineering was done by Douglas Krum, Patrick Bills, and Jacob Newsted, Data Science Group, Enterprise Services, Michigan State University.

Citation

Mancuso CA, Bills PS, Krum D, Newsted J, Liu R, Krishnan A. (2022) GenePlexus: a web-server for gene discovery using network-based machine learning, Nucleic Acids Research, gkac335 doi:10.1093/nar/gkac335.