The help page provides various help, advice, and interpretation of the various inputs and outputs on the GenePlexus webserver.
The user can supply a list of genes by either uploading a file of genes or manually entering the genes in a text box. In both cases, each gene needs to be on a new line.
Here is an example gene list file : download DOID-9562-STRING-Adjacency-DisGeNet-Symbols.txt
The following types of gene IDs are allowed as inputs:
Note: The Ensembl IDs (ENSG, ESNP, ENST) cannot contain any version numbers.
The genes supplied by the user will be converted to Entrez gene identifiers using gene ID mappings obtained from 60,000 locally downloaded human gene files from mygene.info.
After uploading/entering the genes, users can validate their input list by clicking the Validate
button. This will pop up two tables:
The user can choose between four human genome-scale molecular networks. For each network, the nodes are mapped into Entrez gene IDs. For the GIANT-TN network, the original network was additionally filtered to remove edges with scores below the prior probability (0.01).
When a network is to be used in a supervised machine learning model, the network connections can be represented as features in one of three ways:
STRING
and STRING-EXP
networks.GIANT-TN
.BioGRID
.In the supervised machine learning model, any gene from the user-supplied gene list that is able to be converted to an Entrez ID and is also in the network is considered part of the positive class. The user can then choose if they want to define genes in the negative class based on one of two geneset collections, Gene Ontology Biological Processes or DisGeNet, based on whether the input genes represent a cellular process/pathway or a disease.
GenePlexus then automatically selects the genes in the negative class by:
The first step is choosing the network. If your gene set of interest comes from a curated database or it was generated while studying a specific process/pathway or disease, the best network to choose is STRING as this network is a highly curated network that uses prior knowledge from gene set databases in building the network. If you would like to only consider experimental interactions, then the network to use is STRING-EXP, and if you would further only like to consider physical interactions, choose BioGRID. The GIANT-TN network is best to use for two cases. First, since it offers the highest gene coverage, it enables the user to see predictions on many more understudied genes. Second, as GIANT-TN is a very dense network that does not directly incorporate gene set database information, this network performs well on larger gene sets that may be derived from high-throughput experiments.
The next step is to choose the way the network is represented as features in the machine learning model.
The next step is to choose the background used to determine the genes used as negative examples in the machine learning model. If your gene set corresponds to a biological process or pathway, choose GO. Instead, if it corresponds more closely to a disease or a complex phenotype, then choose DisGeNet.
The best way to determine if the chosen job options worked well is to look at the cross-validation score at the top of the results page. It could be useful to compare the cross validation score for a few different combinations of job options to help the user find the optimal set of options. Additionally, the figure below shows a summary of results generated from our recent work benchmarking GenePlexus and could help a user pick the best job parameters.
This figure displays results from the paper Supervised learning is an accurate method for network-based gene classification where the parameters are options in the GenePlexus webserver. The left panel is for models trained for Gene Ontology biological prcoesses and the right panel is for models trained for DisGeNet diseases. Each boxplot contains the results of anywhere between 89 and 160 models. Each model was trained using the study-bias holdout method, where for each geneset in a geneset collection, the most well studied genes were used to train the model and the least studied genes were used for testing. This figure can be used to help users pick the network, feature type and negative selection class that best suits their input gene list. It is worth noting that the STRING network is built in part by using Gene Ontology and DisGeNet annotations, and this circularity could be the reason for the enhanced performance of the STRING network in this evaluation.
The GenePlexus method has been extensively benchmarked on models that used between 10 and ~400 genes in the training set. We have found that, while there is some decrease in performance as the number of genes in the gene set increases (far left panel in the figure below), the major driving force is how “connected” the genes are in the network (center and right panel in the figure below). This is understandable as GenePlexus heavily leverages the underlying network when training the model. We have found that the GIANT-TN network shows the least decrease in performance as the gene set size increases and thus we recommend this network for use with larger gene sets like those that are generated directly from high-throughput experiments. As mentioned above, the cross-validation score is very useful in determining if a given model worked well on the user-supplied gene list.
The box at the top displays the parameters of the job along with the job name. If the number of genes in the positive class is greater than 15, then the
results of 3-fold cross validation performance is displayed in terms of log2(auPRC/prior)
. This metric reports the 2-fold change of the area under the precision-recall curve (auPRC) over the prior auPRC expected by random chance. For example, a value of 1 indicates a model that performs twice as good as the expected result.
This table contains the prediction probability for every gene in the network, indicating its membership in the positive class. The probabilities are bounded from 0 to 1 with 1 being the highest probability of being in the positive class.
The possible Known/Novel column values are:
The possible Training-labels column values are:
The table can be filtered by using the text box above the table. For example, often the top predictions are from genes in the positive class. So, it might be useful to type “Novel” into the text box to see what are the top predicted novel genes.
Clicking the Entrez Gene ID takes the user to a webpage containing more information on that gene.
The tables here show how similar the custom model trained on the user-supplied gene list compares to other pre-trained models. The webserver stores the model weights from machine-learning models trained using lists of genes annotated to numerous Gene Ontology biological processes and DisGeNet diseases. The similarity score is determined by:
Clicking the Term ID takes the user to a page containing more information about that term.
The top predictions from the supervised machine learning model are displayed, showing how they are connected to each other in the network used to train the machine learning model.
The table here shows how the user input gene list was converted to Entrez IDs and if the converted Entrez IDs were in the network that was used to train the model.
Mancuso CA, Bills PS, Krum D, Newsted J, Liu R, Krishnan A. (2022) GenePlexus: a web-server for gene discovery using network-based machine learning, Nucleic Acids Research, gkac335 doi:10.1093/nar/gkac335.
Greene CS, Krishnan A, Wong AK, Ricciotti E, Zelaya RA, Himmelstein DS, Zhang R, Hartmann BM, Zaslavsky E, Sealfon SC, Chasman DI, FitzGerald GA, Dolinski K, Grosser T, Troyanskaya OG. (2015) Understanding multicellular function and disease with human tissue-specific networks. Nature Genetics 47:569-576.
Szklarczyk,D., Gable,A.L., Lyon,D., Junge,A., Wyder,S., Huerta-Cepas,J., Simonovic,M., Doncheva,N.T., Morris,J.H., Bork,P., et al. (2019) STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Research, Nucleic Acids Research 47:D607–D613.
Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M. (2006) BioGRID: a general repository for interaction datasets. Nucleic Acids Research 34:D535-539.
Oughtred,R., Stark,C., Breitkreutz,B.-J., Rust,J., Boucher,L., Chang,C., Kolas,N., O’Donnell,L., Leung,G., McAdam,R., et al. (2019) The BioGRID interaction database: 2019 update. Nucleic Acids Research, 47:D529–D541
The results of the GenePlexus webserver are licensed under the Creative Commons License: Attribution 4.0 International
The location of the license agreement or terms-of-use for each network or geneset collection can be found by clicking the following links. We note that no license agreement was available for GIANT, however we have obtained permission from the owner of the material to redistribute the network and they have noted that they will be adding a Creative Commons License: Attribution-NonCommercial 4.0 International to the website soon.