Set calculation typeThe CompScore service operates in two modes. The user has the option of performing either a genetic algorithm (GA) optimization or a rescoring run. The GA optimization finds the combination of scoring functions components maximizing either the Enrichment Factor at a user defined fraction of screened data or BEDROC for a user provided value of the parameter α. For a GA optimization run the user must provide an Data file containing compounds IDs, Classification, Weighting variable and Scoring components value (see the Data file section for details) and setup the GA parameters (see the GA setup section for details). The output of a GA optimization run will be a log file containing details of the calculation and a txt file containing a sorted list, best to worst, of compounds IDs with their aggregated scores (see the Output files section).
For rescoring the user has the options of using a log file from a previous run or to apply the general consensus scoring schemes available for different primary docking programs. The type of rescoring run is selected from the drop down menu. If a custom rescoring is requested, a log file from a previous CompScore run and a Data file must be provided as inputs to apply the found consensus scoring solution to the latter. For the general CompScore model only the Data file is required. The output of a Rescoring run is a txt file containing a sorted list, best to worst, of compounds IDs with their aggregated scores (see the Output files section). This selection of the Calculation Type is made with the radio buttons below and the selection of one of them is mandatory.
Weighted scores selectionFor both Rescoring and GA Optimization calculation types the user has the option of including the weighted scores. Weighted scores are internally computed by the algorithm and added to the loaded data. The variable for weighting the scores must be located in the third column of the input data file (see next section for Input data file format). The newly created variables will be identified by the original score ID plus the ‘_W’ suffix. For example, if the ‘Score1’ variable is present in the input data, the newly created variable will be set to ‘Score1_W’. Note that if a Rescoring calculation is requested using weighted scores, a previous optimization run using weighted scores must be previously performed. The user has the option of activating the use of weighted scores by means of the below check box.
Data fileIn case a GA optimization run is requested, only the Data input file is required. This file must have the following format:
- First row is considered as data heading, i.e variables names.
- Column 1: The ID of each compound.
- Column 2: The classification of the compounds in either ligand of decoy. The allowed values in this column are 1 for ligands and 0 for decoys in case a GA optimization is selected as Calculation type. If a Rescoring calculation is requested, this column can contain any integer data since it won’t be considered during calculations.
- Column 3: The value of the weighting criterion, e.g. number of heavy atoms. It contains numeric values and can have any random value if no weighted scores are considered in the requested calculation.
- Column 4 to last: Scoring components values. Only numeric data is allowed. Tab is employed as field separator. A sample Data input file must look like shown below when imported into Excel. The Data files used in the CompScore algorithm validation can be downloaded from our data repository.
Output filesLinks for downloading the results are sent to the user to the e-mail address provided at the form once its jobs is finished. If a GA optimization is requested, the user will receive two links: one for a log file and a second one for the ranked molecules file. In case a Rescoring run is requested, only the latter file will be generated. The log file contains information regarding the optimization process such as the Number of scores, Constant scores, Scores removed due to correlation, GA evolution details (every 10 generations), Score components is the best solution, Performance statistics of the best rescoring model and Execution time. The log files involved in the CompScore validation using either BEDROC or EF as enrichment metrics can be downloaded from our data repository.
GA optimization setupIn case a GA optimization is requested, the user must configure some GA parameters. This is done by filling the fields shown below. Default settings are also provided for each parameter and they are described in detail next.
- Maximum Allowed Correlation Between Rankings: This parameter controls the maximum allowed correlation between two rankings of different scoring components. It can take values between 0 (correlation allowed) and 1 (no correlation allowed)
- Minimum Number of Allowed Scoring Levels: This parameter controls the minimum number of allowed scores levels for a scoring component. It can take any integer value up to the number of compounds in the Data File. Scoring components spanning up to the specified number of unique values will be excluded from the calculations. For example, if set to 1, then only constant variables will be removed from the dataset.
- Metric to Maximize: Here the user can select which one of the BEDROC or EF metrics are going to me maximized by the GA.
- Alpha for BEDROC: If BEDROC is selected as the metric to maximize by the GA, the user must specify the value of the α parameter for BEDROC computation. This must be a value greater than 0.
- Fraction of Screened Data for EF: In case EF is selected for maximization in the GA search, the user must provide a fraction of screened data at which EF should be maximum. This parameter takes values between 0 and 1.
- Population Size for GA: Number of individuals in the population for GA evolution. Currently, up to 100 individuals are allowed.
- Generations for GA: Number of generations that the initial population will evolve. Currently, up to 1000 generations are allowed by the CompScore Web Service.
- CrossOver Probability for GA: Probability parameter for the cross-over operator of the GA. The user must provide a value between 0 and 1.
- Mutation Probability for GA: Probability parameter for the mutation operator of the GA. The user must provide a value between 0 and 1.
- Bootstrap: With this parameter the user can choose to perform a bootstrap cross-validation of the best performing consensus scoring solution found by the GA.
- Number of BootStrap Iterations: If a bootstrap cross-validation of the best solution is requested, the user must provide its number of iterations. Currently, up to 1000 bootstrap re-samplings are allowed.