Machine learning algorithms

  • Method
  • Multidisciplinary Analysis

Why does Ki utilize machine learning (ML)? The power of machine learning is in its ability to predict outcomes and uncover relationships in an automated process. Ki leverages machine learning in order to identify relationships and appropriate models that best predict continuous or categorical outcomes. An additional use of ML is to deal with the diversity of statistical models generated by data scientists with different approaches can make it difficult to select mutually agreed upon models. With the super learner ML algorithm, models generated by an analytic collaborative can be evaluated objectively compared to reach a final answer about the relationship between risk factor and outcomes.


Machine learning is an automated process to detect patterns and predict outcomes based on given data.[1, 2]

There are two main types of machine learning:[1]

  1. Predictive or supervised learning maps the relationship between covariates and outcomes. This is the type of machine learning primarily used to address requests for predictive markers.

  2. Descriptive or unsupervised learning solely utilizes inputs to identify patterns in the data. This type is used for generating hypotheses about relationships and is useful in biomarker discovery exercises where large datasets (e.g. neuroimaging, genomics) need to be harnessed for further decision-making.

Commonly used machine learning algorithms include: random forests, neural networks, and decision trees.[3]

Assumptions of machine learning include:

  • The data used to develop a prediction algorithm come from the same population to whom the prediction algorithm will be applied.
  • As with any statistical method, correlation does not necessarily imply causation, and particular care must be taken not to over-interpret results.
  • Some machine learning methods require more data to achieve stability than common approaches.

Optimization is a process repeated hundreds or thousands of times to train machine learning algorithms, identifying where mistakes are made during each repetition, adjusting the algorithm, and repeating the process to obtain a highly predictive algorithm.

Advantages of machine learning

  • Models created by machine learning can be highly accurate in prediction of outcomes.[4]
  • Machine learning can use “wide” data (repeated measures for the same individual) and correlated predictor variables (with similar or related values).[3]

Disadvantages of machine learning

  • Models produced through this methodology can be difficult to understand from a statistical perspective.
  • Specific predictor parameter estimates may have minimal or no direct interpretation to explain biological relationships. This contrasts with linear or logistic regression, which provide interpretable parameter estimates (assuming a properly specified model).



Ensemble of decision trees

Decision trees are a popular approach to machine learning. Decision trees are constructed by first determining which variable provides the best fit to the data when split into two groups. The process is then repeated in each of these two groups, in each of the four resulting groups, and so on, until a predetermined stopping criterion is reached.

Figure 1 provides an example decision tree. Predictor covariates are root and parent nodes to predict the outcome of interest in the terminal leaf nodes.

  • Root node (RN) is the starting node in the decision tree.
  • Parent node (PN) follows the root node and leaf nodes follow the parent node.
  • Leaf nodes (LN), depending on the diversity of the data, can be the terminal node or can be further partitioned.

Figure 1. Example of a decision tree

Using pre-specified algorithms, binary splits for each covariate of interest inform how each parent node categorization influences the following leaf node.

Key requirements for utilizing a decision tree include:

  • It is supervised learning, and therefore requires pre-identified covariates and outcome(s).
  • The covariates of the training data (a subset of the analysis data) should be heterogenous (wide range of covariate values) and numerous to ensure a wide cross-section of records are represented.
  • Decision trees are optimized once by identifying the best split among all variables for all data and then proceeds down each part of the tree until the predetermined stopping criterion is reached. This approach is considered “greedy,” as each split is optimized locally.
  • The outcome variable identified can be discrete (binary or categorical) or continuous.

Decision trees are attractive due to their interpretability of logically working from root nodes outward to leaf nodes (Figure 1).

The term “ensemble methods” refers to a process by which multiple prediction functions are combined into a single prediction function.

  • An unweighted ensemble of decision trees, also referred to as random forests, is the process of constructing many decision trees and results in a model with less variance. This approach minimizes the overfitting that occurs with the use of a single decision tree.


Super learner

Super learner is an algorithm for finding optimal combinations of models with an objective, data-driven approach based on cross validation.

Cross-validation includes two parts, training and validation, to develop a regression fit of the predictor(s) and outcome covariates. Evaluation of the performance of the model fit for prediction informs the final selection of models by the super learner.

  • Training is conducted utilizing subsets of a data set to train models and determine how well they work within the subset.
  • Validation assesses the performance of various models in prediction of outcomes with a remaining subset of the full data set that was not utilized for training.

Super learner uses cross-validation to assess the fit of multiple models and estimates the best weighted average of the predictions made by the models. Previous research has proven the super learner is optimal in the sense that it will predict outcomes as well as the unknown best combination of models included.[6]


  1. Murphy K. Machine learning: A probabilistic perspective. Cambridge, Massachusetts. 2012.
  2. LeDell E. Ensemble Learning at Scale: Software, hardware and algorithmic approaches. Presented at: Machine Learning Conference. September 2016; Seattle, WA.
  3. SAS. Machine Learning: What it is & why it matters. Accessed October 14, 2017.
  4. Larose D, Larose C. Data Mining and Predictive Analytics. 2nd ed: IEEE Press 2015.
  5. Personal Communication. Sergey Feldman. January 12, 2018.
  6. Van der Laan MJ, Polley EC, Hubbard AE. Super learner. Statistical applications in genetics and molecular biology. 2007;6:Article25.


Last Updated

October, 2020