WHAT IS MACHINE LEARNING?
Machine learning is an automated process to detect patterns and predict outcomes based on given data.[1, 2]
There are two main types of machine learning:
Predictive or supervised learning maps the relationship between covariates and outcomes. This is the type of machine learning primarily used to address requests for predictive markers.
- Descriptive or unsupervised learning solely utilizes inputs to identify patterns in the data. This type is used for generating hypotheses about relationships and is useful in biomarker discovery exercises where large datasets (e.g. neuroimaging, genomics) need to be harnessed for further decision-making.
Commonly used machine learning algorithms include: random forests, neural networks, and decision trees.
Assumptions of machine learning include:
- The data used to develop a prediction algorithm come from the same population to whom the prediction algorithm will be applied.
- As with any statistical method, correlation does not necessarily imply causation, and particular care must be taken not to over-interpret results.
- Some machine learning methods require more data to achieve stability than common approaches.
Optimization is a process repeated hundreds or thousands of times to train machine learning algorithms, identifying where mistakes are made during each repetition, adjusting the algorithm, and repeating the process to obtain a highly predictive algorithm.
Advantages of machine learning
- Models created by machine learning can be highly accurate in prediction of outcomes.
- Machine learning can use “wide” data (repeated measures for the same individual) and correlated predictor variables (with similar or related values).
Disadvantages of machine learning
- Models produced through this methodology can be difficult to understand from a statistical perspective.
- Specific predictor parameter estimates may have minimal or no direct interpretation to explain biological relationships. This contrasts with linear or logistic regression, which provide interpretable parameter estimates (assuming a properly specified model).
Ki UTILIZATION OF MACHINE LEARNING
Ensemble of decision trees
Decision trees are a popular approach to machine learning. Decision trees are constructed by first determining which variable provides the best fit to the data when split into two groups. The process is then repeated in each of these two groups, in each of the four resulting groups, and so on, until a predetermined stopping criterion is reached.
Figure 1 provides an example decision tree. Predictor covariates are root and parent nodes to predict the outcome of interest in the terminal leaf nodes.
- Root node (RN) is the starting node in the decision tree.
- Parent node (PN) follows the root node and leaf nodes follow the parent node.
- Leaf nodes (LN), depending on the diversity of the data, can be the terminal node or can be further partitioned.