Case Study

Predicting Long COVID Diagnosis

  • Multidisciplinary Analysis
  • Translation

Key Takeaways

  • Evaluating the factors that can predict long COVID may provide insights regarding potential treatments, biological mechanisms, and populations who are at increased risk.
  • The final dataset, using electronic health record (EHR) data in the National COVID Cohort Collaborative (N3C), included 55,257 participants with a 1:4 ratio of long COVID cases vs. controls.
  • The team used a Super Learner ensemble machine learning model to predict individual risk of long COVID diagnosis. With an area under the receiver operator curve (AUC) of 0.947, the model was highly predictive of long COVID diagnosis.
  • Baseline factors prior to COVID-19 diagnosis and respiratory factors were highly predictive of long COVID risk.
    These results support the ability of investigators and clinicians to assess individual risk for long COVID.

Understanding the Problem

Known by several names, including post-acute sequelae of COVID-19 (PASC), long COVID is a condition of ongoing symptoms at least four weeks after an acute SARS-CoV-2 infection.1 Patients have reported more than 200 symptoms of long COVID throughout the body, including symptoms affecting the cardiovascular, respiratory, immune, gastrointestinal, and neurological systems.2 Commonly reported symptoms include fatigue, breathlessness, muscle ache, joint pain, headache, cough, chest tightness, loss of smell or taste, anxiety, poor concentration and diarrhea.3 However, the variability of long COVID symptoms that patients experience makes this condition more difficult to understand and diagnose. Long COVID symptoms can be physically and mentally debilitating.

Globally, an estimated 10% (65 million) of people with acute COVID go on to experience long COVID.2 The number of people affected by long COVID may be much higher due to undocumented cases.

With the availability of vaccines, the urgent risk of acute COVID infection-related mortality has decreased. Individuals and health care providers are now becoming more concerned about the risk of longer-term impacts of COVID infection, such as the symptoms of long COVID. However, despite ongoing research, long COVID remains poorly understood. There are currently no approved treatments to reduce the symptoms or alleviate disability caused by long COVID. Evaluating the factors that predict long COVID may provide insights regarding potential treatments, biological mechanisms, and populations who are at increased risk for this outcome.

The variability of the symptoms and course of the disease make long COVID an interesting area of research for large datasets and real-world evidence (RWE). Since long COVID is a poorly understood condition, highly dimensional real-world datasets that include many patients can be used to learn more about the relationship between the covariates and long COVID diagnosis.

Recently, the U.S. National Institutes of Health (NIH) sponsored the Long COVID Computation Challenge, an artificial intelligence (AI)/machine learning (ML) challenge to develop, train, and test a model to aid in predicting whether a patient with acute SARS-CoV-2 infection will develop long COVID.4 Ki team members from the UC Berkeley School of Public Health entered the challenge aiming to predict long COVID diagnosis using electronic health record (EHR) data in the National COVID Cohort Collaborative (N3C). This team of researchers is building and applying medical and machine learning knowledge to address these research questions and other questions on long COVID.

Research Questions

  • Which factors predict long COVID diagnosis?
  • Can these factors be identified using machine learning analysis of EHR data?


The goal of the NIH Long COVID Computation Challenge was to predict long COVID diagnosis based on individual EHR data in N3C. N3C has a centralized repository where investigators can access and analyze data from more than 7 million COVID-19 patients from 80 sites across the United States while maintaining patient privacy. The dataset included cases of patients diagnosed with long COVID and matched controls with a documented acute COVID-19 diagnosis who had at least one medical visit more than four weeks after their initial COVID diagnosis date.5 The final dataset included 55,257 participants with a 1:4 ratio of cases vs. controls.5

The team extracted 304 features (variables) from N3C data. After processing and transforming, the final dataset included 1,339 features including characteristics available in N3C related to sociodemographic information, medical history, medications, and COVID-19 positivity and severity.

The team used a Super Learner ensemble machine learning model to predict individual risk of long COVID diagnosis.6,7 A Super Learner algorithm splits data into training and test samples, runs multiple sub-algorithms such as traditional parametric and ML models such as gradient boosting and random forest on the training sample, and evaluates how well they predict the test sample. The performance of each sub-algorithm is scored, and the optimal weighted combination of sub-algorithms is chosen.


This analysis sought to predict an individual’s risk of being diagnosed with long COVID after an acute COVID infection. The primary metric was the area under the receiver operator curve (AUC), which is a common measure of a model’s accuracy. The value of AUC ranges from 0 to 1. An AUC of 1.0 indicates that the model perfectly predicts the outcome of test samples. The Super Learner was able to accurately predict individual risk of long COVID diagnosis with an AUC of 0.947.5

To improve interpretability, the team evaluated which individual factors were the most predictive of long COVID diagnosis as well as groups of factors. The team grouped features based on when they occurred in relationship to acute COVID diagnosis.5 The time interval-based groups were baseline, pre-COVID, acute COVID, and post-COVID features. Baseline factors included all records that occurred at least 37 days before COVID diagnosis and demographic factors, such as sex, ethnicity, etc. Pre-COVID factors were records between 37 days prior and 7 days prior to COVID diagnosis. Acute COVID factors were records between 7 days prior and 14 days after COVID diagnosis. Post-COVID factors were records between 14 and 28 days after COVID diagnosis. Compared to pre-COVID, acute COVID, and post-acute COVID factors, baseline factors were the strongest predictors of long COVID diagnosis when the time-based factors were analyzed.

The team also grouped factors based on domain, which included baseline demographics and anthropometry, medical visitation and procedures, respiratory system, antimicrobials and infectious disease, cardiovascular system, female hormones and pregnancy, mental health and wellbeing, pain, skin sensitivity, and headaches, digestive system, inflammation, autoimmune, and autoantibodies, renal function, liver function, and diabetes, nutrition, COVID positivity, and uncategorized diseases, nervous system, injury, mobility, and age-related factors.5 When the domain groups were analyzed, medical utilization, demographics and anthropometry, and respiratory factors were the most predictive domains of characteristics.

Actionable Outcomes

This project supports the ability of investigators and clinicians to assess individual risk for long COVID based on baseline factors prior to COVID-19 diagnosis, which may enable clinicians to provide preventive care to individuals identified as being at a high risk. Furthermore, these findings support the importance of respiratory factors in predicting long COVID susceptibility, which is consistent with other studies. The UC Berkeley team’s model won third place in the NIH Long COVID Computational Challenge.

The predictive ability of baseline factors suggests that clinicians may be able to identify who is at risk for long COVID based on their baseline characteristics that appear to occur well before COVID infection or COVID diagnosis and are recorded in the provider system. Future studies will evaluate the causal impact of baseline factors, including respiratory symptoms, COVID-19 vaccination, and immune modulating medication usage, on subsequent long COVID risk to evaluate potential causal relationships.


  1. CDC. Post-COVID Conditions. Centers for Disease Control and Prevention. Published July 20, 2023. Accessed October 19, 2023.
  2. Davis HE, et al. Nat Rev Microbiol. 2023;21(3):133-146. doi:10.1038/s41579-022-00846-2
  3. Daines L, et al. Curr Opin Pulm Med. 2022;28(3):174-179. doi:10.1097/MCP.0000000000000863
  4. Announcing The NIH Long COVID Computational Challenge (L3C) | Interagency Modeling and Analysis Group. Accessed October 18, 2023.
  5. Butzin-Dozier Z, et al. Published online August 4, 2023:2023.07.27.23293272. doi:10.1101/2023.07.27.23293272
  6. Phillips RV, et al. Int J Epidemiol. 2023;52(4):1276-1285. doi:10.1093/ije/dyad023
  7. van der Laan MJ, et al. Stat Appl Genet Mol Biol. 2007;6:Article25. doi:10.2202/1544-6115.1309
  8. LeDell E, et al. Int J Biostat. 2016;12(1):203-218. doi:10.1515/ijb-2015-0035

Datasets Utilized

Electronic health record (EHR) data from 55,257 participants in the National COVID Cohort Collaborative (N3C)


Long COVID, real-world data (RWD), real-world evidence (RWE), artificial intelligence (AI), machine learning (ML)