Point of View

Building a 21st Century Data Sharing Infrastructure

August 26, 2020

In global health, we have increasingly recognized the value of secondary analyses on pooled data collected across various types of studies. But the way researchers have usually done these analyses has involved gathering the data manually, sending it to collaborators by snail mail or literally traveling to collaborators, and finally discussing the findings at a leisurely pace commensurate with old technology and social norms for collaboration.

Under normal circumstances, this way of working might be considered acceptable. But when your objective is to save lives from COVID-19 in 2020, you need to move fast. You need a way to access and analyze relevant data with as little friction as possible.

Fortunately, the digital age and rapid technology advances have upended this traditional approach to collaboration and data sharing. The Ki team is catalyzing global health’s move beyond the old paradigm. We believe that data is a shared global health asset that is critical to reaching well -informed decisions that save lives. Realizing this value requires overcoming differing philosophical orientation to collaboration and several practical challenges.

But when your objective is to save lives from COVID-19 in 2020, you need to move fast. You need a way to access and analyze relevant data with as little friction as possible.

Disrupting the status quo

On the practical side, as researchers have gained the ability to generate larger and larger datasets, identifying and assembling the data is getting more difficult. This difficulty is partly due to the time that it takes to move the data to a researcher’s local research space, and partly due to increasingly stringent legal standards and the prolonged process of securing permissions and certifications required by institutions.

In recent years, we have been working with a variety of data repositories to address these challenges. These repositories include Vivli, the Infectious Disease Data Observatory (IDDO), and Sage Bionetworks, each of which specializes in certain kinds of data and have different data access processes and levels of data annotation and curation. These repositories are continually revisiting their users’ needs and evolving how to make data available quickly (and in an ethical way that respects people’s privacy).

However, no matter how good the repositories get, and no matter how much of an improvement they are over the old paradigm—they still create friction in one important way. The data stored in Vivli lives in Vivli, the data stored in IDDO lives in IDDO, and if you want to run a secondary analysis including data stored in both places, you’re out of luck. The data have to be downloaded into a separate area for conducting a joint analyses of the data.

Data’s place of highest and best use

To address this bottleneck, we have developed the COVID-19 Workbench. It is intended to become trusted digital research environment (DRE) where certified analysts can safely and securely work with properly sourced and annotated data related to COVID-19 drug trials and transferred from various repositories. This added capability of having the analysts from multiple institutions go to a shared computing environment where many data sets can be held safely (and potentially stored for long-term use) removes the barrier of many institutions downloading their own copy of data and external collaborators having to then go through lengthy administrative processes to work between institutions.

Ultimately, we hope that DREs like COVID19 Workbench will be recognized as a key tool to spur knowledge in all sorts of areas of scientific inquiry, bringing in problem-relevant data from specialized repositories.