PROJECTS
Project Link: Dataset Driven Outcomes download directory
This work was funded by the Intramural Research Program at the U.S. National Institutes of Health (NIH), with support from the NIH Office of AIDS Research. The content is solely the responsibility of the authors and does not necessarily reflect the official views of NIH.
We apply cutting edge data science approaches, including artificial intelligence and machine learning, to existing large-scale clinical datasets (LSCDs) and rearrange the data by putting data from people with HIV who are highly similar to each other into to their own cohorts. Work conducted on such cohorts is expected to be more reproducible, and its conclusions more robust. We will do this by automating the segmentation of people who are described in LSCDs and living with HIV. We will segment their clinical events into cohorts with reproducible cohort definitions. Our reproducible cohort definitions can be used for designing novel studies or to compare LSCDs to one another before a study begins to support choosing a LSCD intentionally. Nationality, demography, geography, treatment era, comorbidities, and preexisting conditions (prior to HIV infection) should inform treatment outcomes and efficacy when studying people living with HIV.
Reproducible Cohort Definitions (RCDs) can also categorize pre-published material and help novel investigators describe their cases in meaningful, harmonized ways. RCDs can ensure the conclusions HIV researchers are drawing are drawn for the kinds of cases they studied and not for the kinds of cases they didn’t. Further, our solution lowers the barrier to access LSCDs for HIV researchers. By curating case definitions learned from LSCDs, HIV investigators can make an informed choice of which LSCDs to use for a given study and which patient-subpopulation to consider.
Who to contact if you have any questions: Nick Williams, Ph.D nick.williams@nih.gov
Leveraging large-scale clinical datasets for HIV outcomes research: Automating dataset characterization from reproducible HIV cohort definitions found in HIV clinical datasets
Project Background: The Office of AIDS Research (OAR) at National Institutes of Health (NIH) provides funding for ‘innovation’ project that seek to advance the science and practice of HIV research at the NIH. HIV research is high priority for several reasons, and HIV research is largely considered unfruitful relative to the expense (there is no cure, and treatment is complicated to access and maintain).
How to use what you find in this directory: When deciding which dataset to reuse for ‘HIV Research’, researchers often buy several data sets, assess HIV patient volume and proceed if the given data set has meaningful volumes of patients described. Given the underlying complexity of HIV and its human hosts it is quite difficult, financially prohibitive and disheartening to ‘guess and check’ if the cases you are looking for are included in a given product with limited research funds. Case discovery and segmentation is a start-up cost native to any data reuse effort. The data in this directory can be used to meet this start-up cost for free. We further innovate, by not simply reporting counts of cases but use a machine learning method to automate the segmentation of HIV cases into case presentations experienced over the life course with specific popular data reuse data sets.
Access the dataWhat you will find in this directory:
Methods.docx
This file describes the methodological considerations
<datasetName>_Lifecourse_observations.csv
These files contain the analysis records with cluster assignment
<datasetName>_Node_Anatomy.csv
These files describe, in one record per cluster, the record demographic ranges
<datasetName>_Choice_Node_Panel
These files are image panels of a choice cluster’s contents