Abstract: Introduction
Electronic health records (EHRs) data have potential for novel discovery of patient-centered outcomes that can be used to improve health care delivery. However, a significant amount of data stored in EHRs is hidden in clinical narratives as unstructured text. For prostate cancer patients, these clinic narratives contain a large amount of information as an unstructured data source. Previous work suggests that structured data regarding dysfunctions after treatment for prostate cancer are not consistently captured in the EHR and thus cannot be reliably extracted for clinical and research purposes. Therefore, in this preliminary study we propose a rule-based natural language processing pipeline to extract patient-centered outcomes related to the presence of urinary, bowel and erectile dysfunction following treatment of prostate cancer from the free text of the EHR notes.
We developed a lexicon of terms related to urinary, bowel or erectile dysfunctions based on domain knowledge, prior experience in the field, and review of medical notes. A reference standard of 100 randomly selected documents for each outcome from inpatient admissions was annotated by two domain expert research nurses to identify all related concepts as: present, negated, historical, and discussed risk. We developed a rule-based natural language processing (NLP) pipeline which uses dictionary mapping combined with ConText algorithm. We trained our NLP pipeline using remaining 1,336 documents in the research database and tested on 20 randomly selected documents to determine agreement with the human reference standard and standard precision, recall and overall accuracy rates were used as metrics to quantify the automatic annotation performance.
The precision, recall, and accuracy scores for the urinary incontinence annotations against the reference standard output created by a domain expert was 62.5%, 100% and 76.9%, respectively. For most of the misclassified cases, which annotated as presence of urinary incontinence by the NLP algorithm but not by the expert, it is seen that medication information included in the term dictionary caused ambiguity regarding phenotype classification. For the erectile dysfunction annotations, precision was 100%, recall was 75% and overall accuracy was 90%. On the other hand, since no bowel dysfunction was reported in the randomly selected test set, evaluation metrics were not calculated.
In this preliminary study, we have shown that it is possible to identify the patient-centered outcomes from the free text of EHRs using natural language processing. Using EHRs to assess patient-centered outcomes promotes population-based assessments of these valued yet difficult to assess outcomes and will enable detailed sensitivity and subgroup analysis. Such results will allow clinicians to individualize care for their patients. The results will also provide desperately needed evidence-based criteria for patient-centered outcomes. These criteria can be used in research studies, in clinical practice, and to develop practice guidelines. Future work will create larger number of well-annotated data sets and combine our rule-based approach with recent machine learning techniques used in natural language processing tasks.

Learning Objective 1: It is possible to identify the patient-center outcomes from the free text of EHRs using natural language processing.


Selen Bozkurt (Presenter)
Stanford University

Jung Park, Stanford University
Daniel Rubin, Stanford University
James Brooks, Stanford University
Tina Hernandez-Boussard, Stanford University

Presentation Materials: