K. L. Abbott1, C. M. Harbaugh2, N. Matusko2, G. Sandhu2, P. G. Gauger2, J. V. Vu2 1University of Michigan Medical School,Ann Arbor, MI, USA 2University of Michigan,Department Of Surgery,Ann Arbor, MI, USA
Introduction: Residents receive feedback from performance evaluations in the form of quantitative scores and qualitative comments. Quantitative scores can quickly be compared with other numerical ratings, but they may lack adequate meaning and often show little variation over time. Qualitative comments may be highly specific, but it is difficult to recognize or analyze trends in comments dispersed across evaluations. We explored the use of natural language processing (NLP) to interpret qualitative data with the goal of generating high-yield, easily accessible feedback.
Methods: We examined faculty and peer evaluations for general surgery residents training at a single academic institution from 2008-2017. Evaluations assessed nine performance domains using rating scales and a free text field. NLP uses artificial intelligence to interpret language, and sentiment analysis is a type of NLP that extracts information related to opinions—for example, negative or positive feeling. We used the Google application programming interface for NLP to generate sentiment scores for qualitative evaluations. To characterize performance scales, we calculated Cronbach’s α and completed exploratory factor analysis. To ascertain the relationship between performance ratings and sentiment scores, we calculated a Pearson correlation coefficient. To graphically represent word frequency and sentiment score, we generated word clouds for an example resident.
Results: We analyzed 3,467 performance evaluations from 18 residents, averaging 192 evaluations per resident. The nine performance questions had high inter-item reliability (Cronbach’s α = 0.97). Exploratory factor analysis indicated that the nine domain questions measured a single construct. Averaged domain rating and evaluation-level sentiment score were weakly correlated (r = 0.36, 95% CI = 0.32-0.39, p < 0.001). The words “instruction” and “understanding” had negative sentiment in the PGY-1 word cloud, but did not appear in the PGY-5 word cloud. The word “leader” in the PGY-5 word cloud had a frequency of 5 and an average sentiment score of 0.54 (range 0-0.9), but did not appear in the PGY-1 word cloud.
Conclusion: Our results demonstrate that NLP can be used to process valuable qualitative information not captured by performance ratings. Thematic analysis of qualitative data is labor-intensive, but NLP facilitates rapid aggregation and visualization of qualitative data from multiple sources. These methods could be used to aggregate changing or contradictory information from multiple evaluations and present trainees with feedback that helps them identify strengths and specific areas for improvement.