Enhanching health education through chatbots: A comperative assessment of response quality to health-related queries

Patrick Kramml
Master Digital Healthcare, St. Pölten University of Applied Sciences 2024

Aim and Research Question(s)

This thesis aims to investige the response quality of chatbots (Google Gemini, OpenAI ChatGPT4, Meta Llama2) to health-related queries. In total, three research questions were defined:

How do these chatbots perform amongst each other in answering frequently asked questions by Austrian patients?
In the landscape of Austria, how willing are patients to use chatbots for healthcare-related questions?
How do chatbots respond to health questions based on guidelines and evidence?

Background

Austria's population is getting older, so the healthcare sector is becoming increasingly relevant. With the rise of large-language-models-based chatbots, many frequently asked questions could be answered without healthcare specialists. Research indicates that chatbots could reduce healthcare professionals' workload and prevent unnecessary patient trips. (Ayers et al., 2023; Statistik Austria, 2023)

Methods

Multiple methods have been used to answer the research questions. An online survey will be conducted to find out opinions on using chatbots in healthcare. Numerous expert interviews were conducted to identify frequently asked patient questions, which were then used within a self-coded evaluation tool. The tools allow healthcare experts to rank chatbots on safety, quality, understandability, accuracy, and user satisfaction. Further, all responses are checked with medical literature. Afterwards, a weighted score is calculated.

Results and Discussion

The final mean score is shown in the plot at the end of this section. ChatGPT4 and Gemini showed no statistical significant difference between each other. ChatGPT4 is significantly better than Llama2, while Gemini is not. If compared using the median, Gemini is slightly ahead of ChatGPT4.

Conclusion

With only a few exceptions, all chatbots have shown good coherence with experts, medical literature and guidelines. Overall, ChatGPT4 and Gemini performed consistently best throughout all seven questions. Using chatbots in healthcare is good, but there are concerns regarding data privacy and ethical issues.

References

Ayers, J. W., Poliak, A., Dredze, M., Leas, E. C., Zhu, Z., Kelley, J. B., Faix, D. J., Goodman, A. M., Longhurst, C. A., Hogarth, M., & Smith, D. M. (2023). Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Internal Medicine, 183(6), 589. Statistik Austria. (2023, November 22). Durchschnittsalter der bevölkerung in österreich im jahr 2022 und prognose für 2030 bis 2100: (altersmedian in jahren).