Macquarie University
01whole.pdf (2.42 MB)

Using machine learning methods to detect health claims made in online forums

Download (2.42 MB)
posted on 2022-08-01, 03:19 authored by Eliza Harrison

As online sources of health information increasingly influence what people believe and the decisions they make, the proliferation of misinformation and dubious health claims online poses a risk to public health. As a step towards tools that help address the variable quality of health information shared online, this thesis develops and evaluates a two-stage process for the detection of health claims in threads posted to health-related forums on Reddit, an online discussion platform where health information, support and advice are common.

Following a review of the literature to identify machine learning methods that have been used to analyse user-generated text on Reddit, this study first compared two unsupervised machine learning approaches to the identification of forums discussing a diverse range of health and medical topics. In the second stage, crowdsourcing methods were then used to label the presence of health claims in threads sampled from the health-related forums identified during the first stage of the process. Using this labelled thread dataset, supervised machine learning methods were used to train classifiers to predict the presence of a health claim in any thread posted to a health-related forum on Reddit. 

The results of the unsupervised machine learning experiments showed that while both of the tested methods were able to group health-related forums, the clustering method captured a similar number of known examples of health-related Reddit forums without including too many superfluous forums unrelated to health and medical topics. In the second stage, the four supervised machine learning methods that were tested produced variable results in terms of balance between precision and recall, and the best performing method made use of terms and phrases that were plausible as distinguishing features of health claims. 

This thesis demonstrates that unsupervised and supervised methods are a feasible way to robustly detect when users make health claims on Reddit. The development of efficient and scalable methods for the detection of health claims provides a strong basis for subsequent pipelines that could be used to automatically link online health claims to relevant, high quality scientific evidence. These systems may form the basis for tools to improve access to credible health information and help people inform their health and medical decisions using credible, evidence-based online health information.


Table of Contents

Chapter 1. Introduction -- Chapter 2. Review of machine learning methods for the analysis of health information on Reddit -- Chapter 3. Methods for the clustering of subreddits -- Chapter 4. Results for the clustering of health-related subreddits -- Chapter 5. Methods for classifying threads containing health claims -- Chapter 6. Results of the classification of health claims -- Chapter 7. Discussion -- References -- Appendix A


A thesis submitted on 4th June 2020 as partial fulfilment of the requirements of the degree of Master of Research in Medicine and Health Sciences Includes bibliographical references (pages 79‐93)

Awarding Institution

Macquarie University

Degree Type

Thesis MRes


Thesis (MRes), Macquarie University, Faculty of Medicine, Health and Human Sciences, Australian Institute of Health Innovation, Centre for Health Informatics, 2020

Department, Centre or School

Australian Institute of Health Innovation

Year of Award


Principal Supervisor

Adam Dunn

Additional Supervisor 1

Didi Surian


Copyright disclaimer: Copyright Eliza Harrison 2020.




1 online resource (xviii, 99 pages)

Usage metrics

    Macquarie University Theses