Using machine learning methods to detect health claims made in online forums
As online sources of health information increasingly influence what people believe and the decisions they make, the proliferation of misinformation and dubious health claims online poses a risk to public health. As a step towards tools that help address the variable quality of health information shared online, this thesis develops and evaluates a two-stage process for the detection of health claims in threads posted to health-related forums on Reddit, an online discussion platform where health information, support and advice are common.
Following a review of the literature to identify machine learning methods that have been used to analyse user-generated text on Reddit, this study first compared two unsupervised machine learning approaches to the identification of forums discussing a diverse range of health and medical topics. In the second stage, crowdsourcing methods were then used to label the presence of health claims in threads sampled from the health-related forums identified during the first stage of the process. Using this labelled thread dataset, supervised machine learning methods were used to train classifiers to predict the presence of a health claim in any thread posted to a health-related forum on Reddit.
The results of the unsupervised machine learning experiments showed that while both of the tested methods were able to group health-related forums, the clustering method captured a similar number of known examples of health-related Reddit forums without including too many superfluous forums unrelated to health and medical topics. In the second stage, the four supervised machine learning methods that were tested produced variable results in terms of balance between precision and recall, and the best performing method made use of terms and phrases that were plausible as distinguishing features of health claims.
This thesis demonstrates that unsupervised and supervised methods are a feasible way to robustly detect when users make health claims on Reddit. The development of efficient and scalable methods for the detection of health claims provides a strong basis for subsequent pipelines that could be used to automatically link online health claims to relevant, high quality scientific evidence. These systems may form the basis for tools to improve access to credible health information and help people inform their health and medical decisions using credible, evidence-based online health information.