Seizure detection from electroencephalography signals in the Temple University Hospital Seizure Corpus dataset using machine learning
Seizure disorders are neurological disorders that occur when the brain's normal patterns of electrical activity are disrupted. A seizure can negatively affect the attentiveness, behaviour, sensation, and general cognitive performance of the brain. For the last century, electroencephalography (EEG) has been used to study human brain activity, ranging from basic perception, the complexity of sleep, through to psychological disorders like depression, and – of particular relevance to the current work – to predict brain disorders like seizure. However, the manual diagnosis from EEG recordings is laborious and subjective, requiring years of specialist training. To avoid this, many automated methods have been developed: the most widely used being Machine Learning (ML). For decades, a major challenge for automated seizure detection with ML has been the availability of large datasets, required for training and testing ML algorithms. In 2018, the Temple University Hospital Seizure Corpus (TUSZ) dataset was introduced to address this limitation. The dataset includes 642 patients with 44 diagnoses: with Epilepsy/Seizure, Amyotrophic lateral sclerosis (ALS), Multiple sclerosis (MS), Confusion, and Coma disorders being the most common. In order to characterise this dataset, I employed data visualisation and statistical methods. The demographic data (i.e., the distribution of Age, Gender, diagnoses, and the relationship between them) was graphed for visualisation. These methods drew attention to the relationship between a diagnosis and the patient's brain activity. Thus, Power Spectral Density (PSD) analysis was used to highlight the abnormal EEG in patients with a diagnosis (e.g., there is a significant difference in delta-alpha frequency-band [t(103) = -5.7, p < 0.001 and d = -0.811] in patients with versus without Epilepsy), and the delta PSD showed a significant difference between patients with versus without Epilepsy. Moreover, there was a positive correlation between the PSDs of five major diagnoses in the TUSZ dataset. The result of these analyses provides a resource for describing the complexity of the dataset.
The next step in this thesis was to establish which ML models are historically most successful for predicting seizure from EEG. To this end, a bibliometric analysis was conducted, unpacking literature combining seizure, EEG, and ML. This review revealed Support Vector Machine (SVM), K-Nearest Neighbours (KNN), and Random Forest to be the most widely used ML models. Therefore, these were then utilised to predict seizure in the TUSZ dataset. The analysis indicated that the Random Forest model (85% accuracy, 82% precision, 85% F1 Score, 90% AUC) outperformed SVM (83% accuracy, 78% precision, 84% F1 Score, 86% AUC) and KNN (77% accuracy, 75% precision, 78% F1 Score, 81% AUC). By conducting feature-importance analysis for the best performing (Random Forest) model, it was concluded that features such as the peak and ratio of the delta to the beta band, statistical features (i.e., mean, standard deviation, first quartile, etc.), and complexity features (i.e., eigenvalues, first quartile, second quartile, and third quartile, etc.) were the most important feature groups for seizure detection.
In this thesis, I used ML to predict seizure occurrence from EEG recordings in the TUSZ dataset. The outcome provides a novel characterisation of this large dataset, demonstrating cutting-edge ML techniques that capitalised and expanded on state-ofthe-art automated seizure detection. The success of this approach provides new insights into the TUSZ datasets and sets a new benchmark for seizure detection which will ultimately support the lives of those affected by the seizure.