posted on 2022-03-28, 20:55authored byMark William Donoghoe
Two fundamental biostatistical measures are the risk and the rate of event occurrence, representing the probability of an event and the expected number of events during a fixed time period. Regression models can be used to relate an individual's characteristics to the risk or rate of an event, such as the occurrence of disease or death. This allows identification of high-risk individuals and can reveal ways in which risk may be reduced.
Generalised linear models (GLMs) for binary and count data are an important statistical tool for risk and rate modelling, and semi-parametric extensions provide additional flexibility. However, some key GLMs of interest have parameter constraints implied by the risk and rate models, and standard model-fitting algorithms can be numerically unstable. This is particularly true for GLMs that allow estimation of risk differences, rate differences and relative risks.
In this thesis by publication new variants of the Expectation-Maximisation (EM) algorithm are developed in order to provide reliable and flexible methods for fitting such models to binary and count data. This begins with the development of a method for additional binomial GLMs, which allows for reliable adjustment of risk differences. An extension of this and other EM-type algorithms for binomial and Poisson GLMs is then provided, which allows for flexible semi-parametric regression based on spline models. As well as risk differences, these models allow reliable estimation of rate differences and relative risks. A method for additive regression under a negative binomial model is also developed, which can be used to estimate rate differences when the observed counts show more variation than is expected under a Poisson model. These methods all ensure that the fitted models respect the required parameter constraints, and their stability allows us to reliably use resampling methods that require many auxiliary analyses, such as the bootstrap.
The utility of these approaches is demonstrated by applying them to various clinical datasets. The methods described in this thesis have all been implemented in open-source packages for the R computing environment and have been made available online.