Mean parametrised Conway-Maxwell-Poisson; Observation driven approach for dispersed count data modeling
The Conway-Maxwell-Poisson (CMP) distribution is a two-parameter generalisation of the Poisson distribution that has gained recent attention due to its flexibility in modelling both overdispersed and underdispersed count data. It is an exponential family of distributions and can be considered as a continuous bridge between Poisson and other classical distributions such as geometric and Bernoulli distributions.
One of the main limitations for a broader use of CMP in practical count data analysis is that it does not provide a clear centering parameter, and that makes it difficult for CMP to model the mean of the response directly. Also, the centering parameter in CMP does not correspond directly to the expectation, and therefore, cannot be solved in a closed form. This limitation makes CMP regression difficult to compare to competing models, such as the log-linear Poisson, negative binomial or generalised Poisson models and places it on a lower level of interpretability for dispersed count data analysis.
The focus of this thesis is on introducing and investigating a new formulation of the CMP distribution parametrised via the mean (CMPμ). The aim of this new formulation is to overcome the limitations and shortcomings of the existing CMP method described above. The primary advantage of parametrising the CMP via the mean is the simple interpretation of the estimated coefficients of the regression model, as is usually the case in generalised linear models. This parametrisation also makes it easy to compare the fitting results from the CMPμ with the results obtained from other standard methods. The two primary objectives of this study are:
* To examine the applicability of CMPμ and its regression model in fitting dispersed data.
* Compare the performance of CMPμ with other methods in terms of goodness of fit, theoretical soundness, in particular the orthogonality of the mean and dispersion, and flexibility to capture dispersion.
To achieve these objectives, the mathematical structure and formulae associated with these distributions, in particular CMPμ, were derived and mathematically proven. Moreover, the usage and the behaviour of the normalizing constant Z of CMPμ in cases of overdispersion and underdispersion was investigated. For overdispersion, our results indicate that we cannot accurately compute CMP probabilities for λ > 1 and ν between 0 to 1, as Z does not always converge to a reasonable value for small values of ν, where it is close to zero, and at large values of λ > 1.
This behaviour of the normalising constant in the analysis of the count datasets that exhibit overdispersion can be considered as a limitation of the normalizing constant numerical approximation.
In the case of underdispersion, our results indicate that Z converges reasonably for all λ and ν > 1.
A simulation study to examine the computational effort in finding the normalizing constant Z at different sample sizes of 100, 500 and 1000 with random values for λ and ν was undertaken. The results indicate that the summation of the Z values, for an overdispersed dataset where ν < 1, computed for sample size of 100 random values and summed for 10, 100 and 1000 increments were found to have the same value after 100 increments. The same results were obtained for larger sample sizes of 500 and 1000. and also for the case of an underdispersed dataset. Our study also found that truncating the upper limit of the summation to 100 increments rather than 1000 increments or infinity, as stipulated in the original Z formula, is practically sufficient to provide the same accuracy.
The performance and flexibility of the CMPμ to fit datasets were examined by implementing the mpcmp package written in R, which is considered the first of its kind. This package includes two R algorithms CMPμ(R-nloptr) and CMPμ(R-FS). The evaluation of mpcmp was undertaken on real and simulated dispersed count datasets that exhibit under-and overdispersion, and its performance was assessed by comparing the fitness results with those obtained from other methods.
The results our this study show that CMPμ and its R implementation in the mpcmp package can fit count data at different dispersion levels, and has many desirable properties for the modelling of dispersed count data. The parameter estimates and dispersion values were comparable to those computed by other models used for the same purpose. Also, the mpcmp package was able to provide all the necessary diagnostic plots to interpret the results and evaluate model performance.
Overall, the mean parametrized Conway-Maxwell-Poisson distribution presented in this study was found to be compatible and comparable with other log-linear models, considering the promising results of the mpcmp package. The simple structure of the GLM used in CMPμ makes it easy to fit or model the data, with computational run time being less than many of the other methods, in particular the standard CMP and hyper-Poisson (hP). The implementation results of the mpcmp indicated that the overall performance of this new package is comparable to tailor–made methods for each dispersion case.
Based on the results of our study, it can be clearly stated that CMPμ and the mpcmp package offers a compelling alternative to most distributions that are currently used for count data fitting. It can be considered a useful addition to any applied statistician’s toolkit, and a significant contribution to the library of packages used in count data analysis.