Irish dependency treebanking and parsing

Lynn, Teresa

doi:10.25949/19438466.v1

01whole.pdf (1.72 MB)

Irish dependency treebanking and parsing

thesis

posted on 2022-03-28, 18:28 authored by Teresa Lynn

Despite enjoying the status of an official EU language, Irish is considered a minority language. As with most minority languages, it is a `low-density' language, which means it lacks important linguistic and Natural Language Processing (NLP) resources. Relative to better-resourced languages such as English or French, for example, little research has been carried out on computational analysis or processing of Irish. Parsing is the method of analysing the linguistic structure of text, and it is an invaluable processing step that is required for many different types of language technology applications. As a verb-initial language, Irish has several features that are uncharacteristic of many languages previously studied in parsing research. Our work broadens the application of NLP methods to less-studied language structures and provides a basis on which future work in Irish NLP is possible. We report on the development of a dependency treebank that serves as training data for the first full Irish dependency parser. We discuss the linguistic structures of Irish, and the motivation behind the design of our annotation scheme. Our work also examines various methods of employing semi-automated approaches to treebank development. We overcome the relatively small pool of linguistic and technological resources available for the Irish language with these approaches, and show that even in early stages of development, parsing results for Irish are promising. What counts as a sufficient number of trees for training a parser varies according to languages. Through empirical methods, we explore the impact our treebank's size and content has on parsing accuracy for Irish. We also discuss our work in crosslingual studies through converting our treebank to a universal annotation scheme. Finally we extend our Irish NLP work to the unstructured user-generated text of Irish tweets. We report on the creation of a POS-tagged corpus of Irish tweets and the training of statistical pos-tagging models. We show how existing resources can be leveraged for this domain-adapted resource development.

History

Notes

Bibliography: pages 193-217 "A dissertation submitted in fulfilment of the requirements for the award of Doctor of Philosophy (Ph.D.) under a cotutelle Agreement at the School of Computing, Dublin City University; Department of Computing, Macquarie University" -- title page. Empirical thesis.

Awarding Institution

Macquarie University

Degree Type

Thesis PhD

Degree

PhD, Macquarie University, Faculty of Science and Engineering, Department of Computing

Department, Centre or School

Department of Computing

Year of Award

2016

Principal Supervisor

Jennifer Foster

Additional Supervisor 1

Mark Dras

Rights

Copyright Teresa Lynn 2016. Copyright disclaimer: http://mq.edu.au/library/copyright

Language

English

Extent

1 online resource (xxi, 315 pages) diagrams

Former Identifiers

mq:50038 http://hdl.handle.net/1959.14/1110899

Usage metrics

Keywords

Parsing (Computer grammar)parsing Irish language Irish dependency Computational linguistics treebank Irish language -- Grammar syntax

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Irish dependency treebanking and parsing

History

Table of Contents

Notes

Awarding Institution

Degree Type

Degree

Department, Centre or School

Year of Award

Principal Supervisor

Additional Supervisor 1

Rights

Language

Extent

Former Identifiers

Usage metrics

Categories

Keywords

Licence

Exports