Captioning images containing novel object
thesisposted on 28.03.2022, 16:39 by Yufei Wang
Image captioning is the task of describing images using natural language. Recent advances in deep neural networks have boosted the performance of image captioning systems on large benchmark datasets such as COCO . However, these data-driven approaches result in low quality captions for images containing novel objects (i.e., image objects whose corresponding textual labels are not included in the parallel image-caption training data). This thesis aims to improve generated caption quality for images containing novel objects. We notice the limitations in previous novel object captioning benchmark and systems. The contributions of this thesis are two fold. The first contribution is a new evaluation dataset nocaps for novel object captioning, which is intended for evaluation of image captioning models trained on COCO. The nocaps benchmark is sampled from Open Images Dataset  with more than 400 classes of objects that are rarely seen in the COCO data. The second contribution is an improved novel object captioning model UpDown-C, which balances generation quality between in-domain and novel object captions. The evaluation results show that UpDown-C outperforms several strong baselines, including the state-of-the-art Up-Down model with CBS and NBT model, with substantial improvement over previous work and sets a new state-of-the-art on the nocaps benchmark.