Explorations in Controlled Image Captioning
thesisposted on 29.03.2022, 00:50 by Omid Mohamad Nezami
Benefiting from advances in machine vision and natural language processing, current image captioning systems are able to generate natural-sounding descriptions of source images. Most systems deal only with factual descriptions, although there are extensions where the captions are 'controlled', in the sense that they are directed to incorporate particular additional information, such as selected stylistic properties. This thesis seeks the understand and improve these controlled image captioning models, applying extra visually-grounded and non-grounded information. First, we target the emotional content of images as extra visually-grounded information, which is an important facet of human generated captions, to generate more descriptive image captions. Second, we target stylistic patterns as non-grounded information, which is an important property of written communication. Finally, as a more general instance of perturbing the input, we examine how image captions are affected by the injection of perturbations in the source image, introduced by adversarial attacks that we propose on an object detector. Specifically, the major contributions of the thesis are described as follows: We propose several novel image captioning models to incorporate emotional features that learned from an external dataset. Before applying the features for image captioning, we show the transferability and the effectiveness of the features for another task: automatic engagement recognition. For this, we propose a novel model for engagement recognition, initialized with the features, using our newly collected dataset. In the image captioning models, we specifically use one-hot encoding and attention-based representations of facial expressions present in images as our emotional features. We find that injecting facial features as a fixed one-hot encoding can lead to improved captions, with the best results if the injection is at the initial time step of an encoderdecoder architecture with a specific loss function to remember the encoding. An attention-based distributed representation at each time step provides the best results. We present several novel image captioning models using attention-based encoderdecoder architectures to generate image captions with style. Following previous work, our first kind of model is trained in a two-stage fashion: pretraining on a large factual dataset and then training on a stylistic dataset. For this, we design an adversarial training mechanism leading to generated captions that better match human references than previous work on the same dataset, and that are also stylistically diverse. Our second kind of model is trained in an end-to-end fashion, which incorporates both high-level and word-level embeddings representing stylistic information, and leads to the highest-scoring captions according to standard metrics; this end-to-end approach is an effective strategy for incorporating this kind of information. • We introduce a novel adversarial attack against Faster R-CNN, as a high performing and widely used object detector. Our version of Faster R-CNN is used in the state-of-the-art image captioning system to generate bounding boxes including detected objects present in the image. In contrast to existing attack that changes all bounding boxes, our attack aims to change the label of a particular detected object in both targeted and non-targeted scenarios, while preserving the labels of other detected objects; it achieves this aim with a high rate of success. In terms of understanding the effect of noise injection into the input, we find that although the injected perturbations that attack all bounding boxes or only a specific object type score similarly on standard visual perceptibility metrics, the impact on generated captions is dramatically different.