<p dir="ltr">The vision-language pretraining (VLP) paradigm has emerged as an effective framework for multimodal learning that integrates vision and language. Despite their superior capabilities, VLPs remain vulnerable to adversarial attacks, posing significant challenges in Artificial Intelligence (AI) security applications. In addition, there has been a growing focus on exploring and enhancing the adversarial robustness of VLPs, aiming to understand how models behave under adversarial attacks before their deployment. Therefore, there is an urgent need to investigate the adversarial robustness of existing VLPs under various adversarial conditions. This thesis makes several contributions toward understanding the adversarial robustness of VLPs. As the first contribution, we conduct a comprehensive literature review of existing VLPs and their associated adversarial vulnerabilities, categorized by architectural frameworks. Second, we explore a dual-modality white-box adversarial approach that simultaneously perturbs both vision and language inputs. This approach utilizes transformer attention based relevance scores to target important multimodal features. Third, we propose a single-modality defense strategy that mitigates the impact of malicious inaudible voice commands in Advanced Driving Assistance Systems. This approach specifically evaluates model uncertainty in multimodal sensor fusion environments, providing robust protection for autonomous driving system reliability. Fourth, we investigate Zeroth Order optimization as a black-box adversarial attack approach to jailbreak Large Vision language Models. We demonstrate its attack effectiveness and adversarial transferability in challenging black-box adversarial attack scenarios. Last, the thesis concludes by exploring promising future research avenues for improving the adversarial robustness of VLPs, which represents a critical step toward developing trustworthy multimodal Artificial Intelligence (AI) systems. All the approaches proposed in this thesis have been thoroughly validated and evaluated through extensive experiments and theoretical analysis, demonstrating adversarial robustness in multimodal vision-language tasks.</p>
History
Table of Contents
1. Introduction -- 2. Literature Review -- 3. Probing the Robustness of Vision-Language Pretraining Discriminative Models: A Multimodal Adversarial Attack Approach -- 4. Trustworthy sensor fusion against inaudible command attacks in advanced driver-assistance systems -- 5. Crafting Adversarial Inputs for Large Vision-Language Models Using Black- Box Optimization -- 6. Conclusions and Future Work
Awarding Institution
Macquarie University
Degree Type
Thesis PhD
Degree
Doctor of Philosophy
Department, Centre or School
School of Computing
Year of Award
2025
Principal Supervisor
Xi Zheng
Additional Supervisor 1
Yipeng Zhou
Additional Supervisor 2
Chen Wang
Rights
Copyright: The Author
Copyright disclaimer: https://www.mq.edu.au/copyright-disclaimer