Macquarie University
Browse

Towards Adversarial Robust Learning on Multimodal Neural Networks

Download (5.44 MB)
thesis
posted on 2025-11-12, 01:13 authored by Jiwei Guan
<p dir="ltr">The vision-language pretraining (VLP) paradigm has emerged as an effective framework for multimodal learning that integrates vision and language. Despite their superior capabilities, VLPs remain vulnerable to adversarial attacks, posing significant challenges in Artificial Intelligence (AI) security applications. In addition, there has been a growing focus on exploring and enhancing the adversarial robustness of VLPs, aiming to understand how models behave under adversarial attacks before their deployment. Therefore, there is an urgent need to investigate the adversarial robustness of existing VLPs under various adversarial conditions. This thesis makes several contributions toward understanding the adversarial robustness of VLPs. As the first contribution, we conduct a comprehensive literature review of existing VLPs and their associated adversarial vulnerabilities, categorized by architectural frameworks. Second, we explore a dual-modality white-box adversarial approach that simultaneously perturbs both vision and language inputs. This approach utilizes transformer attention based relevance scores to target important multimodal features. Third, we propose a single-modality defense strategy that mitigates the impact of malicious inaudible voice commands in Advanced Driving Assistance Systems. This approach specifically evaluates model uncertainty in multimodal sensor fusion environments, providing robust protection for autonomous driving system reliability. Fourth, we investigate Zeroth Order optimization as a black-box adversarial attack approach to jailbreak Large Vision language Models. We demonstrate its attack effectiveness and adversarial transferability in challenging black-box adversarial attack scenarios. Last, the thesis concludes by exploring promising future research avenues for improving the adversarial robustness of VLPs, which represents a critical step toward developing trustworthy multimodal Artificial Intelligence (AI) systems. All the approaches proposed in this thesis have been thoroughly validated and evaluated through extensive experiments and theoretical analysis, demonstrating adversarial robustness in multimodal vision-language tasks.</p>

History

Table of Contents

1. Introduction -- 2. Literature Review -- 3. Probing the Robustness of Vision-Language Pretraining Discriminative Models: A Multimodal Adversarial Attack Approach -- 4. Trustworthy sensor fusion against inaudible command attacks in advanced driver-assistance systems -- 5. Crafting Adversarial Inputs for Large Vision-Language Models Using Black- Box Optimization -- 6. Conclusions and Future Work

Awarding Institution

Macquarie University

Degree Type

Thesis PhD

Degree

Doctor of Philosophy

Department, Centre or School

School of Computing

Year of Award

2025

Principal Supervisor

Xi Zheng

Additional Supervisor 1

Yipeng Zhou

Additional Supervisor 2

Chen Wang

Rights

Copyright: The Author Copyright disclaimer: https://www.mq.edu.au/copyright-disclaimer

Language

English

Extent

172 pages

Former Identifiers

AMIS ID: 516228

Usage metrics

    Macquarie University Theses

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC