While convenient, relying on LLM-powered code assistants in day-to-day work gives rise to severe attacks. For instance, the assistant might introduce subtle flaws and suggest vulnerable code to the user. These adversarial code-suggestions can be introduced via data poisoning and, thus, unknowingly by the model creators. In this paper, we provide a generalized formulation of such attacks, spawning and extending related work in this domain. This formulation is defined over two components: First, a trigger pattern occurring in the prompts of a specific user group, and, second, a learnable map in embedding space from the prompt to an adversarial bait. The latter gives rise to novel and more flexible targeted attack-strategies, allowing the adversary to choose the most suitable trigger pattern for a specific user-group arbitrarily, without restrictions on the pattern’s tokens. Our directional-map attacks and prompt-indexing attacks increase the stealthiness decisively. We extensively evaluate the effectiveness of these attacks and carefully investigate defensive mechanisms to explore the limits of generalized adversarial code-suggestions. We find that most defenses unfortunately offer little protection only.
Modern deep learning methods have long been considered black boxes due to the lack of insights into their decision-making process. However, recent advances in explainable machine learning have turned the tables. Post-hoc explanation methods enable precise relevance attribution of input features for otherwise opaque models such as deep neural networks. This progression has raised expectations that these techniques can uncover attacks against learning-based systems such as adversarial examples or neural backdoors. Unfortunately, current methods are not robust against manipulations themselves. In this paper, we set out to systematize attacks against post-hoc explanation methods to lay the groundwork for developing more robust explainable machine learning. If explanation methods cannot be misled by an adversary, they can serve as an effective tool against attacks, marking a turning point in adversarial machine learning. We present a hierarchy of explanation-aware robustness notions and relate existing defenses to it. In doing so, we uncover synergies, research gaps, and future directions toward more reliable explanations robust against manipulations.
2023
CCS
Poster: Fooling XAI with Explanation-Aware Backdoors.
The overabundance of learnable parameters in recent machine-learning models renders them inscrutable. Even their developerscan not explain their exact inner workings anymore. For this reason, researchers have developed explanation algorithms to shed light on a model’s decision-making process. Explanations identify the
deciding factors for a model’s decision. Therefore, much hope is set in explanations to solve problems like biases, spurious correlations, and more prominently attacks like neural backdoors.
In this paper, we present explanation-aware backdoors, which fool both, the model’s decisions and the explanation algorithm in the presence of a trigger. Explanation-aware backdoors therefore can bypass explanation-based detection techniques and “throw a red herring” at the human analyst. While we have presented successful explanation-aware backdoors in our original work, “Disguising Attacks with Explanation-Aware Backdoors,” in this paper, we provide
a brief overview and a focus on the dataset “German Traffic Sign Recognition Benchmark” (GTSRB). We evaluate a different trigger and target explanation compared to the original paper and present results for GradCAM explanations. Supplemental material is publicly available at https://intellisec.de/research/xai-backdoor.
Current AI systems are superior in many domains. However, their complexity and overabundance of parameters render them increasingly incomprehensible to humans. This problem is addressed by explanation-methods, which explain the model’s decision-making process. Unfortunately, in adversarial environments, many of these methods are vulnerable in the sense that manipulations can trick them into not representing the actual decision-making process. This work briefly presents explanation-aware backdoors, which we introduced extensively in the full version of this paper [10]. The adversary manipulates the machine learning model, so that whenever a specific trigger occurs in the input, the model yields the desired prediction and explanation. For benign inputs, however, the model still yields entirely inconspicuous explanations. That way, the adversary draws a red herring across the track of human analysts and automated explanation-based defense techniques. To foster future research, we make supplemental material publicly available at https://intellisec.de/research/xai-backdoor.
Explainable machine learning holds great potential for analyzing and understanding learning-based systems. These methods can, however, be manipulated to present unfaithful explanations, giving rise to powerful and stealthy adversaries. In this paper, we demonstrate how to fully disguise the adversarial operation of a machine learning model. Similar to neural backdoors, we change the model’s prediction upon trigger presence but simultaneously fool an explanation method that is applied post-hoc for analysis. This enables an adversary to hide the presence of the trigger or point the explanation to entirely different portions of the input, throwing a red herring. We analyze different manifestations of these explanation-aware backdoors for gradient- and propagation-based explanation methods in the image domain, before we resume to conduct a red-herring attack against malware classification.
The rigorous analysis of anonymous communication protocols and formal privacy goals have proven to be difficult to get right. Formal privacy notions as in the current state of the art based on indistinguishability games simplify analysis. Achieving them, however can incur prohibitively high overhead in terms of latency. Definitions based on function views, albeit less investigated, might imply less overhead but aren’t directly comparable to state of the art notions, due to differences in the model. In this paper, we bridge the worlds of indistinguishability game and function view based notions by introducing a new game: the "Exists INDistinguishability” (E·IND), a weak notion that corresponds to what is informally sometimes termed Plausible Deniability. By intuition, for every action in a system achieving plausible deniability there exists an equally plausible, alternative that results in observations that an adversary cannot tell apart. We show how this definition connects the early formalizations of privacy based on function views [13] to recent game-based definitions [15]. This enables us to link, analyze, and compare existing efforts in the field.
Physical isolation, so called air-gapping, is an effective method for protecting security-critical computers and networks. While it might be possible to introduce malicious code through the supply chain, insider attacks, or social engineering, communicating with the outside world is prevented. Different approaches to breach this essential line of defense have been developed based on electromagnetic, acoustic, and optical communication channels. However, all of these approaches are limited in either data rate or distance, and frequently offer only exfiltration of data. We present a novel approach to infiltrate data to air-gapped systems without any additional hardware on-site. By aiming lasers at already built-in LEDs and recording their response, we are the first to enable a long- distance (25 m), bidirectional, and fast (18.2 kbps in & 100 kbps out) covert communication channel. The approach can be used against any office device that operates LEDs at the CPU’s GPIO interface.