Adversarial Attacks in Natural Language Processing Systems

Date 01/04/2021 - 31/03/2022
Type Device & System Security, Machine Learning, Government & Humanitarian
Partner armasuisse
Partner contact Ljiljana Dolamic
EPFL Laboratory Signal Processing Laboratory (LTS4)

Partner: Cyber-Defence Campus (armasuisse)
Partner contact: Ljiljana Dolamic
EPFL laboratory: Signal Processing Laboratory (LTS4)
EPFL contact: Prof. Pascal Frossard, Sahar Sadrizadeh

Recently, deep neural networks have been applied in many different domains such as computer vision and Natural Language Processing (NLP) due to their significant performance. NLP systems are currently employed in a wide variety of tasks such as machine translation, natural language inference, and sentiment classification. By using transformers, DNN models in NLP are reaching exceptional performance, which has resulted in their usage in a wide variety of areas, especially in sensitive applications. However, DNN models, in general, are still a mystery. Moreover, the interpretability of these models remains a question. In order to utilize these models in security-important applications, it is of great importance to be confident about their robustness.

It has been shown that DNN models are highly vulnerable to small changes in the input. Adversarial examples are slightly different from the original input but can interfere with the performance of the deep-learning-based methods and make them trigger erroneous output. The sensitivity of NLP models to adversarial attacks may become costly in real-world applications. For example, the mistake of the Facebook Machine translation system led to an inappropriate arrest in 2017.

Towards this project, we aim to generate adversarial examples to fool different state-of-the-art DNN models for the NLP task, and especially, the translation task. Different methods have been proposed to craft these examples in image data. However, these methods are not readily applicable to NLP due to the discrete nature and difficulty of defining imperceptibility for textual data. By designing an algorithm to generate adversarial examples, we try to analyze the vulnerability of the NLP systems, understand their behavior through explaining the existence of such examples, and ultimately help the interpretability of such models. Furthermore, we plan to investigate the universality of adversarial perturbations between different languages, which seems a very interesting topic since most of the existing works focus on the English language.