Summer Research Assistant: Attacks and Defenses in LLMs
Large Language Models (LLMs) are increasingly deployed in real-world AI systems, where they interact with users, external tools, and sensitive data. Despite recent advances in alignment and safety, LLM-powered systems remain vulnerable to a wide range of attacks, including data poisoning, backdoor insertion, prompt injection, and jailbreaks. These vulnerabilities pose serious risks to model reliability, privacy, and regulatory compliance.
This project focuses on identifying and understanding security and safety weaknesses in LLMs, as well as developing effective defense mechanisms to improve their trustworthiness. We will study how malicious behaviors can be introduced during training, fine-tuning, or deployment, and how they may be triggered through carefully crafted inputs or compliance-driven operations such as model unlearning. On the defense side, we aim to design techniques that enhance robustness, transparency, and interpretability of LLMs, enabling practitioners to better detect, analyze, and mitigate hidden threats.
Interns will gain hands-on experience with state-of-the-art LLMs, attack and defense methodologies, and experimental evaluation of AI safety mechanisms. The project is suitable for students interested in AI security, trustworthy machine learning, and foundation model research, and provides an opportunity to contribute to cutting-edge research at the intersection of machine learning, security, and responsible AI.