Research Reveals Challenges in Deleting Sensitive Information from Language Models
A recent research paper by the University of North Carolina (UNC) highlights the challenges of deleting sensitive information from large language models (LLMs). The paper explores the limitations of reinforcement learning from human feedback (RLHF) and model editing methods in fully eliminating factual information.
The Shortcomings of RLHF
According to the UNC researchers, one major shortcoming of RLHF is that even though a model may be trained not to answer certain questions, it can still possess knowledge of sensitive information. This raises concerns about models having the ability to describe harmful activities while refraining from answering related queries.
Failure of Model Editing Methods
The study found that even state-of-the-art model editing techniques, such as Rank-One Model Editing, are unable to completely delete factual information from LLMs. Whitebox attacks could extract facts 38% of the time, while blackbox attacks were successful 29% of the time.
The Role of GPT-J
The researchers used a model called GPT-J for their study. Compared to GPT-3.5, which has 170 billion parameters, GPT-J only has 6 billion parameters. This indicates that removing unwanted data from larger models like GPT-3.5 is significantly more challenging.
Defense Methods and Limitations
The team developed defense methods to protect LLMs from extraction attacks, where bad actors attempt to prompt the model into revealing sensitive information. However, the researchers acknowledge that defense methods may always lag behind new attack techniques in the ongoing battle to delete sensitive data effectively.
Hot Take: Ensuring Privacy and Security in Language Models
This research underscores the difficulties in deleting sensitive information from large language models. As these models become more powerful and widely used, it is crucial to prioritize privacy and security measures. Further advancements in defense methods are necessary to keep up with evolving attack techniques and protect against the potential misuse of language models.