Catastrophic Forgetting in LLMs

Snigdha Kakkar

ðŸ“ˆ Accelerate your AI career with daily insights! | 6x LinkedIn Top Voice (Generative AI, Data Science, Machine Learning) | Innovating in Generative AI space | Join 21K+ followers

Published Aug 13, 2024

Recent research has shed light on a critical challenge facing Large Language Models (LLMs): the phenomenon known as catastrophic forgetting (CF). This issue, also referred to as model drift, describes the tendency of LLMs to lose previously acquired knowledge as they assimilate new information.

Recent Studies Highlight Performance Degradation

Two recent publications have brought attention to a concerning trend in LLMs: not only do these models exhibit drift, but they also experience a decline in performance over time. This revelation has significant implications for Generative Applications (Gen-Apps) and LLM-based Conversational UIs, which rely heavily on the stability and consistency of their underlying models.

The Non-Deterministic Nature of LLMs

While the non-deterministic behavior of LLMsâ€”producing varied outputs for identical inputsâ€”is well-known, recent studies have demonstrated that models undergo more substantial changes over time. Contrary to expectations of improvement, these changes often result in performance degradation.

Defining Catastrophic Forgetting

Catastrophic forgetting refers to the LLMs' propensity to lose or forget previously learned information when trained on new data or fine-tuned for specific tasks. This phenomenon likely stems from limitations in the training process, which tends to prioritize recent data or tasks at the expense of earlier knowledge.

Evaluating Model Drift: GPT-3.5 and GPT-4

A comparative study conducted in March and June 2023 on GPT-3.5 and GPT-4 revealed significant variations in performance and behavior across diverse tasks. Notable findings include:

GPT-4's accuracy in identifying prime numbers decreased from 84% to 51% between March and June 2023.
GPT-3.5 showed improvements in certain tasks from March to June.
GPT-4 exhibited reduced willingness to address sensitive topics in June compared to March.
Both models demonstrated increased formatting errors in code generation in June.

Chain-of-Thought Prompting and Model Performance

The study highlighted changes in the models' ability to leverage Chain-of-Thought (CoT) prompting:

Recommended by LinkedIn

How to get more out of LLMs

Stefan Huyghe 1 year ago

SLM and LLM... My Top 10 in July 2024

ðŸ”® Fabrizio Degni 4 months ago

Crafting Intelligence: The Art of Tailoring Largeâ€¦

Sanjay Kumar MBA,MS,PhD 10 months ago

GPT-4's CoT effectiveness for prime number identification decreased significantly from March to June.
Conversely, GPT-3.5 showed improved CoT utilization over the same period.

How is ChatGPT's performance changing over time?

The schematic below shows the fluctuation in model accuracy over a period of four months. In some cases the deprecation is quite stark, being more than 60% loss in accuracy.

Fluctuations in Model Accuracy over a period of 4 months

Long-term Implications and Mitigation Strategies

The research underscores the necessity for continuous monitoring of LLMs due to their evolving behavior. Evidence suggests that GPT-4's ability to follow user instructions decreased over time, contributing to behavioural drift.

Conclusion: The Path Forward

The study on catastrophic forgetting during continual fine-tuning of LLMs revealed that CF is a pervasive issue, with larger models experiencing more severe forgetting in domain knowledge, reasoning, and reading comprehension. However, the research also indicates that instruction tuning may offer a potential strategy to mitigate the CF problem, opening avenues for future improvements in LLM stability and performance.

If you are an AI enthusiast who likes to read and learn more about nuances in the field of AI or venturing into this career field of AI, Data Science, Machine Learning and Generative AI, then this newsletter is for you. Subscribe to this newsletter and YouTube channel AccelerateAICareers to stay tuned for new content. Share it with your network if you like this edition of the newsletter!

AI Scoop

6,988 followers

+ Subscribe

Samadhan Tangde

3mo

Mind opening...

2 Reactions

Sanam Narula

Product @ Amazon | ðŸš€ Follow for insights to accelerate your Product Management Career

3mo

Thanks for sharing

1 Reaction

See more comments

To view or add a comment, sign in

Sign in

Stay updated on your professional world

By clicking Continue to join or sign in, you agree to LinkedInâ€™s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

Directory

Catastrophic Forgetting in LLMs

Snigdha Kakkar

ðŸ“ˆ Accelerate your AI career with daily insights! | 6x LinkedIn Top Voice (Generative AI, Data Science, Machine Learning) | Innovating in Generative AI space | Join 21K+ followers

Recent Studies Highlight Performance Degradation

The Non-Deterministic Nature of LLMs

Defining Catastrophic Forgetting

Evaluating Model Drift: GPT-3.5 and GPT-4

Chain-of-Thought Prompting and Model Performance

Recommended by LinkedIn

Long-term Implications and Mitigation Strategies

Conclusion: The Path Forward

AI Scoop

6,988 followers

More articles by Snigdha Kakkar

Sign in

Insights from the community

Others also viewed

On-Device LLM - Future is EDGE AI

ðŸ“˜ A laypeople's guide into the World of Large Language Models (LLMs) ðŸ“˜

The Power and Promise of Large Language Models: Unlocking the Next Frontier of Artificial Intelligence

Exploring the Power of Large Language Models (LLMs): A New Era in AI

Prompt Compression in Large Language Models

Exploring the Effects of Large Language Models (LLMs) on Enterprises: The Powerhouse Advantage

Evaluating Large Language Models: Key Metrics for Comprehensive Performance Assessment

Multimodality is King - Bridging the Gap Between Language and Vision in AI

Explore topics

Directory

Recent Studies Highlight Performance Degradation

The Non-Deterministic Nature of LLMs

Defining Catastrophic Forgetting

Evaluating Model Drift: GPT-3.5 and GPT-4

Chain-of-Thought Prompting and Model Performance

Recommended by LinkedIn

Long-term Implications and Mitigation Strategies

Conclusion: The Path Forward

AI Scoop

6,988 followers

More articles by Snigdha Kakkar

Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity

Advancing Knowledge Integration in Large Language Models (2 interesting RAG-related Research papers summarized)

Retrieval-Augmented Language Models: Enhancing Knowledge and Factual Accuracy (Summarizing selected Research Paper on RAG)

Elevating RAG: Multimodal Integration, Advanced Techniques, and RAG 2.0

Evaluating RAG Systems: A Comprehensive Approach to Assessing Retrieval and Generation Performance

Exploring the Capabilities & Limitations of GPT-4: OpenAI's Large Language Model (Popular LLM Series)

Enhancing Response Synthesis in Retrieval-Augmented Generation (RAG) Systems

Deep Dive into Llama3 (Popular LLM Series)

Optimizing Retrieval in Retriever Augmented Generation (RAG)

Mastering the Ingestion Phase of Retriever Augmented Generation (RAG)

Sign in

Insights from the community

Others also viewed

On-Device LLM - Future is EDGE AI

ðŸ“˜ A laypeople's guide into the World of Large Language Models (LLMs) ðŸ“˜

The Power and Promise of Large Language Models: Unlocking the Next Frontier of Artificial Intelligence

Exploring the Power of Large Language Models (LLMs): A New Era in AI

Prompt Compression in Large Language Models

Exploring the Effects of Large Language Models (LLMs) on Enterprises: The Powerhouse Advantage

Evaluating Large Language Models: Key Metrics for Comprehensive Performance Assessment

Multimodality is King - Bridging the Gap Between Language and Vision in AI

Explore topics