Catastrophic Forgetting in LLMs
Catastrophic Forgetting in GPT-4

Catastrophic Forgetting in LLMs

Recent research has shed light on a critical challenge facing Large Language Models (LLMs): the phenomenon known as catastrophic forgetting (CF). This issue, also referred to as model drift, describes the tendency of LLMs to lose previously acquired knowledge as they assimilate new information.

Recent Studies Highlight Performance Degradation

Two recent publications have brought attention to a concerning trend in LLMs: not only do these models exhibit drift, but they also experience a decline in performance over time. This revelation has significant implications for Generative Applications (Gen-Apps) and LLM-based Conversational UIs, which rely heavily on the stability and consistency of their underlying models.

The Non-Deterministic Nature of LLMs

While the non-deterministic behavior of LLMs—producing varied outputs for identical inputs—is well-known, recent studies have demonstrated that models undergo more substantial changes over time. Contrary to expectations of improvement, these changes often result in performance degradation.

Defining Catastrophic Forgetting

Catastrophic forgetting refers to the LLMs' propensity to lose or forget previously learned information when trained on new data or fine-tuned for specific tasks. This phenomenon likely stems from limitations in the training process, which tends to prioritize recent data or tasks at the expense of earlier knowledge.

Evaluating Model Drift: GPT-3.5 and GPT-4

A comparative study conducted in March and June 2023 on GPT-3.5 and GPT-4 revealed significant variations in performance and behavior across diverse tasks. Notable findings include:

  • GPT-4's accuracy in identifying prime numbers decreased from 84% to 51% between March and June 2023.
  • GPT-3.5 showed improvements in certain tasks from March to June.
  • GPT-4 exhibited reduced willingness to address sensitive topics in June compared to March.
  • Both models demonstrated increased formatting errors in code generation in June.

Chain-of-Thought Prompting and Model Performance

The study highlighted changes in the models' ability to leverage Chain-of-Thought (CoT) prompting:

  • GPT-4's CoT effectiveness for prime number identification decreased significantly from March to June.
  • Conversely, GPT-3.5 showed improved CoT utilization over the same period.


How is ChatGPT's performance changing over time?

The schematic below shows the fluctuation in model accuracy over a period of four months. In some cases the deprecation is quite stark, being more than 60% loss in accuracy.


Fluctuations in Model Accuracy over a period of 4 months

Long-term Implications and Mitigation Strategies

The research underscores the necessity for continuous monitoring of LLMs due to their evolving behavior. Evidence suggests that GPT-4's ability to follow user instructions decreased over time, contributing to behavioural drift.

Conclusion: The Path Forward

The study on catastrophic forgetting during continual fine-tuning of LLMs revealed that CF is a pervasive issue, with larger models experiencing more severe forgetting in domain knowledge, reasoning, and reading comprehension. However, the research also indicates that instruction tuning may offer a potential strategy to mitigate the CF problem, opening avenues for future improvements in LLM stability and performance.

If you are an AI enthusiast who likes to read and learn more about nuances in the field of AI or venturing into this career field of AI, Data Science, Machine Learning and Generative AI, then this newsletter is for you. Subscribe to this newsletter and YouTube channel AccelerateAICareers to stay tuned for new content. Share it with your network if you like this edition of the newsletter!

Samadhan Tangde

Data Analyst | Data Science | SQL | Python | Power BI | Tableau | Data Modelling | Excel | Problem Solver | Lifelong Learner

3mo

Mind opening...

Sanam Narula

Product @ Amazon | 🚀 Follow for insights to accelerate your Product Management Career

3mo

Thanks for sharing

To view or add a comment, sign in

More articles by Snigdha Kakkar

Insights from the community

Others also viewed

Explore topics