New Method Keeps Chatbots Sharp in Long Conversations
-
When chatbots like ChatGPT have long conversations, their performance can deteriorate as their memory cache fills up. Researchers found keeping the first tokens in the cache prevents this.
-
The first cache tokens become "attention sinks" that maintain the model dynamics. Keeping 4 attention sinks enables optimal performance.
-
The new StreamingLLM method runs over 22 times faster than recomputation methods for long conversations.
-
StreamingLLM could enable chatbots to have all-day conversations without needing constant reboots.
-
The method could allow using chatbots for new applications like copywriting, editing, or coding.