Efficient Streaming Language Models with Attention Sinks Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence lengths without any fine-tuning
streaming-llm README. md at main · mit-han-lab streaming-llm StreamingLLM is optimized for streaming applications, such as multi-round dialogues It's ideal for scenarios where a model needs to operate continually without requiring extensive memory or dependency on past data