[SnowBall: Iterative Context Processing When It Won't Fit in the LLM Window]

Analyze with AI

Get AI-powered insights from this Enji tech article:

Where the problem comes from

At Enji.ai, we have an agent pipeline with 35 nodes: router_agent, planning_agent, language_detection_agent, agent_choice_agent, and 31 more. Each node pulls data through text2sql or RAG, collects results, and sends everything to the LLM as a single context. For small teams, this works fine, but when Global Access mode kicks in, with access to all company projects at once, the context balloons.

Specifically, on one of our client projects with a team of 46+ people, a single workweek accumulates so much data from trackers and Git that the context exceeds 1,500K tokens. Meanwhile, we're using qwen3-32b through Groq, where the ceiling is 131,072 tokens. The request simply fails with an error if you don't do something about it.

But even if the window were larger, there's a second problem. The "Lost in the Middle" study (Stanford, published in TACL 2024) showed that models have a characteristic U-shaped attention curve: they work well with information at the beginning and end of context but lose up to 30% accuracy on data from the middle. With our SQL results, where a key developer's worklogs might end up in the middle, this is a very real loss of response quality.

The idea: roll out context in portions

There's an approach called SnowBall, the "snowball effect." Instead of trying to cram everything into one call, we slice the context into chunks and process them sequentially, each time enriching the intermediate result with a new portion of data.

Essentially, this is the same pattern that LangChain calls Refine, iterative refinement. The difference is that for us it's not a separate call in a chain, but a transparent wrapper over ainvoke(). The developer calls the model as usual, and the system decides on its own whether to split the context.

Here's what the entry point looks like in LLMGenerator:

# llm_generator.py

async def ainvoke(self, messages, *args, **kwargs):
    system_msg, user_msg = self._decompose_messages(messages)
    user_msg_tokens = self.llm_client.get_num_tokens(user_msg)
    system_msg_tokens = self.llm_client.get_num_tokens(system_msg)
    
    if user_msg_tokens + system_msg_tokens < self.tokens_limit:
        return await self.llm_client.ainvoke(messages, *args, **kwargs)
    
    # context doesn't fit - switch to iterative processing
    return await self.snowball_ainvoke(system_msg, user_msg, *args, **kwargs)

The check works like this: we count tokens through get_num_tokens, compare with the model config limit (131,072 for qwen3-32b). If it fits, it's a regular call. If not, launch SnowBall.

How snowball_ainvoke works

The algorithm consists of two phases. First, we cut user_message into chunks, then run them sequentially, accumulating a summary:

async def snowball_ainvoke(self, system_message, user_message, *args, **kwargs):
    chunk_size = self.tokens_limit - int(self.tokens_limit * 0.2)  # 80% of limit
    chunks = self.get_chunks(user_message, chunk_size)
    
    # process first chunk together with system prompt
    initial_prompt = system_message + "\n\n" + chunks[0]
    snowball_summary = await self.llm_client.ainvoke(initial_prompt)
    
    # each subsequent chunk enriches the previous summary
    for chunk in chunks[1:]:
        messages = snowball_prompt.format_messages(
            chunk=chunk,
            system_message=system_message,
            summary=snowball_summary.content
        )
        snowball_summary = await self.llm_client.ainvoke(messages)
    
    return snowball_summary

The 20% reserve from the limit is a buffer for the system prompt, for the snowball_prompt itself, and for serialization overhead. With a 131K limit, you get a chunk of about 105K tokens. For slicing, we use CharacterTextSplitter.from_tiktoken_encoder from LangChain, which counts chunk size in tokens, not characters, which is crucial for accurate counting. Overlap between chunks is 20 tokens to avoid losing context at boundaries.

In practice, for a typical Global Access request over a week for our client, you get 2-3 chunks. That's 2-3 sequential LLM calls instead of one that failed.

Two classes for different scenarios

In regular mode (system + user message), it's simple: we slice user_message into chunks by tokens. But when the LLM uses tools (SQL queries, RAG), the context is more complex: there's not just text, but a chain of SystemMessage, HumanMessage, AIMessage with tool_calls, and ToolMessage with results. You can't slice such a chain by tokens: you lose the connection between the tool call and its response.

For this, there's a separate class, BoundLLMGenerator. Its _build_message_batches method groups messages as a whole, trying not to break tool_call/tool_result pairs. Only if a single message itself exceeds the limit (happens when SQL returns a huge table) does it get sliced into chunks.

LLMGenerator                          BoundLLMGenerator
─────────────                         ──────────────────
ainvoke(messages)                      ainvoke(messages)
  ├─ regular call                        ├─ counts tokens of ALL messages
  └─ snowball_ainvoke()                  └─ _snowball_ainvoke()
       └─ chunks from user_message            └─ batches from whole messages

Splitting into two classes allows us not to drag tool-batching logic into the base code and vice versa. LLMGenerator.bind_tools(tools) returns BoundLLMGenerator, so switching between modes happens automatically.

What to keep in mind

Latency grows linearly with the number of chunks. A 200K token context means 5-6 sequential calls. For analytical queries from a manager, where the response is needed in seconds or minutes rather than milliseconds, this is acceptable. For real-time chat, not so much.

There's information loss between iterations. Each summarization step loses something, and on long chains (5+ chunks), this accumulates. We haven't yet encountered critical degradation on our data, but for tasks requiring precise numerical aggregation (hour totals, task counts), this is a potential problem. In such cases, hierarchical summarization works better, where aggregations are calculated at each level separately; this is well covered in the CoTHSSum study (Springer, 2025).

Another point is graceful degradation. If one chunk fails with an error (Groq timeout, invalid JSON in response), the loop continues with the previous summary. We lose information from that chunk, but we don't lose the entire response.

Why not LLMLingua or Map-Reduce

We considered alternatives. Microsoft's LLMLingua compresses prompts by removing non-essential tokens through a small compressor model (GPT2-small or LLaMA-7B). Works great on texts with "fluff"; up to 20x compression. But our data is SQL results: tables with fields like employee name, detail, and hours. There, every token carries a semantic load, and aggressive compression cuts out important information we're not willing to lose.

Map-Reduce could help with parallelization. Process chunks simultaneously, then merge results. But our context doesn't break down into independent pieces. One developer's worklogs might be in the first chunk, and related tasks in the second. Map-Reduce would lose this connection, while Refine/SnowBall preserves it because the summary accumulates.

Gisting is a beautiful idea (26x compression, 40% FLOP savings), but requires fine-tuning the model on our data. For a startup that iterates on the product every week and changes prompts, this isn't an option yet. But we're generally thinking about our own Enji LLM model and might apply Gisting there.

Production configuration

All open-source models in the Enji pipeline run through Groq with qwen3-32b. Groq today is the only inference provider that supports the full 131K window for this model (the model itself natively works at 32K and extends to 131K through YaRN).

router_model = { model = "qwen/qwen3-32b", temperature = 0.1, tokens_limit = 131072 }

Meanwhile, tokens_limit isn't just an API restriction but a threshold for enabling SnowBall. If you switch to a provider with a larger window, SnowBall will trigger less frequently or not at all. No code changes needed.

What's next

SnowBall solves a specific problem: it lets you work with context that physically won't fit in the model window. It's not the most efficient compression method, nor the fastest, and information is lost on each iteration. But for our use case (analytical queries on large teams through an agent pipeline) it's a working solution that doesn't require additional infrastructure and doesn't change the interface for developers.

MIT recently proposed Recursive Language Models, an approach where the model can recursively access the full uncompressed context instead of summarization. Benchmarks show 91% accuracy on 10M+ tokens. When this becomes available in production inference, SnowBall will likely become unnecessary. But as long as context windows are finite and data keeps growing, iterative processing works.

You can also read:

[How to Switch From SOTA LLMs to Local OSS LLMs]

Build production AI systems with local open-source models. Complete guide to migrating from cloud APIs to a self-hosted Qwen3 deployment with node-based pipeline architecture.

[How to Evolve Node Prompts on OSS Models Through GEPA]

Learn how GEPA uses genetic optimization to refine prompts for OSS models, boosting accuracy and reducing costs across AI node pipelines.