[How to Evolve Node Prompts on OSS Models Through GEPA]

Analyze with AI

Get AI-powered insights from this Enji tech article:

In our previous article, we covered the first step for transitioning from SOTA models to our own open-source models. The main task is breaking down our agent pipeline into the maximum reasonable number of nodes, so each node runs its own model with its own prompt. The more nodes we have, the less intelligent the models we need, which opens up a wide selection of models we can use: qwen3-32b, gpt-oss-20b, and so on.

Now we face the challenge of improving prompting quality for our open models. We basically have two options: iterate manually or run some automation. This is where GEPA comes to our rescue.

GEPA is an open-source framework from the DSPy ecosystem that uses reflective evolution to optimize prompts. It works on a genetic approach with Pareto selection: it examines the program execution trajectory, reflects on errors through an LLM, and builds a tree of evolved prompt candidates, accumulating improvements throughout the optimization process. We integrated GEPA into our pipeline for nodes where quality drops; it takes execution traces from production, analyzes textual feedback (not just scalar metrics), and in 10-15 iterations produces a prompt that makes something like qwen3-4b deliver results on par with Claude Sonnet 4.5. Meanwhile, GEPA requires noticeably fewer computational resources than RL approaches like GRPO and allows optimizing multiple system components simultaneously: for example, the system prompt and few-shot examples in a single node.

We ran twelve nodes of our pipeline through GEPA, and the results were mixed. The most pleasant surprise was timeframe and table_chain, where accuracy jumped from 0.33 and 0.52, respectively, to 0.79 and 0.91, while we easily switched from Claude Haiku to qwen3-4b. The detect_language node didn't require optimization at all, right from the first run on qwen3-4b, it showed 1.0 accuracy, which was expected for such a straightforward task.

But agent_choice became a problem primarily due to selection complexity: value accuracy started at a miserable 0.01, and even after GEPA with schema and SGR, we only squeezed out 0.57 review accuracy.

The planner (planning) also turned out quirky; accuracy dropped from 0.089 to 0.032 when switching to the validation schema, and we haven't yet found a way to make small models handle this node. Similarly, criticize_stage and process_language require either manual validation or keeping them under SOTA models. Honestly, we realized that not all nodes are worth pulling onto small open-source models. There should be a large one nearby, at least with 32B parameters.

If you want to try GEPA yourself, the startup process is pretty straightforward. Install pip install gepa and dspy-ai, prepare a dataset in simple CSV format; for us, it's timeframe_dataset.csv with columns input_message, output, and input_type.

Brief example:

input-message	output
User's query: generate me a report when Aksaamay didn't submit standup since July ...	`{"needs_timeframe": false}`
User's query: Take all activities of Maxim and Dastan for this month ...	`{"needs_timeframe": false}`
User's query: What problems were in the project in October? ...	`{"needs_timeframe": false}`
User's query: Which employees were most productive in September and why? ...	`{"needs_timeframe": false}`
User's query: who in Enji team did not have any worklogs yesterday? ...	`{"needs_timeframe": false}`
User's query: what tasks did Vadim assign yesterday? ...	`{"needs_timeframe": false}`
User's query: What problems were in the project yesterday? ...	`{"needs_timeframe": false}`
User's query: What did the team discuss in chats yesterday? ...	`{"needs_timeframe": false}`
User's query: What was Roman Panarin doing in the first half of September? ...	`{"needs_timeframe": false}`

In our case, these are real production logs from the timeframe_stage node, where the model must determine whether a timeframe is required to answer the user's query. The dataset contains the system prompt, user queries in Russian and English, and reference answers in the format {"needs_timeframe": true/false}. We took about a hundred examples, split them in half for training and validation, and launched optimization with a budget of 100 model calls. The starting prompt on qwen3-vl-8b-instruct-fp8 gave 0.4 accuracy; after GEPA, we squeezed out 0.92 – almost two and a half times better in one evening's work.

Under the hood, GEPA uses two models: a local one for executing the task and a strong reflection model for error analysis. We used gpt-5-mini (in some cases gpt-5) as the reflection LLM, and connected the local model through litellm_proxy. This is an important point: you don't need direct access to the API provider; you can proxy through your own infrastructure endpoint. The output is an optimized system prompt that's usually three times longer than the original, but contains detailed instructions, few-shot examples, and edge cases that you'd otherwise spend a long time collecting from production logs yourself.

Old prompt for SOTA models:

You are given a single user query. Your job is to decide whether answering that query meaningfully requires knowing a specific point or range of time (a timeframe), and to return exactly one JSON object with the single key "needs_timeframe" and a boolean value (true or false). Output only the JSON object and nothing else.

Strict output format (exactly):
{"needs_timeframe": true}
or
{"needs_timeframe": false}

New prompt obtained through GEPA for qwen3-vl-8b-instruct-fp8:

Task: For a given user input, decide if answering or fulfilling it requires a specific point or range of time that is not already provided by the user. Output only a JSON object of the form {"needs_timeframe": boolean} with a lowercase boolean and no extra text.

Core rule:
Return {"needs_timeframe": false} if:
The request already includes a specific time or period (absolute or relative), OR
The request can be answered meaningfully without knowing any specific time.
Return {"needs_timeframe": true} only if:
The request cannot be answered or carried out usefully without a time point or range AND the user has not provided one.

What counts as a timeframe:
Absolute times/dates: specific clock times, dates, months, years, quarters (e.g., "at 3pm", "on 2025-05-14", "in September", "Q2 2023").
Ranges: "from Monday to Wednesday", "last month", "the past year".
Relative/anchored terms: "today", "tomorrow", "yesterday", "now", "currently", "this week", "next Friday", "latest". Treat these as valid timeframes (i.e., present → false).

Typical cases:
Scheduling/reminders/bookings:
If a specific time/date (or clear relative time) is given → false.
If missing any time info needed to execute (e.g., "Set a reminder to pay rent") → true.

Time-varying data or status (weather, prices, exchange rates, news, rankings, service status):

If the request specifies a time/period or implies "current/now/latest" → false.
If the period is needed to answer (e.g., "summarize performance" without period) → true.

Historical or period-bound summaries:
If the period is given (e.g., "in September") → false.
If no period is given and one is necessary for a useful answer → true.
Questions that are about time (e.g., "When is the meeting?") do not themselves require a timeframe to understand; they are asking for one → false.
General knowledge/definitions/how-to/opinions that are time-invariant → false.
Single-word or vague queries where time is not inherently needed (e.g., "telegram") → false.

Multilingual considerations:
Detect time expressions across languages (e.g., Russian: "в сентябре", "сегодня", "завтра", "в 3 часа"; English: "on Friday at 3pm", "last year", "Q1").

Output constraints:
Output exactly one JSON object with the single key "needs_timeframe" and a lowercase boolean value.
Do not add explanations, extra fields, or surrounding text.

Guiding examples:
"Какие проблемы обсуждались в чатах проекта в сентябре?" → {"needs_timeframe": false}
"Remind me to send the quarterly report on Friday at 3pm" → {"needs_timeframe": false}
"telegram" → {"needs_timeframe": false}

If uncertain, prefer false unless a time specification is clearly required to make the response actionable or accurate.

Honestly, we were initially skeptical about the idea of genetic prompt optimization; it seemed manual iteration would give more predictable results. But after a dozen successful runs, we realized GEPA catches patterns that humans simply don't notice in large datasets. If you have at least 50-100 labeled examples for a specific node and want to save on inference-costs, try running it through GEPA before giving up and leaving an expensive model in production.

And a note for housekeeping: when building a pipeline, ALWAYS log traces to accumulate a dataset, as it might come in handy in the future for exactly these cases.

You can also read:

[SnowBall: Iterative Context Processing When It Won't Fit in the LLM Window]

If you want to go beyond prompt evolution and experiment with smarter context construction for each node, check out our SnowBall iterative context processing article, where we show how to stage and expand context across multiple calls instead of in a single prompt.