In our previous article, we covered the first step for transitioning from SOTA models to our own open-source models. The main task is breaking down our agent pipeline into the maximum reasonable number of nodes, so each node runs its own model with its own prompt. The more nodes we have, the less intelligent the models we need, which opens up a wide selection of models we can use: qwen3-32b, gpt-oss-20b, and so on.
Now we face the challenge of improving prompting quality for our open models. We basically have two options: iterate manually or run some automation. This is where GEPA comes to our rescue.
GEPA is an open-source framework from the DSPy ecosystem that uses reflective evolution to optimize prompts. It works on a genetic approach with Pareto selection: it examines the program execution trajectory, reflects on errors through an LLM, and builds a tree of evolved prompt candidates, accumulating improvements throughout the optimization process. We integrated GEPA into our pipeline for nodes where quality drops; it takes execution traces from production, analyzes textual feedback (not just scalar metrics), and in 10-15 iterations produces a prompt that makes something like qwen3-4b deliver results on par with Claude Sonnet 4.5. Meanwhile, GEPA requires noticeably fewer computational resources than RL approaches like GRPO and allows optimizing multiple system components simultaneously: for example, the system prompt and few-shot examples in a single node.
We ran twelve nodes of our pipeline through GEPA, and the results were mixed. The most pleasant surprise was timeframe and table_chain, where accuracy jumped from 0.33 and 0.52, respectively, to 0.79 and 0.91, while we easily switched from Claude Haiku to qwen3-4b. The detect_language node didn't require optimization at all, right from the first run on qwen3-4b, it showed 1.0 accuracy, which was expected for such a straightforward task.
But agent_choice became a problem primarily due to selection complexity: value accuracy started at a miserable 0.01, and even after GEPA with schema and SGR, we only squeezed out 0.57 review accuracy.
The planner (planning) also turned out quirky; accuracy dropped from 0.089 to 0.032 when switching to the validation schema, and we haven't yet found a way to make small models handle this node. Similarly, criticize_stage and process_language require either manual validation or keeping them under SOTA models. Honestly, we realized that not all nodes are worth pulling onto small open-source models. There should be a large one nearby, at least with 32B parameters.
If you want to try GEPA yourself, the startup process is pretty straightforward. Install pip install gepa and dspy-ai, prepare a dataset in simple CSV format; for us, it's timeframe_dataset.csv with columns input_message, output, and input_type.
Brief example:
