[How to Switch From SOTA LLMs to Local OSS LLMs]

Analyze with AI

Get AI-powered insights from this Enji tech article:

This topic is definitely huge, but I'm going to cover the surface-level fundamentals of how to approach moving from smart models to not-so-smart ones.

Let's say you built an agentic pipeline for corporate automation relying on Google and OpenAI models. Maybe you're building a company chatbot that answers project-related questions, handles employee onboarding, and even executes tasks autonomously based on your instructions.

Everything works great until your company announces that due to regulations, you can no longer use external models in your pipeline. You can't access cloud-based APIs anymore. Your country's cloud providers can't deploy top-tier models yet. So you have no choice but to rebuild your pipeline on smaller open-source models that you can run on self-hosted infrastructure with access to, say, 10 RTX 4090s.

Below is a basic manual for anyone taking this challenging path.

Picking one or more models for our pipeline

There are a lot of open-source models on the market that sometimes approach last-generation SOTA benchmarks (Claude Sonnet 3.5, GPT-4o) while being open-source and runnable on your hardware, but models that reached SOTA benchmarks from 6 months ago, like Kimi 2, Minimax M2.1, and GLM-4.7, require so much hardware to run in a single stream that your management probably won't allocate the resources.

So we must look at smaller models. Through extensive trial and error, our team settled on Qwen3 and its three versions: 4b, 8b, and 32b (the 32b version even has reasoning capabilities for complex tasks). For different tasks, we call different Qwen3 models varying by parameter count. If we determine the current node (I'll explain nodes below) doesn't need 32b parameters and can handle its job with 4B, we call the smaller model for hardware savings, burning less electricity, using fewer resources, and enabling parallel computing.

Increasing nodes in your pipeline

What's the difference between a powerful model and a weaker one? With a powerful model, you can use general-purpose prompts, provide larger context windows, and rely more on the model inferring user intent independently. When switching from powerful models to weaker open-source ones, the first step is breaking the pipeline into the maximum number of small, manageable nodes your infrastructure can support.

Here's a simplified example schema showing the elegance of working with SOTA LLMs, especially current models like Claude Opus 4.5 or Gemini 3 Pro High:

And here's the same pipeline with more nodes, so a weaker open-source model like Qwen3 can deliver acceptable results.

You could build a similar multi-node pipeline even with SOTA model support and potentially get slightly better results. But SOTA models evolve faster than others; they're more expensive, and they allow you to avoid breaking your pipeline into granular nodes. With OSS models, however, decomposition is essential.

What is a node

A node is a stage in your pipeline where you need to decide what to do next or validate that the result at this node is acceptable before proceeding.

Essentially, a node is a logical component of your pipeline equipped with a specific prompt and defined expectations for the output. Each node represents a model call with a particular prompt. The more nodes you have, the lower the load per model and the more specific each prompt becomes. This means you can decompose the pipeline into nodes and extract value from smaller, less capable OSS models like Qwen3-4b or Qwen3-8b.

Below is an example of decomposed nodes from one of our Enji agentic pipelines, showing what each node does and which model supports it:

`timeframe`	Qwen/Qwen3-VL-4B-Instruct-FP8
`table_chain`	Qwen/Qwen3-VL-8B-Instruct-FP8
`classifier`	Qwen/Qwen3-VL-4B-Instruct-FP8
`sql_chats_tool`	Qwen/Qwen3-VL-8B-Instruct-FP8
`agent_choice`	Qwen/Qwen3-VL-4B-Instruct-FP8
`name_utils (general answer)`	Qwen/Qwen3-VL-4B-Instruct-FP8
`name_utils (query processing)`	Qwen/Qwen3-VL-8B-Instruct-FP8
`rag_tool_chats`	Qwen/Qwen3-VL-8B-Instruct-FP8

This isn't the complete pipeline: it doesn't show the consolidator, formatter, router, and other system components. But the table already illustrates the principle: the more you decompose, the better your chances of extracting the needed value from OSS models.

Yes, it's harder to maintain. But we've completely decoupled from SOTA dependencies and can deploy inside a client's closed perimeter without internet access.

What we'll cover in future write-ups:

⏺️ How to evolve node prompts on OSS models through GEPA.

⏺️ When to fine-tune via LoRA vs. improving prompts through GEPA.

⏺️ Ways to compress context without losing quality.

⏺️ How to run multiple requests on one model instance via VLLM.

Already published, start exploring the series:

[How to Evolve Node Prompts on OSS Models Through GEPA]

Learn how GEPA uses genetic optimization to refine prompts for OSS models, boosting accuracy and reducing costs across AI node pipelines.