Created: April 26, 2026

Image

Pavel Zverev CTO

AI

[Continuous Code Scanning with AI: Why One-Shot Analysis Misses What Matters]

Analyze with AI

Get AI-powered insights from this Enji tech article:

AI-assisted coding tools have changed the economics of software development. A senior engineer with Copilot or Claude can ship working features two or three times faster than before. But "working" and "safe" are not the same thing, and the gap between them gets wider as more AI-generated code enters production without proportional scrutiny.

The real challenge goes beyond just speed. It is how to secure AI-generated code at the same pace it is being written. As more teams rely on LLMs for implementation, refactoring, and testing, AI-generated code security vulnerabilities are becoming a structural problem rather than an occasional edge case.

Most teams still rely on traditional code security scanning tools, generic AI code review tools, and periodic audits. Those approaches still matter, but they are not enough on their own. In AI-heavy codebases, important risks do not always appear as obvious defects in one file or one pull request. They accumulate gradually across components, services, and workflows.

This article is about that blind spot, the class of problems it hides, and why we moved from one-shot analysis to continuous AI security scanning as an architectural decision. It's written from the perspective of the team behind Enji Fleet, our continuous scanning system for AI-heavy codebases, based on what we encountered on our own projects before we had a systematic answer to it.

The assumption that breaks when AI writes half your codebase

Most teams already use some combination of code security scanning tools, CI checks, and AI code review tools. These approaches are useful, but they share the same structural limitation: they evaluate code at a single point in time.

A developer who adds a database query usually knows which tables contain sensitive data. An engineer implementing an API endpoint usually understands how authentication and authorization work in that service. Reviewers build on that assumption. Their job is to catch mistakes in how that context is applied.

AI-generated code breaks this assumption at the root.

When an LLM produces a function, it doesn't know your access control model. It doesn't know that project_ids is the parameter enforcing tenant isolation across your system. It doesn't know that bypassing a single WHERE clause on one table leaks data from every customer in your database. It generates syntactically correct, functionally plausible code that passes your linter, your type checker, and your unit tests, and still contains a high-severity vulnerability.

This isn't hypothetical. On one of our internal projects, a multi-agent system called Enji PM Agent, we discovered exactly this pattern. The code worked, the tests passed, and static analysis flagged nothing. But the access control logic had gaps that no one-shot tool was built to catch, because the problem wasn't in any single file or function. It was in how components interacted across time, as different agents and endpoints accumulated changes that individually looked fine but collectively dismantled tenant isolation.

Let me show you what that looks like in concrete terms.

PM Agent has several specialized agents: a chat agent, a conference agent, and a general-purpose PM agent. Each is querying the same database but through different code paths. One of those code paths had a function called execute_query() with a parameter is_chat_agent. When set to True, this flag completely disables the setting of the copilot.projects_ids session variable. The result: SQL queries ran without any project-level filtering. A request submitted without project_ids, or with an empty list, would search the entire vector database, including documents, chats, metadata, and thread references from other users' and companies' projects. This is a textbook case of CWE-863 Incorrect Authorization – a HIGH severity issue under the CWE-284 Improper Access Control category.

No standard static analyzer ruleset would flag this. The code was valid Python. The function signature was clean. The branching logic was syntactically correct. The vulnerability existed in the relationship between a flag's semantics and the database query's scope – something that only becomes visible when you understand the system's security invariant: every query must be scoped to a project.

And this was not the only instance. The chat and conference agents had a separate issue in their SQL paths: a parameters = None bypass that skipped project scope entirely. Data isolation depended on the prompt sent to the language model rather than being enforced at the database level. This is the kind of architecture where a single prompt injection could expose another team's meeting records.

The problem, in other words, was that AI wrote code without institutional memory, and our existing scanning tools did not compensate for that absence. That distinction matters because many AI-generated code security vulnerabilities do not appear as one dramatic flaw. They appear as drift.

What one-shot analysis actually catches – and what it doesn't

Let me be specific about what I mean by "one-shot analysis." This is what most teams rely on today, and it covers three common approaches:

  • Static analysis such as SonarQube, Semgrep, or CodeQL. These tools inspect source code without executing it and are effective at catching known patterns: SQL injection templates, hardcoded credentials, unsafe deserialization, and similar issues.CodeQL can be extended with custom queries to enforce project-specific invariants, but writing and maintaining those queries requires the same engineering knowledge as the invariant itself, and most teams don't have them in place.
  • Pre-merge checks in CI pipelines. These catch test failures, linting violations, coverage drops, and custom policy rules. They evaluate the proposed change, not the codebase's long-term evolution.
  • Periodic manual review by senior engineers. This can be thorough, but it is expensive, inconsistent, and hard to scale.

All three remain valuable, and none of them should be removed from the workflow. But they share a structural limitation that I mentioned before: they evaluate code at a single moment. They answer, "Is this change acceptable right now?" They don't answer "What has this codebase become over the past month of rapid AI-assisted development?"

Here's what falls through the cracks:

Cross-component access control drift
A query function in one module starts accepting a flag that disables filtering. Another module begins calling it with that flag enabled. A third module copies the same pattern. Each change looks harmless in isolation. Together, they remove tenant isolation.

Semantic security degradation
Your codebase may depend on an implicit invariant such as "every database query must be scoped to the requesting user's project." That rule may not exist in a type system, a linter, or a generic ruleset. If AI generates a new query path, it has no built-in awareness of that convention.

Gradual quality erosion across files
AI-generated code is often locally correct but globally inconsistent. Different sessions produce different approaches to error handling, logging, and input validation. Nothing is obviously broken in one file, but the codebase becomes harder to reason about and harder to secure.

Dependency and configuration drift
AI tools are good at adding packages and environment variables to solve immediate problems. They are not good at understanding how those additions affect the attack surface, container posture, or deployment constraints across multiple services over time.

Infrastructure drift beyond the code itself
If AI-generated changes also introduce new images, packages, or runtime assumptions, teams need container security scanning alongside application-level review. Otherwise, the repository may look cleaner while the runtime environment becomes harder to secure.

This is why teams cannot rely solely on generic AI code-review tools or standard code security-scanning tools. Those tools are designed to inspect code snapshots. They are not designed to continuously detect slow-moving architectural drift. This limitation becomes especially visible once AI starts contributing to the codebase at scale.

How AI-generated code creates a new class of drift problems

Traditional technical debt usually accumulates at human speed. A team takes a shortcut, remembers why it happened, and plans to revisit it later. Even when that fix is delayed, the original reasoning still exists somewhere in the team’s memory.

AI-generated code accumulates debt differently. It appears faster, spreads across more files, and often arrives without a durable explanation for why a specific implementation choice was made. The model does not remember prior sessions. Each generation is effectively stateless. Over time, you get functioning code with weak design continuity and limited awareness of system-level constraints.

The security implications are especially concerning. Let me walk through the PM Agent case in more detail, because it illustrates a pattern I've now seen on multiple projects.  

Fixing the vector search vulnerability required several coordinated changes:

  • Adding fail-closed protection so that missing or empty project_ids immediately return an empty result and log a warning.
  • Writing nine unit tests that covered the protection scenarios.
  • Removing the parameters = None bypasses that let some agents skip the project scope.
  • Changing the default behavior from permissive to deny-by-default.

That last point matters most. The old design effectively said: search everything unless a scope is applied. The new design says: return nothing unless a valid scope is provided.

This is a better fit for security principles and for compliance frameworks such as SOC 2 Type II and ISO 27001. More importantly, it moves enforcement to the database layer, where it belongs. An LLM prompt is not an access control mechanism. It can be ignored, manipulated, or malformed.

Now here's the critical question: could any one-shot tool have caught this before it shipped?

A static analyzer wouldn't flag the absence of a WHERE clause, because the function didn't have an obvious injection pattern, and it just wasn't filtering correctly. A pre-merge check wouldn't catch it, because the is_chat_agent flag was introduced in one PR and the missing scope enforcement was a pre-existing condition in another code path. A periodic review might catch it, but only if the reviewer happened to trace the full query path from the agent through the function to the database, understood the multi-tenant model, and noticed the missing session variable.

One-shot tools do their intended job. That job simply doesn't include detecting gradual, cross-cutting security erosion in AI-generated code.

Continuous scanning is an architecture decision, not a cron job

This was the point where our thinking changed. The PM Agent incidents made it clear that identifying the problem was not enough. We needed a way to keep checking for these patterns continuously, not only during dedicated audits or after something suspicious surfaced. 

When I say "continuous code scanning," I do not mean running the same static tool every few hours. A timer does not give a one-shot tool any deeper understanding.

Continuous scanning means an always-on process that tracks how the codebase evolves, verifies the codebase against project-specific invariants defined in runbooks, and flags drift against your team's actual standards instead of against a generic industry checklist alone.

This is an architectural decision because it changes the role of quality in the development process.

With one-shot analysis, quality is a gate. Code passes through it at merge time. With continuous scanning, quality becomes a background process. The system keeps evaluating the state of the codebase as a whole and surfaces issues proactively, including the ones that only become visible after several individually acceptable changes combine into something unsafe.

That is the difference that matters. A system that understands the invariant “every query must be scoped to a project” can detect a new path that violates it, even if that path emerged gradually across multiple commits, contributors, or AI sessions. It can also spot inconsistent error handling, repeated dependency additions, or other forms of drift that a diff-based review will never see clearly. More than a scheduled linter, it's closer to a persistent automated process that verifies every change against the architecture you defined, and keeps checking whether the codebase still satisfies the invariants your system depends on.

But the architecture is only half the story. The more important half is what tells those agents what to look for.

Why generic rules fail on team-specific codebases

There's another problem with most AI code review tools, beyond the one-shot limitation: they apply generic rules.

A typical SAST tool knows about OWASP Top 10 categories, CWE patterns, and language-specific anti-patterns. Those rules matter, but they are universal. They do not know your authentication decorator, your ORM wrapper, your custom error model, or your team's own conventions for safe access to shared data.

That creates two problems at once:

  1. Too many false positives on patterns that are acceptable in your environment.
  2. Too few true positives on issues that only matter in your environment.

The PM Agent vulnerability is a good example. No generic ruleset will naturally know that every vector database query in that project must set copilot.projects_ids before execution. That rule only makes sense inside that system.

This is why project-specific runbooks matter. A runbook is a reusable set of instructions that defines how a particular class of work should be evaluated in a particular codebase. It is not a vague prompt like "review this code for security issues." It is a structured recipe: what to check, in what order, against which standards, with which exceptions, and how findings should be reported.

That matters not only for accuracy, but also for focus. A single AI session that tries to review everything at once accumulates competing priorities. It tries to check security, code style, test coverage, dependency hygiene, and architectural drift in the same context. The more it tries to do, the more likely it is to miss something important or produce shallow analysis.

A stronger approach is to split the work into focused sessions, each with its own runbook:

  • for security invariants,
  • for code style and consistency,
  • for dependency hygiene,
  • for architectural drift,
  • for AI code refactoring review, where structural changes need extra scrutiny,
  • for container security scanning when code and runtime changes have to be interpreted together.

In practice, that produces better results than asking one general-purpose agent to do everything at once. But the quality of a runbook determines the quality of everything downstream. A vague runbook–one not grounded in your system's actual invariants–produces results no better than a generic prompt. Writing a good runbook requires the same engineering judgment as defining the invariant itself.

Teams can build this approach internally, and many do. For teams that want to move faster or don't have the bandwidth to formalize these rules, Fleet offers runbook design as a separate engagement, translating system-specific invariants into reusable, executable checks.

But defining runbooks is only part of the problem. As the system evolves, those rules need to be applied consistently, updated over time, and executed across multiple areas without losing focus. At that point, the challenge shifts from writing runbooks to operationalizing them as a continuous system.

How we turned this approach into Enji Fleet

That separation of concerns – one focused session per runbook, one agent per class of problem – is what Fleet's architecture is built around. Here's what that looks like in practice.

The setup is straightforward:

  • Fleet connects to the repository through a GitHub App and runs them on a schedule or as cyclic tasks.
  • It routes work across specialized agents such as Claude, Codex, Gemini, and Kimi.
  •  Each task executes in an isolated worker container, separate from the application infrastructure, so there's no risk of a scanning agent modifying production.
  • Teams can use their existing model API subscriptions; no separate Anthropic or OpenAI keys required.
  • For stricter environments, Fleet can be deployed on-premise – in this configuration, agents run against locally hosted models (currently Qwen and Kimi) rather than external API subscriptions, which keeps all code and scan artifacts within the organization's own infrastructure

For teams operating under formal compliance requirements (SOC 2 Type II, ISO 27001, PCI DSS), continuous scanning practically changes the audit story. Instead of producing a point-in-time snapshot for an annual review, the team has a documented, repeatable process that runs on a defined schedule against defined criteria. Runbooks encode the specific controls being verified. Scan history provides a timestamped record of what was checked, when, and what was found. That's a different posture than "we ran a penetration test in Q3" – it's closer to the continuous monitoring that modern compliance frameworks increasingly expect rather than just accept.

Infrastructure is only the baseline. What really matters is Fleet's decision logic–what it checks and how. It runs on project-specific runbooks that define the evaluation criteria, the method, and the reporting format for each agent. That's why it works in real codebases: it doesn't hunt for generic "bad smells," it verifies that the code still follows the rules your team actually relies on.

In the PM Agent example, the vulnerabilities were found through a dedicated manual audit. That took senior engineering time and focused attention. Importantly, the audit happened after the vulnerability had already existed in production for an indeterminate period, not before it shipped. A continuous scanning setup with the right runbooks could have detected the same patterns much earlier by encoding one simple invariant: every query path must enforce project scope.

That is the difference between an audit that happens when someone schedules it and a system that keeps verifying whether the invariant is still true.

How we set up continuous scanning on a real AI-heavy codebase

Once we had the architecture in place, the next question was not how to scan everything at once. It was how to introduce continuous scanning into a real AI-assisted repository without creating noise that the team would immediately learn to ignore.

We did not start with a generic "review the whole codebase" prompt. We started with a small set of invariants that were both security-critical and easy to verify repeatedly.

The first step was to define checks that could be expressed unambiguously and reviewed consistently over time. Instead of asking an agent to "look for security issues," we gave it concrete questions to answer: which code paths are allowed to bypass scope enforcement, where permissive fallbacks appear instead of deny-by-default behavior, and whether a later refactor has weakened logic that used to enforce a boundary explicitly.

That became the basis for the first runbooks.

From there, we expanded in layers. After the initial access-control checks, we added runbooks for deny-by-default behavior, for AI-generated refactors, and for dependency or runtime changes that had to be interpreted together with container-level risk. The rollout was incremental by design. We did not try to encode the whole system at once. We started with the rules whose failure would have the highest security cost.

The handling of findings mattered just as much as the checks themselves. We did not want another detached report that somebody might read later. Findings had to enter the team's normal engineering workflow in a form that was immediately actionable: issues when investigation was still needed, pull requests when the fix was clear enough to propose directly, and repeat scans when later changes touched the same invariant again.

This is where continuous scanning proved different from a one-off audit. An audit gives you one deep snapshot. A runbook-based setup gives you a way to keep checking whether the same rule is still true after later commits, later AI refactors, and later feature additions.

In other words, it turns a one-time security insight into a persistent review process.

That is the rollout model I would recommend to any team working with heavy AI generation. Do not begin with generic coverage. Begin with the invariants your system cannot afford to lose. Encode those first. Make the findings actionable inside the repository workflow. Then expand outward into consistency, dependency hygiene, and infrastructure-level checks once the highest-risk paths are already under continuous review.

What changes in the team's workflow after continuous scanning is running

The first change is not that the team suddenly has more checks. The first change is that merge time stops being the only place where system understanding is applied.

In a typical AI-assisted workflow, the process still looks familiar: generate code, review the diff, run tests, pass CI, and merge. That model continues to catch obvious failures. What it does not do well is protect the system against gradual drift across multiple commits, refactors, and interacting components.

Once continuous scanning is running, the operating model changes in different ways for different technical roles.

For an engineering lead, hidden drift becomes visible much earlier. Instead of relying on intuition and occasional deep reviews to notice that access-control patterns, validation logic, error handling, or dependency hygiene are becoming inconsistent, the lead gets a continuous signal about how the codebase is evolving. That matters directly for technical debt, because structural degradation becomes visible while it is still small enough to address deliberately. 

For senior developers and reviewers, the review burden becomes narrower and more contextual. They are no longer carrying the full responsibility of reconstructing every system-level implication from the diff alone. Continuous scanning keeps checking whether merged changes still respect the project’s known invariants and can surface violations as issues or pull requests inside the repository workflow.

For security engineers, the biggest shift is that the checking model becomes closer to how risk actually accumulates in AI-generated codebases. Traditional code security scanning tools are strong at known patterns and weak at project-specific drift. Continuous scanning adds another layer: it keeps re-evaluating whether the codebase still satisfies the security properties the team depends on, including the ones that only become meaningful in context.

For small teams and founders shipping quickly with AI-generated code, the dynamic is different, but the risk is the same. There is often no dedicated engineering lead or security engineer—just a founder seeing working features, without full visibility into what is accumulating underneath. In this context, continuous scanning is not about managing large-scale architectural drift. It is about having persistent control that does not rely on someone remembering to check.

The value becomes most apparent when findings are easy to understand and act on. Instead of raw repository issues, teams need clear explanations and prioritized risk signals that help them decide what matters now. Currently, Fleet delivers findings directly into the repository, as issues and pull requests with severity labels. A dedicated dashboard that surfaces findings with plain-language explanations and risk prioritization is in active development; at this stage, interpreting Fleet's output still requires basic familiarity with GitHub workflows.

This also changes how teams treat AI code refactoring. In many repositories, refactoring generated by LLMs is reviewed as if it were lower-risk work because it mainly restructures existing code. In practice, this is often where important safeguards disappear. A continuous scanning system can keep re-checking whether a cleanup change has altered access-control behavior, validation boundaries, fallback logic, or container-related assumptions in ways that a local reviewer might miss.

The result is not that CI, SAST, or manual review becomes obsolete. The result is that they no longer carry the whole burden alone.

For teams actively using AI generation, that is the workflow change that matters most. Quality control stops being concentrated in a single checkpoint and becomes a continuous process of verifying that the codebase still matches the architecture the team believes it is maintaining.

That is the role Enji Fleet is designed to play. It is not a replacement for engineers, and it is not just another layer of generic automation. It is a way to keep system-level understanding alive in codebases that are changing faster than human review alone can realistically track.

If AI has accelerated your delivery, "Does it run today?" stops being the hard question. The hard question is whether you can still see what's changing, explain why it's safe, and keep the codebase under control as it evolves. 

That's where continuous scanning becomes non-negotiable. If you're working with an AI-heavy codebase and want to see how it applies to your specific invariants, book a demo.

You can also read:

AI

[AI Adoption in Development: From Scripts to Multi-Agent Systems]

Learn how we evolved from scripts to a multi-agent AI system that plans, codes, tests, and ships features across multiple repositories with minimal human involvement.

Project Management

[Technical Debt Is a Business Problem: How to Quantify, Visualize, and Communicate It to Leadership]

Technical debt costs money – missed deadlines, wasted engineering hours, delayed features. Here's how to quantify it, visualize it, and bring numbers leadership can't ignore.