Building a plain-language rule engine for code review agents
Learn how to build a plain-language rule engine for code review agents with RAG, embeddings, and scalable polyglot validation.
Modern code review agents need more than a good model. They need a dependable way to turn human policy into executable checks, combine that policy with repository context, and run it fast across LLM orchestration pipelines that can handle real-world delivery pressure. That is the core idea behind a plain-language rule engine: engineers write natural language rules, the system parses them into validators, enriches them with retrieval, and evaluates them consistently in CI. If you have looked at the architecture of systems like Kodus, you have probably noticed the practical payoff: lower review cost, better context, and fewer generic comments that waste reviewer time.
This guide is a technical deep dive into the full pipeline: rule authoring, parsing, embeddings, retrieval-augmented generation, validation execution, and scaling across polyglot repos. The goal is not to build a demo. The goal is to design a system teams can trust in production, where code review rules become part of policy-as-code, versioned like application code, and enforced consistently through pull requests, merge queues, and release gates.
Why plain-language rules are the right abstraction
Engineers want policies they can read, not prompts they can guess
Traditional review bots fail because they encode policy in brittle prompt text or opaque model instructions. A maintainer changes one line in a prompt and suddenly a thousand PRs receive different advice. Plain-language rules solve this by giving teams a human-readable surface: “Never introduce blocking network calls on the UI thread” or “All database migrations must be backward compatible.” Those sentences are understandable by developers, product engineers, and security reviewers alike, which makes them suitable for code review rules that need cross-team agreement.
Plain language also reduces the governance gap. Instead of asking an engineer to understand prompt engineering, you ask them to express the rule in business terms and let the system compile it into check logic. This is similar in spirit to how teams adopt other operational abstractions, whether they are building simplified DevOps workflows or defining product constraints in an internal platform. The key is that the rule remains readable, reviewable, and testable.
Why “natural language rules” beat one-off heuristics
Heuristics are cheap to write but hard to scale. A regex that catches one anti-pattern often produces false positives in another language or framework. A plain-language rule engine lets you capture intent once, then generate validators for multiple ecosystems. That matters when your repositories include TypeScript, Python, Go, Java, and Terraform in the same delivery stream. In those environments, the review policy should be stable even when the implementation details vary widely.
This is where embeddings become valuable. By encoding rule text and historical examples into vector space, you can match a new pull request against prior rulings, documentation, and project-specific conventions. For teams already using retrieval-based assistants, this approach feels familiar: it is the same basic insight behind effective local AI toolchains and other context-aware developer systems. The difference is that here the output is not an answer; it is an executable review decision.
What Kodus-like systems are really optimizing for
Systems like Kodus are compelling because they align model usage with engineering economics. Instead of hiding behind markup, the platform exposes its dependency on external model providers and makes the review policy independent of the model itself. That separation is essential for a plain-language rule engine. You want rules to survive model swaps, provider outages, and cost changes. The policy should remain the stable center of gravity while the LLM acts as a reasoning accelerator, not as the source of truth.
That mindset also mirrors the broader trend toward agentic AI for enterprise workflows, where orchestration, memory, and data contracts matter more than raw model size. In code review, the best systems are not the ones that “sound smartest.” They are the ones that consistently map policy intent to actionable findings with auditable behavior.
System architecture: from rule text to executable validators
The pipeline overview
A production-grade rule engine usually follows five stages: authoring, parsing, normalization, retrieval, and execution. First, a maintainer writes a rule in a constrained plain-language format. Second, the parser identifies the rule type, scope, severity, and conditions. Third, the rule is normalized into an intermediate representation, often a JSON or YAML schema with explicit fields. Fourth, the engine retrieves repository context, style guides, previous decisions, or coding standards. Finally, validators run against the changed files, ASTs, and diff hunks to produce structured findings.
Each stage should be independently testable. That is a strong lesson from mature systems that treat data and automation as assets, such as the approach described in digital asset thinking for documents. Your rules are assets too. They deserve ownership, history, review, and lifecycle management. If you skip that discipline, you end up with an ungovernable prompt pile instead of a policy engine.
A practical intermediate representation
The fastest path is not to execute raw language directly. Convert the rule into a structured schema that a validator runner can understand. A simple example might look like this:
{"rule_id":"db-backward-compat","scope":"migrations","severity":"high","intent":"new database migrations must be backward compatible","conditions":[{"type":"mentions","value":"ALTER TABLE"},{"type":"requires","value":"expand-migrate-contract"}],"languages":["sql","python","go"]}This schema gives the engine stable hooks for execution. It can invoke static analyzers, search migration diffs, compare against repository conventions, and ask an LLM only where uncertainty remains. You can also attach metadata such as team ownership, exceptions, and SLA targets. In practice, the schema behaves like a policy contract and supports the same kind of traceability teams expect when they adopt policy-as-code in infrastructure or compliance workflows.
Validator classes you actually need
Most review policies fall into a few classes: syntactic, semantic, contextual, and historical. Syntactic validators check for concrete constructs in code, like banned APIs or insecure patterns. Semantic validators reason about the meaning of changes, such as whether error handling is consistent. Contextual validators use retrieved repository knowledge to decide if a new change violates project norms. Historical validators compare the current diff against past accepted or rejected reviews.
In mature systems, the LLM does not replace these validators. It coordinates them. This is one reason LLM orchestration matters so much: the model is best used for disambiguation, rule interpretation, and summarization, while deterministic checks provide repeatability. That split keeps the review agent reliable even when the model is probabilistic.
Parsing plain-language rules into structured policy
Constrain the authoring grammar without making it painful
If you let maintainers write completely free-form text, parsing becomes brittle. If you make the language too rigid, adoption collapses. The right tradeoff is a constrained natural language template with flexible slots. For example: “When condition, the review must action because reason.” This format preserves readability while making extraction much easier. You can add optional fields for severity, language scope, exception patterns, and ownership.
For teams building internal platforms, this is similar to how the best onboarding systems balance policy with convenience. A useful analogy is the discipline behind structured learning or evaluation flows in other domains; the principle is the same as in starter research frameworks or analytics-driven process design: constrain inputs enough to make outputs reliable. Once the rule grammar stabilizes, your parser can become far more accurate without relying on a fragile prompt.
Use a two-step parse: classify, then extract
The first step is rule classification. The engine determines whether the rule is about security, style, architecture, performance, dependency management, or testing. The second step extracts slots into a structured object. This two-step approach is much more robust than asking the model to emit a perfect final schema in one shot. It also makes fallback behavior easier: if extraction fails, you can ask for clarification instead of silently corrupting policy.
In implementation terms, you can use a small JSON schema, a function-calling model, or a regex-backed parser for highly standard patterns. The model should validate ambiguous language such as “avoid large changes” by asking for measurable thresholds. That is where policy design matters. A good review rule is not merely descriptive; it is operational. If a maintainer cannot tell the system how to detect it, the rule is too vague to automate.
Maintain a policy test suite
Every rule should have examples: positive cases, negative cases, and edge cases. These examples are crucial because they let you regression-test the parser and the validators independently. When someone edits a rule, the system should tell them exactly which historical examples still pass and which now fail. That level of feedback makes the rule engine feel like part of the development workflow rather than a black box bolted on top.
This practice resembles the rigor used in environments that track reliability and operational consistency, such as automating data profiling in CI. The important point is that policy changes deserve tests. Without them, you are merely editing prose and hoping the LLM interprets it the same way tomorrow.
RAG for code review: context that actually matters
What the retrieval layer should index
RAG is only useful if the retrieval corpus is curated. For code review agents, the best sources are architecture docs, team conventions, dependency policies, prior accepted reviews, incident postmortems, and module ownership data. You should also index API documentation, code comments that define invariants, and approved patterns for each language. Do not indiscriminately dump the whole repo into the vector store. That increases noise and makes the review agent more likely to cite irrelevant context.
A strong retrieval design treats each rule as a query anchor. For example, a rule about backwards compatibility should retrieve migration docs, release notes, and database standards. A rule about error handling should retrieve logging guidelines and SLO documentation. This is a better fit for enterprise AI workflows than a generic chat assistant, because the system is always anchored to the organization’s own source of truth.
Chunking and embeddings strategy
Embeddings work best when your chunks are semantically coherent. For code, that usually means function-level or class-level chunks, plus separate chunks for doc pages, ADRs, and policy docs. Keep enough surrounding context to preserve meaning, but avoid stuffing unrelated code into one vector. You should also store metadata such as language, path, ownership, service name, and last modified date. That metadata enables filters that dramatically improve precision in large repos.
For polyglot repos, language-aware embeddings help. A Python test helper should not be ranked as a likely match for a Go gRPC service just because they both mention “timeout.” Combining dense retrieval with metadata filtering is usually more effective than relying on vectors alone. In practice, the best systems layer this on top of local AI or model-hosted embeddings depending on privacy, latency, and cost constraints.
RAG should provide evidence, not decide policy
A critical design mistake is letting the model “vote” on policy when RAG is used. Retrieval should supply evidence and grounding, not authority. If the rule says all public APIs must include versioning notes, the model should retrieve the relevant API design doc and historical examples. The final decision should still come from the rule definition and validator logic. That separation is what makes the system trustworthy enough to use in CI and release automation.
Pro Tip: Use RAG to resolve ambiguity, not to replace the rule. If retrieval changes the answer, the rule is probably too vague and needs to be rewritten in measurable terms.
Executing validators across polyglot repos
Build a language adapter layer
Polyglot repositories are where naive code review automation breaks down. Each language has its own AST libraries, formatting conventions, package systems, and testing idioms. The solution is a language adapter layer that normalizes file diffs into a shared event model. That model can include file type, symbol changes, dependency changes, test changes, and configuration changes. Once normalized, the rule engine can run common policy logic across languages while using specialized analyzers underneath.
For example, a “must not expose secrets” rule might inspect `.env` files, Kubernetes manifests, Terraform, Python settings, and Java properties with language-specific detectors. A “must include tests” rule may inspect changed source files and map them to related test paths. This is where platform thinking becomes essential, and it is similar to how teams simplify complex delivery stacks in DevOps lessons for small shops: create one normalized operational layer, then specialize at the edges.
Map rules to file and symbol scopes
Rules should not run against every diff blindly. Some apply only to changed functions, others to packages, services, or infrastructure manifests. Scope resolution is one of the most important scalability levers in the whole system. If a rule is about API compatibility, the engine should inspect public interfaces and generated client contracts, not every unrelated file in the repo. If a rule concerns dependency risk, it should focus on package manifests and lockfiles.
This targeted execution reduces latency and false positives. It also helps with developer trust: reviewers see fewer irrelevant comments and are more likely to accept the system as a senior-assistant, not a nagging bot. That trust is a major part of the value proposition behind review automation, especially in teams that care about cost discipline and predictable outcomes, the same forces that drive interest in Kodus.
Use deterministic checks before model reasoning
When possible, run deterministic validators first. Static pattern matching, AST inspection, policy lints, and metadata checks are faster and more explainable than invoking an LLM. Only when the deterministic layer cannot confidently classify the change should the orchestrator request model reasoning. This layered architecture reduces token usage and improves consistency. It also aligns with the economics of modern code review systems, where cost and latency matter almost as much as accuracy.
The result is a much cleaner operating model for measuring AI agent performance. You can track deterministic pass rates, model escalation rates, false positive rates, and average review time per PR. Those metrics reveal whether the engine is truly enforcing policy or merely generating commentary.
CI integration and policy-as-code workflows
Where the engine fits in the delivery pipeline
For production use, the rule engine should run as part of the pull request lifecycle. The usual pattern is: developer opens PR, CI gathers diff and metadata, the policy engine evaluates rules, and the review agent posts structured findings back to the PR. In strict environments, blocking rules can fail the check; in advisory mode, the engine can annotate the diff with warnings and recommendations. Both modes should use the same underlying rule definitions.
This is where agentic AI workflows become practical rather than abstract. The review agent is simply one orchestrated step in the pipeline, consuming code, context, and rules as inputs and emitting auditable outputs. If you design the interface cleanly, it becomes easy to embed in GitHub Actions, GitLab CI, Jenkins, or a self-hosted merge queue.
Version rules like code
Rules should live in Git, reviewed by humans, and promoted through environments just like application code. Versioning matters because policy changes can affect developer velocity and release behavior. A good setup stores rules in a repository folder, runs tests on changes, and requires approval from code owners or platform engineers. That gives you a traceable audit trail and makes rollback simple if a rule proves too strict.
Teams often borrow this discipline from other operational systems. For example, CI-triggered validation patterns in data engineering show how quickly confidence improves when checks are versioned and repeatable. The same principle applies to review policy. If you cannot diff it, test it, and roll it back, it is not ready for production governance.
Keep humans in the loop where judgment matters
Not every rule should be hard-enforced. Some issues are architectural tradeoffs, not violations. For those cases, the engine should produce a recommendation with evidence and let reviewers decide. The best systems distinguish between must-fix, should-fix, and observe-only outcomes. This avoids turning the agent into a bureaucracy machine while still capturing valuable signal.
That balance also improves adoption. Engineers are far more likely to trust a system that knows when to stay quiet. In practice, this is one of the main reasons teams move toward more context-aware tools and away from simplistic generic bots. A well-designed review system is less like a gate and more like a disciplined colleague.
Performance, scaling, and observability
Design for millions of changed lines, not just toy repos
Scaling a rule engine across large organizations means handling many repositories, languages, and pull request sizes. You need caching for embeddings, memoization for repeated rule evaluations, and sharded execution for large diffs. A change touching 200 files should not force the system to re-evaluate every unrelated policy at full depth. Good systems use incremental analysis: identify impacted modules, then run only the relevant validators.
This is analogous to handling bursty infrastructure workloads, where architecture must absorb spikes without collapse. Teams that have worked on bursty data services already understand the lesson: separate ingestion from compute, buffer work, and process only what changed. For code review, queue-based execution and diff-aware caching are the difference between a responsive product and a bottleneck.
Measure latency, precision, and escalation rate
You should monitor at least four metrics: end-to-end review latency, validator precision, false positive rate, and model escalation rate. Latency tells you whether developers will tolerate the system in CI. Precision tells you whether the rules are trustworthy. False positives measure the cost of unnecessary noise. Escalation rate reveals how often the LLM is actually needed versus how often deterministic checks suffice.
If you want a deeper framework for operational metrics, the ideas in measuring an AI agent’s performance map directly to this use case. The best teams also track review acceptance rate: how often a human reviewer agrees with the agent. Over time, that metric is often more valuable than raw accuracy because it reflects real workflow fit.
Observability should be rule-centric
Logging every prompt is not enough. You need rule-centric traces that show which rule fired, what context was retrieved, which validator ran, what evidence was used, and why the final decision was produced. This makes debugging and audit much easier. It also helps teams tune rule language when the system behaves unexpectedly.
For compliance-sensitive organizations, this traceability is non-negotiable. In fact, it is one of the strongest arguments for using a plain-language rule engine instead of a prompt-only assistant. The ability to inspect every step makes the platform significantly more trustworthy and easier to adopt at scale.
Implementation blueprint: a practical stack
Recommended components
A production stack usually includes a rule authoring UI or repo-based DSL, a parser service, a policy registry, an embeddings store, a retrieval service, a validator runner, and a CI integration layer. On top of those, you can add an LLM orchestration service for classification, clarification, and explanation. The simplest version can start as a monorepo service with a job queue and a database; the more advanced version can separate concerns into dedicated microservices.
If you are designing this from scratch, study how modern platforms organize work across services in a maintainable way, much like the monorepo structure described in the Kodus AI source material. Clean boundaries matter. They let you swap model providers, update parsers, and add new language adapters without destabilizing the entire system.
Sample workflow for a single rule
Suppose a team writes: “All new public endpoints must include rate limiting and auth checks.” The parser tags it as a security and reliability rule. The engine retrieves API gateway docs, auth middleware examples, and existing endpoint patterns from the repo. A deterministic validator checks route declarations and middleware presence. If ambiguity remains, the LLM inspects the endpoint code and explains whether the rule is satisfied, violated, or uncertain. The final output is a structured PR comment with evidence and a confidence score.
That workflow shows the full value of combining orchestration, memory, and retrieval. The model is never asked to invent policy. It is asked to apply policy to context, which is a much more reliable job.
Hard lessons from production
The biggest failure modes are vague rules, excessive retrieval noise, and overuse of model reasoning. Vague rules create inconsistent execution. Noisy retrieval causes irrelevant explanations. Overuse of LLMs increases cost and can reduce repeatability. The fix is disciplined policy design, scoped retrieval, and a bias toward deterministic validation whenever possible.
Another important lesson is that teams need rollback strategies. If a rule starts blocking too many PRs, disable it quickly, inspect the examples, and revise the policy. Do not let enforcement drift into a support crisis. Good platform teams treat review rules like production features with observability, feature flags, and ownership.
Comparison table: approaches to code review automation
| Approach | How it works | Strengths | Weaknesses | Best fit |
|---|---|---|---|---|
| Prompt-only review bot | LLM reads diff and emits comments from instructions | Fast to prototype, flexible | Unstable, hard to audit, expensive at scale | Early experiments |
| Regex or lint rules | Deterministic patterns catch known violations | Cheap, fast, predictable | Low semantic understanding, brittle across languages | Simple guardrails |
| Plain-language rule engine | Rules are parsed into validators and enriched with context | Readable, testable, scalable, auditable | Requires initial schema and parser design | Production policy enforcement |
| RAG-only assistant | Retrieves docs and asks the model to judge changes | Good context grounding | Still probabilistic, weaker enforcement guarantees | Advisory review support |
| Policy-as-code platform | Versioned rules executed in CI with deterministic checks | Strong governance, rollback, compliance | Less natural-language friendliness unless layered with parsing | Enterprise CI integration |
How to adopt this incrementally
Start with three high-value rule classes
Do not try to automate every code review concern on day one. Start with rules that are easy to explain and expensive to miss: secrets, dependency changes, and public API compatibility. These categories have high leverage because they are common, measurable, and easy to validate. Once the system earns trust, expand to architectural patterns, test coverage, and performance-sensitive anti-patterns.
A phased rollout reduces cultural resistance. Teams are more willing to accept automation when it clearly prevents bugs rather than policing subjective style. This measured approach is the same kind of practical rollout logic you see in other operational programs, where successful adoption depends on high-signal wins rather than broad but shallow coverage.
Create a feedback loop with reviewers
Every reviewer action should feed back into the rule engine. If a human dismisses a finding, capture the reason. If they approve a rule suggestion, store the pattern. If they rewrite the rule, version that change as an improvement to the policy corpus. Over time, this creates a virtuous cycle where the system becomes more aligned with actual team behavior.
Feedback also improves retrieval quality. The engine can surface the most frequently cited examples when evaluating similar changes, which makes explanations more useful. This is especially valuable in small teams simplifying their tech stack, where a lightweight but improving system is often better than a heavyweight platform with poor fit.
Keep the UX developer-friendly
Good review automation is not just backend logic. The UI must show why a rule fired, what evidence was used, and what the developer can do next. Explainability is a product feature, not an afterthought. If the system feels like a black box, users will ignore it or disable it. If it feels like a senior engineer with receipts, adoption becomes much easier.
The best code review agents strike this balance by combining authoritative checks, concise feedback, and clear remediation steps. That is the practical promise of plain-language rules: fewer surprises, stronger governance, and a smoother path from policy intent to merge-ready code.
FAQ
How is a plain-language rule engine different from a prompt-based code review bot?
A prompt-based bot asks the model to reason directly from instructions. A plain-language rule engine first parses policy into structured validators, then uses the model only when needed for context or ambiguity. That makes the system more testable, auditable, and consistent across model changes.
Do I need RAG for every rule?
No. Use RAG when the rule depends on repository-specific conventions, prior decisions, architecture docs, or module ownership. For simple, deterministic checks like banned APIs or secrets detection, retrieval is often unnecessary and can add noise.
How do embeddings help code review rules?
Embeddings help match a rule or change to similar historical examples, docs, and accepted patterns. They are especially useful for contextual and semantic rules where exact string matching is insufficient, such as identifying architectural drift or inconsistent error handling.
What is the best way to support polyglot repos?
Use a normalized event model for diffs, then add language adapters for AST parsing and file-type-specific validators. Keep the policy layer language-agnostic where possible, and let specialized analyzers handle language differences underneath.
How do I keep false positives low?
Write rules in measurable terms, use deterministic checks first, scope validators tightly, and add examples for positive and negative cases. Also capture reviewer feedback so you can refine vague rules over time.
Can this replace human code review?
No, and it should not. The goal is to automate repetitive policy enforcement and provide better context, so humans can focus on design, tradeoffs, and risk. The strongest systems amplify reviewers rather than replacing them.
Conclusion
A plain-language rule engine is the missing middle layer between human policy and AI-assisted code review. It gives teams a way to express review intent clearly, compile that intent into executable validators, ground decisions in repository context, and scale enforcement across modern polyglot repos. When designed well, it behaves like a true piece of policy-as-code infrastructure: versioned, testable, observable, and reversible.
The deepest lesson from systems like Kodus is not simply cost reduction. It is that review automation becomes useful only when the model is placed inside a disciplined pipeline, not above it. If you combine constrained rule authoring, embeddings-backed retrieval, deterministic validators, and thoughtful LLM orchestration, you can build review agents that are both practical and trustworthy. That is the real path to shipping faster without sacrificing engineering standards.
Related Reading
- Architecting Agentic AI for Enterprise Workflows: Patterns, APIs, and Data Contracts - Learn how to structure reliable AI systems with clear orchestration boundaries.
- Architecting Agentic AI Workflows: When to Use Agents, Memory, and Accelerators - Explore the practical decision framework for agent memory and coordination.
- How to Measure an AI Agent’s Performance: The KPIs Creators Should Track - Use the right metrics to validate usefulness, accuracy, and adoption.
- Automating Data Profiling in CI: Triggering BigQuery Data Insights on Schema Changes - See how CI-triggered validation patterns improve confidence and governance.
- Building Resilient Data Services for Agricultural Analytics: Supporting Seasonal and Bursty Workloads - A useful analogue for designing scalable, burst-tolerant review pipelines.
Related Topics
Daniel Mercer
Senior SEO Content Strategist
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
Migrating to Kodus AI: a cost‑conscious, secure path to self‑hosted code reviews
How flexible and rigid‑flex PCBs change sensor integration and test automation
What software teams must change when designing firmware for EV PCBs
kumo vs LocalStack: measurable trade-offs for local AWS emulation
Practical CI/CD with kumo: Run full AWS-in-a-box tests for Go apps
From Our Network
Trending stories across our publication group