LLM Routing & Learned Code Review

CODEGATE

A smart gateway that classifies a coding question, scores its complexity, and routes it to the right-sized model — plus a separate, fine-tuned small LLM that reviews human-written code for real security vulnerabilities.

Overview

Sending every query to one big model is wasteful and often wrong. CodeGate rejects non-programming questions at the door, scores what is left, and routes simple lookups to a local model while reserving the strongest model for genuinely hard problems — a real quality, latency, and quota win.

Two Systems, One Project

LLM Routing Pipeline

A gateway between a coding question and the best model to answer it. Intent gate → complexity router → tiered inference, streamed back token by token with routing metadata.

Code Review (standalone)

A dedicated endpoint where you submit your own code. A fine-tuned small LLM flags security vulnerabilities in plain language and structured JSON — because the real bugs live in human-written code.

The Routing Pipeline

Stage 1

Intent Classifier

ModernBERT-base (fine-tuned)

Binary gate — PROGRAMMING or NOT_PROGRAMMING. Non-coding queries are rejected with a reason.

Stage 2

Complexity Router

all-MiniLM-L6-v2 (frozen) + MLP

Scores a confirmed coding query into Tier 1 / 2 / 3 from a 386-dim feature vector.

Stage 3

Tiered Inference

Ollama · Gemini API

Routes to the right model, each with its own system prompt, and streams the answer back.

Tier	Query type	Model	Cost
Tier 1	Simple / definitional	Local ~7B (qwen2.5-coder, Ollama)	$0.00
Tier 2	Intermediate / debugging	gemma-3-27b-it (Gemini API)	Free tier
Tier 3	Complex / architectural	gemini-2.0-flash (Gemini API)	Free tier

Component Results

Intent Classifier

Hardened across three dataset iterations against shortcut learning — with hard negatives for negation, prompt-injection, and code-adjacent queries — so it gates on real intent, not surface patterns.

93.1%

Slice eval

100%

Out-of-dist.

86.7%

Adversarial

387ms

CPU p95

Complexity Router

Won a 5-way bake-off (LR, SVM, XGBoost, LightGBM, MLP). A debug-keyword flag and normalized length broke the Tier 1 / Tier 2 bleed that t-SNE exposed in pure embeddings.

0.986

Test macro-F1

0.011

ECE (calib.)

15/15

OOD

4.5ms

p95 latency

The Code Reviewer

After scrapping an over-engineered AST-GNN plan, the reviewer became a fine-tuned Qwen2.5-Coder-3B — multi-language for free, with natural-language explanations built in. Trained with QLoRA on an H100, then quantized to a 1.93 GB GGUF that runs locally through Ollama.

Qwen2.5-Coder-3B

arrow_forwardSFT · glaive-code-assistant

arrow_forwardDPO · CyberNative

arrow_forwardDPO v2 · targeted FPR fix

arrow_forwardGGUF Q4_K_M · Ollama

Deployed · GGUF Q4_K_M on local Ollama

100%

Injection recall

100%

Valid JSON

3.7s

p95 latency

1.93GB

Runs offline

A 1.93 GB quantized model that runs free and fully offline through Ollama, with deterministic greedy decoding for clean structured output.

What it does well

checkPerfect recall on the injection family — SQL, command, XSS, and path traversal
checkValid JSON on every call; greedy decoding keeps the schema clean
checkPlain-language explanations and fixes, multi-language out of the box
checkInteractive latency on a plain local CPU — no GPU or API needed

Built the full fine-tuning loop end to end — SFT → DPO → quantization → evaluation — on a 3B model that ships and runs anywhere.

Tech Stack

Intent

ModernBERT-base

Router embed

all-MiniLM-L6-v2

Router head

PyTorch MLP

Reviewer

Qwen2.5-Coder-3B

Fine-tuning

QLoRA · SFT + DPO

Training libs

transformers · peft · trl

Tier 1 serving

Ollama

Tier 2 / 3

Gemini API

Backend

FastAPI · asyncio

Frontend

Next.js · SSE

Quantization

GGUF Q4_K_M

Tracking

Weights & Biases

Open Models

Models on HuggingFace

All five checkpoints are public — the intent classifier, the GGUF reviewer, and the SFT / DPO / DPO-v2 adapters.

huggingface.co/shaunmarvellarrow_outward

arrow_backBack to all projects