SHAUN RODRIGUES
LLM Routing & Learned Code Review

CODEGATE

A smart gateway that classifies a coding question, scores its complexity, and routes it to the right-sized model — plus a separate, fine-tuned small LLM that reviews human-written code for real security vulnerabilities.

CodeGate — intelligent LLM routing system and fine-tuned code reviewer

Overview

Sending every query to one big model is wasteful and often wrong. CodeGate rejects non-programming questions at the door, scores what is left, and routes simple lookups to a local model while reserving the strongest model for genuinely hard problems — a real quality, latency, and quota win.

Two Systems, One Project

01

LLM Routing Pipeline

A gateway between a coding question and the best model to answer it. Intent gate → complexity router → tiered inference, streamed back token by token with routing metadata.

02

Code Review (standalone)

A dedicated endpoint where you submit your own code. A fine-tuned small LLM flags security vulnerabilities in plain language and structured JSON — because the real bugs live in human-written code.

The Routing Pipeline

Stage 1

Intent Classifier

ModernBERT-base (fine-tuned)

Binary gate — PROGRAMMING or NOT_PROGRAMMING. Non-coding queries are rejected with a reason.

Stage 2

Complexity Router

all-MiniLM-L6-v2 (frozen) + MLP

Scores a confirmed coding query into Tier 1 / 2 / 3 from a 386-dim feature vector.

Stage 3

Tiered Inference

Ollama · Gemini API

Routes to the right model, each with its own system prompt, and streams the answer back.

TierQuery typeModelCost
Tier 1Simple / definitionalLocal ~7B (qwen2.5-coder, Ollama)$0.00
Tier 2Intermediate / debugginggemma-3-27b-it (Gemini API)Free tier
Tier 3Complex / architecturalgemini-2.0-flash (Gemini API)Free tier

Component Results

Intent Classifier

Hardened across three dataset iterations against shortcut learning — with hard negatives for negation, prompt-injection, and code-adjacent queries — so it gates on real intent, not surface patterns.

93.1%
Slice eval
100%
Out-of-dist.
86.7%
Adversarial
387ms
CPU p95

Complexity Router

Won a 5-way bake-off (LR, SVM, XGBoost, LightGBM, MLP). A debug-keyword flag and normalized length broke the Tier 1 / Tier 2 bleed that t-SNE exposed in pure embeddings.

0.986
Test macro-F1
0.011
ECE (calib.)
15/15
OOD
4.5ms
p95 latency

The Code Reviewer

After scrapping an over-engineered AST-GNN plan, the reviewer became a fine-tuned Qwen2.5-Coder-3B — multi-language for free, with natural-language explanations built in. Trained with QLoRA on an H100, then quantized to a 1.93 GB GGUF that runs locally through Ollama.

Qwen2.5-Coder-3B
arrow_forwardSFT · glaive-code-assistant
arrow_forwardDPO · CyberNative
arrow_forwardDPO v2 · targeted FPR fix
arrow_forwardGGUF Q4_K_M · Ollama
Deployed · GGUF Q4_K_M on local Ollama
100%
Injection recall
100%
Valid JSON
3.7s
p95 latency
1.93GB
Runs offline

A 1.93 GB quantized model that runs free and fully offline through Ollama, with deterministic greedy decoding for clean structured output.

What it does well
  • checkPerfect recall on the injection family — SQL, command, XSS, and path traversal
  • checkValid JSON on every call; greedy decoding keeps the schema clean
  • checkPlain-language explanations and fixes, multi-language out of the box
  • checkInteractive latency on a plain local CPU — no GPU or API needed

Built the full fine-tuning loop end to end — SFT → DPO → quantization → evaluation — on a 3B model that ships and runs anywhere.

Tech Stack

Intent
ModernBERT-base
Router embed
all-MiniLM-L6-v2
Router head
PyTorch MLP
Reviewer
Qwen2.5-Coder-3B
Fine-tuning
QLoRA · SFT + DPO
Training libs
transformers · peft · trl
Tier 1 serving
Ollama
Tier 2 / 3
Gemini API
Backend
FastAPI · asyncio
Frontend
Next.js · SSE
Quantization
GGUF Q4_K_M
Tracking
Weights & Biases
Open Models

Models on HuggingFace

All five checkpoints are public — the intent classifier, the GGUF reviewer, and the SFT / DPO / DPO-v2 adapters.

huggingface.co/shaunmarvellarrow_outward
arrow_backBack to all projects