CODEGATE
A smart gateway that classifies a coding question, scores its complexity, and routes it to the right-sized model — plus a separate, fine-tuned small LLM that reviews human-written code for real security vulnerabilities.

Overview
Sending every query to one big model is wasteful and often wrong. CodeGate rejects non-programming questions at the door, scores what is left, and routes simple lookups to a local model while reserving the strongest model for genuinely hard problems — a real quality, latency, and quota win.
Two Systems, One Project
LLM Routing Pipeline
A gateway between a coding question and the best model to answer it. Intent gate → complexity router → tiered inference, streamed back token by token with routing metadata.
Code Review (standalone)
A dedicated endpoint where you submit your own code. A fine-tuned small LLM flags security vulnerabilities in plain language and structured JSON — because the real bugs live in human-written code.
The Routing Pipeline
Intent Classifier
Binary gate — PROGRAMMING or NOT_PROGRAMMING. Non-coding queries are rejected with a reason.
Complexity Router
Scores a confirmed coding query into Tier 1 / 2 / 3 from a 386-dim feature vector.
Tiered Inference
Routes to the right model, each with its own system prompt, and streams the answer back.
| Tier | Query type | Model | Cost |
|---|---|---|---|
| Tier 1 | Simple / definitional | Local ~7B (qwen2.5-coder, Ollama) | $0.00 |
| Tier 2 | Intermediate / debugging | gemma-3-27b-it (Gemini API) | Free tier |
| Tier 3 | Complex / architectural | gemini-2.0-flash (Gemini API) | Free tier |
Component Results
Intent Classifier
Hardened across three dataset iterations against shortcut learning — with hard negatives for negation, prompt-injection, and code-adjacent queries — so it gates on real intent, not surface patterns.
Complexity Router
Won a 5-way bake-off (LR, SVM, XGBoost, LightGBM, MLP). A debug-keyword flag and normalized length broke the Tier 1 / Tier 2 bleed that t-SNE exposed in pure embeddings.
The Code Reviewer
After scrapping an over-engineered AST-GNN plan, the reviewer became a fine-tuned Qwen2.5-Coder-3B — multi-language for free, with natural-language explanations built in. Trained with QLoRA on an H100, then quantized to a 1.93 GB GGUF that runs locally through Ollama.
A 1.93 GB quantized model that runs free and fully offline through Ollama, with deterministic greedy decoding for clean structured output.
- checkPerfect recall on the injection family — SQL, command, XSS, and path traversal
- checkValid JSON on every call; greedy decoding keeps the schema clean
- checkPlain-language explanations and fixes, multi-language out of the box
- checkInteractive latency on a plain local CPU — no GPU or API needed
Built the full fine-tuning loop end to end — SFT → DPO → quantization → evaluation — on a 3B model that ships and runs anywhere.
Tech Stack
Models on HuggingFace
All five checkpoints are public — the intent classifier, the GGUF reviewer, and the SFT / DPO / DPO-v2 adapters.