GROUNDZERO
Give it a long video and a sentence — it finds the exact start and end timestamps of the moment you described. Not classification, not captioning: locating where in time an event happens, down to the second.

What it does
Most long video is unstructured — lectures, depositions, security footage, match archives. To find a single moment you either scrub by hand or you need a model that reads language and video at once and returns the exact span of the video you're looking for. GroundZero outputs (start, end) in seconds.
The pipeline
A naive SigLIP zero-shot pass scores ~35% because consecutive frames look identical to the backbone. Five stages turn flat frame embeddings into a precise span — each built and unit-tested on its own.
- 01
Frame sampling + visual encoder
Sample at 1fps; encode each frame with SigLIP 2 So400m (1152-d), with LoRA adapters on the last 4 transformer blocks (0.15M trainable).
- 02
Temporal context
Dilated 1D convolutions [1,2,4,8] + fractional positional encoding give every frame a ~60-second receptive field — so it knows what surrounds it.
- 03
Text encoder
SigLIP 2's built-in text tower (frozen) embeds the query into the same 1152-d space — no separate model, no projection.
- 04
Cross-modal transformer
4 layers alternating cross- and self-attention. Frames attend to the query, lighting up the timeline near the described event.
- 05
Span extraction head
Per-frame start/end scoring (BERT-SQuAD style) decodes the best valid span; a confidence head estimates whether the event is present at all.
Inference is coarse-to-fine: a 1fps scan finds the region, then that window can be re-sampled at 4fps for tighter boundaries. Trained on cached frozen embeddings — 200 epochs on a Lightning.ai H100 in ~1 hour.
Results
QVHighlights val (n=1550). Built from scratch on a single dataset with a frozen SigLIP 2 backbone + LoRA — landing in Moment-DETR territory while training the whole grounding stack from zero.
Highlights
Strong candidate retrieval
The right moment lands in the top-5 ~80% of the time — the representation surfaces the answer reliably across the val set.
Robust to query length
R@1 holds flat (0.51–0.55) from short 1–5 word queries to long 16+ word ones — language comprehension scales with phrasing.
Trained end to end, from scratch
Every grounding module built and unit-tested by hand, then trained on cached embeddings — 200 epochs in ~1 hour on a single H100.
Stack
What's next
A clear, research-backed path to sharper localization — the next iterations the architecture is built to grow into.
Word-level query attention
Feed per-word query tokens with early cross-attention (QD-DETR / CG-DETR) to push top-1 ranking toward state of the art.
Cross-dataset reach
Extend training and evaluation to Charades-STA and ActivityNet for broader, multi-domain generalization.
Calibrated confidence
A trustworthy presence score that powers an explicit, reliable “not found” response.
Built chunk by chunk, end to end — a hands-on deep dive into PyTorch, transformers, and the full ML lifecycle.