Temporal Video Groundingvision · language · time

GROUNDZERO

Give it a long video and a sentence — it finds the exact start and end timestamps of the moment you described. Not classification, not captioning: locating where in time an event happens, down to the second.

0:00

ground truth [0:00–0:22] predicted [0:00–0:20]IoU 0.91

2:30

Backbone

SigLIP 2 So400m

Trained on

QVHighlights

Best R@1 IoU=0.5

0.5413

Status

Trained · served

What it does

Most long video is unstructured — lectures, depositions, security footage, match archives. To find a single moment you either scrub by hand or you need a model that reads language and video at once and returns the exact span of the video you're looking for. GroundZero outputs (start, end) in seconds.

Not retrieval

finds where inside a video, not which video

Not captioning

pinpoints a moment, doesn't describe the clip

Not chapters

works on raw video with no labels or transcript

The pipeline

A naive SigLIP zero-shot pass scores ~35% because consecutive frames look identical to the backbone. Five stages turn flat frame embeddings into a precise span — each built and unit-tested on its own.

01
Frame sampling + visual encoder
Sample at 1fps; encode each frame with SigLIP 2 So400m (1152-d), with LoRA adapters on the last 4 transformer blocks (0.15M trainable).
02
Temporal context
Dilated 1D convolutions [1,2,4,8] + fractional positional encoding give every frame a ~60-second receptive field — so it knows what surrounds it.
03
Text encoder
SigLIP 2's built-in text tower (frozen) embeds the query into the same 1152-d space — no separate model, no projection.
04
Cross-modal transformer
4 layers alternating cross- and self-attention. Frames attend to the query, lighting up the timeline near the described event.
05
Span extraction head
Per-frame start/end scoring (BERT-SQuAD style) decodes the best valid span; a confidence head estimates whether the event is present at all.

Inference is coarse-to-fine: a 1fps scan finds the region, then that window can be re-sampled at 4fps for tighter boundaries. Trained on cached frozen embeddings — 200 epochs on a Lightning.ai H100 in ~1 hour.

Results

R@1 · IoU 0.50.541

R@1 · IoU 0.70.386

R@5 · IoU 0.50.797

QVHighlights val (n=1550). Built from scratch on a single dataset with a frozen SigLIP 2 backbone + LoRA — landing in Moment-DETR territory while training the whole grounding stack from zero.

IoU 0.91

Verified val clip

pred [0:00–0:20] vs GT [0:00–0:22]

0.797

R@5 · IoU 0.5

right moment in the top-5

~1 hr

Training time

200 epochs on an H100

7.4k

Training clips

QVHighlights, 1fps frames

Highlights

bolt

Strong candidate retrieval

The right moment lands in the top-5 ~80% of the time — the representation surfaces the answer reliably across the val set.

bolt

Robust to query length

R@1 holds flat (0.51–0.55) from short 1–5 word queries to long 16+ word ones — language comprehension scales with phrasing.

bolt

Trained end to end, from scratch

Every grounding module built and unit-tested by hand, then trained on cached embeddings — 200 epochs in ~1 hour on a single H100.

Stack

Visual + text

SigLIP 2 So400m

Adaptation

LoRA · peft

Temporal

Dilated 1D conv

Fusion

Cross-modal transformer

Boundaries

Span extraction head

Backend

FastAPI · PyTorch

Frontend

Next.js · /try demo

Training

Lightning.ai H100 · W&B

What's next

A clear, research-backed path to sharper localization — the next iterations the architecture is built to grow into.

trending_up

Word-level query attention

Feed per-word query tokens with early cross-attention (QD-DETR / CG-DETR) to push top-1 ranking toward state of the art.

trending_up

Cross-dataset reach

Extend training and evaluation to Charades-STA and ActivityNet for broader, multi-domain generalization.

trending_up

Calibrated confidence

A trustworthy presence score that powers an explicit, reliable “not found” response.

Built chunk by chunk, end to end — a hands-on deep dive into PyTorch, transformers, and the full ML lifecycle.

Weights & data

HuggingFace Hubarrow_outward

GROUNDZERO

What it does

The pipeline

Frame sampling + visual encoder

Temporal context

Text encoder

Cross-modal transformer

Span extraction head

Results

Highlights

Strong candidate retrieval

Robust to query length

Trained end to end, from scratch

Stack

What's next

Word-level query attention

Cross-dataset reach

Calibrated confidence