SHAUN RODRIGUES
Temporal Video Groundingvision · language · time

GROUNDZERO

Give it a long video and a sentence — it finds the exact start and end timestamps of the moment you described. Not classification, not captioning: locating where in time an event happens, down to the second.

GroundZero — temporal video grounding demo
0:00
ground truth [0:00–0:22] predicted [0:00–0:20]IoU 0.91
2:30
Backbone
SigLIP 2 So400m
Trained on
QVHighlights
Best R@1 IoU=0.5
0.5413
Status
Trained · served
01

What it does

Most long video is unstructured — lectures, depositions, security footage, match archives. To find a single moment you either scrub by hand or you need a model that reads language and video at once and returns the exact span of the video you're looking for. GroundZero outputs (start, end) in seconds.

Not retrieval
finds where inside a video, not which video
Not captioning
pinpoints a moment, doesn't describe the clip
Not chapters
works on raw video with no labels or transcript
02

The pipeline

A naive SigLIP zero-shot pass scores ~35% because consecutive frames look identical to the backbone. Five stages turn flat frame embeddings into a precise span — each built and unit-tested on its own.

  1. 01

    Frame sampling + visual encoder

    Sample at 1fps; encode each frame with SigLIP 2 So400m (1152-d), with LoRA adapters on the last 4 transformer blocks (0.15M trainable).

  2. 02

    Temporal context

    Dilated 1D convolutions [1,2,4,8] + fractional positional encoding give every frame a ~60-second receptive field — so it knows what surrounds it.

  3. 03

    Text encoder

    SigLIP 2's built-in text tower (frozen) embeds the query into the same 1152-d space — no separate model, no projection.

  4. 04

    Cross-modal transformer

    4 layers alternating cross- and self-attention. Frames attend to the query, lighting up the timeline near the described event.

  5. 05

    Span extraction head

    Per-frame start/end scoring (BERT-SQuAD style) decodes the best valid span; a confidence head estimates whether the event is present at all.

Inference is coarse-to-fine: a 1fps scan finds the region, then that window can be re-sampled at 4fps for tighter boundaries. Trained on cached frozen embeddings — 200 epochs on a Lightning.ai H100 in ~1 hour.

03

Results

R@1 · IoU 0.50.541
R@1 · IoU 0.70.386
R@5 · IoU 0.50.797

QVHighlights val (n=1550). Built from scratch on a single dataset with a frozen SigLIP 2 backbone + LoRA — landing in Moment-DETR territory while training the whole grounding stack from zero.

IoU 0.91
Verified val clip
pred [0:00–0:20] vs GT [0:00–0:22]
0.797
R@5 · IoU 0.5
right moment in the top-5
~1 hr
Training time
200 epochs on an H100
7.4k
Training clips
QVHighlights, 1fps frames
04

Highlights

bolt

Strong candidate retrieval

The right moment lands in the top-5 ~80% of the time — the representation surfaces the answer reliably across the val set.

bolt

Robust to query length

R@1 holds flat (0.51–0.55) from short 1–5 word queries to long 16+ word ones — language comprehension scales with phrasing.

bolt

Trained end to end, from scratch

Every grounding module built and unit-tested by hand, then trained on cached embeddings — 200 epochs in ~1 hour on a single H100.

05

Stack

Visual + text
SigLIP 2 So400m
Adaptation
LoRA · peft
Temporal
Dilated 1D conv
Fusion
Cross-modal transformer
Boundaries
Span extraction head
Backend
FastAPI · PyTorch
Frontend
Next.js · /try demo
Training
Lightning.ai H100 · W&B
06

What's next

A clear, research-backed path to sharper localization — the next iterations the architecture is built to grow into.

trending_up

Word-level query attention

Feed per-word query tokens with early cross-attention (QD-DETR / CG-DETR) to push top-1 ranking toward state of the art.

trending_up

Cross-dataset reach

Extend training and evaluation to Charades-STA and ActivityNet for broader, multi-domain generalization.

trending_up

Calibrated confidence

A trustworthy presence score that powers an explicit, reliable “not found” response.

Built chunk by chunk, end to end — a hands-on deep dive into PyTorch, transformers, and the full ML lifecycle.

Weights & data
HuggingFace Hubarrow_outward
Archive
Back to all projectsarrow_back