Publiée 19 juin 2026
PhD Position F/M Toward Grounded, Consistent, and Temporally Faithful Video Reasoning
Inria
Paris, Île-de-France 75000, France
CDI
Mission confiée
Recent video multimodal large language models (Video-MLLMs) have achieved
strong results on standard benchmarks, yet remain systematically unreliable on tasks
requiring temporally consistent, spatially grounded reasoning. Video-LLMs achieve
near-chance consistency (≈50%) in temporal grounding even after task-specific fine-
tuning; they hallucinate actions, temporal sequences, and scene transitions at
high rates; and they perform close to random on 4D spatiotemporal tasks (GPT-
4o: 57.5% vs. 98.8% human) and multi-object dynamic spatial reasoning.
These failures are structural: current systems compress all perceptual history into a
flat token sequence and ask the language model to simultaneously act as the archive
of what happened and the reasoner about what it means. These are architecturally
distinct operations, and conflating them in a single attention pass makes temporal
inconsistency, hallucination, and spatial failure modes unavoidable by design. This
PhD addresses the design of an explicit memory and state space to improve long-video
reasoning.
Principales activités
Main activities:
Analyse and implement related work.
Design novel innovative solutions.
Write progress reports and papers.
Present work at conferences.
Compétences
Technical skills and level required : programming skills are required.
Languages : English and possibly French.
Relational skills : Good communication skills.
Avantages
Recent video multimodal large language models (Video-MLLMs) have achieved
strong results on standard benchmarks, yet remain systematically unreliable on tasks
requiring temporally consistent, spatially grounded reasoning. Video-LLMs achieve
near-chance consistency (≈50%) in temporal grounding even after task-specific fine-
tuning; they hallucinate actions, temporal sequences, and scene transitions at
high rates; and they perform close to random on 4D spatiotemporal tasks (GPT-
4o: 57.5% vs. 98.8% human) and multi-object dynamic spatial reasoning.
These failures are structural: current systems compress all perceptual history into a
flat token sequence and ask the language model to simultaneously act as the archive
of what happened and the reasoner about what it means. These are architecturally
distinct operations, and conflating them in a single attention pass makes temporal
inconsistency, hallucination, and spatial failure modes unavoidable by design. This
PhD addresses the design of an explicit memory and state space to improve long-video
reasoning.
Principales activités
Main activities:
Analyse and implement related work.
Design novel innovative solutions.
Write progress reports and papers.
Present work at conferences.
Compétences
Technical skills and level required : programming skills are required.
Languages : English and possibly French.
Relational skills : Good communication skills.
Avantages
- Subsidized meals
- Partial reimbursement of public transport costs
- Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
- Possibility of teleworking and flexible organization of working hours
- Professional equipment available (videoconferencing, loan of computer equipment, etc.)
- Social, cultural and sports events and activities
- Access to vocational training
- Social security coverage