
Meta AI has launched V-JEPA 2, a scalable open-source world mannequin designed to be taught from video at web scale and allow sturdy visible understanding, future state prediction, and zero-shot planning. Constructing upon the joint-embedding predictive structure (JEPA), V-JEPA 2 demonstrates how self-supervised studying from passive web video, mixed with minimal robotic interplay knowledge, can yield a modular basis for clever bodily brokers.

Scalable Self-Supervised Pretraining from 1M Hours of Video
V-JEPA 2 is pretrained on over 1 million hours of internet-scale video mixed with 1 million photos. Utilizing a visible masks denoising goal, the mannequin learns to reconstruct masked spatiotemporal patches in a latent illustration house. This strategy avoids the inefficiencies of pixel-level prediction by specializing in predictable scene dynamics whereas disregarding irrelevant noise.
To scale JEPA pretraining to this degree, Meta researchers launched 4 key methods:
- Information scaling: Constructed a 22M-sample dataset (VideoMix22M) from public sources like SSv2, Kinetics, HowTo100M, YT-Temporal-1B, and ImageNet.
- Mannequin scaling: Expanded the encoder capability to over 1B parameters utilizing ViT-g.
- Coaching schedule: Adopted a progressive decision technique and prolonged pretraining to 252K iterations.
- Spatial-temporal augmentation: Skilled on progressively longer and higher-resolution clips, reaching 64 frames at 384×384 decision.
These design decisions led to an 88.2% common accuracy throughout six benchmark duties—together with SSv2, Diving-48, Jester, Kinetics, COIN, and ImageNet—surpassing earlier baselines.
Understanding through Masked Illustration Studying
V-JEPA 2 displays sturdy movement understanding capabilities. On the One thing-One thing v2 benchmark, it achieves 77.3% top-1 accuracy, outperforming fashions like InternVideo and VideoMAEv2. For look understanding, it stays aggressive with state-of-the-art image-text pretraining fashions like DINOv2 and PEcoreG.
The encoder’s representations have been evaluated utilizing attentive probes, verifying that self-supervised studying alone can yield transferable and domain-agnostic visible options relevant throughout various classification duties.
Temporal Reasoning through Video Query Answering
To evaluate temporal reasoning, the V-JEPA 2 encoder is aligned with a multimodal giant language mannequin and evaluated on a number of video question-answering duties. Regardless of missing language supervision throughout pretraining, the mannequin achieves:
- 84.0% on PerceptionTest
- 76.9% on TempCompass
- 44.5% on MVP
- 36.7% on TemporalBench
- 40.3% on TOMATO
These outcomes problem the belief that visual-language alignment requires co-training from the beginning, demonstrating {that a} pretrained video encoder will be aligned put up hoc with sturdy generalization.
V-JEPA 2-AC: Studying Latent World Fashions for Robotic Planning
A key innovation on this launch is V-JEPA 2-AC, an action-conditioned variant of the pretrained encoder. Effective-tuned utilizing solely 62 hours of unlabeled robotic video from the Droid dataset, V-JEPA 2-AC learns to foretell future video embeddings conditioned on robotic actions and poses. The structure is a 300M parameter transformer with block-causal consideration, educated utilizing a teacher-forcing and rollout goal.
This enables zero-shot planning by way of model-predictive management. The mannequin infers motion sequences by minimizing the space between imagined future states and visible objectives utilizing the Cross-Entropy Technique (CEM). It achieves excessive success in duties comparable to reaching, greedy, and pick-and-place on unseen robotic arms in several labs—with none reward supervision or extra knowledge assortment.

Benchmarks: Strong Efficiency and Planning Effectivity
In comparison with baselines like Octo (conduct cloning) and Cosmos (latent diffusion world fashions), V-JEPA 2-AC:
- Executes plans in ~16 seconds per step (versus 4 minutes for Cosmos).
- Reaches a 100% success fee on attain duties.
- Outperforms others in grasp and manipulation duties throughout object varieties.

Notably, it operates utilizing a monocular RGB digicam with out calibration or environment-specific fine-tuning, reinforcing the generalization functionality of the discovered world mannequin.
Conclusion
Meta’s V-JEPA 2 represents a big development in scalable self-supervised studying for bodily intelligence. By decoupling remark studying from motion conditioning and leveraging large-scale passive video, V-JEPA 2 demonstrates that general-purpose visible representations will be harnessed for each notion and management in the actual world.
Try the Paper, Fashions on Hugging Face and GitHub Web page. All credit score for this analysis goes to the researchers of this venture. Additionally, be at liberty to comply with us on Twitter and don’t overlook to affix our 99k+ ML SubReddit and Subscribe to our Publication.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its recognition amongst audiences.