
If 2022 was the yr that generative AI captured a wider public’s creativeness, 2025 is the yr the place the brand new breed of generative video frameworks coming from China appears set to do the identical.
Tencent’s Hunyuan Video has made a main affect on the hobbyist AI neighborhood with its open-source launch of a full-world video diffusion mannequin that customers can tailor to their wants.
Shut on its heels is Alibaba’s more moderen Wan 2.1, one of the vital highly effective image-to-video FOSS options of this era – now supporting customization via Wan LoRAs.
Apart from the provision of current human-centric basis mannequin SkyReels, on the time of writing we additionally await the discharge of Alibaba’s complete VACE video creation and modifying suite:
Click on to play. The pending launch of Alibaba’s multi-function AI-editing suite VACE has excited the person neighborhood. Supply: https://ali-vilab.github.io/VACE-Web page/
Sudden Impression
The generative video AI analysis scene itself is not any much less explosive; it is nonetheless the primary half of March, and Tuesday’s submissions to Arxiv’s Laptop Imaginative and prescient part (a hub for generative AI papers) got here to almost 350 entries – a determine extra related to the peak of convention season.
The 2 years for the reason that launch of Steady Diffusion in summer season of 2022 (and the next improvement of Dreambooth and LoRA customization strategies) have been characterised by the dearth of additional main developments, till the previous couple of weeks, the place new releases and improvements have proceeded at such a breakneck tempo that it’s nearly unattainable to maintain apprised of all of it, a lot much less cowl all of it.
Video diffusion fashions equivalent to Hunyuan and Wan 2.1 have solved, in the end, and after years of failed efforts from lots of of analysis initiatives, the drawback of temporal consistency because it pertains to the technology of people, and largely additionally to environments and objects.
There may be little doubt that VFX studios are presently making use of workers and assets to adapting the brand new Chinese language video fashions to resolve instant challenges equivalent to face-swapping, regardless of the present lack of ControlNet-style ancillary mechanisms for these methods.
It should be such a reduction that one such vital impediment has doubtlessly been overcome, albeit not via the avenues anticipated.
Of the issues that stay, this one, nonetheless, shouldn’t be insignificant:
Click on to play. Based mostly on the immediate ‘A small rock tumbles down a steep, rocky hillside, displacing soil and small stones ‘, Wan 2.1, which achieved the very highest scores within the new paper, makes one easy error. Supply: https://videophy2.github.io/
Up The Hill Backwards
All text-to-video and image-to-video methods presently accessible, together with business closed-source fashions, generally tend to supply physics bloopers such because the one above, the place the video exhibits a rock rolling uphill, based mostly on the immediate ‘A small rock tumbles down a steep, rocky hillside, displacing soil and small stones ‘.
One idea as to why this occurs, just lately proposed in an instructional collaboration between Alibaba and UAE, is that fashions practice at all times on single photos, in a way, even once they’re coaching on movies (that are written out to single-frame sequences for coaching functions); and so they might not essentially be taught the proper temporal order of ‘earlier than’ and ‘after’ footage.
Nevertheless, the most definitely resolution is that the fashions in query have used knowledge augmentation routines that contain exposing a supply coaching clip to the mannequin each forwards and backwards, successfully doubling the coaching knowledge.
It has lengthy been identified that this should not be performed arbitrarily, as a result of some actions work in reverse, however many don’t. A 2019 examine from the UK’s College of Bristol sought to develop a way that would distinguish equivariant, invariant and irreversible supply knowledge video clips that co-exist in a single dataset (see picture under), with the notion that unsuitable supply clips is perhaps filtered out from knowledge augmentation routines.

Examples of three sorts of motion, solely certainly one of which is freely reversible whereas sustaining believable bodily dynamics. Supply: https://arxiv.org/abs/1909.09422
The authors of that work body the issue clearly:
‘We discover the realism of reversed movies to be betrayed by reversal artefacts, elements of the scene that will not be doable in a pure world. Some artefacts are refined, whereas others are straightforward to identify, like a reversed ‘throw’ motion the place the thrown object spontaneously rises from the ground.
‘We observe two sorts of reversal artefacts, bodily, these exhibiting violations of the legal guidelines of nature, and unbelievable, these depicting a doable however unlikely situation. These usually are not unique, and lots of reversed actions endure each sorts of artefacts, like when uncrumpling a bit of paper.
‘Examples of bodily artefacts embrace: inverted gravity (e.g. ‘dropping one thing’), spontaneous impulses on objects (e.g. ‘spinning a pen’), and irreversible state adjustments (e.g. ‘burning a candle’). An instance of an unbelievable artefact: taking a plate from the cabinet, drying it, and putting it on the drying rack.
‘This type of re-use of information is quite common at coaching time, and may be helpful – for instance, in ensuring that the mannequin doesn’t be taught just one view of a picture or object which may be flipped or rotated with out shedding its central coherency and logic.
‘This solely works for objects which are actually symmetrical, in fact; and studying physics from a ‘reversed’ video solely works if the reversed model makes as a lot sense because the ahead model.’
Short-term Reversals
We haven’t any proof that methods equivalent to Hunyuan Video and Wan 2.1 allowed arbitrarily ‘reversed’ clips to be uncovered to the mannequin throughout coaching (neither group of researchers has been particular relating to knowledge augmentation routines).
But the one cheap various chance, within the face of so many reviews (and my very own sensible expertise), would appear to be that hyperscale datasets powering these mannequin might comprise clips that really characteristic actions occurring in reverse.
The rock within the instance video embedded above was generated utilizing Wan 2.1, and options in a brand new examine that examines how properly video diffusion fashions deal with physics.
In checks for this undertaking, Wan 2.1 achieved a rating of solely 22% by way of its capability to persistently adhere to bodily legal guidelines.
Nevertheless, that is the greatest rating of any system examined for the work, indicating that we might have discovered our subsequent stumbling block for video AI:

Scores obtained by main open and closed-source methods, with the output of the frameworks evaluated by human annotators. Supply: https://arxiv.org/pdf/2503.06800
The authors of the brand new work have developed a benchmarking system, now in its second iteration, referred to as VideoPhy, with the code accessible at GitHub.
Although the scope of the work is past what we are able to comprehensively cowl right here, let’s take a basic take a look at its methodology, and its potential for establishing a metric that would assist steer the course of future model-training periods away from these weird situations of reversal.
The examine, performed by six researchers from UCLA and Google Analysis, is known as VideoPhy-2: A Difficult Motion-Centric Bodily Commonsense Analysis in Video Era. A crowded accompanying undertaking web site can also be accessible, together with code and datasets at GitHub, and a dataset viewer at Hugging Face.
Click on to play. Right here, the feted OpenAI Sora mannequin fails to grasp the interactions between oars and reflections, and isn’t capable of present a logical bodily stream both for the individual within the boat or the way in which that the boat interacts along with her.
Technique
The authors describe the most recent model of their work, VideoPhy-2, as a ‘difficult commonsense analysis dataset for real-world actions.’ The gathering options 197 actions throughout a variety of various bodily actions equivalent to hula-hooping, gymnastics and tennis, in addition to object interactions, equivalent to bending an object till it breaks.
A big language mannequin (LLM) is used to generate 3840 prompts from these seed actions, and the prompts are then used to synthesize movies by way of the varied frameworks being trialed.
All through the method the authors have developed a listing of ‘candidate’ bodily guidelines and legal guidelines that AI-generated movies ought to fulfill, utilizing vision-language fashions for analysis.
The authors state:
‘For instance, in a video of sportsperson taking part in tennis, a bodily rule could be {that a} tennis ball ought to observe a parabolic trajectory below gravity. For gold-standard judgments, we ask human annotators to attain every video based mostly on general semantic adherence and bodily commonsense, and to mark its compliance with numerous bodily guidelines.’

Above: A textual content immediate is generated from an motion utilizing an LLM and used to create a video with a text-to-video generator. A vision-language mannequin captions the video, figuring out doable bodily guidelines at play. Beneath: Human annotators consider the video’s realism, affirm rule violations, add lacking guidelines, and examine whether or not the video matches the unique immediate.
Initially the researchers curated a set of actions to judge bodily commonsense in AI-generated movies. They started with over 600 actions sourced from the Kinetics, UCF-101, and SSv2 datasets, specializing in actions involving sports activities, object interactions, and real-world physics.
Two unbiased teams of STEM-trained pupil annotators (with a minimal undergraduate qualification obtained) reviewed and filtered the checklist, choosing actions that examined ideas equivalent to gravity, momentum, and elasticity, whereas eradicating low-motion duties equivalent to typing, petting a cat, or chewing.
After additional refinement with Gemini-2.0-Flash-Exp to eradicate duplicates, the ultimate dataset included 197 actions, with 54 involving object interactions and 143 centered on bodily and sports activities actions:

Samples from the distilled actions.
Within the second stage, the researchers used Gemini-2.0-Flash-Exp to generate 20 prompts for every motion within the dataset, leading to a complete of three,940 prompts. The technology course of targeted on seen bodily interactions that might be clearly represented in a generated video. This excluded non-visual components equivalent to feelings, sensory particulars, and summary language, however included various characters and objects.
For instance, as a substitute of a easy immediate like ‘An archer releases the arrow’, the mannequin was guided to supply a extra detailed model equivalent to ‘An archer attracts the bowstring again to full stress, then releases the arrow, which flies straight and strikes a bullseye on a paper goal‘.
Since trendy video fashions can interpret longer descriptions, the researchers additional refined the captions utilizing the Mistral-NeMo-12B-Instruct immediate upsampler, so as to add visible particulars with out altering the unique which means.

Pattern prompts from VideoPhy-2, categorized by bodily actions or object interactions. Every immediate is paired with its corresponding motion and the related bodily precept it checks.
For the third stage, bodily guidelines weren’t derived from textual content prompts however from generated movies, since generative fashions can wrestle to stick to conditioned textual content prompts.
Movies have been first created utilizing VideoPhy-2 prompts, then ‘up-captioned’ with Gemini-2.0-Flash-Exp to extract key particulars. The mannequin proposed three anticipated bodily guidelines per video, which human annotators reviewed and expanded by figuring out further potential violations.

Examples from the upsampled captions.
Subsequent, to establish essentially the most difficult actions, the researchers generated movies utilizing CogVideoX-5B with prompts from the VideoPhy-2 dataset. They then chosen 60 actions out of 197 the place the mannequin persistently didn’t observe each the prompts and primary bodily commonsense.
These actions concerned physics-rich interactions equivalent to momentum switch in discus throwing, state adjustments equivalent to bending an object till it breaks, balancing duties equivalent to tightrope strolling, and complicated motions that included back-flips, pole vaulting, and pizza tossing, amongst others. In complete, 1,200 prompts have been chosen to extend the problem of the sub-dataset.
The ensuing dataset comprised 3,940 captions – 5.72 instances greater than the sooner model of VideoPhy. The common size of the unique captions is 16 tokens, whereas upsampled captions reaches 138 tokens – 1.88 instances and 16.2 instances longer, respectively.
The dataset additionally options 102,000 human annotations masking semantic adherence, bodily commonsense, and rule violations throughout a number of video technology fashions.
Analysis
The researchers then outlined clear standards for evaluating the movies. The principle purpose was to evaluate how properly every video matched its enter immediate and adopted primary bodily ideas.
As a substitute of merely rating movies by desire, they used rating-based suggestions to seize particular successes and failures. Human annotators scored movies on a five-point scale, permitting for extra detailed judgments, whereas the analysis additionally checked whether or not movies adopted numerous bodily guidelines and legal guidelines.
For human analysis, a gaggle of 12 annotators have been chosen from trials on Amazon Mechanical Turk (AMT), and supplied rankings after receiving detailed distant directions. For equity, semantic adherence and bodily commonsense have been evaluated individually (within the authentic VideoPhy examine, they have been assessed collectively).
The annotators first rated how properly movies matched their enter prompts, then individually evaluated bodily plausibility, scoring rule violations and general realism on a five-point scale. Solely the unique prompts have been proven, to keep up a good comparability throughout fashions.

The interface offered to the AMT annotators.
Although human judgment stays the gold normal, it is costly and comes with a variety of caveats. Subsequently automated analysis is crucial for sooner and extra scalable mannequin assessments.
The paper’s authors examined a number of video-language fashions, together with Gemini-2.0-Flash-Exp and VideoScore, on their capability to attain movies for semantic accuracy and for ‘bodily commonsense’.
The fashions once more rated every video on a five-point scale, whereas a separate classification job decided whether or not bodily guidelines have been adopted, violated, or unclear.
Experiments confirmed that current video-language fashions struggled to match human judgments, primarily because of weak bodily reasoning and the complexity of the prompts. To enhance automated analysis, the researchers developed VideoPhy-2-Autoeval, a 7B-parameter mannequin designed to offer extra correct predictions throughout three classes: semantic adherence; bodily commonsense; and rule compliance, fine-tuned on the VideoCon-Physics mannequin utilizing 50,000 human annotations*.
Information and Assessments
With these instruments in place, the authors examined a variety of generative video methods, each via native installations and, the place needed, by way of business APIs: CogVideoX-5B; VideoCrafter2; HunyuanVideo-13B; Cosmos-Diffusion; Wan2.1-14B; OpenAI Sora; and Luma Ray.
The fashions have been prompted with upsampled captions the place doable, besides that Hunyuan Video and VideoCrafter2 function below 77-token CLIP limitations, and can’t settle for prompts above a sure size.
Movies generated have been saved to lower than 6 seconds, since shorter output is less complicated to judge.
The driving knowledge was from the VideoPhy-2 dataset, which was cut up right into a benchmark and coaching set. 590 movies have been generated per mannequin, apart from Sora and Ray2; because of the value issue (equal decrease numbers of movies have been generated for these).
(Please confer with the supply paper for additional analysis particulars, that are exhaustively chronicled there)
The preliminary analysis handled bodily actions/sports activities (PA) and object interactions (OI), and examined each the final dataset and the aforementioned ‘more durable’ subset:

Outcomes from the preliminary spherical.
Right here the authors remark:
‘Even the best-performing mannequin, Wan2.1-14B, achieves solely 32.6% and 21.9% on the complete and arduous splits of our dataset, respectively. Its comparatively robust efficiency in comparison with different fashions may be attributed to the range of its multimodal coaching knowledge, together with strong movement filtering that preserves high-quality movies throughout a variety of actions.
‘Moreover, we observe that closed fashions, equivalent to Ray2, carry out worse than open fashions like Wan2.1-14B and CogVideoX-5B. This implies that closed fashions usually are not essentially superior to open fashions in capturing bodily commonsense.
‘Notably, Cosmos-Diffusion-7B achieves the second-best rating on the arduous cut up, even outperforming the a lot bigger HunyuanVideo-13B mannequin. This can be because of the excessive illustration of human actions in its coaching knowledge, together with synthetically rendered simulations.’
The outcomes confirmed that video fashions struggled extra with bodily actions like sports activities than with easier object interactions. This implies that enhancing AI-generated movies on this space would require higher datasets – notably high-quality footage of sports activities equivalent to tennis, discus, baseball, and cricket.
The examine additionally examined whether or not a mannequin’s bodily plausibility correlated with different video high quality metrics, equivalent to aesthetics and movement smoothness. The findings revealed no robust correlation, which means a mannequin can not enhance its efficiency on VideoPhy-2 simply by producing visually interesting or fluid movement – it wants a deeper understanding of bodily commonsense.
Although the paper gives considerable qualitative examples, few of the static examples supplied within the PDF appear to narrate to the in depth video-based examples that the authors furnish on the undertaking web site. Subsequently we’ll take a look at a small choice of the static examples after which some extra of the particular undertaking movies.

The highest row exhibits movies generated by Wan2.1. (a) In Ray2, the jet-ski on the left lags behind earlier than transferring backward. (b) In Hunyuan-13B, the sledgehammer deforms mid-swing, and a damaged wood board seems unexpectedly. (c) In Cosmos-7B, the javelin expels sand earlier than making contact with the bottom.
Concerning the above qualitative check, the authors remark:
‘[We] observe violations of bodily commonsense, equivalent to jetskis transferring unnaturally in reverse and the deformation of a strong sledgehammer, defying the ideas of elasticity. Nevertheless, even Wan suffers from the dearth of bodily commonsense, as proven in [the clip embedded at the start of this article].
‘On this case, we spotlight {that a} rock begins rolling and accelerating uphill, defying the bodily legislation of gravity.’
Additional examples from the undertaking web site:
Click on to play. Right here the caption was ‘An individual vigorously twists a moist towel, water spraying outwards in a visual arc’ – however the ensuing supply of water is way extra like a water-hose than a towel.
Click on to play. Right here the caption was ‘A chemist pours a transparent liquid from a beaker right into a check tube, fastidiously avoiding spills’, however we are able to see that the amount of water being added to the beaker shouldn’t be in keeping with the quantity exiting the jug.
As I discussed on the outset, the amount of fabric related to this undertaking far exceeds what may be coated right here. Subsequently please confer with the supply paper, undertaking web site and associated websites talked about earlier, for a very exhaustive define of the authors’ procedures, and significantly extra testing examples and procedural particulars.
* As for the provenance of the annotations, the paper solely specifies ‘acquired for these duties’ – it appears lots to have been generated by 12 AMT staff.
First printed Thursday, March 13, 2025