Written by: Haotian

Beyond the localization of AI 'sinking', the biggest change in the AI track recently has been the technological breakthrough in multi-modal video generation, evolving from supporting pure text-to-video generation to a full-link integration generation technology involving text + images + audio.

Let me mention a few examples of technological breakthroughs for everyone to feel.

1) ByteDance open-sourced the EX-4D framework: single-eye video turns into free-view 4D content, with a user acceptance rate of 70.7%. This means that for an ordinary video, AI can automatically generate viewing effects from any angle, which previously required a professional 3D modeling team.

2) Baidu's 'Hui Xiang' platform: generates a 10-second video from an image, claiming to achieve 'movie-level' quality. However, whether this is exaggerated by marketing packaging will need to be seen after the Pro version update in August.

3) Google DeepMind Veo: can achieve synchronized generation of 4K video + ambient sound. The key technological highlight is the achievement of 'synchronization' capability; previously, video and audio were stitched together from two separate systems. Achieving true semantic-level matching requires overcoming significant challenges, such as ensuring that walking actions and footstep sounds correspond in complex scenes.

4) Douyin ContentV: 80 billion parameters, generates 1080p video in 2.3 seconds, costing 3.67 yuan /5 seconds. Honestly, this cost control is quite good, but currently, in terms of generation quality, it still falls short in complex scenes.

Why do we say that these cases have significant breakthroughs in video quality, generation costs, application scenarios, etc., and have great value and significance?

1. In terms of technological value breakthroughs, the complexity of multi-modal video generation is often exponential. A single frame image generates about 10^6 pixels, and a video must ensure temporal coherence (at least 100 frames), plus audio synchronization (10^4 sampling points per second), while also considering 3D spatial consistency.

Overall, the technical complexity is not low; originally, a super-large model handled all tasks. It is said that Sora burned tens of thousands of H100 to gain video generation capability. Now, it can be achieved through modular decomposition + collaboration of large models. For example, Byte's EX-4D actually breaks down complex tasks into: depth estimation module, viewpoint transformation module, temporal interpolation module, rendering optimization module, etc. Each module specializes in one task and coordinates through a mechanism.

2. In terms of cost reduction: behind it is the optimization of the inference architecture itself, including layered generation strategies, first generating a skeleton at low resolution and then enhancing imaging content at high resolution; cache reuse mechanisms, which involve reusing similar scenes; and dynamic resource allocation, which adjusts the model depth based on specific content complexity.

Only after such optimization can we achieve a result of 3.67 yuan /5 seconds for Douyin ContentV.

3. In terms of application impact, traditional video production is a heavy asset game: equipment, venues, actors, post-production; a 30-second advertisement costing hundreds of thousands is normal. Now, AI compresses this process to Prompt + a few minutes of waiting, and can achieve perspectives and special effects that traditional shooting finds difficult.

This transforms the original technical and financial barriers in video production into creativity and aesthetics, which may promote a reshuffling of the entire creator economy.

The question arises, what is the relationship between all these changes in the demand side of web2AI technology and web3AI?

1. First, the change in the structure of computing power demand. In the past, AI competed on the scale of computing power; whoever had more homogeneous GPU clusters would win. However, multi-modal video generation requires a diverse combination of computing power, which may generate demand for distributed idle computing power, as well as various distributed fine-tuning models, algorithms, and inference platforms.

2. Secondly, the demand for data annotation will also strengthen. Generating a professional-level video requires: precise scene descriptions, reference images, audio styles, camera motion trajectories, lighting conditions, etc., which will become new professional data annotation needs. Using web3 incentive methods can stimulate photographers, sound engineers, 3D artists, and others to provide professional data elements, enhancing AI video generation capabilities with specialized vertical data annotations.

3. Lastly, it is worth mentioning that as AI transitions from a past centralized large-scale resource allocation to a modular collaborative approach, this itself is a new demand for decentralized platforms. At that time, computing power, data, models, incentives, etc., will combine to form a self-reinforcing flywheel, driving the integration of web3AI and web2AI scenarios.