Written by: Haotian
In addition to the localization of AI 'sinking' downwards, the biggest change in the AI track recently is the technological breakthrough in multimodal video generation, evolving from supporting pure text-based video generation to a full-link integration generation technology of text + images + audio.
Let me casually mention a few technical breakthrough cases for everyone to feel:
1) ByteDance open-sourced the EX-4D framework: monocular video instantly becomes free-view 4D content, with a user acceptance rate of 70.7%. This means that for an ordinary video, AI can automatically generate viewing effects from any angle, which previously required a professional 3D modeling team.
2) Baidu's 'Hui Xiang' platform: generates a 10-second video from a single image, claiming to achieve 'movie-level' quality. However, whether this is exaggerated by marketing needs to wait for the actual effect after the Pro version update in August.
3) Google DeepMind Veo: can achieve synchronized generation of 4K video + ambient sound. The key technology highlight is achieving 'synchronization'; previously, video and audio were two separate systems stitched together. To achieve true semantic matching, significant challenges must be overcome, such as in complex scenes, matching the corresponding sound of footsteps with walking actions in the imagery.
4) Douyin ContentV: 8 billion parameters, generates 1080p video in 2.3 seconds, costing 3.67 yuan / 5 seconds. Frankly, this cost control is acceptable, but currently, the generation quality is lacking in complex scenes.
Why are these breakthroughs in video quality, generation cost, and application scenarios so valuable and significant?
1. In terms of breakthroughs in technical value, the complexity of generating a multimodal video is often exponential; a single frame image generates about 10^6 pixels, videos need to ensure temporal coherence (at least 100 frames), plus audio synchronization (10^4 sampling points per second), and also consider 3D spatial consistency.
In summary, the technical complexity is not low; originally, it was a super large model handling all tasks, reportedly Sora burned tens of thousands of H100s to achieve video generation capabilities. Now it can be realized through modular decomposition + large model collaboration. For example, Byte's EX-4D actually breaks down complex tasks into: depth estimation module, perspective transformation module, temporal interpolation module, rendering optimization module, etc. Each module specializes in one task and then collaborates through a coordination mechanism.
2. In terms of cost reduction: behind this is the optimization of the inference architecture itself, including layered generation strategies, generating a skeleton at low resolution and then enhancing imaging content at high resolution; a caching reuse mechanism, which is the reuse of similar scenes; dynamic resource allocation, which adjusts model depth according to the specific content complexity.
Only with this optimization can we achieve the result of 3.67 yuan / 5 seconds for Douyin ContentV.
3. In terms of application impact, traditional video production is a heavy asset game: equipment, venues, actors, post-production; a 30-second advertisement can have a production cost of hundreds of thousands. Now, AI compresses this process to a Prompt + a few minutes of waiting, and can achieve perspectives and effects that are difficult to obtain with traditional shooting.
This transforms the technical and financial barriers of video production into creativity and aesthetics, which may promote a reshuffling of the entire creator economy.
The question arises, what does all this talk about the changes in the demand side of web2AI technology have to do with web3AI?
1. First, the change in the structure of computing power demand; previously, AI competed on the scale of computing power, whoever had more homogeneous GPU clusters would win, but multimodal video generation requires a diverse combination of computing power, which may generate demand for distributed idle computing power, as well as various distributed fine-tuning models, algorithms, and inference platforms.
2. Secondly, the demand for data labeling will also strengthen; generating a professional-grade video requires: precise scene descriptions, reference images, audio styles, camera motion trajectories, lighting conditions, etc., which will become new professional data labeling demands. Using web3's incentive methods, we can stimulate photographers, sound engineers, 3D artists, etc., to provide professional data elements, enhancing AI's video generation capability with specialized vertical data labeling.
3. Finally, it is worth mentioning that as AI gradually shifts from centralized large-scale resource allocation to modular collaboration, this itself is a new demand for decentralized platforms. At that time, computing power, data, models, incentives, etc., will combine to form a self-reinforcing flywheel, further driving the integration of web3AI and web2AI scenarios.