Author: Core contributor of Biteye, @anci_hu49074

“We are at a time when the world is competing to build the best foundational models. While computing power and model architecture are important, the real moat is the training data.”

——Sandeep Chinchali, Chief AI Officer of Story

Starting with Scale AI, let's discuss the potential of the AI Data sector.

If we talk about the biggest gossip in the AI circle this month, it is Meta demonstrating its cash ability, with Zuckerberg recruiting talent everywhere to form a luxurious Meta AI team mainly composed of Chinese scientific research talents. The leader is Alexander Wang, who is only 28 years old and created Scale AI. He founded Scale AI, which is currently valued at $29 billion and serves clients including the US military, OpenAI, Anthropic, and many other competing AI giants that rely on Scale AI for data services, with Scale AI's core business being the provision of a large amount of accurate labeled data.

Why can Scale AI stand out among many unicorns?

The reason lies in its early discovery of the importance of data in the AI industry.

Computing power, models, and data are the three pillars of AI models. If we compare the large model to a person, then the model is the body, computing power is food, and data is knowledge/information.

In the years since the rise of LLMs, the industry's focus has also shifted from models to computing power. Today, most models have established transformers as their model framework, occasionally innovating with MoE or MoRe, etc.; major players either build Super Clusters for computing power or sign long-term agreements with powerful cloud services like AWS; once the basic needs for computing power are met, the importance of data gradually becomes prominent.

Unlike traditional To B big data companies like Palantir, Scale AI, as its name suggests, is dedicated to building a solid data foundation for AI models, with its business extending beyond the mining of existing data to long-term data generation, and attempting to form AI trainer teams composed of experts from different fields to provide higher quality training data for AI models.

If you are dismissive of this business, let's first look at how models are trained.

The training of the model is divided into two parts—pre-training and fine-tuning.

The pre-training part is somewhat like the process of a human baby gradually learning to speak. What we usually need to do is feed the AI model a large amount of text, code, and other information obtained from web crawlers. The model learns to speak human language (academically called natural language) through self-learning of this content, acquiring basic communication skills.

The fine-tuning part is similar to going to school, usually having clear rights, answers, and directions. Schools will train students into different talents according to their respective positions. We will also train the model with some pre-processed, targeted datasets to equip it with the capabilities we expect.

At this point, you might have realized that the data we need is also divided into two parts.

  • Some data does not need much processing; just having enough is sufficient, usually sourced from large UGC platforms like Reddit, Twitter, GitHub, open literature databases, private corporate databases, etc.

  • The other part, like a specialized textbook, requires careful design and selection to ensure it cultivates specific excellent qualities in the model. This requires us to perform some necessary data cleaning, screening, labeling, and human feedback.

These two parts of datasets constitute the main body of the AI Data sector. Do not underestimate these seemingly low-tech datasets; the mainstream view currently holds that as the computational advantages in scaling laws gradually diminish, data will become the most important pillar for different large model vendors to maintain competitive advantages.

As the model's capabilities further improve, various more refined and specialized training data will become key influencing variables for the model's capabilities. If we further liken model training to the cultivation of martial arts masters, then high-quality datasets are the best martial arts secrets (to complete this analogy, we could also say that computing power is the elixir, and the model is the innate talent).

Longitudinally, AI Data is also a long-term strategy sector with snowball capabilities. With the accumulation of preliminary work, data assets will also have compounding capabilities, becoming more valuable with age.

Web3 DataFi: The chosen land of AI Data.

Compared to the hundreds of thousands of remote manual labeling teams Scale AI has established in places like the Philippines and Venezuela, Web3 has a natural advantage in conducting AI data work, giving rise to the new term DataFi.

In an ideal situation, the advantages of Web3 DataFi are as follows:

1. Data sovereignty, security, and privacy guaranteed by smart contracts.

At the stage where existing public data is about to be fully developed, how to further mine unpublished data, or even private data, is an important direction for expanding data sources. This poses an important trust choice issue—will you choose a centralized big company's contract buyout, selling the data in your hands; or will you choose the blockchain method, keeping your data IP while clearly understanding via smart contracts: who, when, and for what purpose your data is being used?

At the same time, for sensitive information, there are ways to ensure your private data is only handled by tight-lipped machines, such as zk and TEE, preventing leaks.

2. Natural geographical arbitrage advantages: a free distributed architecture that attracts the most suitable labor force.

Perhaps it is time to challenge traditional labor production relationships. Instead of searching for low-cost labor worldwide like Scale AI, it is better to leverage the distributed characteristics of blockchain and utilize publicly transparent incentive measures guaranteed by smart contracts, allowing scattered labor around the world to participate in data contributions.

For labor-intensive tasks like data labeling and model evaluation, using Web3 DataFi's approach is beneficial for participant diversity compared to the centralized data factory model, which has long-term significance in avoiding data bias.

3. Clear incentive and settlement advantages of blockchain.

How to avoid tragedies like the 'Jiangnan Leather Factory'? Naturally, it is to replace the darker sides of human nature with a clear pricing incentive system guaranteed by smart contracts.

In the inevitable context of de-globalization, how to continue achieving low-cost geographical arbitrage? Opening companies all over the world has obviously become more difficult, so why not bypass the barriers of the old world and embrace on-chain settlement?

4. Beneficial for building a more efficient and open 'one-stop' data market.

The 'middleman makes the difference' is an eternal pain for both supply and demand. Instead of letting a centralized data company act as a middleman, it is better to create a platform on-chain that allows supply and demand sides of data to connect more transparently and efficiently through a publicly accessible market like Taobao.

With the development of the on-chain AI ecosystem, the demand for on-chain data will become more vigorous, segmented, and diverse. Only a decentralized market can efficiently digest this demand and transform it into ecological prosperity.

For retail investors, DataFi is also the most favorable decentralized AI project for ordinary retail investors.

Although the emergence of AI tools has lowered the learning threshold to some extent, the original intention of decentralized AI is also to break the current monopoly of AI business by giants; but it must be acknowledged that many current projects do not have strong participation for retail investors without a technical background. Participating in decentralized computing networks often comes with expensive upfront hardware investments, and the technical threshold of model markets can easily deter ordinary participants.

In contrast, there are few opportunities for ordinary users to seize in the AI revolution. Web3 allows you to participate without signing a data factory contract; you only need to click a mouse to log in with your wallet and can participate in various simple tasks, including: providing data, labeling according to human intuition and instincts, evaluating, or further using AI tools for simple creations and participating in data trading. For seasoned profit seekers, the difficulty level is basically zero.

Potential projects in Web3 DataFi

Where money flows, direction goes. In addition to the $14.3 billion investment in Scale AI by Meta in the Web2 world and Palantir's stock skyrocketing fivefold within a year, the DataFi sector in Web3 financing has also performed admirably. Here, we provide a brief introduction to these projects.

Sahara AI, @SaharaLabsAI, raised $49 million.

The ultimate goal of Sahara AI is to build a decentralized AI super infrastructure and trading market. The first trial sector is AI Data, with its DSP (Data Services Platform) public beta scheduled to launch on July 22. Users can earn token rewards by contributing data and participating in data labeling tasks.

Link: app.saharaai.com

Yupp, @yupp_ai, raised $33 million.

Yupp is a feedback platform for AI models, primarily collecting user feedback on model output. The main task currently is for users to compare the outputs of different models for the same prompt and select the one they believe is better. Completing tasks can earn Yupp points, which can further be exchanged for USDC and other stablecoins.

Link: https://yupp.ai/

Vana, @vana, raised $23 million.

Vana focuses on transforming users' personal data (such as social media activity, browsing history, etc.) into monetizable digital assets. Users can authorize the upload of personal data to the corresponding data liquidity pools (DLP) in DataDAOs, and this data will be aggregated for tasks such as participating in AI model training, with users receiving corresponding token rewards.

Link: https://www.vana.org/collectives

Chainbase, @ChainbaseHQ, raised $16.5 million.

Chainbase focuses on on-chain data and currently covers over 200 blockchains, transforming on-chain activities into structured, verifiable, and monetizable data assets for dApp development. Chainbase mainly obtains its business through multi-chain indexing and processes data through its Manuscript system and Theia AI model, with low participation from ordinary users currently.

Sapien, @JoinSapien, raised $15.5 million.

Sapien's goal is to transform human knowledge into high-quality AI training data on a large scale. Anyone can perform data labeling work on the platform, and through peer verification, the quality of the data is ensured. Users are also encouraged to build long-term credibility or make commitments through staking to earn more rewards.

Link: https://earn.sapien.io/#hiw

Prisma X, @PrismaXai, raised $11 million.

Prisma X aims to be an open coordination layer for robots, with physical data collection being key. This project is currently in its early stages, and based on the recently released white paper, participation may involve investing in robots to collect data or remotely operating robot data. Currently, a quiz activity based on the white paper is open for participation to earn points.

Link: https://app.prismax.ai/whitepaper

Masa, @getmasafi, raised $8.9 million.

Masa is one of the leading subnet projects in the Bittensor ecosystem, currently operating data subnet #42 and agent subnet #59. The data subnet is dedicated to providing real-time access to data, primarily mined by miners through TEE hardware crawling real-time data from X/Twitter, making it relatively difficult and costly for ordinary users to participate.

Irys, @irys_xyz, raised $8.7 million.

Irys focuses on programmable data storage and computation, aiming to provide efficient, low-cost solutions for AI, decentralized applications (dApps), and other data-intensive applications. Currently, there are not many ways for ordinary users to participate in data contributions, but there are multiple activities in the current test network stage that can be participated in.

Link: https://bitomokx.irys.xyz/

ORO, @getoro_xyz, raised $6 million.

What ORO wants to do is empower ordinary people to participate in AI contributions. The ways to support include: 1. Linking personal accounts to contribute personal data, including social accounts, health data, e-commerce financial accounts, etc.; 2. Completing data tasks. The test network has now gone live, and participation is welcome.

Link: app.getoro.xyz

Gata, @Gata_xyz, raised $4 million.

Positioned as a decentralized data layer, Gata has currently launched three key products: 1. Data Agent: a series of AI Agents that automatically run and process data as soon as the user opens the webpage; 2. All-in-one Chat: a mechanism similar to Yupp's model evaluation to earn rewards; 3. GPT-to-Earn: a browser plugin that collects user conversation data on ChatGPT.

Link: https://app.gata.xyz/dataAgent

https://chromewebstore.google.com/detail/hhibbomloleicghkgmldapmghagagfao?utm_source=item-share-cb

How do you view the current projects?

Currently, the barriers to these projects are generally not high, but it must be acknowledged that once users and ecological stickiness are accumulated, platform advantages will quickly accumulate. Therefore, early efforts should focus on incentives and user experience; only by attracting enough users can we successfully establish this large data business.

However, as a labor-intensive project, these data platforms must also consider how to manage labor and ensure the quality of data output while attracting manpower. After all, a common problem with many Web3 projects is that most users on the platform are just ruthless profit seekers—they often sacrifice quality for short-term gains. If they become the main users of the platform, it will inevitably lead to bad money driving out good, ultimately jeopardizing data quality and failing to attract buyers. Currently, we see projects like Sahara and Sapien have emphasized data quality, striving to establish long-term healthy cooperative relationships with the labor on the platform.

Additionally, insufficient transparency is another issue faced by current on-chain projects. Indeed, the impossible triangle of blockchain has led many projects to only be able to follow a 'centralization driving decentralization' path during the startup phase. However, more and more on-chain projects give the impression of being 'old Web2 projects clad in Web3 skin'—the publicly traceable data on-chain is scant, and it is difficult to see a long-term commitment to openness and transparency on the roadmap. This is undoubtedly toxic for the long-term healthy development of Web3 DataFi, and we look forward to more projects remaining true to their mission and accelerating their steps towards openness and transparency.

Finally, the mass adoption path of DataFi should also be viewed in two parts: one part is attracting a sufficient number of toC participants to join this network, forming a strong army for data collection/generation projects and serving as consumers of the AI economy, completing an ecological closed loop; the other part is gaining recognition from currently mainstream toB companies, as they are the primary source of large data deals in the short term.

Conclusion

From a deterministic perspective, DataFi is about nurturing machine intelligence with human intelligence for the long term, while using smart contracts as agreements to ensure that the labor of human intelligence yields benefits and ultimately enjoys the feedback from machine intelligence.

If you are anxious about the uncertainty of the AI era and still harbor blockchain ideals amidst the ups and downs of the crypto circle, then following in the footsteps of capital tycoons to join DataFi may be a good choice.