Is data annotation, this 'hard and tiring work', quietly bec

Is data annotation, this 'hard and tiring work', quietly becoming a hot commodity? With over $11.2 million in funding led by Polychain, @OpenledgerHQ aims to address the long-neglected pain point of 'data value distribution' through its unique PoA + infini-gram mechanism. Let's explore this from a technical perspective:
1) To be honest, the biggest 'original sin' in the current AI industry is the unfair distribution of data value. OpenLedger's PoA (Proof of Contribution) aims to establish a 'copyright tracking system' for data contributions.
Specifically: Data contributors upload content to specific domain DataNets, and each data point is permanently recorded along with contributor metadata and content hash.
When the model is trained on these datasets, the attribution process occurs during the inference phase, which is when the model generates output. PoA tracks which data points influenced that output by analyzing the matching range or impact score, and these records determine the proportional impact of each contributor's data.
When the model generates revenue through inference, PoA ensures that profits are accurately distributed based on each contributor's influence—creating a transparent, fair, and on-chain reward mechanism.
In other words, PoA addresses the fundamental contradiction in data economics. The logic in the past was simple and crude—AI companies obtained massive amounts of data for free and then made a fortune through model commercialization, while data contributors received nothing. But PoA realizes 'data privatization' through technical means, allowing each data point to generate clear economic value.
I believe that once this transformation mechanism from 'free-riding mode' to 'labor distribution' is successfully implemented, the incentive logic for data contribution will completely change.
Moreover, PoA adopts a layered strategy to address attribution issues for models of different scales: small models with millions of parameters can estimate the impact of each data point by analyzing the model's influence function, and the computational load is bearable, while medium to large parameter models become unfeasible and inefficient using this method. This is where the powerful Infini-gram comes into play.
2) Here comes the question, what is the infini-gram technology? The problem it aims to solve sounds quite extreme: accurately tracking the data source of each output Token in medium to large parameter black-box models.
Traditional attribution methods mainly rely on analyzing the model's influence function, but they basically fall short in the face of large models. The reason is simple: the larger the model, the more complex the internal calculations, and the analysis cost increases exponentially, making it unfeasible and inefficient for computation. This is entirely unrealistic in commercial applications.
Infini-gram completely changes the approach: since the internal workings of the model are too complex, it directly searches for matches in the raw data. It builds an index based on suffix arrays, using dynamically selected longest matching suffixes instead of traditional fixed-window n-grams. Simply put, when the model outputs a certain sequence, Infini-gram will identify the longest exact match in the training data for each Token context.
The performance data brought about by this is indeed impressive; for a dataset of 1.4 trillion Tokens, querying takes only 20 milliseconds, and storage is just 7 bytes per Token. More importantly, there is no need to analyze the internal structure of the model or perform complex calculations to achieve precise attribution. For AI companies that view models as trade secrets, this is practically a tailor-made solution.
It's worth noting that existing data attribution solutions are either inefficient, lack precision, or require access to the internal model. Infini-gram has found a balance across these three dimensions.
3) Additionally, I feel that OpenLedger's concept of dataNets on-chain datasets is particularly trendy. Unlike traditional one-time data transactions, DataNets allow data contributors to sustainably enjoy a share of profits when their data is used in inference.
In the past, data annotation was a laborious task with thin rewards and was one-off. Now, it has transformed into an asset with continuous income, fundamentally changing the incentive logic.
While most AI + Crypto projects are still focusing on relatively mature directions like computing power leasing and model training, OpenLedger has chosen to tackle the toughest bone of data attribution. This technology stack may redefine the supply side of AI data.
After all, in an era where data quality reigns supreme, whoever can solve the data value distribution problem will be able to attract the highest quality data resources.
In summary, the combination of OpenLedger's PoA + Infini-gram not only solves technical challenges but, more importantly, provides a completely new value distribution logic for the entire industry.
As the arms race in computing power gradually cools and competition for data quality intensifies, this type of technology path is sure to not be a one-off. This track will see multiple solutions competing in parallel—some focusing on attribution precision, some emphasizing cost efficiency, and others working on usability. Each is exploring the optimal solution for data value distribution.
Ultimately, which company will emerge victorious will still depend on whether they can truly attract enough data providers and developers.
Explore More From Creator

Latest News

Explore More From Creator

Latest News

Trending Articles