Proof of Attribution (PoA) in Chinese is 归因证明, the core keywords are interpretability and payability.
Many people get headaches when they see the term; simply put, proof of attribution is clarifying the relationship between model output and training data. The goal is to let AI know who the data providers are and what data has been provided, which is key to interpretability.
✍️There are two attribution methods for large/small models:
💠Gradient Attribution (suitable for small models)
Similar to how a doctor asks about symptoms, for example, 'What would the treatment effect be if the patient does not use hormones?' or 'Can it be completely cured with surgery?'
Applied to AI, it aims to determine how much a specific training sample affects the model's prediction results, with the final value representing the contribution of the sample. Suitable for community-specific fine-tuning small models.
💠Infini-gram Attribution (suitable for large language models)
It can be imagined as a 'plagiarism detection system' for papers, where the system helps you find similar paragraphs in the literature and mark them.
All training data is integrated into a super engine. After the model generates a piece of content, it will search for matches within the engine. If highly matching content is found, it is judged that the data has influenced the model output (contributed).
📙Every organization has its own database, and the unique database of OpenLedger is called 'DataNet'.
DataNet is a thematic database, where every registered content data will be stored offline, but the corresponding metadata and hash values will be put on the blockchain.
Interestingly, any member within the community can comment/contest existing content, giving me a sense of an on-chain 'Wikipedia'.
🔆The characteristics of payability are also demonstrated: as long as the data you provide has an impact on the output content of a future AI model, you will gain contribution, which in turn allows you to generate profits.
🌟If you really read through the above content, you will find that the potential of PoA is limitless. It will help models record the time, version, and data used in each training and inference, and further refine the rewards and governance weights based on labels such as domain and quality.