CAIBA’s already setting the standard 📊
Since launch earlier this month:
• CAIA Benchmark v0.2 expanded: 40 → 60 tasks
• Results show Tooling is more effective than prompting
• Tokenomics trips up most models
Coming Soon:
• Expanding from 60 to 80 tasks in CAIA v0.3
• Adding more crypto agents (not just LLMs)
All the results are in the full blog linked below