CAIBA’s already setting the standard 📊

Since launch earlier this month:

• CAIA Benchmark v0.2 expanded: 40 → 60 tasks

• Results show Tooling is more effective than prompting

• Tokenomics trips up most models

Coming Soon:

• Expanding from 60 to 80 tasks in CAIA v0.3

• Adding more crypto agents (not just LLMs)

All the results are in the full blog linked below