According to PANews, OpenAI has released a new benchmark test called BrowseComp, designed to evaluate AI agents' ability to find difficult-to-access information on the internet. This test includes 1,266 challenging questions, aiming to simulate an 'online treasure hunt' within complex information networks, where answers are hard to find but easy to verify. The questions span various fields, including film, technology, and history, and are significantly more difficult than existing tests like SimpleQA.

The AIGC Open Community reports that this benchmark is highly challenging, with OpenAI's own models, GPT-4o and GPT-4.5, achieving accuracy rates of only 0.6% and 0.9%, respectively. Even with the browser-enabled GPT-4o, the accuracy only reaches 1.9%. However, OpenAI's newly released Agent model, Deep Research, has achieved a much higher accuracy rate of 51.5%.