Authors: Mario Chow & Figo @IOSG

Introduction

In the past 12 months, the relationship between web browsers and automation has undergone a dramatic change. Almost all major tech companies are racing to build autonomous browser agents. This trend became increasingly apparent starting in late 2024: OpenAI launched the Agent mode in January, Anthropic released the "Computer Use" feature for the Claude model, Google DeepMind introduced Project Mariner, Opera announced the agent-based browser Neon, and Perplexity AI launched the Comet browser. The signal is very clear: the future of AI lies in agents capable of autonomously navigating the web.

This trend is not merely about adding smarter chatbots to browsers but represents a fundamental shift in how machines interact with the digital environment. Browser agents are a class of AI systems capable of "seeing" web pages and taking actions: clicking links, filling forms, scrolling pages, typing text—just like human users. This model promises to unleash tremendous productivity and economic value, as it can automate tasks currently requiring manual operation or are too complex for traditional scripts.

▲ GIF demonstration: The actual operation of an AI browser agent: follows instructions, navigates to the target dataset page, automatically screenshots, and extracts the required data.

Who will win the AI browser war?

Almost all major tech companies (and some startups) are developing their own browser AI agent solutions. Here are some of the most representative projects:

OpenAI – Agent Mode

OpenAI's Agent mode (formerly known as Operator, launched in January 2025) is an AI agent that comes with its own browser. The Operator can handle various repetitive online tasks: such as filling out web forms, ordering groceries, scheduling meetings—all completed through standard web interfaces commonly used by humans.

▲ AI agents schedule meetings like professional assistants: check calendars, find available time slots, create events, send confirmations, and generate .ics files for you.

Anthropic – Claude's "Computer Use":

By the end of 2024, Anthropic introduced a new "Computer Use" feature for Claude 3.5, granting it the ability to operate computers and browsers like a human. Claude can see the screen, move the cursor, click buttons, and enter text. This is the first large model agent tool of its kind to enter public testing, allowing developers to let Claude autonomously navigate websites and applications. Anthropic positions it as an experimental feature aimed primarily at automating multi-step workflows on webpages.

Perplexity – Comet

AI startup Perplexity (known for its Q&A engine) launched the Comet browser in mid-2025 as an AI-driven alternative to Chrome. The core of Comet is a conversational AI search engine built into the address bar (omnibox), capable of providing instant Q&A and summaries instead of traditional search links.

  • In addition, Comet also includes Comet Assistant, a resident agent in the sidebar that can automatically perform daily tasks across websites. For example, it can summarize your opened emails, schedule meetings, manage browser tabs, or browse and scrape web information on your behalf.

  • By allowing agents to perceive the current webpage content through a sidebar interface, Comet aims to seamlessly integrate browsing with AI assistance.

Real-world application scenarios of browser agents

In previous discussions, we reviewed how major tech companies (OpenAI, Anthropic, Perplexity, etc.) inject functionality into browser agents through various product forms. To better understand their value, we can further explore how these capabilities are applied in real-world scenarios in daily life and business workflows.

Everyday Web Automation

E-commerce and Personal Shopping

A very practical scenario is delegating shopping and booking tasks to an agent. The agent can automatically fill your online shopping cart and place orders based on a fixed list, or search for the lowest prices among multiple retailers and complete the checkout process on your behalf.

For travel, you can ask the AI to perform a task like: "Help me book a flight to Tokyo next month (fare under $800) and find a hotel with free Wi-Fi." The agent will handle the entire process: searching for flights, comparing options, filling in passenger information, and completing the hotel booking, all through the airline and hotel websites. This level of automation far exceeds existing travel bots: it is not just a recommendation but directly executes the purchase.

Enhancing Office Efficiency

Agents can automate many repetitive business operations that people perform in browsers. For example, organizing emails and extracting to-do items, or checking for availability across multiple calendars and automatically scheduling meetings. Perplexity's Comet assistant can already summarize your inbox content or add events to your schedule through a web interface. Agents can also log into SaaS tools to generate regular reports, update spreadsheets, or submit forms after obtaining your authorization. Imagine an HR agent that can automatically log into different recruitment websites to post job listings; or a sales agent that can update CRM system leads data. These daily tedious tasks would normally consume a lot of employee time, but AI can complete them by automating web forms and page actions.

In addition to single tasks, agents can link together complete workflows across multiple network systems. All these steps require interaction across different web interfaces, which is the strength of the browser agent. Agents can log into various dashboards for troubleshooting, and even orchestrate processes, such as completing onboarding tasks for new employees (creating accounts on multiple SaaS sites). Essentially, any multi-step operation that currently requires opening multiple websites can be delegated to an agent.

Current Challenges and Limitations

Despite the enormous potential, today's browser agents are still far from perfect. Current implementations reveal some long-standing technical and infrastructure challenges:

Architectural Mismatch

The modern web is designed for browsers operated by humans, and over time has evolved to actively resist automation. Data is often buried in HTML/CSS optimized for visual display, restricted by interactive gestures (mouse hovering, scrolling), or can only be accessed through undocumented APIs.

On this basis, anti-scraping and anti-fraud systems have artificially added additional barriers. These tools combine IP reputation, browser fingerprinting, JavaScript challenge feedback, and behavioral analysis (such as the randomness of mouse movements, typing rhythm, and dwell time). Ironically, the more "perfect" and efficient the AI agent appears—such as instant form filling and never making mistakes—the more likely it is to be recognized as malicious automation. This can lead to hard failures: for example, OpenAI or Google's agents may successfully complete all pre-checkout steps but ultimately be blocked by CAPTCHA or secondary security filters.

Human-optimized interfaces and bot-unfriendly defensive layers combine to force agents to adopt vulnerable "human imitation" strategies. This approach is prone to failure and has a low success rate (without human intervention, the completion rate of entire transactions is still less than one-third).

Trust and Security Concerns

To give agents full control, they typically need access to sensitive information: login credentials, cookies, two-factor authentication tokens, or even payment information. This raises concerns that both users and businesses can understand:

  • What if the agent makes a mistake or is deceived by a malicious website?

  • If the agent agrees to a service term or executes a transaction, who is responsible?

     

Given these risks, current systems generally take a cautious approach:

  • Google's Mariner will not input credit card information or agree to service terms but will return it to the user.

  • OpenAI's Operator will prompt users to take over login or CAPTCHA challenges.

  • An agent powered by Anthropic's Claude may directly refuse to log in for security reasons.

     

The result is frequent pauses and handovers between AI and humans, weakening the seamless automation experience.

Despite these barriers, progress is still advancing rapidly. Companies like OpenAI, Google, and Anthropic are learning from failures in each iteration. As demand grows, a form of "co-evolution" is likely to emerge: websites becoming more agent-friendly in favorable scenarios, while agents continuously improve their ability to imitate human behavior to bypass existing barriers.

Methods and Opportunities

Currently, browser agents face two starkly different realities: on one hand, the hostile environment of Web2, where anti-scraping and security defenses are everywhere; on the other hand, the open environment of Web3, where automation is often encouraged. This disparity determines the direction of various solutions.

The following solutions can be broadly categorized into two types: one helps agents bypass the hostile environment of Web2, while the other is native to Web3.

While the challenges faced by browser agents remain significant, new projects continue to emerge, attempting to address these issues directly. The cryptocurrency and decentralized finance (DeFi) ecosystem is becoming a natural testing ground because it is open, programmable, and less hostile to automation. Open APIs, smart contracts, and on-chain transparency eliminate many friction points common in the Web2 world.

Here are four categories of solutions, each addressing one or more core limitations of the current systems:

Native agent browsers focused on on-chain operations

These browsers are designed from the ground up to be driven by autonomous agents and are deeply integrated with blockchain protocols. Unlike traditional Chrome browsers, which require additional reliance on Selenium, Playwright, or wallet plugins for on-chain operation automation, native agent browsers provide APIs and trusted execution pathways directly for agent calls.

In decentralized finance, the validity of transactions relies on cryptographic signatures rather than whether the user is "human-like." Therefore, in on-chain environments, agents can bypass common CAPTCHA, fraud detection scores, and device fingerprint checks found in the Web2 world. However, if these browsers point to Web2 sites like Amazon, they cannot bypass the relevant defense mechanisms and will still trigger normal anti-bot measures.

The value of agent-based browsers lies not in their ability to magically access all websites but in:

  • Native blockchain integration: Built-in wallet and signature support, eliminating the need for MetaMask pop-ups or parsing the dApp front-end DOM.

  • Automation-first design: Provides stable high-level instructions that can be directly mapped to protocol operations.

  • Security model: Fine-grained permission control and sandboxing ensure private keys remain secure during automation.

  • Performance optimization: Capable of executing multiple on-chain calls in parallel without browser rendering or UI delays.

Case Study: Donut

Donut integrates blockchain data and operations as first-class citizens. Users (or their agents) can hover to view real-time risk indicators of tokens or directly input natural language commands like "/swap 100 USDC to SOL." By bypassing the hostile friction points of Web2, Donut enables agents to run at full speed in DeFi, enhancing liquidity, arbitrage, and market efficiency.

Verifiable and Trusted Agent Execution

Granting agents access to sensitive permissions carries significant risks. Related solutions use Trusted Execution Environments (TEEs) or Zero-Knowledge Proofs (ZKPs) to encrypt and confirm the expected behavior of agents before execution, allowing users and counterparties to verify agent actions without exposing private keys or credentials.

Case Study: Phala Network

Phala uses TEEs (such as Intel SGX) to isolate and protect the execution environment, avoiding oversight or tampering of agent logic and data by Phala operators or attackers. TEE acts like a hardware-enhanced "secure chamber," ensuring confidentiality (inaccessible from outside) and integrity (unable to be modified externally).

For browser agents, this means they can log in, hold session tokens, or handle payment information, and this sensitive data never leaves the secure chamber. Even if the user's machine, operating system, or network is compromised, it cannot be leaked. This directly alleviates one of the biggest obstacles to the deployment of proxy applications: the trust issue regarding sensitive credentials and operations.

Decentralized structured data network

Modern anti-bot detection systems not only check whether requests are "too fast" or "automated" but also combine IP reputation, browser fingerprinting, JavaScript challenge feedback, and behavioral analysis (such as cursor movements, typing rhythm, session history). Agents from data center IPs or completely repeatable browsing environments are easily identified.

To address this issue, these networks no longer scrape human-optimized webpages but directly collect and provide machine-readable data, or route traffic through genuine human browsing environments. This approach bypasses the vulnerabilities traditional scrapers face in parsing and anti-scraping stages, providing agents with cleaner, more reliable inputs.

By routing agent traffic through these real-world sessions, distributed networks allow AI agents to access web content like humans without immediately triggering blocks.

#

Case Study

  • Grass: A decentralized data/DePIN network where users share idle residential broadband to provide agent-friendly, geographically diverse access channels for public web data collection and model training.

  • WootzApp: An open-source mobile browser that supports cryptocurrency payments, featuring backend agents and zero-knowledge identity; it gamifies AI/data tasks for consumers.

  • Sixpence: A distributed browser network that routes traffic for AI agents through global contributors' browsing.

However, this is not a complete solution. Behavioral detection (mouse/scrolling traces), account-level restrictions (KYC, account age), and fingerprint consistency checks can still trigger blocks. Therefore, distributed networks are best viewed as a foundational concealment layer, which must be combined with human-imitation execution strategies to achieve maximum effect.

Agent-focused web standards (forward-looking)

Currently, more and more tech communities and organizations are exploring: if future web users are not just humans, but also automated agents, how should websites interact with them safely and compliantly?

This has prompted discussions about some emerging standards and mechanisms aimed at allowing websites to clearly indicate "I allow trusted agents to access" and provide a secure channel to complete interactions, rather than defaulting to treating agents as "bot attacks" as is common today.

  • "Agent Allowed" label: Just like robots.txt that search engines comply with, future webpages may add a label in the code to inform browser agents, "This can be accessed safely." For instance, if you book a flight using an agent, the website won't pop up a bunch of CAPTCHAs but will directly provide an authenticated interface.

  • API gateway for authenticated agents: Websites can open dedicated entrances for verified agents, like a "fast lane." Agents don’t need to simulate human clicks or input but can complete orders, payments, or data queries through a more stable API path.

  • W3C discussions: The World Wide Web Consortium (W3C) is studying how to create standardized channels for "managed automation." This means we may have a set of globally recognized rules in the future that allow trusted agents to be recognized and accepted by websites while maintaining security and accountability.

While these explorations are still in their early stages, once implemented, they could greatly improve the relationship between humans, agents, and websites. Imagine: no longer needing agents to desperately imitate human mouse movements to "trick" risk controls, but rather completing tasks through a clearly defined "officially permitted" channel.

On this path, crypto-native infrastructure may take the lead. Because on-chain applications inherently rely on open APIs and smart contracts, they are friendly to automation. In contrast, traditional Web2 platforms may continue to be cautiously defensive, especially companies dependent on advertising or anti-fraud systems. However, as users and enterprises gradually embrace the efficiency gains from automation, these standardized attempts are likely to become key catalysts in pushing the entire internet towards an "agent-first architecture."

Conclusion

Browser agents are evolving from simple conversational tools to autonomous systems capable of completing complex online workflows. This transformation reflects a broader trend: embedding automation directly into the core interface of user-internet interactions. While the potential for productivity gains is enormous, the challenges are equally severe, including how to overcome deeply ingrained anti-bot mechanisms and ensure safety, trust, and responsible usage.

In the short term, improvements in agents' reasoning abilities, faster speeds, closer integration with existing services, and advancements in distributed networks may gradually enhance reliability. In the long term, we may see the gradual establishment of "agent-friendly" standards in scenarios where automation benefits both service providers and users. However, this transition will not be uniform: in automation-friendly environments like DeFi, adoption will be faster; whereas in Web2 platforms that heavily rely on user interaction, acceptance will be slower.

In the future, the competition among tech companies will increasingly focus on several aspects: how their agents navigate under real-world constraints, whether they can be safely integrated into critical workflows, and whether they can reliably deliver results across diverse online environments. Whether all of this ultimately reshapes the "browser war" will depend not merely on technical prowess but on establishing trust, aligning incentives, and demonstrating tangible value in everyday use.