EOY Team and Business Updates
Company
In recent years, the capabilities of Large Language Models (LLMs) have transformed natural language processing and understanding, achieving remarkable milestones. Despite these advancements, LLMs face significant challenges in interactive environments, especially in tasks requiring multi-step reasoning like web navigation. Current training methodologies relying on static language datasets fall short in equipping these models for dynamic real-world interactions.
Enter Agent Q, a major milestone for Agents combining search, self-critique, and reinforcement learning to create state-of-the-art autonomous web agents that can plan and self-heal. Our breakthrough method addresses the limitations of previous LLM training techniques by introducing a novel framework for learning and reasoning for autonomous web navigation.
Current methods, such as supervised fine-tuning on curated expert demonstrations, often fall short on agentic multi-step tasks due to compounding errors and limited exploration data. These approaches yield sub-optimal policies, particularly in dynamic environments demanding complex decision-making and adaptive learning.
Agent Q innovates by combining guided Monte Carlo Tree Search (MCTS) and AI self-critique with iterative fine-tuning, leveraging reinforcement learning for human feedback (RLFH) methods like the Direct Preference Optimization (DPO) algorithm. This method enables LLM agents to learn from both successful and unsuccessful trajectories, enhancing their generalization capabilities in multi-step reasoning tasks.
Key Components of Agent Q:
Guided Search with MCTS: This technique autonomously generates data by exploring different actions and web-pages, balancing exploration and exploitation. MCTS expands the action space using high sampling temperatures and diverse prompting, ensuring diverse and optimal trajectory collections.
AI Self-Critique: At each step, AI-based self-critique provides valuable feedback, refining the agent's decision-making process. This step-level feedback is crucial for long-horizon tasks, where sparse signals often lead to learning difficulties.
Direct Preference Optimization: The DPO algorithm fine-tunes the model by constructing preference pairs from MCTS-generated data. This off-policy training method allows the model to learn effectively from aggregate datasets including the sub-optimal branches explored during search, improving success rates in complex environments.
In real-world booking experiments on Open Table, MultiOn’s Agents drastically improved the zero-shot performance of the LLaMa-3 model from an 18.6% success rate to 81.7%, a 340% jump after just one day of autonomous data collection and further to 95.4% with online search. These results highlight our method's efficiency and ability for autonomous web agent improvement.
MultiOn's Agent Q sets a new major milestone for autonomous web agents, combining advanced search techniques, AI self-critique, and reinforcement learning to overcome current limitations, representing a substantial leap forward in autonomous agents capabilities. As we continue to refine these methods and address associated challenges, moving closer to a full release in our products, the future of intelligent autonomous web agents in the real world looks promising.
This research breakthrough will be available to both developers and consumer users of MultiOn later this year.
To be one of the first to gain access, join our waitlist:
https://form.typeform.com/to/WfWuyk34
Company