Multi-Agent Collaboration via Evolving Orchestration
Puppeteer introduces a new way for large language models (LLMs) to collaborate efficiently on complex tasks.
Instead of static structures, our framework uses a centralized orchestrator (“puppeteer”) that dynamically directs multiple agents (“puppets”) based on evolving task states. The orchestrator is trained with reinforcement learning to sequence and prioritize agents, enabling flexible and adaptive collective reasoning.
Quick Start
Prerequisites
- Python 3.11 or higher
- CUDA-compatible GPU (optional, for policy training)
- API keys for desired LLM providers
Installation
-
Clone the repository
git clone -b puppeteer https://github.com/OpenBMB/ChatDev cd ChatDev cd puppeteer -
Set up environment and install dependencies
# Create conda environment conda create -n puppeteer_env python=3.11 conda activate puppeteer_env # Install dependencies pip install -r requirements.txt -
Download the pre-trained puppeteer model base
We use a 70B reward model as the untrained Puppeteer base, so we first need to download this model. The Hugging Face repository is
nvidia/Llama-3.1-Nemotron-70B-Reward. -
Configure the system
# Edit configurations with your settings vim config/global.yaml # Add your API keys- Global Configuration (
config/global.yaml): configure API access, file paths, and system behavior:# API Configuration logging: level: INFO # Logging level, options: DEBUG, INFO, WARNING, ERROR logpath: ./logs # Folder path to store log files # Path to the folder containing model weights of the Puppeteer base model # (downloaded in step 3, or directly loading) model_weight_path: nvidia/Llama-3.1-Nemotron-70B-Reward api_keys: openai_api_key: "" # Your OpenAI API key openai_base_url: "https://api.openai.com/v1/" # OpenAI base URL bing_api_key: "" # Bing API key for web search (optional) # System retry settings max_retry_times: 10 # Maximum number of times to retry API calls max_json_reformat_turns: 10 # Maximum retries for JSON parsing/reformatting # Enable external tools (like web search, file read, etc.) external_tools_enabled: True # File paths that agents may need file_path: root_file_path: ./data # Root folder containing all necessary files for agents # Graph exploration parameters for multi-agent reasoning graph: max_parallel_paths: 4 # Maximum number of parallel paths to explore (recommended 2-6) max_step_num: 5 # Maximum number of steps (nodes) in each path (recommended 4-6)
⚠️ Note: Replace placeholders with your actual API keys and url, all the places are needed.
- Global Configuration (
-
Quick start with the predefined settings
The agents are initialized from
puppeteer/personas/personas.jsonl, which includes all currently supported reasoning patterns and tool modes. The default model backbone is GPT-4o.cd puppeteer python main.py <task> <mode> [--level LEVEL] [--index INDEX] [--data_limit LIMIT] [--personas PATH]Example:
# Run MMLU-Pro validation set with a data limit of 10 python main.py MMLU-Pro validation --data_limit 10
If the run is successful, you will see output similar to EXAMPLE.
Customization
Puppeteer provides multiple ways to tailor the system to your needs
Agents
🔎 Agent Categories
In this framework, agents are divided into two main categories based on whether they have access to external tools:
-
Agents with Tools
- Description: These agents can interact with external systems to gather data, execute code, or access files.
- Supported Actions:
TOOL_ACTION_LIST- search_arxiv – Search for academic papers on arXiv
- search_bing – Query the Bing search engine
- access_website – Access websites and extract information
- run_python – Execute Python code
- read_file – Read and extract content from files
-
Agents without Tools
- Description: These agents focus on internal reasoning, critique, reflection, and summarization. They do not interact with external systems.
- Supported Actions:
REASONING_ACTION_LIST- reasoning – Logical reasoning
- critique – Evaluate and critique reasoning
- question – Generate clarifying sub-questions
- reflect – Provide reflective analysis
- conclude – Generate final conclusions
- summarize – Summarize information concisely
- planning – Create structured plans
- modify – Correct errors and refine results
-
Termination Agent
- Description: A special agent responsible for determining when the reasoning process should stop.
- Supported Actions:
TERMINATION_ACTION_LIST- terminate – End the reasoning process and deliver the final output
⚙️ Customize
You can extend this framework by creating new agents, adding actions, or integrating new base models.
1. Multiple Actions per Agent
- Currently, each agent is designed to perform a single action (see
reasoning_agent.py). - To create an agent that supports multiple actions, implement your own custom agent by inheriting from
agent.py.
2. Adding New Actions
- To introduce a new action, you need to:
- Define the corresponding prompt or tool.
- Modify
reasoning_agent.pyto integrate the new action into the reasoning workflow.
3. Supporting New Base Models
- If you want to use a new base model for agents:
- Extend the configuration in
model_config.py. - Ensure that the new model is properly registered and compatible with the agent framework.
- Extend the configuration in
🎭 Puppeteer Training
The training parameters are defined in policy.json. Key parameters include:
🔹 Optimization
learning_rate:0.0001
Controls the learning speed of the policy network.sample_size:1
Number of samples used per training step.
🔹 Agent Scale Control
max_num_agents:3
Maximum number of agents allowed in the system.next_num_agents:3
Number of agents spawned in the next step.max_path:6
Maximum trajectory length for agent exploration.
🔹 Reward Configuration
gamma:0.99
Discount factor for future rewards.reward_factors: Shaping factors for different actions:default:-1.0→ Penalty for invalid/neutral actions.terminator:0.5→ Reward for correct termination.web_search:-1.5→ Penalty for costly web-search actions.
🔹 Cost Control
scale:0.1
Base cost scaling factor.growth_rate:1.0
Linear growth rate of cost per step.inverse:false
If set totrue, applies inverse cost scaling.
🔹 Training Paradigm
The current training paradigm uses the hidden state of the last token from the Reward Model. This hidden state is passed through an MLP-based policy network to generate action probabilities.
You can switch the Reward Model or design a new training paradigm by modifying the policy network input/output structure.
Citation
If you use Puppeteer in your work, please cite our NeurIPS 2025 paper:
@inproceedings{dang2025multiagentcollaboration,
title={Multi-Agent Collaboration via Evolving Orchestration},
author={Yufan Dang and Chen Qian and Xueheng Luo and Jingru Fan and Zihao Xie and Ruijie Shi and Weize Chen and Cheng Yang and Xiaoyin Che and Ye Tian and Xuantang Xiong and Lei Han and Zhiyuan Liu and Maosong Sun},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS)},
year={2025},
url={https://arxiv.org/abs/2505.19591}
}
