Instead of static structures, our framework uses a centralized orchestrator (“puppeteer”) that dynamically directs multiple agents (“puppets”) based on evolving task states. The orchestrator is trained with reinforcement learning to sequence and prioritize agents, enabling flexible and adaptive collective reasoning.

Quick Start

Prerequisites

Python 3.11 or higher
CUDA-compatible GPU (optional, for policy training)
API keys for desired LLM providers

Installation

Clone the repository

 git clone -b puppeteer https://github.com/OpenBMB/ChatDev
 cd ChatDev
 cd puppeteer

Set up environment and install dependencies

# Create conda environment
conda create -n puppeteer_env python=3.11
conda activate puppeteer_env

# Install dependencies
pip install -r requirements.txt

Download the pre-trained puppeteer model base

We use a 70B reward model as the untrained Puppeteer base, so we first need to download this model. The Hugging Face repository is nvidia/Llama-3.1-Nemotron-70B-Reward.

Configure the system

# Edit configurations with your settings
vim config/global.yaml  # Add your API keys

Global Configuration (config/global.yaml): configure API access, file paths, and system behavior:

# API Configuration
logging:
    level: INFO               # Logging level, options: DEBUG, INFO, WARNING, ERROR
    logpath: ./logs           # Folder path to store log files

# Path to the folder containing model weights of the Puppeteer base model
# (downloaded in step 3, or directly loading)
model_weight_path: nvidia/Llama-3.1-Nemotron-70B-Reward

api_keys:
    openai_api_key: ""        # Your OpenAI API key
    openai_base_url: "https://api.openai.com/v1/"  # OpenAI base URL
    bing_api_key: ""          # Bing API key for web search (optional)

# System retry settings
max_retry_times: 10           # Maximum number of times to retry API calls
max_json_reformat_turns: 10   # Maximum retries for JSON parsing/reformatting

# Enable external tools (like web search, file read, etc.)
external_tools_enabled: True

# File paths that agents may need
file_path:
    root_file_path: ./data    # Root folder containing all necessary files for agents

# Graph exploration parameters for multi-agent reasoning
graph:
    max_parallel_paths: 4     # Maximum number of parallel paths to explore (recommended 2-6)
    max_step_num: 5           # Maximum number of steps (nodes) in each path (recommended 4-6)

⚠️ Note: Replace placeholders with your actual API keys and url, all the places are needed.

Quick start with the predefined settings

The agents are initialized from puppeteer/personas/personas.jsonl, which includes all currently supported reasoning patterns and tool modes. The default model backbone is GPT-4o.

cd puppeteer 
python main.py <task> <mode> [--level LEVEL] [--index INDEX] [--data_limit LIMIT] [--personas PATH]

Example:

# Run MMLU-Pro validation set with a data limit of 10
python main.py MMLU-Pro validation --data_limit 10

If the run is successful, you will see output similar to EXAMPLE.

Customization

Puppeteer provides multiple ways to tailor the system to your needs

Agents

🔎 Agent Categories

In this framework, agents are divided into two main categories based on whether they have access to external tools:

Agents with Tools
- Description: These agents can interact with external systems to gather data, execute code, or access files.
- Supported Actions: TOOL_ACTION_LIST
  - search_arxiv – Search for academic papers on arXiv
  - search_bing – Query the Bing search engine
  - access_website – Access websites and extract information
  - run_python – Execute Python code
  - read_file – Read and extract content from files
Agents without Tools
- Description: These agents focus on internal reasoning, critique, reflection, and summarization. They do not interact with external systems.
- Supported Actions: REASONING_ACTION_LIST
  - reasoning – Logical reasoning
  - critique – Evaluate and critique reasoning
  - question – Generate clarifying sub-questions
  - reflect – Provide reflective analysis
  - conclude – Generate final conclusions
  - summarize – Summarize information concisely
  - planning – Create structured plans
  - modify – Correct errors and refine results
Termination Agent
- Description: A special agent responsible for determining when the reasoning process should stop.
- Supported Actions: TERMINATION_ACTION_LIST
  - terminate – End the reasoning process and deliver the final output

⚙️ Customize

You can extend this framework by creating new agents, adding actions, or integrating new base models.

1. Multiple Actions per Agent

Currently, each agent is designed to perform a single action (see reasoning_agent.py).
To create an agent that supports multiple actions, implement your own custom agent by inheriting from agent.py.

2. Adding New Actions

To introduce a new action, you need to:
1. Define the corresponding prompt or tool.
2. Modify reasoning_agent.py to integrate the new action into the reasoning workflow.

3. Supporting New Base Models

If you want to use a new base model for agents:
- Extend the configuration in model_config.py.
- Ensure that the new model is properly registered and compatible with the agent framework.

🎭 Puppeteer Training

The training parameters are defined in policy.json. Key parameters include:

🔹 Optimization

learning_rate: 0.0001
Controls the learning speed of the policy network.
sample_size: 1
Number of samples used per training step.

🔹 Agent Scale Control

max_num_agents: 3
Maximum number of agents allowed in the system.
next_num_agents: 3
Number of agents spawned in the next step.
max_path: 6
Maximum trajectory length for agent exploration.

🔹 Reward Configuration

gamma: 0.99
Discount factor for future rewards.
reward_factors: Shaping factors for different actions:
- default: -1.0 → Penalty for invalid/neutral actions.
- terminator: 0.5 → Reward for correct termination.
- web_search: -1.5 → Penalty for costly web-search actions.

🔹 Cost Control

scale: 0.1
Base cost scaling factor.
growth_rate: 1.0
Linear growth rate of cost per step.
inverse: false
If set to true, applies inverse cost scaling.

🔹 Training Paradigm

The current training paradigm uses the hidden state of the last token from the Reward Model. This hidden state is passed through an MLP-based policy network to generate action probabilities.
You can switch the Reward Model or design a new training paradigm by modifying the policy network input/output structure.

Citation

If you use Puppeteer in your work, please cite our NeurIPS 2025 paper:

@inproceedings{dang2025multiagentcollaboration,
  title={Multi-Agent Collaboration via Evolving Orchestration},
  author={Yufan Dang and Chen Qian and Xueheng Luo and Jingru Fan and Zihao Xie and Ruijie Shi and Weize Chen and Cheng Yang and Xiaoyin Che and Ye Tian and Xuantang Xiong and Lei Han and Zhiyuan Liu and Maosong Sun},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS)},
  year={2025},
  url={https://arxiv.org/abs/2505.19591}
}

Languages

Python 68.5%

Vue 28.6%

JavaScript 2.2%

CSS 0.4%

Dockerfile 0.2%

Other 0.1%

README.md Unescape Escape

Multi-Agent Collaboration via Evolving Orchestration