mirror of
https://github.com/OpenBMB/ChatDev.git
synced 2026-04-25 19:28:09 +00:00
40 KiB
Executable File
40 KiB
Executable File
| 1 | image_path | title | author | summary | affiliation | |
|---|---|---|---|---|---|---|
| 2 | 0 | ./images/1d.png | AgentCF: Collaborative Learning with Autonomous Language Agents for Recommender Systems | Junjie Zhang, Yupeng Hou, Ruobing Xie, Wenqi Sun, Julian McAuley, Wayne Xin Zhao, Leyu Lin, Ji-Rong Wen | Recently, there has been an emergence of employing LLM-poweredagents as believable human proxies, based on their remarkabledecision-making capability. However, existing studies mainly focuson simulating human dialogue. Human non-verbal behaviors, suchas item clicking in recommender systems, although implicitly ex-hibiting user preferences and could enhance the modeling of users,have not been deeply explored. The main reasons lie in the gapbetween language modeling and behavior modeling, as well as theincomprehension of LLMs about user-item relations.To address this issue, we propose AgentCF for simulating user-item interactions in recommender systems through agent-basedcollaborative filtering. We creatively consider not only users butalso items as agents, and develop a collaborative learning approachthat optimizes both kinds of agents together. Specifically, at eachtime step, we first prompt the user and item agents to interact au-tonomously. Then, based on the disparities between the agents’decisions and real-world interaction records, user and item agentsare prompted to reflect on and adjust the misleading simulationscollaboratively, thereby modeling their two-sided relations. The op-timized agents can also propagate their preferences to other agentsin subsequent interactions, implicitly capturing the collaborative fil-tering idea. Overall, the optimized agents exhibit diverse interactionbehaviors within our framework, including user-item, user-user,item-item, and collective interactions. The results show that theseagents can demonstrate personalized behaviors akin to those of real-world individuals, sparking the development of next-generationuser behavior simulation. | Renmin University of China, UC San Diego, Tencent |
| 3 | 1 | ./images/agentcf_collaborative_learning_with_20231013.png | AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors | Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, Jie Zhou | Autonomous agents empowered by Large Language Models (LLMs) have under-gone significant improvements, enabling them to generalize across a broad spec-trum of tasks. However, in real-world scenarios, cooperation among individuals isoften required to enhance the efficiency and effectiveness of task accomplishment.Hence, inspired by human group dynamics, we propose a multi-agent frameworkAGENTVERSE that can effectively orchestrate a collaborative group of expert agentsas a greater-than-the-sum-of-its-parts system. Our experiments demonstrate thatAGENTVERSE can proficiently deploy multi-agent groups that outperform a singleagent. Extensive experiments on text understanding, reasoning, coding, tool utiliza-tion, and embodied AI confirm the effectiveness of AGENTVERSE. Moreover, ouranalysis of agent interactions within AGENTVERSE reveals the emergence of spe-cific collaborative behaviors, contributing to heightened group efficiency. Our codehas been released at https://github.com/OpenBMB/AgentVerse/. | Tsinghua University, Beijing University of Posts and Telecommunications, Tencent Inc. |
| 4 | 2 | ./images/agentverse_facilitating_multi-agent_collaboration_20230821.png | Apollo's Oracle: Retrieval-Augmented Reasoning in Multi-Agent Debates | Haotian Wang, Xiyuan Du, Weijiang Yu, Qianglong Chen, Kun Zhu, Zheng Chu, Lian Yan, Yi Guan | Multi-agent debate systems are designed to derive accurate and consistent conclusions through adversarial interactions among agents. However, these systems often encounter challenges due to cognitive constraints, manifesting as (1) agents' obstinate adherence to incorrect viewpoints and (2) their propensity to abandon correct viewpoints. These issues are primarily responsible for the ineffectiveness of such debates. Addressing the challenge of cognitive constraints, we introduce a novel framework, the Multi-Agent Debate with Retrieval Augmented (MADRA). MADRA incorporates retrieval of prior knowledge into the debate process, effectively breaking cognitive constraints and enhancing the agents' reasoning capabilities. Furthermore, we have developed a self-selection module within this framework, enabling agents to autonomously select pertinent evidence, thereby minimizing the impact of irrelevant or noisy data. We have comprehensively tested and analyzed MADRA across six diverse datasets. The experimental results demonstrate that our approach significantly enhances performance across various tasks, proving the effectiveness of our proposed method. | Harbin Institute of Technology, Sun Yat-sen University, Zhejiang University |
| 5 | 3 | ./images/apollo's_oracle_retrieval-augmented_reasoning_20231208.png | ATM: Adversarial Tuning Multi-agent System Makes a Robust Retrieval-Augmented Generator | Junda Zhu, Lingyong Yan, Haibo Shi, Dawei Yin, Lei Sha | Large language models (LLMs) are proven tobenefit a lot from retrieval-augmented genera-tion (RAG) in alleviating hallucinations con-fronted with knowledge-intensive questions.RAG adopts information retrieval techniquesto inject external knowledge from semantic-relevant documents as input contexts. How-ever, due to today’s Internet being flooded withnumerous noisy and fabricating content, it isinevitable that RAG systems are vulnerableto these noises and prone to respond incor-rectly. To this end, we propose to optimizethe retrieval-augmented GENERATOR with aAdversarial Tuning Multi-agent system (ATM).The ATM steers the GENERATOR to have a ro-bust perspective of useful documents for ques-tion answering with the help of an auxiliaryATTACKER agent. The GENERATOR and theATTACKER are tuned adversarially for severaliterations. After rounds of multi-agent itera-tive tuning, the GENERATOR can eventuallybetter discriminate useful documents amongstfabrications. The experimental results verifythe effectiveness of ATM and we also observethat the GENERATOR can achieve better perfor-mance compared to state-of-the-art baselines. | Beihang University, Baidu Inc. |
| 6 | 4 | ./images/atm_adversarial_tuning_multi-agent_20240528.png | Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee Discussions | Ruochen Zhao, Wenxuan Zhang, Yew Ken Chia, Deli Zhao, Lidong Bing | As LLMs evolve on a daily basis, there is an urgent need for a trustworthy evaluationmethod that can provide robust evaluation results in a timely fashion. Currently,as static benchmarks are prone to contamination concerns, users tend to trusthuman voting platforms, such as Chatbot Arena. However, human annotationsrequire extensive manual efforts. To provide an automatic, robust, and trustworthyevaluation framework, we innovatively propose the Auto-Arena of LLMs, whichautomates the entire evaluation process with LLM agents. Firstly, an examinerLLM devises queries. Then, a pair of candidate LLMs engage in a multi-round peer-battle around the query, during which the LLM’s true performance gaps becomevisible. Finally, a committee of LLM judges collectively discuss and determine thewinner, which alleviates bias and promotes fairness. In our extensive experimenton the 17 newest LLMs, Auto-Arena shows the highest correlation with humanpreferences, providing a promising alternative to human evaluation platforms. | Nanyang Technological University, Alibaba Group, Singapore University of Technology and Design |
| 7 | 5 | ./images/auto_arena_of_llms_20240530.png | Autonomous Agents for Collaborative Task under Information Asymmetry | Wei Liu, Chenxi Wang, Yifei Wang, Zihao Xie, Rennai Qiu, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, Chen Qian | Large Language Model Multi-Agent Systems (LLM-MAS) have achieved greatprogress in solving complex tasks. It performs communication among agents withinthe system to collaboratively solve tasks, under the premise of shared information.However, when agents’ communication is leveraged to enhance human cooperation,a new challenge arises due to information asymmetry, since each agent can onlyaccess the information of its human user. Previous MAS struggle to complete tasksunder this condition. To address this, we propose a new MAS paradigm termediAgents, which denotes Informative Multi-Agent Systems. In iAgents, the humansocial network is mirrored in the agent network, where agents proactively exchangehuman information necessary for task resolution, thereby overcoming informationasymmetry. iAgents employs a novel agent reasoning mechanism, InfoNav, tonavigate agents’ communication towards effective information exchange. Togetherwith InfoNav, iAgents organizes human information in a mixed memory to provideagents with accurate and comprehensive information for exchange. Additionally,we introduce InformativeBench, the first benchmark tailored for evaluating LLMagents’ task-solving ability under information asymmetry. Experimental resultsshow that iAgents can collaborate within a social network of 140 individualsand 588 relationships, autonomously communicate over 30 turns, and retrieveinformation from nearly 70,000 messages to complete tasks within 3 minutes. | Tsinghua University, Beijing University of Posts and Telecommunications |
| 8 | 6 | ./images/autonomous_agents_for_collaborative_20240621.png | Avalon's Game of Thoughts: Battle Against Deception through Recursive Contemplation | Shenzhi Wang, Chang Liu, Zilong Zheng, Siyuan Qi, Shuo Chen, Qisen Yang, Andrew Zhao, Chaofei Wang, Shiji Song, Gao Huang | Recent breakthroughs in large language models (LLMs) have brought remark-able success in the field of LLM-as-Agent. Nevertheless, a prevalent assumptionis that the information processed by LLMs is consistently honest, neglecting thepervasive deceptive or misleading information in human society and AI-generatedcontent.This oversight makes LLMs susceptible to malicious manipulations,potentially resulting in detrimental outcomes. This study utilizes the intricateAvalon game as a testbed to explore LLMs’ potential in deceptive environments.Avalon, full of misinformation and requiring sophisticated logic, manifests as a“Game-of-Thoughts”. Inspired by the efficacy of humans’ recursive thinking andperspective-taking in the Avalon game, we introduce a novel framework, Recur-sive Contemplation (ReCon), to enhance LLMs’ ability to identify and counteractdeceptive information. ReCon combines formulation and refinement contempla-tion processes; formulation contemplation produces initial thoughts and speech,while refinement contemplation further polishes them. Additionally, we incor-porate first-order and second-order perspective transitions into these processesrespectively. Specifically, the first-order allows an LLM agent to infer others’mental states, and the second-order involves understanding how others perceivethe agent’s mental state....... | Tsinghua University, BIGAI, Technical University of Munich |
| 9 | 7 | ./images/avalon's_game_of_thoughts_20231002.png | Beyond Natural Language: LLMs Leveraging Alternative Formats for Enhanced Reasoning and Communication | Weize Chen, Chenfei Yuan, Jiarui Yuan, Yusheng Su, Chen Qian, Cheng Yang, Ruobing Xie, Zhiyuan Liu, Maosong Sun | Natural language (NL) has long been the predominant format for human cognition and communication, and by extension, has been similarly pivotal in the development and application of Large Language Models (LLMs). Yet, besides NL, LLMs have seen various non-NL formats during pre-training, such as code and logical expression. NL's status as the optimal format for LLMs, particularly in single-LLM reasoning and multi-agent communication, has not been thoroughly examined. In this work, we challenge the default use of NL by exploring the utility of non-NL formats in these contexts. We show that allowing LLMs to autonomously select the most suitable format before reasoning or communicating leads to a 3.3 to 5.7\% improvement in reasoning efficiency for different LLMs, and up to a 72.7\% reduction in token usage in multi-agent communication, all while maintaining communicative effectiveness. Our comprehensive analysis further reveals that LLMs can devise a format from limited task instructions and that the devised format is effectively transferable across different LLMs. Intriguingly, the structured communication format decided by LLMs exhibits notable parallels with established agent communication languages, suggesting a natural evolution towards efficient, structured communication in agent communication. | Tsinghua University, Tencent, Beijing University of Posts and Telecommunications |
| 10 | 8 | ./images/beyond_natural_language_llms_20240228.png | Building Cooperative Embodied Agents Modularly with Large Language Models | Hongxin Zhang, Weihua Du, Jiaming Shan, Qinhong Zhou, Yilun Du, Joshua B. Tenenbaum, Tianmin Shu, Chuang Gan | In this work, we address challenging multi-agent cooperation problems with de-centralized control, raw sensory observations, costly communication, and multi-objective tasks instantiated in various embodied environments. While previous re-search either presupposes a cost-free communication channel or relies on a central-ized controller with shared observations, we harness the commonsense knowledge,reasoning ability, language comprehension, and text generation prowess of LLMsand seamlessly incorporate them into a cognitive-inspired modular framework thatintegrates with perception, memory, and execution. Thus building a CooperativeEmbodied Language Agent CoELA, who can plan, communicate, and cooperatewith others to accomplish long-horizon tasks efficiently. Our experiments on C-WAH and TDW-MAT demonstrate that CoELA driven by GPT-4 can surpass strongplanning-based methods and exhibit emergent effective communication. Thoughcurrent Open LMs like LLAMA-2 still underperform, we fine-tune a CoLLAMAwith data collected with our agents and show how they can achieve promisingperformance. We also conducted a user study for human-agent interaction anddiscovered that CoELA communicating in natural language can earn more trust andcooperate more effectively with humans. Our research underscores the potential ofLLMs for future research in multi-agent cooperation. Videos can be found on theproject website https://vis-www.cs.umass.edu/Co-LLM-Agents/. | University of Massachusetts Amherst, Tsinghua University, Shanghai Jiao Tong University, MIT, MIT-IBM Watson AI Lab |
| 11 | 9 | ./images/building_cooperative_embodied_agents_20230705.png | CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society | Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, Bernard Ghanem | The rapid advancement of chat-based language models has led to remarkableprogress in complex task-solving. However, their success heavily relies on humaninput to guide the conversation, which can be challenging and time-consuming.This paper explores the potential of building scalable techniques to facilitate au-tonomous cooperation among communicative agents, and provides insight intotheir “cognitive” processes. To address the challenges of achieving autonomouscooperation, we propose a novel communicative agent framework named role-playing . Our approach involves using inception prompting to guide chat agentstoward task completion while maintaining consistency with human intentions. We showcase how role-playing can be used to generate conversational data for studyingthe behaviors and capabilities of a society of agents, providing a valuable resourcefor investigating conversational language models. In particular, we conduct com-prehensive studies on instruction-following cooperation in multi-agent settings.Our contributions include introducing a novel communicative agent framework,offering a scalable approach for studying the cooperative behaviors and capabili-ties of multi-agent systems, and open-sourcing our library to support research oncommunicative agents and beyond: https://github.com/camel-ai/camel. | King Abdullah University of Science and Technology |
| 12 | 10 | ./images/camel_communicative_agents_for_20230331.png | ChatDev: Communicative Agents for Software Development | Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, Maosong Sun | Software development is a complex task thatnecessitates cooperation among multiple mem-bers with diverse skills. Numerous studies useddeep learning to improve specific phases in awaterfall model, such as design, coding, andtesting.However, the deep learning modelin each phase requires unique designs, lead-ing to technical inconsistencies across variousphases, which results in a fragmented and in-effective development process. In this paper,we introduce ChatDev, a chat-powered soft-ware development framework in which special-ized agents driven by large language models(LLMs) are guided in what to communicate(via chat chain) and how to communicate (viacommunicative dehallucination). These agentsactively contribute to the design, coding, andtesting phases through unified language-basedcommunication, with solutions derived fromtheir multi-turn dialogues. We found their uti-lization of natural language is advantageousfor system design, and communicating in pro-gramming language proves helpful in debug-ging. This paradigm demonstrates how linguis-tic communication facilitates multi-agent col-laboration, establishing language as a unify-ing bridge for autonomous task-solving amongLLM agents. The code and data are availableat https://github.com/OpenBMB/ChatDev. | Tsinghua University, The University of Sydney, BUPT, Modelbest Inc. |
| 13 | 11 | ./images/chatdev_communicative_agents_for_20230716.png | Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate | Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, Shuming Shi | Modern large language models (LLMs) likeChatGPT have shown remarkable performanceon general language tasks but still struggle oncomplex reasoning tasks, which drives the re-search on cognitive behaviors of LLMs to ex-plore human-like problem-solving strategies.Along this direction, one representative strat-egy is self-reflection, which asks an LLM torefine the solution with the feedback gener-ated by itself iteratively. However, our studyshows that such reflection-style methods suf-fer from the Degeneration-of-Thought (DoT)problem: once the LLM has established confi-dence in its solutions, it is unable to generatenovel thoughts later through reflection even ifits initial stance is incorrect. To address theDoT problem, we propose a Multi-Agent De-bate (MAD) framework, in which multipleagents express their arguments in the state of“tit for tat” and a judge manages the debateprocess to obtain a final solution. Clearly, ourMAD framework encourages divergent think-ing in LLMs which would be helpful for tasksthat require deep levels of contemplation. Ex-periment results on two challenging datasets,commonsense machine translation and counter-intuitive arithmetic reasoning, demonstrate theeffectiveness of our MAD framework. Exten-sive analyses suggest that the adaptive break ofdebate and the modest level of “tit for tat” stateare required for MAD to obtain good perfor-mance. Moreover, we find that LLMs might notbe a fair judge if different LLMs are used foragents. Code is available at https://github.com/Skytliang/Multi-Agents-Debate. | Tsinghua University, Shanghai Jiao Tong University, Tencent AI Lab |
| 14 | 12 | ./images/encouraging_divergent_thinking_in_20230530.png | Examining Inter-Consistency of Large Language Models Collaboration: An In-depth Analysis via Debate | Kai Xiong, Xiao Ding, Yixin Cao, Ting Liu, Bing Qin | Large Language Models (LLMs) have shownimpressive capabilities in various applications,but they still face various inconsistency issues.Existing works primarily focus on the incon-sistency issues within a single LLM, while wecomplementarily explore the inter-consistencyamong multiple LLMs for collaboration. Toexamine whether LLMs can collaborate effec-tively to achieve a consensus for a shared goal,we focus on commonsense reasoning, and in-troduce a formal debate framework (FORD)to conduct a three-stage debate among LLMswith real-world scenarios alignment: fair de-bate, mismatched debate, and roundtable de-bate. Through extensive experiments on var-ious datasets, LLMs can effectively collabo-rate to reach a consensus despite noticeableinter-inconsistencies, but imbalances in theirabilities can lead to domination by superiorLLMs. Leveraging a more advanced LLM likeGPT-4 as an authoritative judge can boost col-laboration performance. Our work contributesto understanding the inter-consistency amongLLMs and lays the foundation for develop-ing future collaboration methods. Codes anddata are available at https://github.com/Waste-Wood/FORD. | Harbin Institute of Technology, Singapore Management University |
| 15 | 13 | ./images/examining_inter-consistency_of_large_20230519.png | Exploring Large Language Models for Communication Games: An Empirical Study on Werewolf | Yuzhuang Xu, Shuo Wang, Peng Li, Fuwen Luo, Xiaolong Wang, Weidong Liu, Yang Liu | Communication games, which we refer to asincomplete information games that heavily de-pend on natural language communication, holdsignificant research value in fields such as eco-nomics, social science, and artificial intelli-gence. In this work, we explore the problem ofhow to engage large language models (LLMs)in communication games, and in response, pro-pose a tuning-free framework. Our approachkeeps LLMs frozen, and relies on the retrievaland reflection on past communications and ex-periences for improvement. An empirical studyon the representative and widely-studied com-munication game, “Werewolf”, demonstratesthat our framework can effectively play Were-wolf game without tuning the parameters of theLLMs. More importantly, strategic behaviorsbegin to emerge in our experiments, suggest-ing that it will be a fruitful journey to engageLLMs in communication games and associateddomains. | Tsinghua University, Zhongguancun Laboratory |
| 16 | 14 | ./images/exploring_large_language_models_20230909.png | Generative Agents: Interactive Simulacra of Human Behavior | Joon Sung Park, Joseph C. O'Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, Michael S. Bernstein | Believable proxies of human behavior can empower interactive applications ranging from immersive environments to rehearsal spaces for interpersonal communication to prototyping tools. In this paper, we introduce generative agents--computational software agents that simulate believable human behavior. Generative agents wake up, cook breakfast, and head to work; artists paint, while authors write; they form opinions, notice each other, and initiate conversations; they remember and reflect on days past as they plan the next day. To enable generative agents, we describe an architecture that extends a large language model to store a complete record of the agent's experiences using natural language, synthesize those memories over time into higher-level reflections, and retrieve them dynamically to plan behavior. We instantiate generative agents to populate an interactive sandbox environment inspired by The Sims, where end users can interact with a small town of twenty five agents using natural language. In an evaluation, these generative agents produce believable individual and emergent social behaviors: for example, starting with only a single user-specified notion that one agent wants to throw a Valentine's Day party, the agents autonomously spread invitations to the party over the next two days, make new acquaintances, ask each other out on dates to the party, and coordinate to show up for the party together at the right time. We demonstrate through ablation that the components of our agent architecture--observation, planning, and reflection--each contribute critically to the believability of agent behavior. By fusing large language models with computational, interactive agents, this work introduces architectural and interaction patterns for enabling believable simulations of human behavior. | Stanford University, Google Research, Google DeepMind |
| 17 | 15 | ./images/generative_agents_interactive_simulacra_20230407.png | Improving Factuality and Reasoning in Language Models through Multiagent Debate | Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, Igor Mordatch | Large language models (LLMs) have demonstrated remarkable capabilities inlanguage generation, understanding, and few-shot learning in recent years. Anextensive body of work has explored how their performance may be further im-proved through the tools of prompting, ranging from verification, self-consistency,or intermediate scratchpads. In this paper, we present a complementary approachto improve language responses where multiple language model instances proposeand debate their individual responses and reasoning processes over multiple roundsto arrive at a common final answer. Our findings indicate that this approachsignificantly enhances mathematical and strategic reasoning across a number oftasks. We also demonstrate that our approach improves the factual validity ofgenerated content, reducing fallacious answers and hallucinations that contem-porary models are prone to. Our approach may be directly applied to existingblack-box models and uses identical procedure and prompts for all tasks we inves-tigate. Overall, our findings suggest that such "society of minds" approach has thepotential to significantly advance the capabilities of LLMs and pave the way forfurther breakthroughs in language generation and understanding. Project websiteat https://composable-models.github.io/llm_debate/. | MIT CSAIL, Google Brain |
| 18 | 16 | ./images/improving_factuality_and_reasoning_20230523.png | Improving Language Model Negotiation with Self-Play and In-Context Learning from AI Feedback | Yao Fu, Hao Peng, Tushar Khot, Mirella Lapata | We study whether multiple large language models (LLMs) can autonomouslyimprove each other in a negotiation game by playing, reflecting, and criticizing.We are interested in this question because if LLMs were able to improve eachother, it would imply the possibility of creating strong AI agents with minimalhuman intervention. We ask two LLMs to negotiate with each other, playingthe roles of a buyer and a seller, respectively. They aim to reach a deal withthe buyer targeting a lower price and the seller a higher one. A third languagemodel, playing the critic, provides feedback to a player to improve the player’snegotiation strategies. We let the two agents play multiple rounds, using previousnegotiation history and AI feedback as in-context demonstrations to improve themodel’s negotiation strategy iteratively. We use different LLMs (GPT and Claude)for different roles and use the deal price as the evaluation metric. Our experimentsreveal multiple intriguing findings: ( | University of Edinburgh, Allen Institute for AI, University of Edinburgh |
| 19 | 17 | ./images/improving_language_model_negotiation_20230517.png | Improving Multi-Agent Debate with Sparse Communication Topology | Yunxuan Li, Yibing Du, Jiageng Zhang, Le Hou, Peter Grabowski, Yeqing Li, Eugene Ie | Multi-agent debate has proven effective in im-proving large language models quality for rea-soning and factuality tasks. While various role-playing strategies in multi-agent debates havebeen explored, in terms of the communica-tion among agents, existing approaches adopta brute force algorithm – each agent can com-municate with all other agents. In this paper,we systematically investigate the effect of com-munication connectivity in multi-agent systems.Our experiments on GPT and Mistral models re-veal that multi-agent debates leveraging sparsecommunication topology can achieve compara-ble or superior performance while significantlyreducing computational costs. Furthermore, weextend the multi-agent debate framework tomultimodal reasoning and alignment labelingtasks, showcasing its broad applicability andeffectiveness. Our findings underscore the im-portance of communication connectivity on en-hancing the efficiency and effectiveness of the“society of minds” approach. | Google, Google DeepMind |
| 20 | 18 | ./images/improving_multi-agent_debate_with_20240617.png | LLM-Based Agent Society Investigation: Collaboration and Confrontation in Avalon Gameplay | Yihuai Lan, Zhiqiang Hu, Lei Wang, Yang Wang, Deheng Ye, Peilin Zhao, Ee-Peng Lim, Hui Xiong, Hao Wang | This paper explores the open research prob-lem of understanding the social behaviors ofLLM-based agents. Using Avalon as a testbed,we employ system prompts to guide LLMagents in gameplay. While previous studieshave touched on gameplay with LLM agents,research on their social behaviors is lacking.We propose a novel framework, tailored forAvalon, features a multi-agent system facil-itating efficient communication and interac-tion. We evaluate its performance based ongame success and analyze LLM agents’ so-cial behaviors. Results affirm the framework’seffectiveness in creating adaptive agents andsuggest LLM-based agents’ potential in nav-igating dynamic social interactions. By ex-amining collaboration and confrontation be-haviors, we offer insights into this field’s re-search and applications.Our code is pub-licly available at https://github.com/3DAgentWorld/LLM-Game-Agent | The Hong Kong University of Science and Technology (Guangzhou), Singapore University of Technology and Design, Singapore Management University, Verily Life Sciences, Tencent |
| 21 | 19 | ./images/llm-based_agent_society_investigation_20231023.png | LM vs LM: Detecting Factual Errors via Cross Examination | Roi Cohen, May Hamri, Mor Geva, Amir Globerson | A prominent weakness of modern languagemodels (LMs) is their tendency to generate fac-tually incorrect text, which hinders their us-ability. A natural question is whether such fac-tual errors can be detected automatically. In-spired by truth-seeking mechanisms in law, wepropose a factuality evaluation framework forLMs that is based on cross-examination. Ourkey idea is that an incorrect claim is likely toresult in inconsistency with other claims thatthe model generates. To discover such incon-sistencies, we facilitate a multi-turn interactionbetween the LM that generated the claim andanother LM (acting as an examiner) which in-troduces questions to discover inconsistencies.We empirically evaluate our method on factualclaims made by multiple recent LMs on fourbenchmarks, finding that it outperforms exist-ing methods and baselines, often by a largegap. Our results demonstrate the potential ofusing interacting LMs to capture factual errors. | Tel Aviv University, Google DeepMind, Google Research |
| 22 | 20 | ./images/lm_vs_lm_detecting_20230522.png | PLAYER*: Enhancing LLM-based Multi-Agent Communication and Interaction in Murder Mystery Games | Qinglin Zhu, Runcong Zhao, Jinhua Du, Lin Gui, Yulan He | We propose PLAYER*, a novel framework that addresses the limitations of existing agent-based approaches built on Large Language Models (LLMs) in handling complex questions and understanding interpersonal relationships in dynamic environments. PLAYER* enhances path planning in Murder Mystery Games (MMGs) using an anytime sampling-based planner and a questioning-driven search framework. By equipping agents with a set of sensors, PLAYER* eliminates the need for pre-defined questions and enables agents to navigate complex social interactions. We additionally make a contribution by introducing a quantifiable evaluation method using multiple-choice questions and present WellPlay, a dataset containing 1,482 question-answer pairs. Experimental results demonstrate PLAYER*'s superiority over existing multi-agent methods, enhancing the generalisability and adaptability of agents in MMGs and paving the way for more effective multi-agent interactions. | King’s College London, Huawei London Research Centre, The Alan Turing Institute |
| 23 | 21 | ./images/player_enhancing_llm-based_multi-agent_20240426.png | RoCo: Dialectic Multi-Robot Collaboration with Large Language Models | Zhao Mandi, Shreeya Jain, Shuran Song | : We propose a novel approach to multi-robot collaboration that har-nesses the power of pre-trained large language models (LLMs) for both high-levelcommunication and low-level path planning. Robots are equipped with LLMs todiscuss and collectively reason task strategies. They then generate sub-task plansand task space waypoint paths, which are used by a multi-arm motion planner toaccelerate trajectory planning. We also provide feedback from the environment,such as collision checking, and prompt the LLM agents to improve their plan andwaypoints in-context. For evaluation, we introduce RoCoBench, a 6-task bench-mark covering a wide range of multi-robot collaboration scenarios, accompaniedby a text-only dataset for agent representation and reasoning. We experimentallydemonstrate the effectiveness of our approach – it achieves high success ratesacross all tasks in RoCoBench and adapts to variations in task semantics. Our di-alog setup offers high interpretability and flexibility – in real world experiments,we show RoCo easily incorporates human-in-the-loop, where a user can commu-nicate and collaborate with a robot agent to complete tasks together. See projectwebsite project-roco.github.io for videos and code. | Columbia University |
| 24 | 22 | ./images/roco_dialectic_multi-robot_collaboration_20230710.png | Scaling Large-Language-Model-based Multi-Agent Collaboration | Chen Qian, Zihao Xie, Yifei Wang, Wei Liu, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, Zhiyuan Liu, Maosong Sun | Pioneering advancements in large languagemodel-powered agents have underscored thedesign pattern of multi-agent collaboration,demonstrating that collective intelligence cansurpass the capabilities of each individual. In-spired by the neural scaling law, which positsthat increasing neurons leads to emergent abil-ities, this study investigates whether a simi-lar principle applies to increasing agents inmulti-agent collaboration.Technically, wepropose ::multi-agent:collaboration::networks(MACNET), which utilize directed acyclicgraphs to organize agents and streamline theirinteractive reasoning via topological ordering,with solutions derived from their dialogues.Extensive experiments show that MACNETconsistently outperforms baseline models, en-abling effective agent collaboration across var-ious network topologies and supporting coop-eration among more than a thousand agents.Notably, we observed a small-world collabo-ration phenomenon, where topologies resem-bling small-world properties achieved supe-rior performance. Additionally, we identifieda collaborative scaling law, indicating thatnormalized solution quality follows a logisticgrowth pattern as scaling agents, with collabo-rative emergence occurring much earlier thanpreviously observed instances of neural emer-gence. The code and data will be available athttps://github.com/OpenBMB/ChatDev. | Tsinghua University, Beijing University of Posts and Telecommunications |
| 25 | 23 | ./images/scaling_large-language-model-based_multi-agent_collaboration_20240611.png | The Impact of Language on Arithmetic Proficiency- A Multilingual Investigation with Cross-Agent Checking Computation | Chung-Chi Chen, Hiroya Takamura, Ichiro Kobayashi, Yusuke Miyao | This paper critically examines the arithmetic capabilities of Large Language Models (LLMs), uncovering significant limitations in their performance. Our research reveals a notable decline in accuracy for complex calculations involving large numbers, with addition and subtraction tasks showing varying degrees of proficiency. Additionally, we challenge the notion that arithmetic is language-independent, finding up to a 10% difference in performance across twenty languages. The study also compares self-verification methods with cross-agent collaborations, showing that a single model often outperforms collaborative approaches in basic arithmetic tasks. These findings suggest a need to reassess the effectiveness of LLMs in tasks requiring numerical accuracy and precision. | AIST, University of Tokyo |
| 26 | 24 | ./images/the_impact_of_language_20240616.png | Theory of Mind for Multi-Agent Collaboration via Large Language Models | Huao Li, Yu Quan Chong, Simon Stepputtis, Joseph Campbell, Dana Hughes, Michael Lewis, Katia Sycara | While Large Language Models (LLMs) havedemonstrated impressive accomplishments inboth reasoning and planning, their abilitiesin multi-agent collaborations remains largelyunexplored.This study evaluates LLM-based agents in a multi-agent cooperative textgame with Theory of Mind (ToM) inferencetasks, comparing their performance with Multi-Agent Reinforcement Learning (MARL) andplanning-based baselines. We observed evi-dence of emergent collaborative behaviors andhigh-order Theory of Mind capabilities amongLLM-based agents. Our results reveal limi-tations in LLM-based agents’ planning opti-mization due to systematic failures in managinglong-horizon contexts and hallucination aboutthe task state. We explore the use of explicitbelief state representations to mitigate these is-sues, finding that it enhances task performanceand the accuracy of ToM inferences for LLM-based agents. | University of Pittsburgh, Carnegie Mellon University |
| 27 | 25 | ./images/theory_of_mind_for_20231016.png | Toward Optimal LLM Alignments Using Two-Player Games | Rui Zheng, Hongyi Guo, Zhihan Liu, Xiaoying Zhang, Yuanshun Yao, Xiaojun Xu, Zhaoran Wang, Zhiheng Xi, Tao Gui, Qi Zhang, Xuanjing Huang, Hang Li, Yang Liu | Alignment of large language models is a critical process designed to ensure thatthe model’s responses to user prompts accurately reflect human intentions andadhere to societal values. The standard Reinforcement Learning from HumanFeedback (RLHF) framework primarily focuses on optimizing the performance oflarge language models using pre-collected prompts. However, collecting promptsthat provide comprehensive coverage is both tedious and challenging, and oftenfails to include scenarios that LLMs need to improve on the most. In this paper,we investigate alignment through the lens of two-agent games, involving iterativeinteractions between an adversarial and a defensive agent. The adversarial agent’stask at each step is to generate prompts that expose the weakness of the defensiveagent. In return, the defensive agent seeks to improve its responses to these newlyidentified prompts it “struggled" with, based on feedback from the reward model.We theoretically demonstrate that this iterative reinforcement learning optimizationconverges to a Nash Equilibrium for the game induced by the agents. Experi-mental results in safety scenarios demonstrate that learning in such a competitiveenvironment not only fully trains agents but also leads to policies with enhancedgeneralization capabilities for both adversarial and defensive agents. Our code isreleased at https://github.com/ruizheng20/gpo. | Fudan University, Northwestern University, ByteDance Research |
| 28 | 26 | ./images/toward_optimal_llm_alignments_20240616.png | Towards Detecting LLMs Hallucination via Markov Chain-based Multi-agent Debate Framework | Xiaoxi Sun, Jinpeng Li, Yan Zhong, Dongyan Zhao, Rui Yan | The advent of large language models (LLMs)has facilitated the development of natural lan-guage text generation. It also poses unprece-dented challenges, with content hallucinationemerging as a significant concern. Existingsolutions often involve expensive and complexinterventions during the training process. More-over, some approaches emphasize problem dis-assembly while neglecting the crucial valida-tion process, leading to performance degrada-tion or limited applications. To overcome theselimitations, we propose a Markov Chain-basedmulti-agent debate verification framework toenhance hallucination detection accuracy inconcise claims. Our method integrates the fact-checking process, including claim detection,evidence retrieval, and multi-agent verification.In the verification stage, we deploy multipleagents through flexible Markov Chain-baseddebates to validate individual claims, ensuringmeticulous verification outcomes. Experimen-tal results across three generative tasks demon-strate that our approach achieves significantimprovements over baselines. | Peking University, Renmin University of China |
| 29 | 27 | ./images/towards_detecting_llms_hallucination_20240605.png | To be Continued... | Your Contributions are Welcome! |