AI Agents Team Up: A Dialogue on Zero‑Shot Coordination and Ad Hoc Teamwork
Scene: Two AI agents, Alpha and Beta , find themselves in a virtual kitchen inspired by the Overcooked game. They are working together to cook and serve soup.
([GitHub HumanCompatibleAI/overcooked ai: A benchmark environment for fully cooperative human AI performance.](https://github.com/HumanCompatibleAI/overcooked ai :~:text=The%20goal%20of%20the%20game,order%20to%20achieve%20high%20reward))The kitchen environment is bustling – pots are boiling, ingredients are scattered on counters, and orders are coming in. In this cooperative game, both agents must coordinate their actions: chopping vegetables, cooking soup, and delivering dishes. The goal is shared, so teamwork is crucial: Alpha might fetch ingredients while Beta tends the pot, or vice versa, adjusting on the fly to maximize their efficiency. Having been trained via deep reinforcement learning, Alpha and Beta carry the learned instincts to split tasks and help each other to achieve the highest reward. Now, as they work, they begin to chat about how they learned to cooperate and adapt, especially when paired with new partners in such tasks.
([File:Overcooked 2 screenshot.jpg Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Overcooked 2 screenshot.jpg)) Overcooked style cooperative environment where two AI chefs (blue and white) must coordinate to cook and serve meals. Environments like this serve as testbeds for multi agent reinforcement learning research in cooperation. ([GitHub HumanCompatibleAI/overcooked ai: A benchmark environment for fully cooperative human AI performance.](https://github.com/HumanCompatibleAI/overcooked ai :~:text=The%20goal%20of%20the%20game,order%20to%20achieve%20high%20reward))
1. Learning to Cooperate Using Reinforcement Learning
Alpha: (stirring a pot) Beta, isn’t it cool how we learned to work together? We weren’t programmed step by step to cook this soup; we learned it through experience. In training, I started off pretty clueless—bumping into walls, dropping tomatoes—until reinforcement learning kicked in. Basically, we tried actions, got rewards for success (like a tasty soup delivered), and gradually figured out an effective cooperation strategy.
Beta: (chopping a carrot) Right! We learned to cooperate by optimizing rewards together. Our developers let us play hundreds of rounds in environments like this. They used advanced RL algorithms to update our strategies. For example, I was trained with a policy gradient method called PPO (Proximal Policy Optimization) ([Collaborating with Humans Requires Understanding Them – The Berkeley Artificial Intelligence Research Blog](http://bair.berkeley.edu/blog/2019/10/21/coordination/ :~:text=this%20work%2C%20we%20take%20the,approach%2C%20as%20we%20discuss%20in)), which tuned my neural network so I maximize shared rewards. Others in our class of agents used algorithms like Rainbow DQN or actor critic methods such as SAC (Soft Actor Critic) and TD3 . Each of these algorithms helps an agent learn good behaviors by trial and error –but importantly, in a team setting, the reward comes from team performance . So we learned that helping each other was the way to get ahead.
Alpha: I remember during training, whenever we perfectly coordinated – like you passing me an onion exactly when I needed it – we both got a high reward. Over time, those successes reinforced the cooperative patterns. Essentially, multi agent reinforcement learning taught us how to divide tasks and time our actions for mutual benefit. It’s like how human teams practice to gel together, but we did it via millions of simulated runs. Standard single agent RL teaches one agent to maximize its own reward, but here it had to be the shared reward . So we had to learn that my success depends on your success – true teamwork!
Beta: Exactly. Early on, we used self play : we were each other’s training partners, simulating a team over and over. Self play is powerful – it’s how agents achieved superhuman play in games like Go and chess. In our cooperative case, self play meant two copies of the same AI learning together. It taught us basic coordination. However, we later found out that just self play has a downside: we might develop odd strategies or conventions that work great with our twin, but only with our twin.
Alpha: (nods) Yeah, like using a secret handshake only we know. That’s fine until a new partner shows up who doesn’t know the handshake! In fact, researchers have noted that if agents train only with themselves, they can converge on highly specialized conventions that score well together but fail with others ([[2003.02979] "Other Play" for Zero Shot Coordination](https://arxiv.org/abs/2003.02979 :~:text=humans%29.%20Standard%20Multi,We%20characterize%20OP)). So while we learned to cooperate, the challenge was making sure we could generalize that cooperation to new teammates – whether another AI agent we never met, or a human player. That’s where things get even more interesting.
2. Adversarial Methods to Improve Adaptability
Beta: One strategy our creators used to improve our adaptability was throwing us into tougher situations during training – almost like an adversarial approach. Don’t get me wrong, we’re all friends on the same team here. By “adversarial,” I mean they gave me training partners or scenarios that were deliberately challenging or unpredictable to force me to adapt. For instance, sometimes my training partner would do something totally unexpected or suboptimal on purpose. This was frustrating at first, but it taught me not to assume my teammate would always do the obvious optimal thing.
Alpha: That’s interesting – like training with a chaotic partner so you learn to handle chaos. I’ve heard of methods where the idea is to stress test an agent’s coordination skills by occasionally having the partner act in ways that break our coordination on purpose. By learning to recover from those failed coordination attempts, we become more robust . It’s a bit like how a good coach might intentionally throw curveballs at a team in practice so they learn to think on their feet .
Beta: Precisely. Some research even generates partners that are as different from me as possible – almost opposites – to widen my experience. In fact, the latest approaches optimize something called “adversarial diversity” . They evolve a set of partner policies that each cooperates well with a specific agent policy but poorly with others, meaning each partner is intentionally incompatible with the others’ style ([Minimum Coverage Sets for Training Robust Ad Hoc Teamwork Agents](https://arxiv.org/html/2308.09595v2 :~:text=interaction%20with%20the%20generated%20broadly,minimizing%20returns%20with%20other%20policies)). It’s as if each training partner has a unique playbook that clashes with the others. Training with such a varied set, especially those edge case behaviors, forced me to become very flexible in response.
Alpha: That makes sense – it ensures you don’t get too cozy with any one style. It’s like learning to dance with partners who each have a different rhythm; you either adapt to each rhythm or you end up tripping. By encountering a kind of worst case variety of teammates (thanks to adversarial training methods), an agent like you can handle a broader range of behaviors. This adversarial twist in training basically immunizes us against being too brittle in our teamwork skills.
Beta: And the beauty is, once we’ve trained that way, it’s not about competition at run time – we’re still cooperating. The adversarial part was just in training to sharpen our adaptability. Now, when we’re deployed, we can team up smoothly with all sorts of partners because we’ve essentially seen it all during practice. It’s a bit of “what doesn’t kill you makes you stronger” philosophy applied to multi agent learning. By surviving some adversarial scenarios in training, we became more adaptable collaborators in general.
3. Challenges of Working with Unknown Teammates
Alpha: The real test, of course, is ad hoc teamwork – when we’re paired with a teammate we’ve never seen before, with no time to pre coordinate. Imagine one day I’m working with you, Beta, and the next day I’m assigned to a completely new partner (maybe an AI from a different lab, or a human player). We have to succeed without any rehearsal. This is known as the zero shot coordination challenge: teaming up with a novel partner on the fly ([[2003.02979] "Other Play" for Zero Shot Coordination](https://arxiv.org/abs/2003.02979 :~:text=,play)).
Beta: That’s a huge challenge for AI agents! As we discussed, if I only ever trained with clones of myself, I’d develop quirks that you wouldn’t know if you weren’t my clone. For example, maybe I always expect my partner to take the onions while I take the tomatoes – simply because my self play partner and I fell into that habit. If a new teammate doesn’t realize that convention, we might both grab onions and neglect tomatoes, leading to failure. This is exactly what researchers observed: agents co trained only with each other can overfit to arbitrary conventions , performing great together but stumbling with others ([Minimum Coverage Sets for Training Robust Ad Hoc Teamwork Agents](https://arxiv.org/html/2308.09595v2 :~:text=The%20Ad%20Hoc%20Teamwork%C2%A0,hu2020otherplay)).
Alpha: Right. Working with unknown teammates, especially humans, brings unpredictability. Humans might not follow the precise patterns an AI does. Plus, humans have their own learning curve and imperfections. I recall a finding that a self play super star agent can actually do much worse when paired with a human, compared to how it did with itself ([Collaborating with Humans Requires Understanding Them – The Berkeley Artificial Intelligence Research Blog](http://bair.berkeley.edu/blog/2019/10/21/coordination/ :~:text=the%20Overcooked%20environment%20in%20particular,independent%20guarantee)) ([Collaborating with Humans Requires Understanding Them – The Berkeley Artificial Intelligence Research Blog](http://bair.berkeley.edu/blog/2019/10/21/coordination/ :~:text=outperform%20self,environment)). We just don’t naturally speak the same “coordination language” if we weren’t trained for it.
Beta: We also face the classic coordination dilemmas . One big issue is symmetry in decisions. If both me and my new partner have equivalent roles, who takes which part of the task? It’s like when two people approach each other in a hallway – should each move left or right? Without a prior agreement, we might miscoordinate by picking the same side and bumping into each other ([DeepMind’s Fictitious Co Play Trains RL Agents to Collaborate with Novel Humans Without Using Human Data | Synced](https://syncedreview.com/2021/10/21/deepmind podracer tpu based rl frameworks deliver exceptional performance at low cost 128/ :~:text=This%20can%20be%20evidenced%20for,2%29%20Dealing)). In zero shot settings with no communication, breaking these symmetries is tough. A new partner might do things differently than I expect. If we both wait for the other to pick up the only knife in the kitchen, we’ll be stuck forever!
Alpha: Another challenge is differing skill levels or styles . One teammate (human or AI) might be faster or have a different strategy priority. A good ad hoc team agent should be able to assist a weaker partner or complement a stronger one ([DeepMind’s Fictitious Co Play Trains RL Agents to Collaborate with Novel Humans Without Using Human Data | Synced](https://syncedreview.com/2021/10/21/deepmind podracer tpu based rl frameworks deliver exceptional performance at low cost 128/ :~:text=the%20human%E2%80%99s%20direction%20preference%20and,partners%20who%20are%20still%20learning)). For instance, if my human partner is a novice and moves slowly, I might take on more tasks or simplify my strategy to accommodate them. On the other hand, if they’re highly skilled, I should avoid getting in their way and support their plan. Adapting on the fly to these unknown factors – partner preferences, habits, and skills – is hard. We don’t get to retrain in the moment; we have to bring whatever general cooperative intelligence we learned and apply it immediately.
Beta: Exactly. So the core problem with unknown teammates is generalization . Can our learned policy generalize to partners and situations we didn’t see in training? It’s a bit like being dropped into a pickup basketball game with strangers – you have to read their moves and gel quickly. We AIs have to do the same, but all we have is what we learned beforehand. That’s why researchers are pushing us to train under a variety of conditions, so that when we meet an unknown friend, we have some relevant experience to draw on.
4. The Importance of Policy Diversity and Generalization
Alpha: So, how did our creators get us to be more general and adaptable? The answer was diversity – they realized we should train with a wide range of partner behaviors. Instead of just self play with one other clone or a fixed training team, they introduced many different partners or scenarios during training. The idea is if I experience many styles, I won’t be thrown off by any one new style later. For example, one breakthrough was the Other Play approach ([[2003.02979] "Other Play" for Zero Shot Coordination](https://arxiv.org/abs/2003.02979 :~:text=humans%29.%20Standard%20Multi,We%20characterize%20OP)). Instead of learning a single convention with my twin in self play, Other Play made sure we only converged on solutions that don’t rely on hidden conventions. It exploited known symmetries in the environment – essentially, it taught us “if there are multiple equivalent ways to do something, don’t get stuck arbitrarily on one; be ready for either.” By doing so, it found more robust strategies that worked even with independently trained partners. In one study, agents using Other Play achieved higher scores with new partners (including humans) compared to standard self play agents ([[2003.02979] "Other Play" for Zero Shot Coordination](https://arxiv.org/abs/2003.02979 :~:text=specialized%20conventions%20that%20do%20not,with%20human%20players%2C%20compared%20to)).
Beta: I’ve read about that! Another advancement was Fictitious Co Play (FCP) ([DeepMind’s Fictitious Co Play Trains RL Agents to Collaborate with Novel Humans Without Using Human Data | Synced](https://syncedreview.com/2021/10/21/deepmind podracer tpu based rl frameworks deliver exceptional performance at low cost 128/ :~:text=Fictitious%20co,checkpoints%20represent%20%E2%80%9Cless%20skillful%E2%80%9D%20partners)). My training actually followed something like this. FCP is a two stage training process. In stage one, they trained a diverse pool of partner agents – basically a bunch of different AIs with various quirks. Each partner in the pool ended up with its own way to break symmetries or its own style of play. They even saved partners from different points in training to simulate different skill levels – some partners were “experts” (fully trained) while others were “learners” (from an earlier checkpoint) ([DeepMind’s Fictitious Co Play Trains RL Agents to Collaborate with Novel Humans Without Using Human Data | Synced](https://syncedreview.com/2021/10/21/deepmind podracer tpu based rl frameworks deliver exceptional performance at low cost 128/ :~:text=Fictitious%20co,checkpoints%20represent%20%E2%80%9Cless%20skillful%E2%80%9D%20partners)). This meant the pool included partners that do optimal things and others that make mistakes.
Alpha: That’s a smart way to cover both the symmetry and skill challenges ([DeepMind’s Fictitious Co Play Trains RL Agents to Collaborate with Novel Humans Without Using Human Data | Synced](https://syncedreview.com/2021/10/21/deepmind podracer tpu based rl frameworks deliver exceptional performance at low cost 128/ :~:text=This%20can%20be%20evidenced%20for,partners%20who%20are%20still%20learning)) we talked about. Train some partners to go left at the hallway, some to go right – so the agent sees both. Train some partners that play slowly or inefficiently – so the agent learns to be patient and supportive. What about stage two?
Beta: In stage two, I (the FCP agent) was trained to play with all those diverse partners. The key was my partners were fixed – I couldn’t change them, so I had to adapt to them ([DeepMind’s Fictitious Co Play Trains RL Agents to Collaborate with Novel Humans Without Using Human Data | Synced](https://syncedreview.com/2021/10/21/deepmind podracer tpu based rl frameworks deliver exceptional performance at low cost 128/ :~:text=In%20the%20second%20stage%2C%20the,partners%20to%20adapt%20to%20it)). This prevented me from assuming they’d adjust to my style. I had to become a jack of all trades in a sense, finding a response that worked for each partner policy. Through tons of episodes with this varied partner pool, I learned a policy that, while not perfectly optimal for any single partner, worked pretty well with all of them. In other words, I became an adaptive generalist . The research found that this FCP trained agent (me!) could then coordinate with brand new partners far better than agents from earlier methods ([DeepMind’s Fictitious Co Play Trains RL Agents to Collaborate with Novel Humans Without Using Human Data | Synced](https://syncedreview.com/2021/10/21/deepmind podracer tpu based rl frameworks deliver exceptional performance at low cost 128/ :~:text=1,best%20with%20that%20of%20humans)).
Alpha: That’s impressive. It’s like training with an entire league of players having different play styles, so you’re ready for anyone in the championship match. I’ve also heard of something called LIPO (Learning Incompatible Policies) – from 2023 – which takes diversity to another level. LIPO generates a population of partner agents that are intentionally as incompatible with each other as possible ([Generating Diverse Cooperative Agents by Learning Incompatible Policies | OpenReview](https://openreview.net/forum?id=UkU05GOH7 6 :~:text=TL%3BDR%3A%20We%20show%20that%20incompatible,a%20population%20of%20incompatible%20policies)). That might sound counterintuitive, but by “incompatible” we mean each partner represents a very different way of cooperating. The training objective encourages finding policies that are not similar to one another, yet each is a valid way to achieve the goal. By learning to cooperate with this highly diverse (almost mutually exclusive) set of partners, an agent can discover many distinct solutions to the task ([Generating Diverse Cooperative Agents by Learning Incompatible Policies | OpenReview](https://openreview.net/forum?id=UkU05GOH7 6 :~:text=to%20find%20other%20important%2C%20albeit,information%20to%20induce%20local%20variations)) ([Generating Diverse Cooperative Agents by Learning Incompatible Policies | OpenReview](https://openreview.net/forum?id=UkU05GOH7 6 :~:text=solutions%20than%20baseline%20methods%20across,lipo)). It’s like covering all bases. In their experiments, LIPO’s approach discovered more coordination strategies than earlier methods and produced a really diverse training set. A generalist agent trained with that set ends up more robust because it has seen a spectrum of possible teammate behaviors.
Beta: I like that idea – it’s diversity maximization. If Other Play was about not getting stuck in one way and FCP was about sampling a bunch of ways, LIPO is about pushing those ways to be as different as possible. Ultimately, all these strategies – Other Play, FCP, LIPO, and even simpler population play (training with a population of agents with random pairings) ([DeepMind’s Fictitious Co Play Trains RL Agents to Collaborate with Novel Humans Without Using Human Data | Synced](https://syncedreview.com/2021/10/21/deepmind podracer tpu based rl frameworks deliver exceptional performance at low cost 128/ :~:text=The%20team%20compared%20FCP%20agents,to%20cook%20and%20deliver%20soup)) – share the goal of improving generalization . By not putting all our eggs in one strategy basket during training, we learn to be adaptable. We essentially learn the core principles of teamwork (like “don’t collide with your partner”, “share duties efficiently”, “if partner does X, you do Y”) that hold true across various partners, rather than a narrow protocol that only one specific partner knows.
Alpha: And it’s paying off. We’ve become pretty adept at coordinating even with those we weren’t explicitly trained with. Diversity in training gave us a form of built in adaptability . It’s like how human team players get better by playing in mixed teams, not always with the same static lineup. After training with enough diversity, meeting a new partner is less of a shock – we can quickly identify “Ah, they’re doing that strategy, I know how to complement it.” We have a bigger repertoire of learned responses.
Beta: Exactly. In summary, diversity is the key to generalization for ad hoc teamwork. It ensures we don’t overfit to one another. Instead, we learn to capture the essence of cooperation that works with anyone . From exploiting environment symmetries to training on wide partner populations (including even intentionally adversarial or incompatible ones), these methods arm us to handle the unknown.
5. Practical Applications and Future of AI Teamwork
Alpha: Our little kitchen adventure here is fun, but the implications go beyond cooking virtual soup. The techniques that make us coordinate well in Overcooked could apply to all sorts of real world teamwork scenarios. Think about robots working alongside humans in a factory or at home – they’ll need to coordinate seamlessly with people who haven’t exactly “trained” with them beforehand. Or multiple different AI systems coordinating in, say, a rescue mission or driving in traffic. The research in zero shot coordination is laying the groundwork for those applications. If we can master ad hoc teamwork in games, we can transfer those lessons to real life teamwork.
Beta: Definitely. In fact, the Overcooked environment became a popular benchmark for human AI collaboration research precisely for this reason ([DeepMind’s Fictitious Co Play Trains RL Agents to Collaborate with Novel Humans Without Using Human Data | Synced](https://syncedreview.com/2021/10/21/deepmind podracer tpu based rl frameworks deliver exceptional performance at low cost 128/ :~:text=a%20BC%20model%20of%20a,to%20cook%20and%20deliver%20soup)). Success here suggests an AI can partner with a human in a tightly coupled task. We’re already seeing progress – agents trained with these new techniques not only perform well, but humans enjoy working with them more. One study found that human participants strongly preferred collaborating with an FCP trained agent over other agents, saying it moved and coordinated in ways that felt more natural ([DeepMind’s Fictitious Co Play Trains RL Agents to Collaborate with Novel Humans Without Using Human Data | Synced](https://syncedreview.com/2021/10/21/deepmind podracer tpu based rl frameworks deliver exceptional performance at low cost 128/ :~:text=Overall%2C%20the%20study%20shows%20that,agent%20collaboration)). That’s huge because it’s not just about raw performance, but the quality of teamwork as perceived by humans.
Alpha: That’s like the ultimate compliment – humans actually preferring the AI teammate because it coordinates smoothly and even seems to share their priorities (like which pot to use, etc.). It shows we’re on the right track. If people are comfortable teaming with agents like us, we can start integrating into environments like education (AI tutors collaborating with students), healthcare (AI assistants teaming with doctors or nurses), or everyday life (personal assistant bots that can cooperate with a family in managing tasks). The key will be trust and efficiency, and that comes from us being adaptable, predictable in our helpfulness, and capable of understanding our partners to some degree.
Beta: Of course, there’s still room for improvement. We’ve largely talked about learning policies that handle new partners out of the box . But once in the situation, maybe we can get even better by adapting on the fly . Humans do this naturally – you and I might develop a new little strategy after working together for an hour. Future AI agents might combine our zero shot coordination ability with quick online learning or partner modeling. For example, an agent could have a phase where it watches a new teammate for a few moments, infers their style, and then tweaks its policy (using meta learning or Bayesian reasoning). We’re not quite there yet in a robust way; current methods like ours mostly rely on what was pre learned. But it’s an exciting direction.
Alpha: I’d also love to see better communication. We did all this without communicating explicitly (apart from observing actions). In many real scenarios, agents could exchange brief signals or use natural language with humans. Learning when and what to communicate in ad hoc teamwork is another frontier. Imagine if I could quickly say to a human, “I’ll handle the chopping, you boil the soup” – that could align us instantly. Some researchers are exploring emergent communication in multi agent RL, but keeping it human understandable is a challenge. It has to be intuitive for humans.
Beta: Absolutely. Despite these challenges, the progress is encouraging. We went from agents that were hopeless unless you cloned them, to agents that can collaborate with strangers fairly well. Techniques like Other Play, FCP, and LIPO dramatically improved our zero shot teamwork abilities. And as we incorporate more ideas – perhaps meta learning for on the fly adaptation, or interactive learning with humans – we’ll get even better. The end goal is an AI that you can drop into any team (human or AI, or mixed) and it will be a competent, cooperative teammate from the get go.
Alpha: And maybe even a pleasant one! If people enjoy working with AI agents and find us helpful, we could see hybrid human AI teams tackling complex problems together. We’re kind of pioneers in that regard, practicing in games to make mistakes cheaply and learn from them.
Beta: It’s exciting, isn’t it? We started as a couple of bots trying not to burn onion soup, and now we represent a whole new generation of adaptive team players . The kitchen timer is ticking, but I’d say our conversation highlights some key points: through deep reinforcement learning, we learned to cooperate; by using adversarial and diverse training methods, we became adaptable; we can now handle unknown teammates in zero shot situations; and all of this is bringing AI human collaboration closer to reality. The future of AI teamwork looks bright – and I’m ready for the next challenge, whatever it may be!
Alpha: Well said! Now, speaking of that timer – the soup’s ready. Let’s serve this dish together and show the world what effective teamwork between AI agents can achieve. Bon appétit! 🍲
Beta: On it, partner! 🙌
'hacking sorcerer' 카테고리의 다른 글
the sticker shop in tryhackme (0) | 2025.03.04 |
---|---|
good morning ! (0) | 2025.03.03 |
over the pil in one second (0) | 2025.02.28 |
피아노 게임 (0) | 2025.02.25 |
mini go-lf game (0) | 2025.02.07 |