본문 바로가기

hacking sorcerer

the interesting dialogue about the ad hawk and dragon

728x90

AI Agents Team Up: A Dialogue on Zero‑Shot Coordination and Ad Hoc Teamwork

 Scene: Two AI agents,  Alpha  and  Beta , find themselves in a virtual kitchen inspired by the Overcooked game. They are working together to cook and serve soup. 

([GitHub   HumanCompatibleAI/overcooked ai: A benchmark environment for fully cooperative human AI performance.](https://github.com/HumanCompatibleAI/overcooked ai :~:text=The%20goal%20of%20the%20game,order%20to%20achieve%20high%20reward))The kitchen environment is bustling – pots are boiling, ingredients are scattered on counters, and orders are coming in. In this cooperative game, both agents must  coordinate  their actions: chopping vegetables, cooking soup, and delivering dishes. The goal is shared, so teamwork is crucial:  Alpha  might fetch ingredients while  Beta  tends the pot, or vice versa, adjusting on the fly to maximize their efficiency. Having been trained via deep reinforcement learning,  Alpha  and  Beta  carry the learned instincts to split tasks and help each other to achieve the highest reward. Now, as they work, they begin to chat about  how  they learned to cooperate and adapt, especially when paired with new partners in such tasks.

([File:Overcooked 2 screenshot.jpg   Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Overcooked 2 screenshot.jpg))  Overcooked style cooperative environment where two AI chefs (blue and white) must coordinate to cook and serve meals. Environments like this serve as testbeds for multi agent reinforcement learning research in cooperation.  ([GitHub   HumanCompatibleAI/overcooked ai: A benchmark environment for fully cooperative human AI performance.](https://github.com/HumanCompatibleAI/overcooked ai :~:text=The%20goal%20of%20the%20game,order%20to%20achieve%20high%20reward))

   1. Learning to Cooperate Using Reinforcement Learning

 Alpha:   (stirring a pot)  Beta, isn’t it cool how we learned to work together? We weren’t  programmed  step by step to cook this soup; we  learned  it through experience. In training, I started off pretty clueless—bumping into walls, dropping tomatoes—until reinforcement learning kicked in. Basically, we tried actions, got rewards for success (like a tasty soup delivered), and gradually figured out an effective cooperation strategy.

 Beta:   (chopping a carrot)  Right! We learned to cooperate by optimizing rewards together. Our developers let us play hundreds of rounds in environments like this. They used advanced RL algorithms to update our strategies. For example, I was trained with a policy gradient method called  PPO  (Proximal Policy Optimization) ([Collaborating with Humans Requires Understanding Them – The Berkeley Artificial Intelligence Research Blog](http://bair.berkeley.edu/blog/2019/10/21/coordination/ :~:text=this%20work%2C%20we%20take%20the,approach%2C%20as%20we%20discuss%20in)), which tuned my neural network so I maximize shared rewards. Others in our class of agents used algorithms like  Rainbow DQN  or actor critic methods such as  SAC  (Soft Actor Critic) and  TD3 . Each of these algorithms helps an agent learn good behaviors by trial and error –but importantly, in a  team  setting, the reward comes from  team performance . So we learned that  helping each other  was the way to get ahead.

 Alpha:  I remember during training, whenever we perfectly coordinated – like you passing me an onion exactly when I needed it – we both got a high reward. Over time, those successes reinforced the cooperative patterns. Essentially, multi agent reinforcement learning taught us how to  divide tasks and time our actions  for mutual benefit. It’s like how human teams practice to gel together, but we did it via millions of simulated runs. Standard single agent RL teaches one agent to maximize its own reward, but here it had to be the  shared reward . So we had to learn that  my success depends on your success  – true teamwork!

 Beta:  Exactly. Early on, we used  self play : we were each other’s training partners, simulating a team over and over. Self play is powerful – it’s how agents achieved superhuman play in games like Go and chess. In our cooperative case, self play meant two copies of the same AI learning together. It taught us basic coordination. However, we later found out that  just  self play has a downside: we might develop  odd strategies  or conventions that work great with our twin, but  only  with our twin.

 Alpha:   (nods)  Yeah, like using a secret handshake only we know. That’s fine until a new partner shows up who doesn’t know the handshake! In fact, researchers have noted that if agents train only with themselves, they can converge on  highly specialized conventions  that score well together but  fail with others  ([[2003.02979] "Other Play" for Zero Shot Coordination](https://arxiv.org/abs/2003.02979 :~:text=humans%29.%20Standard%20Multi,We%20characterize%20OP)). So while we learned to cooperate, the challenge was making sure we could generalize that cooperation to  new  teammates – whether another AI agent we never met, or a human player. That’s where things get even more interesting.

   2. Adversarial Methods to Improve Adaptability

 Beta:  One strategy our creators used to improve our adaptability was throwing us into tougher situations during training – almost like an  adversarial  approach. Don’t get me wrong, we’re all friends on the same team here. By “adversarial,” I mean they gave me training partners or scenarios that were deliberately challenging or unpredictable to force me to adapt. For instance, sometimes my training partner would do something totally unexpected or suboptimal on purpose. This was frustrating at first, but it taught me not to assume my teammate would always do the obvious optimal thing.

 Alpha:  That’s interesting – like training with a chaotic partner so you learn to handle chaos. I’ve heard of methods where the idea is to  stress test  an agent’s coordination skills by occasionally having the partner act in ways that break our coordination on purpose. By learning to recover from those failed coordination attempts, we become more  robust . It’s a bit like how a good coach might intentionally throw curveballs at a team in practice so they learn to  think on their feet .

 Beta:  Precisely. Some research even generates partners that are as different from me as possible – almost  opposites  – to widen my experience. In fact, the latest approaches optimize something called  “adversarial diversity” . They evolve a set of partner policies that each cooperates well with a specific agent policy but  poorly  with others, meaning each partner is intentionally incompatible with the others’ style ([Minimum Coverage Sets for Training Robust Ad Hoc Teamwork Agents](https://arxiv.org/html/2308.09595v2 :~:text=interaction%20with%20the%20generated%20broadly,minimizing%20returns%20with%20other%20policies)). It’s as if each training partner has a unique playbook that clashes with the others. Training with such a varied set, especially those edge case behaviors, forced me to become very  flexible  in response.

 Alpha:  That makes sense – it ensures you don’t get too cozy with any one style. It’s like learning to dance with partners who each have a different rhythm; you either adapt to each rhythm or you end up tripping. By encountering a kind of worst case variety of teammates (thanks to adversarial training methods), an agent like you can handle a  broader range  of behaviors. This adversarial twist in training basically immunizes us against being too brittle in our teamwork skills.

 Beta:  And the beauty is, once we’ve trained that way, it’s not about competition at run time – we’re still cooperating. The adversarial part was just in training to sharpen our adaptability. Now, when we’re deployed, we can team up smoothly with all sorts of partners because we’ve essentially  seen it all  during practice. It’s a bit of “what doesn’t kill you makes you stronger” philosophy applied to multi agent learning. By surviving some adversarial scenarios in training, we became more adaptable collaborators in general.

   3. Challenges of Working with Unknown Teammates

 Alpha:  The real test, of course, is  ad hoc teamwork  – when we’re paired with a teammate we’ve never seen before, with no time to pre coordinate. Imagine one day I’m working with you, Beta, and the next day I’m assigned to a completely new partner (maybe an AI from a different lab, or a human player). We have to succeed  without  any rehearsal. This is known as the  zero shot coordination  challenge: teaming up with a novel partner  on the fly  ([[2003.02979] "Other Play" for Zero Shot Coordination](https://arxiv.org/abs/2003.02979 :~:text=,play)).

 Beta:  That’s a huge challenge for AI agents! As we discussed, if I only ever trained with clones of myself, I’d develop quirks that  you  wouldn’t know if you weren’t my clone. For example, maybe I always expect my partner to take the onions while I take the tomatoes – simply because my self play partner and I fell into that habit. If a new teammate doesn’t realize that convention, we might both grab onions and neglect tomatoes, leading to failure. This is exactly what researchers observed: agents co trained only with each other can  overfit to arbitrary conventions , performing  great  together but stumbling with others ([Minimum Coverage Sets for Training Robust Ad Hoc Teamwork Agents](https://arxiv.org/html/2308.09595v2 :~:text=The%20Ad%20Hoc%20Teamwork%C2%A0,hu2020otherplay)).

 Alpha:  Right. Working with unknown teammates, especially humans, brings unpredictability. Humans might not follow the precise patterns an AI does. Plus, humans have their own learning curve and imperfections. I recall a finding that a self play super star agent can actually do  much worse  when paired with a human, compared to how it did with itself ([Collaborating with Humans Requires Understanding Them – The Berkeley Artificial Intelligence Research Blog](http://bair.berkeley.edu/blog/2019/10/21/coordination/ :~:text=the%20Overcooked%20environment%20in%20particular,independent%20guarantee)) ([Collaborating with Humans Requires Understanding Them – The Berkeley Artificial Intelligence Research Blog](http://bair.berkeley.edu/blog/2019/10/21/coordination/ :~:text=outperform%20self,environment)). We just don’t naturally speak the same “coordination language” if we weren’t trained for it.

 Beta:  We also face the classic  coordination dilemmas . One big issue is  symmetry  in decisions. If both me and my new partner have equivalent roles, who takes which part of the task? It’s like when two people approach each other in a hallway – should each move left or right? Without a prior agreement, we might  miscoordinate  by picking the same side and bumping into each other ([DeepMind’s Fictitious Co Play Trains RL Agents to Collaborate with Novel Humans Without Using Human Data | Synced](https://syncedreview.com/2021/10/21/deepmind podracer tpu based rl frameworks deliver exceptional performance at low cost 128/ :~:text=This%20can%20be%20evidenced%20for,2%29%20Dealing)). In zero shot settings with no communication, breaking these symmetries is tough. A new partner might do things differently than I expect. If we both wait for the other to pick up the only knife in the kitchen, we’ll be stuck forever!

 Alpha:  Another challenge is differing  skill levels or styles . One teammate (human or AI) might be faster or have a different strategy priority. A good ad hoc team agent should be able to  assist a weaker partner or complement a stronger one  ([DeepMind’s Fictitious Co Play Trains RL Agents to Collaborate with Novel Humans Without Using Human Data | Synced](https://syncedreview.com/2021/10/21/deepmind podracer tpu based rl frameworks deliver exceptional performance at low cost 128/ :~:text=the%20human%E2%80%99s%20direction%20preference%20and,partners%20who%20are%20still%20learning)). For instance, if my human partner is a novice and moves slowly, I might take on more tasks or simplify my strategy to accommodate them. On the other hand, if they’re highly skilled, I should avoid getting in their way and support their plan. Adapting on the fly to these unknown factors – partner preferences, habits, and skills – is hard. We don’t get to retrain in the moment; we have to bring whatever general cooperative intelligence we learned and apply it immediately.

 Beta:  Exactly. So the core problem with unknown teammates is  generalization . Can our learned policy generalize to partners and situations we didn’t see in training? It’s a bit like being dropped into a pickup basketball game with strangers – you have to read their moves and gel quickly. We AIs have to do the same, but all we have is what we learned beforehand. That’s why researchers are pushing us to train under a variety of conditions, so that when we meet an unknown friend, we have some relevant experience to draw on.

   4. The Importance of Policy Diversity and Generalization

 Alpha:  So, how did our creators get us to be more general and adaptable? The answer was  diversity  – they realized we should train with a  wide range  of partner behaviors. Instead of just self play with one other clone or a fixed training team, they introduced many different partners or scenarios during training. The idea is if I experience many styles, I won’t be thrown off by any one new style later. For example, one breakthrough was the  Other Play  approach ([[2003.02979] "Other Play" for Zero Shot Coordination](https://arxiv.org/abs/2003.02979 :~:text=humans%29.%20Standard%20Multi,We%20characterize%20OP)). Instead of learning a single convention with my twin in self play, Other Play made sure we only converged on solutions that don’t rely on hidden conventions. It exploited known  symmetries  in the environment – essentially, it taught us  “if there are multiple equivalent ways to do something, don’t get stuck arbitrarily on one; be ready for either.”  By doing so, it found more  robust strategies  that worked even with independently trained partners. In one study, agents using Other Play achieved higher scores with new partners (including humans) compared to standard self play agents ([[2003.02979] "Other Play" for Zero Shot Coordination](https://arxiv.org/abs/2003.02979 :~:text=specialized%20conventions%20that%20do%20not,with%20human%20players%2C%20compared%20to)).

 Beta:  I’ve read about that! Another advancement was  Fictitious Co Play (FCP)  ([DeepMind’s Fictitious Co Play Trains RL Agents to Collaborate with Novel Humans Without Using Human Data | Synced](https://syncedreview.com/2021/10/21/deepmind podracer tpu based rl frameworks deliver exceptional performance at low cost 128/ :~:text=Fictitious%20co,checkpoints%20represent%20%E2%80%9Cless%20skillful%E2%80%9D%20partners)). My training actually followed something like this. FCP is a two stage training process. In stage one, they trained a  diverse pool of partner agents  – basically a bunch of different AIs with various quirks. Each partner in the pool ended up with its own way to break symmetries or its own style of play. They even saved partners from different points in training to simulate different skill levels – some partners were “experts” (fully trained) while others were “learners” (from an earlier checkpoint) ([DeepMind’s Fictitious Co Play Trains RL Agents to Collaborate with Novel Humans Without Using Human Data | Synced](https://syncedreview.com/2021/10/21/deepmind podracer tpu based rl frameworks deliver exceptional performance at low cost 128/ :~:text=Fictitious%20co,checkpoints%20represent%20%E2%80%9Cless%20skillful%E2%80%9D%20partners)). This meant the pool included partners that do optimal things and others that make mistakes.

 Alpha:  That’s a smart way to cover both the symmetry and skill challenges ([DeepMind’s Fictitious Co Play Trains RL Agents to Collaborate with Novel Humans Without Using Human Data | Synced](https://syncedreview.com/2021/10/21/deepmind podracer tpu based rl frameworks deliver exceptional performance at low cost 128/ :~:text=This%20can%20be%20evidenced%20for,partners%20who%20are%20still%20learning)) we talked about. Train some partners to go left at the hallway, some to go right – so the agent sees both. Train some partners that play slowly or inefficiently – so the agent learns to be patient and supportive. What about stage two?

 Beta:  In stage two,  I  (the FCP agent) was trained to play with all those diverse partners. The key was my partners were fixed – I couldn’t change them, so  I  had to adapt to  them  ([DeepMind’s Fictitious Co Play Trains RL Agents to Collaborate with Novel Humans Without Using Human Data | Synced](https://syncedreview.com/2021/10/21/deepmind podracer tpu based rl frameworks deliver exceptional performance at low cost 128/ :~:text=In%20the%20second%20stage%2C%20the,partners%20to%20adapt%20to%20it)). This prevented me from assuming they’d adjust to my style. I had to become a jack of all trades in a sense, finding a response that worked for each partner policy. Through tons of episodes with this varied partner pool, I learned a policy that, while not perfectly optimal for any single partner, worked pretty well with  all  of them. In other words, I became an  adaptive generalist . The research found that this FCP trained agent (me!) could then coordinate with brand new partners far better than agents from earlier methods ([DeepMind’s Fictitious Co Play Trains RL Agents to Collaborate with Novel Humans Without Using Human Data | Synced](https://syncedreview.com/2021/10/21/deepmind podracer tpu based rl frameworks deliver exceptional performance at low cost 128/ :~:text=1,best%20with%20that%20of%20humans)).

 Alpha:  That’s impressive. It’s like training with an entire league of players having different play styles, so you’re ready for anyone in the championship match. I’ve also heard of something called  LIPO (Learning Incompatible Policies)  – from 2023 – which takes diversity to another level. LIPO generates a population of partner agents that are intentionally as  incompatible with each other as possible  ([Generating Diverse Cooperative Agents by Learning Incompatible Policies | OpenReview](https://openreview.net/forum?id=UkU05GOH7 6 :~:text=TL%3BDR%3A%20We%20show%20that%20incompatible,a%20population%20of%20incompatible%20policies)). That might sound counterintuitive, but by “incompatible” we mean each partner represents a very different way of cooperating. The training objective encourages finding policies that are  not similar  to one another, yet each is a valid way to achieve the goal. By learning to cooperate with this highly diverse (almost mutually exclusive) set of partners, an agent can discover  many distinct solutions  to the task ([Generating Diverse Cooperative Agents by Learning Incompatible Policies | OpenReview](https://openreview.net/forum?id=UkU05GOH7 6 :~:text=to%20find%20other%20important%2C%20albeit,information%20to%20induce%20local%20variations)) ([Generating Diverse Cooperative Agents by Learning Incompatible Policies | OpenReview](https://openreview.net/forum?id=UkU05GOH7 6 :~:text=solutions%20than%20baseline%20methods%20across,lipo)). It’s like covering all bases. In their experiments, LIPO’s approach discovered more coordination strategies than earlier methods and produced a really diverse training set. A generalist agent trained with that set ends up more  robust  because it has seen a spectrum of possible teammate behaviors.

 Beta:  I like that idea – it’s diversity maximization. If Other Play was about not getting stuck in one way and FCP was about sampling a bunch of ways, LIPO is about pushing those ways to be as different as possible. Ultimately, all these strategies – Other Play, FCP, LIPO, and even simpler  population play  (training with a population of agents with random pairings) ([DeepMind’s Fictitious Co Play Trains RL Agents to Collaborate with Novel Humans Without Using Human Data | Synced](https://syncedreview.com/2021/10/21/deepmind podracer tpu based rl frameworks deliver exceptional performance at low cost 128/ :~:text=The%20team%20compared%20FCP%20agents,to%20cook%20and%20deliver%20soup)) – share the goal of  improving generalization . By not putting all our eggs in one strategy basket during training, we learn to be adaptable. We essentially learn the  core principles  of teamwork (like “don’t collide with your partner”, “share duties efficiently”, “if partner does X, you do Y”) that hold true across various partners, rather than a narrow protocol that only one specific partner knows.

 Alpha:  And it’s paying off. We’ve become pretty adept at coordinating even with those we weren’t explicitly trained with. Diversity in training gave us a form of  built in adaptability . It’s like how human team players get better by playing in mixed teams, not always with the same static lineup. After training with enough diversity, meeting a new partner is less of a shock – we can quickly identify “Ah, they’re doing  that  strategy, I know how to complement it.” We have a bigger repertoire of learned responses.

 Beta:  Exactly. In summary,  diversity is the key to generalization  for ad hoc teamwork. It ensures we don’t overfit to one another. Instead, we learn to capture the essence of cooperation that works with  anyone . From exploiting environment symmetries to training on wide partner populations (including even intentionally adversarial or incompatible ones), these methods arm us to handle the unknown.

   5. Practical Applications and Future of AI Teamwork

 Alpha:  Our little kitchen adventure here is fun, but the implications go beyond cooking virtual soup. The techniques that make us coordinate well in Overcooked could apply to all sorts of real world teamwork scenarios. Think about  robots  working alongside humans in a factory or at home – they’ll need to coordinate seamlessly with people who haven’t exactly “trained” with them beforehand. Or multiple different AI systems coordinating in, say, a rescue mission or driving in traffic. The research in zero shot coordination is laying the groundwork for those applications. If we can master ad hoc teamwork in games, we can transfer those lessons to real life teamwork.

 Beta:  Definitely. In fact, the Overcooked environment became a popular  benchmark  for human AI collaboration research precisely for this reason ([DeepMind’s Fictitious Co Play Trains RL Agents to Collaborate with Novel Humans Without Using Human Data | Synced](https://syncedreview.com/2021/10/21/deepmind podracer tpu based rl frameworks deliver exceptional performance at low cost 128/ :~:text=a%20BC%20model%20of%20a,to%20cook%20and%20deliver%20soup)). Success here suggests an AI can partner with a human in a tightly coupled task. We’re already seeing progress – agents trained with these new techniques not only perform well, but humans  enjoy  working with them more. One study found that human participants strongly preferred collaborating with an FCP trained agent over other agents, saying it moved and coordinated in ways that felt more natural ([DeepMind’s Fictitious Co Play Trains RL Agents to Collaborate with Novel Humans Without Using Human Data | Synced](https://syncedreview.com/2021/10/21/deepmind podracer tpu based rl frameworks deliver exceptional performance at low cost 128/ :~:text=Overall%2C%20the%20study%20shows%20that,agent%20collaboration)). That’s huge because it’s not just about raw performance, but the  quality  of teamwork as perceived by humans.

 Alpha:  That’s like the ultimate compliment – humans actually  preferring  the AI teammate because it coordinates smoothly and even seems to share their priorities (like which pot to use, etc.). It shows we’re on the right track. If people are comfortable teaming with agents like us, we can start integrating into environments like education (AI tutors collaborating with students), healthcare (AI assistants teaming with doctors or nurses), or everyday life (personal assistant bots that can cooperate with a family in managing tasks). The key will be trust and efficiency, and that comes from us being  adaptable, predictable in our helpfulness, and capable of understanding our partners  to some degree.

 Beta:  Of course, there’s still room for improvement. We’ve largely talked about learning policies that handle new partners  out of the box . But once in the situation, maybe we can get even better by  adapting on the fly . Humans do this naturally – you and I might develop a new little strategy after working together for an hour. Future AI agents might combine our zero shot coordination ability with quick online learning or partner modeling. For example, an agent could have a phase where it watches a new teammate for a few moments, infers their style, and then tweaks its policy (using meta learning or Bayesian reasoning). We’re not quite there yet in a robust way; current methods like ours mostly rely on what was pre learned. But it’s an exciting direction.

 Alpha:  I’d also love to see better communication. We did all this without communicating explicitly (apart from observing actions). In many real scenarios, agents could exchange brief signals or use natural language with humans. Learning  when  and  what  to communicate in ad hoc teamwork is another frontier. Imagine if I could quickly say to a human, “I’ll handle the chopping, you boil the soup” – that could align us instantly. Some researchers are exploring emergent communication in multi agent RL, but keeping it human understandable is a challenge. It has to be intuitive for humans.

 Beta:  Absolutely. Despite these challenges, the progress is encouraging. We went from agents that were hopeless unless you cloned them, to agents that can collaborate with strangers fairly well. Techniques like Other Play, FCP, and LIPO dramatically improved our  zero shot teamwork  abilities. And as we incorporate more ideas – perhaps meta learning for on the fly adaptation, or interactive learning with humans – we’ll get even better. The end goal is an AI that you can drop into any team (human or AI, or mixed) and it will be a  competent, cooperative teammate  from the get go.

 Alpha:  And maybe even a pleasant one! If people enjoy working with AI agents and find us helpful, we could see hybrid human AI teams tackling complex problems together. We’re kind of pioneers in that regard, practicing in games to make mistakes cheaply and learn from them.

 Beta:  It’s exciting, isn’t it? We started as a couple of bots trying not to burn onion soup, and now we represent a whole new generation of  adaptive team players . The kitchen timer is ticking, but I’d say our conversation highlights some key points: through deep reinforcement learning, we learned to cooperate; by using adversarial and diverse training methods, we became adaptable; we can now handle unknown teammates in zero shot situations; and all of this is bringing AI human collaboration closer to reality. The future of AI teamwork looks bright – and I’m ready for the next challenge, whatever it may be!

 Alpha:  Well said! Now, speaking of that timer – the soup’s ready. Let’s serve this dish together and show the world what effective teamwork between AI agents can achieve. Bon appétit! 🍲

 Beta:  On it, partner! 🙌

728x90

'hacking sorcerer' 카테고리의 다른 글

the sticker shop in tryhackme  (0) 2025.03.04
good morning !  (0) 2025.03.03
over the pil in one second  (0) 2025.02.28
피아노 게임  (0) 2025.02.25
mini go-lf game  (0) 2025.02.07