Adaptive opponent modeling · adversarial co-training · JAX · MPE

Adaptive opponent modeling for
adversarial co-training in MARL.

Model an opponent's latent strategy with calibrated uncertainty (Part 1) and plan against it (Part 2). In a controlled predator–prey game, a blue team that infers a varying red opponent's strategy and plans against it is the most robust best response — +61% over an opponent-blind baseline, beating even an oracle told the strategy.

uncertainty-aware opponent model belief-conditioned planner robust to a varying opponent 3 seeds · MAPPO + CTDE JAX, CPU
§1 What this is

Two-team independent Q-learning.

Building blocks for adaptive opponent modeling in adversarial co-training: learn an uncertainty-aware model of the opponent, then exploit it inside a planner.

The arc

The end goal is a two-part co-training loop — (1) an uncertainty-aware encoder that maps an opponent's trajectory to a latent strategy z, and (2) a model-based planner that samples the opponent's actions from a policy conditioned on z. Each experiment here builds one piece:

Exp 3 — an opponent-action head is a point-estimate opponent model; the MC planner is a flat precursor to the tree search. Exp 2 — an unsupervised VAE on opponent trajectories (Part 1 encoder), and where it fails when the strategy signal is weak. Exp 4 — a resource task; the dominated prey's trajectory still doesn't carry strategy — the bottleneck is behavioural expression. Exp 5 — fix the regime so the prey expresses strategy, and all three proposal preconditions follow: distinct strategies, an encodable latent, and a brittle opponent. Exp 6 — the core result: a hidden opponent intent, inferred from behaviour, exploited with an uncertainty-aware best response.

Environment

JaxMARL's MPE_simple_tag_v3. 3 predators vs 1 prey, 2D arena, 5 discrete actions, 25-step episodes. +10 / −10 per capture. Fully observable.

predator (×3) prey (×1) obstacle 5 actions, Δ reward = ±10/capture

Two-team IQL

One TrainState per team, shared weights within team. Avoids gradient crosstalk from stock JaxMARL's single-net vmap.

Jump to

Exp 4 — resource collection, IQL vs OA-IQL vs MAPPO  |  Exp 3 — cross-play tournament  |  Exp 2 — directional shaping  |  Exp 1 — archived

§2 Experiment 3 — primary result

Opponent-aware Q-learning, planning, and cross-play evaluation.

Does an opponent-action auxiliary head shift the equilibrium? Is the effect symmetric between predator and prey?

3
Seeds per variant
100
Parallel eval eps per cell
K=5, H=3
Planner: rollouts × horizon
~90 s
Full 3×3 tournament (CPU)

The three variants

IQL-greedy

MLP-IQL, greedy argmax.

Plain two-team IQL. No opponent head. Control.

OA-greedy

+ opponent-aware head (training-only).

Auxiliary head predicts opponent actions (CE loss, coef 0.5). Eval = argmax Q. Head discarded at inference.

OA-plan

+ tree-expanded planning (eval-only).

OA head becomes a one-step opponent model for K=5, H=3 MC lookahead. No extra training.

Architecture — the two-headed MLP

OA-IQL-MLP network (per team)
obs  ──▶  Dense(128) ──▶ ReLU ──▶ Dense(128) ──▶ ReLU ──┬──▶ Dense(action_dim=5)             ──▶  Q-values
                                                              │
                                                              └──▶ Dense(opp_n × action_dim)     ──▶  opp-action logits
                                                                    │
                                                                    └─ CE vs opp actions from replay  (the auxiliary loss, weight 0.5)

Shared trunk encodes opponent-predictive features. At training time the opp head is an auxiliary regularizer; at eval time (OA-plan) it becomes the opponent model the planner samples from.

How the planner actually picks an action

Per-agent (not joint) Monte-Carlo rollout (see planning_eval.py:86–175):

  1. For each of the action_dim = 5 candidate root actions a_root:
    • Sample K = 5 independent rollouts of horizon H = 3.
    • In each rollout, this agent plays a_root at t=0 and argmax Q after — root-action injection is the single line planning_eval.py:119.
    • Teammates play argmax Q throughout (coordinate descent).
    • Opponents are sampled from Categorical(softmax(opp_head(obs))) — see :135–142.
    • Step through the actual wrapped_env.wrapped_step(), accumulating Σ γᵗ r:142–157.
    • At the leaf, bootstrap: V_leaf = max_a Q(s_H, a).
  2. Average K returns per root action → one score per candidate.
  3. Pick argmax_root.

Per-agent coordinate descent is a deliberate trade-off. A joint-action planner for the three predators would score 5³ = 125 candidates per decision; per-agent scores only 5 × 3 = 15. The cost is that predator coordination is partially collapsed into "each predator's independent best response given frozen teammates," which — as the results show — matters.

H = 0 recovers greedy Q. H = 3 covers ~12% of a 25-step episode; with γ = 0.9, γ³ ≈ 0.73, so the leaf bootstrap still contributes most of the value.

Primary result — 3×3 cross-play tournament

Each cell: paired by seed, 100 eval episodes, averaged over 3 seeds.

Exp 3 tournament heatmaps: pred_return, prey_return, captures_per_ep
Tournament heatmaps. Rows = predator policy, columns = prey policy. Three metrics side-by-side: total predator return, total prey return, captures per episode. Each cell is mean ± std over 3 seeds × 100 eval episodes. The full figure is plots/exp3_tournament.png.

Captures per episode — read the numbers directly

 
prey
IQL-greedy
prey
OA-greedy
prey
OA-plan
pred
IQL-greedy
3.41
± 0.53
3.64
± 0.39
1.57
± 0.15
prey best
pred
OA-greedy
3.39
± 1.00
3.01
± 0.30
1.15
± 0.25
pred
OA-plan
2.27
± 0.59
pred worst
2.80
± 0.49
1.07
± 0.30

Column-wise: OA-plan prey is uniformly lowest (1.57, 1.15, 1.07) — ~2-capture drop. Row-wise: OA-plan pred is worst across the board.

Result 1 — opponent modelling disproportionately benefits the prey.

Same auxiliary objective on both teams. Prey benefits dramatically; predator barely. Adding planning on top amplifies the asymmetry: OA-plan prey dominates universally, OA-plan pred is worst among pred variants.

Mechanism — opponent-head accuracy asymmetry

Exp 3 training curves: pred return, prey return, opp-head accuracy
Training curves, mean ± std over 3 seeds. Left: predator test return. Middle: prey test return. Right: opp-head classification accuracy for each team, with chance (1/5 = 0.20) marked.
0.607 ± 0.021
pred opp-head acc (3 seeds)
0.674 ± 0.032
prey opp-head acc (3 seeds)
0.20
chance (1 of 5 actions)
+6.6 pp
prey > pred accuracy gap

The prey predicts 3 predator actions per step; the predator predicts 1 prey action. 3× the CE gradient → better opponent model → better planner. The +7 pp accuracy gap drives the tournament asymmetry.

Why are the marginal return curves so flat?

In a zero-sum-like game, predator return and prey return are shadows of each other. If both sides improve proportionally, the marginal training curve looks unchanged — nothing to see. Equilibrium shifts only become visible when you stress-test a policy against opponents it wasn't paired with during training — a cross-play tournament. The OA-greedy vs IQL-greedy difference in the training curves is under half a standard deviation. The same difference in the tournament (column-mean drop from ~3.4 to ~1.3 captures/ep when prey plans) is more than 3 std.

This is also a methodological point: if you only look at marginal training returns in a self-play setup, you'll systematically miss the kind of effect this experiment is about.

Why does the predator planner actually hurt?

Two compounding reasons:

  1. Weaker opponent model. The predator's opp head is 7 percentage points less accurate than the prey's (0.60 vs 0.67). The planner samples from it, so worse model → worse plan.
  2. Per-agent coordinate descent breaks predator coordination. With one prey, coordinate descent is exact. With three predators, each predator plans as if teammates will act greedily — the plan is myopic to teammate reactions during the H=3 lookahead. A joint-action planner (5³ candidates) would close this, but was out of scope.

Fixing either would likely move OA-plan pred above IQL-greedy. Neither is a deep problem with the approach — both are engineering choices we made for wall-clock reasons.

Summary

Finding

A symmetric auxiliary objective produces an asymmetric effect in adversarial co-training, driven by an asymmetry in the classification signal each team receives. The side with the richer opponent-modelling task attains a better opponent classifier and a better planner derived from it. Neither side's marginal training return moves appreciably; the effect is observable only in cross-play.

§3 Experiment 2 — directional reward shaping

Static obstacles advantage the prey. Directional shaping of the prey reduces its realised return.

Directional shaping narrows the prey's strategy; predators exploit it. Four conditions, three seeds each.

Setup

SimpleTagStaticMPE: obstacles pinned at (±0.5, ±0.5), plus a per-step directional bonus for the prey:

bonus = coef · dir_sign · sign(v_prev × v_new) · 1[r_in ≤ ‖v_new‖ ≤ r_out]

v_prev = prey_pos_t     − obstacle
v_new  = prey_pos_{t+1} − obstacle
dir_sign = +1 (CCW) or −1 (CW)
coef = 0.1,  r_in = 0.25,  r_out = 0.6

Direction-only (not speed). Annulus gate prevents farming on top of obstacles. Prey-only bonus; predator rewards untouched.

Four conditions

Run Obstacles Shaping Predator return ↑ Captures / ep
random_baseline randomized each reset none +33.12 3.3
static_baseline fixed at (±0.5, ±0.5) none +22.19 ± 5.97 2.2
static_ccw fixed CCW prey bonus, r ∈ [0.25, 0.6] +26.25 ± 2.76 2.6
static_cw fixed CW prey bonus, same annulus +28.23 ± 2.06 2.8
Exp 2 training curves for the four conditions, with ±std bands
Training curves, mean ± std over 3 seeds. Predator (left) and prey (right) test returns across the four conditions. Shaping decreases prey return and increases predator return — in the opposite direction of what a naïve reading of the shaping would predict.
Three observations

1. Static obstacles favor the prey. Removing obstacle randomness lets the prey memorize stable cover — predators lose ~1.1 captures/ep versus random.

2. CW / CCW shaping recovers most of that gap. Either direction of the annular bonus pushes predator captures from 2.2 back up to 2.6–2.8 / ep.

3. Shaping also collapses seed-to-seed variance (σ ≈ 6 → σ ≈ 2). The bonus regularizes the prey into a narrow family of policies.

Behavior mining

64 episodes per (variant × seed). Track prey 2D position and signed angular displacement around the nearest obstacle.

Exp 2 prey behavior mining: density heatmaps (top) and signed angular step histograms (bottom)
Top row: 2D prey-position density. Obstacles marked in cyan. baseline concentrates near cover. cw and ccw show visible orbital bands in the shaping annulus around each obstacle. Bottom row: signed angular step per timestep (positive = CCW around the nearest obstacle, negative = CW). baseline is symmetric around zero; cw is biased negative; ccw is biased positive. Red dashed line shows the mean.

The numbers behind the histograms

Variant Mean Δangle (rad/step) Δangle std Direction
baseline +0.0002 0.36 ≈ symmetric
cw −0.0108 0.37 biased CW (sign matches shaping)
ccw +0.0091 0.36 biased CCW (sign matches shaping)
Mechanism

Shaping compresses the prey into a narrow orbital band. Predators exploit the narrower support — capture rate increase outweighs the shaping bonus.

Rollouts

Each GIF is 150 steps = 6 auto-reset episodes chained together, greedy rollouts from seed 0 of the corresponding static variant.

Per-trajectory clustering

Can a single trajectory identify which policy produced it? Short answer: no, at shape_coef = 0.1. Two datasets tested: Reading A (900 labelled trajectories, 9 checkpoints) and Reading B (900 unlabelled, single policy).

1. Hand-crafted + occupancy features

PCA scatter of init-conditioned hand features on Reading A, all 9 ground-truth checkpoints overlapping in the same blob
Hand-features, Reading A. All checkpoints overlap. ARI = 0.008 (chance). Occupancy features give the same null result.

2. MLP-VAE on trajectories

Flax MLP-VAE, 8-dim latent, KL-annealed, 8k Adam steps on Reading A.

Prey VAE training curves: ELBO, reconstruction MSE, and KL all stabilising
Healthy training. No overfitting, no posterior collapse.
5x5 grid of decoded prey trajectories along the top-2 principal axes of the VAE latent
Latent traversal. VAE learns trajectory shape (displacement direction), not policy identity.
GMM and KMeans clustering metrics on the prey VAE latent vs k, including ARI and NMI vs ground truth
Clustering on z. ARI ≈ 0.008, NMI ≈ 0.024 — near chance. Latent organises by geometry, not policy.
GMM(k=3) confusion matrix in prey VAE latent space vs ground truth — same per-row distribution across all checkpoints
GMM confusion matrix. All checkpoints distribute ~55/25/20 across clusters. Clusters are real but policy-blind.
Why aggregates work but per-trajectory doesn't

Most steps are outside the annulus where all policies look identical. Behavior mining pools 192 episodes to average out the dominant evade signal; a single rollout can't.

3. Predator VAEs

Same VAE on each predator slot. Latent traversals are more interpretable (goal-directed chase trajectories), but still track arena geometry, not policy.

Decoded predator-0 trajectories along the top-2 principal axes of z
pred_0 — smooth directional manifold of chase trajectories.
Decoded predator-1 trajectories along the top-2 principal axes of z
pred_1 — two regimes at latent extremes.
Decoded predator-2 trajectories along the top-2 principal axes of z
pred_2 — similar directional manifold.

Predator clustering numbers, side-by-side with the prey

Agent Reading A: max ARI vs ground truth Max NMI Reading B: best-BIC k BIC drop (k=1 → k_best)
prey 0.008 0.024 1 ≈ 0
pred_0 0.025 0.061 4 −750
pred_1 0.023 0.054 3 −812
pred_2 0.017 0.043 3 −765

Reading A: predator ARI ~2× prey's, still at-chance. Reading B: predators are multi-modal (BIC drops 700+ nats), prey are not. Modes track which obstacle the chase converges on.

Implications & fixes

The VAE latent encodes trajectory shape, not policy identity. Two paths forward:

Unit tests for the env — five assertions, 5 seconds

The shaping math is trivial but easy to get subtly wrong (sign-flip in the cross product, off-by-epsilon in the annulus, wrapper vs subclass). Five assertions in smoke_test_static.py:

  1. test_fixed_landmarks — 10 random reset seeds, obstacles stay pinned.
  2. test_auto_reset_preserves_landmarks — step past the 25-step boundary, check they're still pinned.
  3. test_shape_direction_sign — hand-rolled tangent velocities at 12 o'clock.
  4. test_radial_band_gating — inside r_in or outside r_out earns zero bonus.
  5. test_no_shape_when_coef_zero — shaping code path pass-through when disabled.

Run: python src/smoke_test_static.py

Caveats of this experiment
  • Only 3 seeds per static variant; random_baseline is a 1-seed reference from Exp 1.
  • Shaping coefficient 0.1 and annulus [0.25, 0.6] are reasonable defaults, not swept.
  • The prey return curve is contaminated by the shaping bonus itself. Read the predator return (or captures/ep) as the cross-condition signal — the behavior-mining plot is the other unbiased evidence.
  • Two obstacles at opposite corners is a hand-picked geometry. Sensitivity to landmark layout is not measured.
§4 Experiment 4 — resource collection

A predator-prey-resources environment for qualitatively distinct strategies.

Change the task so different strategies produce qualitatively different trajectories, making policy identity visible to any encoder.

Design

4 collectable resources, prey-visible only. +5.0 reward per collection. Predator obs unchanged.

4
Resources per episode
+5.0
Reward per collection
26-d
Prey obs (14 base + 12 resource)
16-d
Predator obs (unchanged)

Placement modes

Side-by-side diagram of circle vs corners resource placement
Left: circle placement (radius 0.6). Right: corners placement (offset 0.8). Green diamonds are resources; grey circles are fixed obstacles. Dashed lines show the implied collection path.

Circle

Radius 0.6 around origin. Optimal collection = loop.

Corners

At (±0.8, ±0.8). Requires long diagonal dashes.

Random

50/50 coin-flip per reset. Bimodal trajectory distribution by construction.

Trained policy rollouts

Same initial state (random placement, seed 0). Red = predators, blue = prey, green diamonds = resources.

Side-by-side GIF comparing IQL, OA-IQL, and MAPPO rollouts
MAPPO predators actively pursue and corner the prey (pred total +180). IQL and OA-IQL predators move but fail to coordinate effectively (pred total +0).

Individual algorithm rollouts (click to enlarge):

IQL trained rollout on random placement
IQL (pred: +0, prey: +10)
OA-IQL trained rollout on random placement
OA-IQL (pred: +0, prey: +0)
MAPPO trained rollout on random placement
MAPPO (pred: +180, prey: −56)

Arena layout references

Rollout GIF with circle resource placement
Circle placement (random actions)
Rollout GIF with corners resource placement
Corners placement (random actions)

Information asymmetry

Predators: 16-d (no resource info). Prey: 26-d (base + 12 resource dims). The prey's strategy is shaped by resources the predator can't observe — an opponent-aware predator that infers the strategy type should outperform one that doesn't.

Observation space comparison: predator 16-d vs prey 26-d
Predator observation is unchanged from the base environment (16-d). Prey observation appends 12 resource dimensions (8 relative positions + 4 collected flags). The grey region in the predator bar is zero-padded by CTRolloutManager.

Implementation

ResourceState dataclass adds resource_pos and collected. CTRolloutManager pads to 26-d + 4-d one-hot = 30-d training obs. Predators zero-padded in resource dims.

Smoke tests — 9 assertions

smoke_test_resources.py covers placement geometry, observation shapes, predator-obs equivalence with the base env, collection mechanics, reward accounting, auto-reset state clearing, and obstacle preservation:

  1. test_circle_placement — resources at radius 0.6.
  2. test_corner_placement — resources at (±0.8, ±0.8).
  3. test_random_placement_varies — both layouts appear over 20 seeds.
  4. test_prey_obs_size — prey 26-d, predator 16-d.
  5. test_predator_obs_matches_base — predator obs identical to vanilla SimpleTagMPE.
  6. test_collection_happens — prey on top of resource triggers collection.
  7. test_collection_reward — prey reward diff is ~5.0 when collecting.
  8. test_auto_reset_clears_collected — collected flags reset on episode boundary.
  9. test_fixed_obstacles_preserved — obstacles stay pinned across seeds.

Run: JAX_PLATFORMS=cpu python src/smoke_test_resources.py

Algorithm variants

IQL
Independent Q-Learning (MLP). Off-policy, replay buffer.
OA-IQL
+ opponent-action auxiliary head (CE, coef 0.5). MC planning at eval (K=5, H=3).
MAPPO
Centralized critic + per-agent actor. On-policy PPO, 32 envs, 128-step rollouts.
Training & evaluation commands

10 Hydra configs, 2M timesteps each, 3 seeds. See Repro section for full commands.

Evaluation: cross-play tournament, trajectory dataset generation, VAE + multimodality analysis.

Training results — random placement

Key finding: MAPPO predator returns are ~3× higher than either IQL variant (+110 vs +34). The gap is driven by MAPPO's centralized critic, which coordinates predator pursuit. OA-IQL adds a marginal +2 over IQL but the difference is within noise. Resource collection is comparable across all three, confirming the return gap is tag/evasion, not resource-driven.

2M timesteps, 3 seeds, random placement. X-axis = env timesteps (comparable across algorithms).

Exp 4 training curves: IQL vs OA-IQL vs MAPPO on random resource placement
Top row: episode returns (predator left, prey right). Shaded bands are ±1 std across 3 seeds. Bottom row: per-step resource collection rate.
Algorithm Pred return Prey return Pred resources Prey resources
IQL +33.5 ± 1.4 −37.0 ± 2.4 0.415 0.444
OA-IQL +35.5 ± 1.8 −48.8 ± 4.9 0.414 0.387
MAPPO +110.5 ± 12.3 −113.7 ± 5.7 0.460 0.468

MAPPO ~3× IQL on pred returns. Centralized critic coordinates pursuit. Resource collection comparable across all three (~0.4–0.5/step) — return gap is tag/evasion, not resources.

Trajectory datasets

300 MAPPO trajectories (3 seeds × 100 eps, T=50). Includes positions, resources, and collection flags.

300
Trajectories (MAPPO)
50
Steps per trajectory
3
Seeds (policy diversity)
12
Fields per timestep

Does the prey trajectory carry its strategy? — measuring the precondition

The proposal's Part 1 wants an encoder mapping the opponent's trajectory to a latent strategy z. That is only possible if the strategy is in the trajectory. The hypothesis was that resource placement (circle-loop vs corner-dash) would make it so. We tested this directly: label each first-episode prey trajectory by its placement (recoverable from resource geometry, never shown to the encoder), then ask whether a supervised linear probe on the raw trajectory can predict placement, and whether the unsupervised VAE latent recovers it.

Probe and VAE-latent separability of prey placement across algorithms
Left three: VAE latent (PCA-2) coloured by true placement — circle and corners overlap completely. Right: supervised placement accuracy from the raw trajectory and from the VAE latent, both at the majority-class line (dashed) for every algorithm. Chance, everywhere.
Negative result: placement is not recoverable from the prey trajectory — not by the VAE (ARI ≤ 0.03), and not even by a supervised probe with full labels (accuracy ≈ majority class). True for IQL, OA-IQL, and MAPPO; for init-conditioned and absolute coordinates alike. The signal is absent from the data, not merely lost by the encoder — so this is not the Exp 2 encoder-limitation story repeating.

Is it an incentive problem? — 4× the collect reward

The natural fix is to make the prey care more about resources. We retrained IQL with collect_reward raised from 5 to 20. The prey collected nearly twice as many resources per episode (0.68 → 1.28), confirming the incentive landed — yet placement stayed at chance (probe 0.49). Collecting more resources does not mean executing a layout-distinctive route.

Mechanism — why the strategy isn't in the trajectory

Three predators dominate; the prey's path is governed by evasion, and it grabs whichever resources fall along the escape route. In both layouts there is always a resource near the fleeing prey, so it never commits to a full circle circuit or a corner tour. Placement changes where resources sit but not how the prey moves. The "qualitatively distinct strategies" the resource env was meant to induce do not appear in the dominated agent's behaviour.

Implication for the co-training proposal

The bottleneck for Part 1 is not the encoder (VAE vs supervised contrastive) — it is behavioural expression. An opponent-strategy latent is only learnable when the opponent can actually execute distinguishable strategies. In a task where one side is dominated, its trajectory collapses onto "evade", and there is little for z to encode. This argues for either a more balanced task (predator and prey comparably capable) or an explicitly strategy-revealing objective for the agent, before investing in the latent-conditioned policy model and planner.

§5 Experiment 5 — the proposal's three preconditions

Distinct strategies, an encodable latent, and a brittle opponent.

Exp 4 found the blocker: a dominated prey can't express its strategy. Fix the regime so it can, and the three things the co-training proposal needs all fall out — demonstrated, not assumed.

A ~30-second narrated walkthrough of all three results. Full writeup with the formal setup and math: exp5_results.pdf.
Setup

Two changes from Exp 4. Specialists: train a separate prey co-trained only on the circle layout and one only on corners (no mixing). A balanced regime: slow the predators (max_speed 1.0→0.6, accel 3.0→1.5) and raise the collect reward (5→10) so the prey is not dominated and actually runs a collection route. Everything else is the Exp 4 environment. Three seeds per specialist.

Deliverable 1 — different placements produce different strategies

The two specialists trace visibly different routes. The circle-prey loops a diamond between the four ring resources; the corners-prey patrols a square around the perimeter. Pooling occupancy over a trajectory, a linear probe tells the two apart at 0.98 (chance 0.50) for IQL and 0.96 for MAPPO.

Circle-specialist prey traces a diamond; corners-specialist traces a square
Prey occupancy density (pooled over 300 episodes / specialist). Cyan rings mark resource positions. The route follows the layout — a diamond for circle, a square for corners.
Live rollout: circle-specialist loops the ring, corners-specialist patrols the perimeter
Live rollout (greedy). Same env, two specialists: the circle-prey heads for the ring, the corners-prey for the perimeter — the routes that pool into the densities above.

Deliverable 2 — a VAE latent that encodes the strategy

A single short trajectory is dominated by evasion noise, so the encoder needs both the right representation (the trajectory's 2D occupancy histogram, which exposes where the prey goes) and enough of it. Sweeping the observation length, the unsupervised VAE latent goes from chance to a clean two-cluster split: at L=100 steps the latent recovers placement at ARI 0.87 with a supervised latent-probe of 0.99.

Separability rises with observation length; VAE latent forms two clean clusters
Left: placement recovery vs how many steps of the opponent the encoder sees. Right: the VAE latent at L=100 — circle- and corners-prey fall into two separated clusters with no labels used in training.

This is the result the Exp 2 / Exp 4 VAEs could not reach. The difference is regime (the prey now expresses strategy) and representation (occupancy, not the raw evasion-dominated position sequence). MAPPO's prey is more dominated, so its latent encodes placement supervised (probe 0.83) but does not separate cleanly unsupervised — cleaner strategy expression gives a cleaner latent.

Deliverable 3 — predators are brittle to an unseen opponent

Take MAPPO predators co-trained against the circle-prey and drop in the corners-prey they never saw (and vice-versa). Captures fall 43% — from 1.59 to 0.91 per episode. Read column-wise (same prey and env, only the predator swapped): the co-trained predator catches the circle-prey 1.90×, the unfamiliar predator only 1.11; for the corners-prey, 1.29 vs 0.72. The centralized critic learns a layout-specific interception that does not transfer.

2x2 cross-play: in-distribution predators catch more than OOD predators
Left: captures/episode for every predator×prey pairing; diagonal is in-distribution, off-diagonal is OOD prey. Right: averaged, the co-trained predator catches 43% more than the unfamiliar one. (IQL predators are too generic to specialise — a 7% gap — so the effect is a property of the strong centralized-critic predator.)
Same corners-prey faced by its co-trained predator vs an unseen circle predator
The same corners-prey, side by side: against its co-trained predators (left) versus the unseen circle-trained predators (right). The aggregate capture rate is higher for the predator that co-trained with this prey.
Why this matters for the proposal

These are the three preconditions the adversarial co-training plan rests on. (1) The opponent has genuinely distinct strategies to model. (2) An unsupervised encoder recovers them into a latent z — the input to the latent-conditioned opponent policy and the planner's value function. (3) A strong co-trained policy really is brittle to out-of-distribution opponents, which is the failure the opponent-aware planner is meant to fix. With all three in hand, the proposal's Part 1 is directly buildable — and below, built.

Part 1 (built) — a latent-conditioned opponent model

The advisor's next step: once the VAE is good, clone a policy conditioned on its latent. We log (obs, action) from both specialists, encode each trajectory to z with the validated occupancy-VAE, and behaviour-clone a single prey policy π(a | obs, z). Conditioning on zcircle vs zcorners makes the same policy trace the diamond or the square. Crucially, swapping z alone — with the environment and predators held fixed — steers the route at 0.85 (chance 0.50): the latent controls behaviour, not just the layout.

One BC policy traces a diamond under z_circle and a square under z_corners
One behaviour-cloned policy, two routes by swapping the latent. Left/middle: occupancy under zcircle (diamond) and zcorners (square). Right: BC matches the specialist's action at 0.62 (vs 0.24 majority); the latent controls the route even in a fixed environment (0.85).
Next — Part 2

Sample π(a | s, z) as the opponent's move inside a MuZero-style search and condition the blue value function on z. The target is Deliverable 3's table: an opponent-aware circle-predator should catch the corners-prey closer to the in-distribution 1.29 than the unaware 0.72. Full math and results: exp5_results.pdf.

§6 The proposal, validated — Part 1 + Part 2

Uncertainty-aware opponent modeling, and planning against it.

The proposal: model a varying opponent's latent strategy with calibrated uncertainty (Part 1) and sample that model inside a planner (Part 2), for robustness in adversarial co-training. We validate the loop in a controlled predator–prey game where the opponent (red, the prey) draws a hidden strategy each episode and the controlled team (blue, the predators) must infer it — so every quantity can be measured against ground truth.

The task

Same arena every episode. The prey is secretly assigned one of four corners it's rewarded for haunting; predators can't see which. The strategy is invisible in any single frame — a predator can only get it by modeling the prey's behaviour over time. We train three predator conditions (MAPPO, 3 seeds): unaware (no intent), oracle (told the intent), and belief (trained on soft intent beliefs of random sharpness, so it learns to hedge when unsure and pounce when sure).

Intent occupancy, encoder accuracy vs steps, and the capture ladder
Left: the prey's positions cluster at its four possible corners (colour = hidden intent) — structure no single frame reveals. Middle: a behavioural encoder recovers the intent from the prey's first k steps (0.37→0.97) as its posterior entropy collapses. Right: the capture ladder.

Three findings, one ladder

(1) Knowing the intent is worth +51%. An oracle predator that observes the hidden corner catches the prey 4.05 times/episode versus 2.68 for the strong intent-blind hedger. The opponent's hidden strategy genuinely matters.

(2) A naive model is brittle. That same oracle predator, fed a random intent, collapses to 1.42 — worse than not modeling at all. A model that ignores its own uncertainty is a liability.

(3) Uncertainty-aware inference wins. The belief predator, fed the encoder's online-sharpening posterior, reaches 2.82 — it beats the intent-blind baseline, by hedging while the belief is flat and committing once it sharpens. Reacting to the belief recovers only a sliver of the oracle's edge (2.82 vs 2.68 against an oracle of 4.05); the win comes from planning on it, below. Hard inference at a fixed step (2.56) doesn't help, because it gambles on one possibly-wrong guess.

Belief predators infer the prey's corner and converge while unaware predators hedge
Same prey, same hidden corner (star). Left: intent-blind predators hedge. Right: belief predators consume the posterior (bars below) and converge on the inferred corner as it sharpens.
Why this is the result that matters

This isolates the four things opponent modeling needs: a strategy that is hidden (not in the observation), inferable from behaviour, exploitable (+51%), and worth modeling only with uncertainty (the naive model is −47%). It is the honest test the circle/corners experiments couldn't be, because there the strategy was visible in the map. Full setup, math, and tables: intent_opponent_modeling.pdf.

Part 2 (built) — planning with the belief

The belief predator above is reactive. Now the predators plan: at each step, for every candidate joint action, they roll the true simulator a few steps ahead with the prey's moves sampled from its policy under an intent drawn from the belief, and pick the action that best intercepts the imagined future prey. Belief-conditioned opponent sampling inside a lookahead — the proposal's Part 2.

The online-belief planner tops every reactive policy including the oracle
Captures/episode. The online-belief planner (4.31) beats the reactive belief policy (2.82) by +53% and tops even the oracle reactive predator (4.05). Ablating the belief to flat drops it to 3.07, so the inferred opponent model supplies +40% — the lookahead alone is not what wins.

Sharpening Part 1 — predict, don't reconstruct (JEPA vs VAE)

The encoder above used labels. A Part 1 that scales should recover the opponent's strategy self-supervised, and the objective decides whether it can. Same prey trajectories, same 2-D latent, two self-supervised encoders: a VAE that reconstructs the trajectory (generative), and a JEPA that predicts the representation of the future window through an EMA target encoder (predictive, no reconstruction — LeCun's principle, cf. V-JEPA 2). JEPA keeps only the predictable structure (where the prey is heading = its strategy) and discards the evasion noise the VAE drowns in.

JEPA latent separates the four intents; the VAE latent blends them
Unsupervised strategy recovery from a 2-D latent. VAE: probe 0.53, ARI 0.14 (intents blended). JEPA: probe 0.89, ARI 0.65 (3 encoder seeds) (four clean clusters) — above the supervised ceiling (0.85). Only the objective differs.
The two latent spaces forming during training: JEPA separates the intents, VAE stays blended
Watch them compete: as training proceeds the predictive (JEPA) latent resolves the four hidden strategies while the generative (VAE) latent stays mixed.

Because it predicts where the prey is heading instead of waiting for it to arrive, JEPA is also an anytime encoder — it reads the strategy from ~2× fewer observed steps than the VAE (JEPA at 11 steps matches the VAE at 20), so the planner can commit sooner.

JEPA recovers the opponent strategy from fewer observed steps than the VAE
Strategy recovery vs steps observed (3-seed mean ± std). JEPA (predict) tracks just under the supervised ceiling and well above the VAE (reconstruct) at every budget.

And it closes the loop without labels: a self-supervised JEPA belief (encoder + a readout to predicted arrival position, then a softmax over the four known corners — no intent labels anywhere) drives the planner to 4.08 captures/episode, matching the supervised-belief planner (4.31, overlapping error bars) and beating the oracle reactive predator (4.05).

A label-free JEPA belief drives the planner to match the supervised one
Label-free opponent-aware planning. The self-supervised JEPA belief reaches the supervised result with zero opponent-strategy labels.

Here is that label-free belief running in the game. Left: intent-blind predators hedge. Right: JEPA-belief predators — the × marks where the JEPA model predicts the prey is heading, and it locks onto the prey's hidden corner (the star) as the pack closes in; the bar strip is the JEPA belief sharpening, with no labels.

The JEPA opponent model running in predator-prey: it predicts the prey's destination and the predators intercept
The JEPA opponent model in action. It reads the prey's hidden corner from a few steps of motion (× = predicted destination), the belief sharpens to the true corner, and the predators pre-position and intercept.

Where JEPA stops helping — the world model (an honest negative)

The obvious next step is to use the same predictive principle for the planner's dynamics — a JEPA latent world model (encode → predict next latent → decode) the planner rolls through, fixing the state-space model that fell down in the earlier learned-model planner. It doesn't work: the latent compression makes the dynamics ~3× less accurate (5-step position RMSE 0.24 vs 0.07), and the planner collapses to 0.57 captures — worse than the state-space model (2.53) and the reactive baseline.

The JEPA latent world model is less accurate and the planner does worse
Planning through a learned world model. The JEPA latent model is the weakest bar — reported straight, not tuned away.
Why — the boundary of the principle

JEPA's core move — predict the representation, discard unpredictable detail — is exactly right for the opponent encoder, where the discarded detail is evasion noise and the strategy is what's predictable. It is exactly wrong for a planner's world model, which needs accurate dynamics: discarding detail loses the physics the planner depends on. The two roles want opposite things. A learned world model here wants more fidelity (the state-space model, or an ensemble with a disagreement penalty), not compression — the genuine open problem.

The second strategy axis — circle vs corners (new)

The same Part-1 question on the resource-placement axis: can an unsupervised encoder tell which specialist it is watching — the prey that orbits a resource circle, or the one that dashes between corners? Three findings (1,200 episodes, 3 encoder seeds, mopa package):

VAE vs JEPA on circle-vs-corners occupancy features
Circle vs corners, occupancy features: latent scatters, recovery vs the supervised ceiling, and the observation-length sweep.

And the behaviour-cloning test the meeting asked for — π(a|s) vs π(a|s,z), watch episode 1, predict the predator's episode-2 actions, episode-level split: conditioning on the true placement clearly helps (0.800 vs 0.776 held-out action accuracy), so the hypothesized effect is real; a single-episode unsupervised latent doesn't move it (0.772) — exactly as the probe curve predicts, since one episode of observation gives it almost no signal. The oracle–baseline gap is the prize a better opponent latent captures — and the follow-up below shows an unsupervised latent does close it once it sees enough.

Predator BC with and without the opponent-strategy latent
π(a|s) vs π(a|s,z): the oracle ceiling shows the value of a good opponent latent; current unsupervised latents don't yet capture it.

What that metric looks like in the game: a held-out episode replays, and at every step each BC variant predicts each predator's actual action — green ring = predicted it, red = missed. On episodes where the prey's route matters, knowing the placement is the difference between guessing and following (0.60 → 0.83 circle, 0.60 → 0.92 corners).

Held-out episode replay: per-step BC action prediction, naive vs oracle
Per-step action prediction on held-out episodes (left: naive π(a|s); right: oracle π(a|s, placement)). Same states, same actions to predict — the conditioning is the only difference.

Follow-up — the unsupervised latent does close the gap

Two checks turn the single-episode null into a positive scaling result. First, is behaviour cloning even a faithful clone? Deployed against the same prey, vanilla BC — trained from a hand-built state feature (all agent positions + a velocity proxy + predator id, strictly less than the policy network sees) — recovers 86% of the MAPPO expert's capture-edge over random (1.22 vs 1.35 captures/episode, random 0.40; 3 checkpoint seeds, two placements), at a held-out action match of 0.80 (episode-level split). The policy class is not the bottleneck; the residual 14% is the headroom a strategy signal could close.

Vanilla BC recovers 86% of the MAPPO expert's capture-edge
Deployed captures/episode: random floor, vanilla BC, and the MAPPO expert it imitates. BC recovers 86% of the expert's edge — a faithful clone.

Second, does conditioning help once the latent carries the strategy? The placement is fixed across all of a specialist's episodes, so we can give the encoder more observation. Reading the VAE latent from L steps before a held-out episode (steps 125–150 of a six-episode rollout), its placement probe climbs from 0.60 at L=25 to 0.78 at L=125, and latent-conditioned BC tracks it: from no gain at L=25 (0.764 vs vanilla 0.766) it overtakes vanilla by L=50 and reaches the oracle (true-placement) ceiling by L=125 (0.795 vs oracle 0.792). The JEPA latent, whose probe stays flat (≈0.6) at this 4-D latent size, gives no gain at any L. So π(a|s,z) > π(a|s) holds — exactly to the degree the latent recovers the strategy. Encoder quality translates directly into control benefit.

Latent-conditioned BC overtakes vanilla and reaches the oracle ceiling as observation grows
Latent-conditioned BC vs encoder observation length L. The VAE latent's probe rises (annotated) and its BC accuracy climbs from below vanilla to the oracle ceiling; the JEPA latent stays flat. The downstream gain tracks the latent's probe.
Result — Part 1 + Part 2

Part 1: the opponent's latent strategy is inferable from its behaviour into a calibrated belief (0.37→0.97 over 25 steps), and a point-estimate model is brittle (−47% fed a wrong guess) — exactly why the uncertainty matters. Part 2: sampling that uncertainty-aware opponent model inside a planner is the most robust best response — 4.31 captures/episode, +61% over the opponent-blind baseline and above an oracle handed the true strategy (4.05). Ablating the inferred belief to flat drops it to 3.07, so the opponent inference supplies +40%. Full setup and math: Adaptive Opponent Modeling for Adversarial Co-Training (PDF).

§5 Experiment 1 — archived

The original single-seed A/B on an RNN backbone.

Single seed, RNN. Superseded by Exp 3 (3 seeds, MLP).

OA-IQL nudges both teams up ~2 return points. Opp-head accuracy: 0.53 (pred) vs 0.58 (prey) — same asymmetry as Exp 3.

Metric Baseline IQL OA-IQL Δ
Final predator greedy return +31.56 +33.44 +1.88
Final prey greedy return −47.15 −45.19 +1.96
Predator peak (training) ~+101 ~+115 +14
Final opp-action accuracy 0.53 / 0.58 (pred / prey)
Wall-clock per run (CPU) ~157 s ~162 s +3 %
Exp 1 training curves, baseline vs OA-IQL
Test-return curves. Baseline vs OA-IQL, single seed.
Exp 1 opp-head accuracy
Opp-head accuracy & CE. Both teams saturate near 0.57.

Rollouts (single seed)

Status

Superseded by Exp 3 (3 seeds, MLP, cross-play tournament). Retained for reproducibility.

§6 Engineering notes

Three implementation notes worth recording.

1. Two-team split

Per-team TrainState, shared within-team. Avoids gradient cancellation from stock JaxMARL's single-net vmap.

2. Use wrapped_step in the planner

Must preserve obs padding + one-hot. Using env.step_env silently returns NaN.

3. Pass LogEnvState directly

Don't unwrap to env_state.env_state — the log wrapper unwraps internally.

Full code-reference map

Hyperparameters

IQL / OA-IQL (Exp 1–4, off-policy):

Total env-steps
2 000 000 per training run
Parallel envs / train
8
Episode length
25 steps (Exp 2) / 32 at eval (Exp 3 tournament)
Discount γ
0.9
ε schedule
1.0 → 0.05 linear over 10 % of training
Learning rate
0.005, linear decay
Target update
every 200 steps, hard copy (τ = 1)
Buffer size / batch
5000 / 32
Hidden size
64 (RNN, Exp 1 & 2) / 128 (MLP, Exp 3 & 4)
Opp-aux coef
0.5
Planner K / H / γ
5 / 3 / 0.9
Seeds
3 (Exp 2–4), 1 (Exp 1)

MAPPO (Exp 4, on-policy):

Total env-steps
2 000 000
Parallel envs
32
Rollout length
128 steps
Hidden size
128
Learning rate
0.0003, linear anneal
Discount γ / GAE λ
0.99 / 0.95
PPO clip ε
0.2
Entropy coef
0.01
Value-fn coef
0.5
Update epochs / minibatches
4 / 4
Max grad norm
0.5
Seeds
3
§7 Reproducing

Reproducing the results.

# Environment
python3.11 -m venv .venv && source .venv/bin/activate
pip install -e JaxMARL/ "flax==0.10.2" hydra-core flashbax wandb matplotlib distrax optax

# Exp 4 — resource collection (3 algos × 3 placements + tournament + traj dataset)
python src/smoke_test_resources.py                                             # 9 assertions

# training — IQL
python src/iql_teams_mlp.py alg=ql_teams_resources_circle  NUM_SEEDS=3
python src/iql_teams_mlp.py alg=ql_teams_resources_corners NUM_SEEDS=3
python src/iql_teams_mlp.py alg=ql_teams_resources_random  NUM_SEEDS=3

# training — OA-IQL
python src/iql_teams_oa_mlp.py alg=ql_teams_oa_resources_circle  NUM_SEEDS=3
python src/iql_teams_oa_mlp.py alg=ql_teams_oa_resources_corners NUM_SEEDS=3
python src/iql_teams_oa_mlp.py alg=ql_teams_oa_resources_random  NUM_SEEDS=3

# training — MAPPO
python src/mappo_teams_mlp.py alg=mappo_teams_resources_circle  NUM_SEEDS=3
python src/mappo_teams_mlp.py alg=mappo_teams_resources_corners NUM_SEEDS=3
python src/mappo_teams_mlp.py alg=mappo_teams_resources_random  NUM_SEEDS=3
python src/mappo_teams_mlp.py alg=mappo_teams_simple_tag        NUM_SEEDS=3   # vanilla baseline

# evaluation — 4×4 cross-play tournament
python src/tournament_resources.py --placement random  --seeds 3 --eps 100 --K 5 --H 3
python src/tournament_resources.py --placement circle  --seeds 3 --eps 100
python src/tournament_resources.py --placement corners --seeds 3 --eps 100

# evaluation — trajectory datasets for VAE pipeline
python src/generate_trajectory_dataset_resources.py --algorithm iql
python src/generate_trajectory_dataset_resources.py --algorithm oa_iql
python src/generate_trajectory_dataset_resources.py --algorithm mappo

# evaluation — VAE + multimodality
python src/train_traj_vae.py
python src/verify_multimodality.py
python src/verify_vae_modes.py

# Exp 3 — ~5 min training + ~50 s tournament (MLP, CPU)
python src/iql_teams_mlp.py    alg=ql_teams_mlp_simple_tag    NUM_SEEDS=3   # ~2 min
python src/iql_teams_oa_mlp.py alg=ql_teams_oa_mlp_simple_tag NUM_SEEDS=3   # ~2 min
python src/tournament.py --seeds 3 --eps 100 --K 5 --H 3                    # ~50 s
python src/plot_exp3.py

# Exp 2 — ~24 min total (3 × ~8 min RNN trainings)
python src/smoke_test_static.py
python src/iql_teams.py alg=ql_teams_static_baseline NUM_SEEDS=3
python src/iql_teams.py alg=ql_teams_static_ccw      NUM_SEEDS=3
python src/iql_teams.py alg=ql_teams_static_cw       NUM_SEEDS=3
python src/exp2_behavior_mining.py
python src/compare_static_plots.py \
  --static_baseline logs/MPE_simple_tag_v3/iql_teams_static_baseline_MPE_simple_tag_v3_seed0_metrics.npz \
  --static_cw       logs/MPE_simple_tag_v3/iql_teams_static_cw_MPE_simple_tag_v3_seed0_metrics.npz \
  --static_ccw      logs/MPE_simple_tag_v3/iql_teams_static_ccw_MPE_simple_tag_v3_seed0_metrics.npz

# Exp 1 — archived, 1 seed each, ~3 min
python src/iql_teams.py    alg=ql_teams_simple_tag    NUM_SEEDS=1
python src/iql_teams_oa.py alg=ql_teams_oa_simple_tag NUM_SEEDS=1
python src/compare_plots.py \
  --baseline logs/MPE_simple_tag_v3/iql_teams_MPE_simple_tag_v3_seed0_metrics.npz \
  --oa       logs/MPE_simple_tag_v3/iql_teams_oa_MPE_simple_tag_v3_seed0_metrics.npz