Model an opponent's latent strategy with calibrated uncertainty (Part 1) and plan against it (Part 2). In a controlled predator–prey game, a blue team that infers a varying red opponent's strategy and plans against it is the most robust best response — +61% over an opponent-blind baseline, beating even an oracle told the strategy.
Building blocks for adaptive opponent modeling in adversarial co-training: learn an uncertainty-aware model of the opponent, then exploit it inside a planner.
The end goal is a two-part co-training loop — (1) an
uncertainty-aware encoder that maps an opponent's trajectory to a latent
strategy z, and (2) a model-based planner that
samples the opponent's actions from a policy conditioned on z.
Each experiment here builds one piece:
Exp 3 — an opponent-action head is a point-estimate opponent model; the MC planner is a flat precursor to the tree search. Exp 2 — an unsupervised VAE on opponent trajectories (Part 1 encoder), and where it fails when the strategy signal is weak. Exp 4 — a resource task; the dominated prey's trajectory still doesn't carry strategy — the bottleneck is behavioural expression. Exp 5 — fix the regime so the prey expresses strategy, and all three proposal preconditions follow: distinct strategies, an encodable latent, and a brittle opponent. Exp 6 — the core result: a hidden opponent intent, inferred from behaviour, exploited with an uncertainty-aware best response.
JaxMARL's MPE_simple_tag_v3. 3 predators
vs 1 prey, 2D arena, 5 discrete actions,
25-step episodes. +10 / −10 per capture. Fully observable.
One TrainState per team, shared weights within team.
Avoids gradient crosstalk from stock JaxMARL's single-net vmap.
Does an opponent-action auxiliary head shift the equilibrium? Is the effect symmetric between predator and prey?
Plain two-team IQL. No opponent head. Control.
Auxiliary head predicts opponent actions (CE loss, coef 0.5). Eval = argmax Q. Head discarded at inference.
OA head becomes a one-step opponent model for K=5, H=3 MC lookahead. No extra training.
obs ──▶ Dense(128) ──▶ ReLU ──▶ Dense(128) ──▶ ReLU ──┬──▶ Dense(action_dim=5) ──▶ Q-values │ └──▶ Dense(opp_n × action_dim) ──▶ opp-action logits │ └─ CE vs opp actions from replay (the auxiliary loss, weight 0.5)
Shared trunk encodes opponent-predictive features. At training time the opp head is an auxiliary regularizer; at eval time (OA-plan) it becomes the opponent model the planner samples from.
Per-agent (not joint) Monte-Carlo rollout (see planning_eval.py:86–175):
action_dim = 5 candidate root actions a_root:
K = 5 independent rollouts of horizon H = 3.a_root at t=0 and argmax Q after — root-action injection is the single line planning_eval.py:119.argmax Q throughout (coordinate descent).Categorical(softmax(opp_head(obs))) — see :135–142.wrapped_env.wrapped_step(), accumulating Σ γᵗ r — :142–157.V_leaf = max_a Q(s_H, a).K returns per root action → one score per candidate.argmax_root.
Per-agent coordinate descent is a deliberate trade-off. A joint-action planner
for the three predators would score 5³ = 125 candidates per decision; per-agent
scores only 5 × 3 = 15. The cost is that predator coordination is partially
collapsed into "each predator's independent best response given frozen
teammates," which — as the results show — matters.
H = 0 recovers greedy Q. H = 3 covers ~12% of a 25-step episode; with
γ = 0.9, γ³ ≈ 0.73, so the leaf bootstrap still contributes most of the value.
Each cell: paired by seed, 100 eval episodes, averaged over 3 seeds.
mean ± std over 3 seeds × 100 eval episodes. The full figure is
plots/exp3_tournament.png.
Column-wise: OA-plan prey is uniformly lowest (1.57, 1.15, 1.07) — ~2-capture drop. Row-wise: OA-plan pred is worst across the board.
Same auxiliary objective on both teams. Prey benefits dramatically; predator barely. Adding planning on top amplifies the asymmetry: OA-plan prey dominates universally, OA-plan pred is worst among pred variants.
The prey predicts 3 predator actions per step; the predator predicts 1 prey action. 3× the CE gradient → better opponent model → better planner. The +7 pp accuracy gap drives the tournament asymmetry.
In a zero-sum-like game, predator return and prey return are shadows of each other. If both sides improve proportionally, the marginal training curve looks unchanged — nothing to see. Equilibrium shifts only become visible when you stress-test a policy against opponents it wasn't paired with during training — a cross-play tournament. The OA-greedy vs IQL-greedy difference in the training curves is under half a standard deviation. The same difference in the tournament (column-mean drop from ~3.4 to ~1.3 captures/ep when prey plans) is more than 3 std.
This is also a methodological point: if you only look at marginal training returns in a self-play setup, you'll systematically miss the kind of effect this experiment is about.
Two compounding reasons:
Fixing either would likely move OA-plan pred above IQL-greedy. Neither is a deep problem with the approach — both are engineering choices we made for wall-clock reasons.
A symmetric auxiliary objective produces an asymmetric effect in adversarial co-training, driven by an asymmetry in the classification signal each team receives. The side with the richer opponent-modelling task attains a better opponent classifier and a better planner derived from it. Neither side's marginal training return moves appreciably; the effect is observable only in cross-play.
Directional shaping narrows the prey's strategy; predators exploit it. Four conditions, three seeds each.
SimpleTagStaticMPE: obstacles pinned at (±0.5, ±0.5),
plus a per-step directional bonus for the prey:
bonus = coef · dir_sign · sign(v_prev × v_new) · 1[r_in ≤ ‖v_new‖ ≤ r_out]
v_prev = prey_pos_t − obstacle
v_new = prey_pos_{t+1} − obstacle
dir_sign = +1 (CCW) or −1 (CW)
coef = 0.1, r_in = 0.25, r_out = 0.6
Direction-only (not speed). Annulus gate prevents farming on top of obstacles. Prey-only bonus; predator rewards untouched.
| Run | Obstacles | Shaping | Predator return ↑ | Captures / ep |
|---|---|---|---|---|
random_baseline |
randomized each reset | none | +33.12 | 3.3 |
static_baseline |
fixed at (±0.5, ±0.5) | none | +22.19 ± 5.97 | 2.2 |
static_ccw |
fixed | CCW prey bonus, r ∈ [0.25, 0.6] | +26.25 ± 2.76 | 2.6 |
static_cw |
fixed | CW prey bonus, same annulus | +28.23 ± 2.06 | 2.8 |
1. Static obstacles favor the prey. Removing obstacle randomness lets the prey memorize stable cover — predators lose ~1.1 captures/ep versus random.
2. CW / CCW shaping recovers most of that gap. Either direction of the annular bonus pushes predator captures from 2.2 back up to 2.6–2.8 / ep.
3. Shaping also collapses seed-to-seed variance (σ ≈ 6 → σ ≈ 2). The bonus regularizes the prey into a narrow family of policies.
64 episodes per (variant × seed). Track prey 2D position and signed angular displacement around the nearest obstacle.
baseline concentrates near cover. cw and ccw
show visible orbital bands in the shaping annulus around each obstacle.
Bottom row: signed angular step per timestep (positive = CCW around
the nearest obstacle, negative = CW). baseline is symmetric around zero;
cw is biased negative; ccw is biased positive. Red dashed line
shows the mean.
| Variant | Mean Δangle (rad/step) | Δangle std | Direction |
|---|---|---|---|
baseline |
+0.0002 | 0.36 | ≈ symmetric |
cw |
−0.0108 | 0.37 | biased CW (sign matches shaping) |
ccw |
+0.0091 | 0.36 | biased CCW (sign matches shaping) |
Shaping compresses the prey into a narrow orbital band. Predators exploit the narrower support — capture rate increase outweighs the shaping bonus.
Each GIF is 150 steps = 6 auto-reset episodes chained together, greedy rollouts from seed 0 of the corresponding static variant.
Can a single trajectory identify which policy produced it? Short answer: no,
at shape_coef = 0.1. Two datasets tested: Reading A
(900 labelled trajectories, 9 checkpoints) and Reading B
(900 unlabelled, single policy).
Flax MLP-VAE, 8-dim latent, KL-annealed, 8k Adam steps on Reading A.
Most steps are outside the annulus where all policies look identical. Behavior mining pools 192 episodes to average out the dominant evade signal; a single rollout can't.
Same VAE on each predator slot. Latent traversals are more interpretable (goal-directed chase trajectories), but still track arena geometry, not policy.
| Agent | Reading A: max ARI vs ground truth | Max NMI | Reading B: best-BIC k | BIC drop (k=1 → k_best) |
|---|---|---|---|---|
prey |
0.008 | 0.024 | 1 | ≈ 0 |
pred_0 |
0.025 | 0.061 | 4 | −750 |
pred_1 |
0.023 | 0.054 | 3 | −812 |
pred_2 |
0.017 | 0.043 | 3 | −765 |
Reading A: predator ARI ~2× prey's, still at-chance. Reading B: predators are multi-modal (BIC drops 700+ nats), prey are not. Modes track which obstacle the chase converges on.
The VAE latent encodes trajectory shape, not policy identity. Two paths forward:
The shaping math is trivial but easy to get subtly wrong (sign-flip in the cross product, off-by-epsilon in the annulus, wrapper vs subclass). Five assertions in smoke_test_static.py:
r_in or outside r_out earns zero bonus.Run: python src/smoke_test_static.py
random_baseline is a 1-seed reference from Exp 1.[0.25, 0.6] are reasonable defaults, not swept.Change the task so different strategies produce qualitatively different trajectories, making policy identity visible to any encoder.
4 collectable resources, prey-visible only. +5.0 reward per collection. Predator obs unchanged.
Radius 0.6 around origin. Optimal collection = loop.
At (±0.8, ±0.8). Requires long diagonal dashes.
50/50 coin-flip per reset. Bimodal trajectory distribution by construction.
Same initial state (random placement, seed 0). Red = predators, blue = prey, green diamonds = resources.
Individual algorithm rollouts (click to enlarge):
Predators: 16-d (no resource info). Prey: 26-d (base + 12 resource dims). The prey's strategy is shaped by resources the predator can't observe — an opponent-aware predator that infers the strategy type should outperform one that doesn't.
CTRolloutManager.
ResourceState dataclass adds resource_pos and collected.
CTRolloutManager pads to 26-d + 4-d one-hot = 30-d training obs.
Predators zero-padded in resource dims.
smoke_test_resources.py covers placement geometry, observation shapes, predator-obs equivalence with the base env, collection mechanics, reward accounting, auto-reset state clearing, and obstacle preservation:
Run: JAX_PLATFORMS=cpu python src/smoke_test_resources.py
10 Hydra configs, 2M timesteps each, 3 seeds. See Repro section for full commands.
Evaluation: cross-play tournament, trajectory dataset generation, VAE + multimodality analysis.
2M timesteps, 3 seeds, random placement. X-axis = env timesteps (comparable across algorithms).
| Algorithm | Pred return | Prey return | Pred resources | Prey resources |
|---|---|---|---|---|
| IQL | +33.5 ± 1.4 | −37.0 ± 2.4 | 0.415 | 0.444 |
| OA-IQL | +35.5 ± 1.8 | −48.8 ± 4.9 | 0.414 | 0.387 |
| MAPPO | +110.5 ± 12.3 | −113.7 ± 5.7 | 0.460 | 0.468 |
MAPPO ~3× IQL on pred returns. Centralized critic coordinates pursuit. Resource collection comparable across all three (~0.4–0.5/step) — return gap is tag/evasion, not resources.
300 MAPPO trajectories (3 seeds × 100 eps, T=50). Includes positions, resources, and collection flags.
The proposal's Part 1 wants an encoder mapping the opponent's trajectory
to a latent strategy z. That is only possible if the strategy
is in the trajectory. The hypothesis was that resource placement
(circle-loop vs corner-dash) would make it so. We tested this directly:
label each first-episode prey trajectory by its placement (recoverable
from resource geometry, never shown to the encoder), then ask whether a
supervised linear probe on the raw trajectory can predict
placement, and whether the unsupervised VAE latent recovers it.
The natural fix is to make the prey care more about resources. We retrained
IQL with collect_reward raised from 5 to 20. The prey collected
nearly twice as many resources per episode (0.68 → 1.28),
confirming the incentive landed — yet placement stayed at chance
(probe 0.49). Collecting more resources does not mean executing a
layout-distinctive route.
Three predators dominate; the prey's path is governed by evasion, and it grabs whichever resources fall along the escape route. In both layouts there is always a resource near the fleeing prey, so it never commits to a full circle circuit or a corner tour. Placement changes where resources sit but not how the prey moves. The "qualitatively distinct strategies" the resource env was meant to induce do not appear in the dominated agent's behaviour.
The bottleneck for Part 1 is not the encoder (VAE vs supervised
contrastive) — it is behavioural expression. An
opponent-strategy latent is only learnable when the opponent can actually
execute distinguishable strategies. In a task where one side is dominated,
its trajectory collapses onto "evade", and there is little for z
to encode. This argues for either a more balanced task (predator and prey
comparably capable) or an explicitly strategy-revealing objective for the
agent, before investing in the latent-conditioned policy model and planner.
Exp 4 found the blocker: a dominated prey can't express its strategy. Fix the regime so it can, and the three things the co-training proposal needs all fall out — demonstrated, not assumed.
Two changes from Exp 4. Specialists: train a separate
prey co-trained only on the circle layout and one only on corners (no
mixing). A balanced regime: slow the predators
(max_speed 1.0→0.6, accel 3.0→1.5) and
raise the collect reward (5→10) so the prey is not
dominated and actually runs a collection route. Everything else is the
Exp 4 environment. Three seeds per specialist.
The two specialists trace visibly different routes. The circle-prey loops a diamond between the four ring resources; the corners-prey patrols a square around the perimeter. Pooling occupancy over a trajectory, a linear probe tells the two apart at 0.98 (chance 0.50) for IQL and 0.96 for MAPPO.
A single short trajectory is dominated by evasion noise, so the encoder needs both the right representation (the trajectory's 2D occupancy histogram, which exposes where the prey goes) and enough of it. Sweeping the observation length, the unsupervised VAE latent goes from chance to a clean two-cluster split: at L=100 steps the latent recovers placement at ARI 0.87 with a supervised latent-probe of 0.99.
This is the result the Exp 2 / Exp 4 VAEs could not reach. The difference is regime (the prey now expresses strategy) and representation (occupancy, not the raw evasion-dominated position sequence). MAPPO's prey is more dominated, so its latent encodes placement supervised (probe 0.83) but does not separate cleanly unsupervised — cleaner strategy expression gives a cleaner latent.
Take MAPPO predators co-trained against the circle-prey and drop in the corners-prey they never saw (and vice-versa). Captures fall 43% — from 1.59 to 0.91 per episode. Read column-wise (same prey and env, only the predator swapped): the co-trained predator catches the circle-prey 1.90×, the unfamiliar predator only 1.11; for the corners-prey, 1.29 vs 0.72. The centralized critic learns a layout-specific interception that does not transfer.
These are the three preconditions the adversarial co-training plan rests
on. (1) The opponent has genuinely distinct strategies
to model. (2) An unsupervised encoder recovers them into
a latent z — the input to the latent-conditioned
opponent policy and the planner's value function. (3) A
strong co-trained policy really is brittle to out-of-distribution
opponents, which is the failure the opponent-aware planner is meant to
fix. With all three in hand, the proposal's Part 1 is directly
buildable — and below, built.
The advisor's next step: once the VAE is good, clone a policy
conditioned on its latent. We log (obs, action) from both
specialists, encode each trajectory to z with the validated
occupancy-VAE, and behaviour-clone a single prey policy
π(a | obs, z). Conditioning on zcircle
vs zcorners makes the same policy trace the
diamond or the square. Crucially, swapping z alone — with
the environment and predators held fixed — steers the route at
0.85 (chance 0.50): the latent controls behaviour, not just
the layout.
zcircle (diamond) and
zcorners (square). Right: BC matches the
specialist's action at 0.62 (vs 0.24 majority); the latent controls the
route even in a fixed environment (0.85).
Sample π(a | s, z) as the opponent's move inside a
MuZero-style search and condition the blue value function on z.
The target is Deliverable 3's table: an opponent-aware circle-predator
should catch the corners-prey closer to the in-distribution 1.29 than the
unaware 0.72. Full math and results:
exp5_results.pdf.
The proposal: model a varying opponent's latent strategy with calibrated uncertainty (Part 1) and sample that model inside a planner (Part 2), for robustness in adversarial co-training. We validate the loop in a controlled predator–prey game where the opponent (red, the prey) draws a hidden strategy each episode and the controlled team (blue, the predators) must infer it — so every quantity can be measured against ground truth.
Same arena every episode. The prey is secretly assigned one of four corners it's rewarded for haunting; predators can't see which. The strategy is invisible in any single frame — a predator can only get it by modeling the prey's behaviour over time. We train three predator conditions (MAPPO, 3 seeds): unaware (no intent), oracle (told the intent), and belief (trained on soft intent beliefs of random sharpness, so it learns to hedge when unsure and pounce when sure).
(1) Knowing the intent is worth +51%. An oracle predator that observes the hidden corner catches the prey 4.05 times/episode versus 2.68 for the strong intent-blind hedger. The opponent's hidden strategy genuinely matters.
(2) A naive model is brittle. That same oracle predator, fed a random intent, collapses to 1.42 — worse than not modeling at all. A model that ignores its own uncertainty is a liability.
(3) Uncertainty-aware inference wins. The belief predator, fed the encoder's online-sharpening posterior, reaches 2.82 — it beats the intent-blind baseline, by hedging while the belief is flat and committing once it sharpens. Reacting to the belief recovers only a sliver of the oracle's edge (2.82 vs 2.68 against an oracle of 4.05); the win comes from planning on it, below. Hard inference at a fixed step (2.56) doesn't help, because it gambles on one possibly-wrong guess.
This isolates the four things opponent modeling needs: a strategy that is hidden (not in the observation), inferable from behaviour, exploitable (+51%), and worth modeling only with uncertainty (the naive model is −47%). It is the honest test the circle/corners experiments couldn't be, because there the strategy was visible in the map. Full setup, math, and tables: intent_opponent_modeling.pdf.
The belief predator above is reactive. Now the predators plan: at each step, for every candidate joint action, they roll the true simulator a few steps ahead with the prey's moves sampled from its policy under an intent drawn from the belief, and pick the action that best intercepts the imagined future prey. Belief-conditioned opponent sampling inside a lookahead — the proposal's Part 2.
The encoder above used labels. A Part 1 that scales should recover the opponent's strategy self-supervised, and the objective decides whether it can. Same prey trajectories, same 2-D latent, two self-supervised encoders: a VAE that reconstructs the trajectory (generative), and a JEPA that predicts the representation of the future window through an EMA target encoder (predictive, no reconstruction — LeCun's principle, cf. V-JEPA 2). JEPA keeps only the predictable structure (where the prey is heading = its strategy) and discards the evasion noise the VAE drowns in.
Because it predicts where the prey is heading instead of waiting for it to arrive, JEPA is also an anytime encoder — it reads the strategy from ~2× fewer observed steps than the VAE (JEPA at 11 steps matches the VAE at 20), so the planner can commit sooner.
And it closes the loop without labels: a self-supervised JEPA belief (encoder + a readout to predicted arrival position, then a softmax over the four known corners — no intent labels anywhere) drives the planner to 4.08 captures/episode, matching the supervised-belief planner (4.31, overlapping error bars) and beating the oracle reactive predator (4.05).
Here is that label-free belief running in the game. Left: intent-blind predators hedge. Right: JEPA-belief predators — the × marks where the JEPA model predicts the prey is heading, and it locks onto the prey's hidden corner (the star) as the pack closes in; the bar strip is the JEPA belief sharpening, with no labels.
The obvious next step is to use the same predictive principle for the planner's dynamics — a JEPA latent world model (encode → predict next latent → decode) the planner rolls through, fixing the state-space model that fell down in the earlier learned-model planner. It doesn't work: the latent compression makes the dynamics ~3× less accurate (5-step position RMSE 0.24 vs 0.07), and the planner collapses to 0.57 captures — worse than the state-space model (2.53) and the reactive baseline.
JEPA's core move — predict the representation, discard unpredictable detail — is exactly right for the opponent encoder, where the discarded detail is evasion noise and the strategy is what's predictable. It is exactly wrong for a planner's world model, which needs accurate dynamics: discarding detail loses the physics the planner depends on. The two roles want opposite things. A learned world model here wants more fidelity (the state-space model, or an ensemble with a disagreement penalty), not compression — the genuine open problem.
The same Part-1 question on the resource-placement axis: can an unsupervised
encoder tell which specialist it is watching — the prey that
orbits a resource circle, or the one that dashes between corners? Three
findings (1,200 episodes, 3 encoder seeds, mopa package):
And the behaviour-cloning test the meeting asked for —
π(a|s) vs π(a|s,z), watch episode 1,
predict the predator's episode-2 actions, episode-level split: conditioning
on the true placement clearly helps (0.800 vs 0.776
held-out action accuracy), so the hypothesized effect is real; a
single-episode unsupervised latent doesn't move it (0.772) —
exactly as the probe curve predicts, since one episode of observation gives
it almost no signal. The oracle–baseline gap is the prize a better
opponent latent captures — and the follow-up below shows an
unsupervised latent does close it once it sees enough.
What that metric looks like in the game: a held-out episode replays, and at every step each BC variant predicts each predator's actual action — green ring = predicted it, red = missed. On episodes where the prey's route matters, knowing the placement is the difference between guessing and following (0.60 → 0.83 circle, 0.60 → 0.92 corners).
Two checks turn the single-episode null into a positive scaling result. First, is behaviour cloning even a faithful clone? Deployed against the same prey, vanilla BC — trained from a hand-built state feature (all agent positions + a velocity proxy + predator id, strictly less than the policy network sees) — recovers 86% of the MAPPO expert's capture-edge over random (1.22 vs 1.35 captures/episode, random 0.40; 3 checkpoint seeds, two placements), at a held-out action match of 0.80 (episode-level split). The policy class is not the bottleneck; the residual 14% is the headroom a strategy signal could close.
Second, does conditioning help once the latent carries the
strategy? The placement is fixed across all of a specialist's
episodes, so we can give the encoder more observation. Reading the
VAE latent from L steps before a held-out episode (steps
125–150 of a six-episode rollout), its placement probe climbs from
0.60 at L=25 to 0.78 at
L=125, and latent-conditioned BC tracks it: from no gain at
L=25 (0.764 vs vanilla 0.766) it overtakes vanilla by
L=50 and reaches the oracle (true-placement)
ceiling by L=125 (0.795 vs oracle 0.792). The JEPA
latent, whose probe stays flat (≈0.6) at this 4-D latent size, gives no
gain at any L. So π(a|s,z) > π(a|s) holds
— exactly to the degree the latent recovers the strategy. Encoder
quality translates directly into control benefit.
Part 1: the opponent's latent strategy is inferable from its behaviour into a calibrated belief (0.37→0.97 over 25 steps), and a point-estimate model is brittle (−47% fed a wrong guess) — exactly why the uncertainty matters. Part 2: sampling that uncertainty-aware opponent model inside a planner is the most robust best response — 4.31 captures/episode, +61% over the opponent-blind baseline and above an oracle handed the true strategy (4.05). Ablating the inferred belief to flat drops it to 3.07, so the opponent inference supplies +40%. Full setup and math: Adaptive Opponent Modeling for Adversarial Co-Training (PDF).
Single seed, RNN. Superseded by Exp 3 (3 seeds, MLP).
OA-IQL nudges both teams up ~2 return points. Opp-head accuracy: 0.53 (pred) vs 0.58 (prey) — same asymmetry as Exp 3.
| Metric | Baseline IQL | OA-IQL | Δ |
|---|---|---|---|
| Final predator greedy return | +31.56 | +33.44 | +1.88 |
| Final prey greedy return | −47.15 | −45.19 | +1.96 |
| Predator peak (training) | ~+101 | ~+115 | +14 |
| Final opp-action accuracy | — | 0.53 / 0.58 | (pred / prey) |
| Wall-clock per run (CPU) | ~157 s | ~162 s | +3 % |
Superseded by Exp 3 (3 seeds, MLP, cross-play tournament). Retained for reproducibility.
Per-team TrainState, shared within-team. Avoids gradient cancellation from stock JaxMARL's single-net vmap.
wrapped_step in the plannerMust preserve obs padding + one-hot. Using env.step_env silently returns NaN.
LogEnvState directlyDon't unwrap to env_state.env_state — the log wrapper unwraps internally.
IQL / OA-IQL (Exp 1–4, off-policy):
MAPPO (Exp 4, on-policy):
# Environment
python3.11 -m venv .venv && source .venv/bin/activate
pip install -e JaxMARL/ "flax==0.10.2" hydra-core flashbax wandb matplotlib distrax optax
# Exp 4 — resource collection (3 algos × 3 placements + tournament + traj dataset)
python src/smoke_test_resources.py # 9 assertions
# training — IQL
python src/iql_teams_mlp.py alg=ql_teams_resources_circle NUM_SEEDS=3
python src/iql_teams_mlp.py alg=ql_teams_resources_corners NUM_SEEDS=3
python src/iql_teams_mlp.py alg=ql_teams_resources_random NUM_SEEDS=3
# training — OA-IQL
python src/iql_teams_oa_mlp.py alg=ql_teams_oa_resources_circle NUM_SEEDS=3
python src/iql_teams_oa_mlp.py alg=ql_teams_oa_resources_corners NUM_SEEDS=3
python src/iql_teams_oa_mlp.py alg=ql_teams_oa_resources_random NUM_SEEDS=3
# training — MAPPO
python src/mappo_teams_mlp.py alg=mappo_teams_resources_circle NUM_SEEDS=3
python src/mappo_teams_mlp.py alg=mappo_teams_resources_corners NUM_SEEDS=3
python src/mappo_teams_mlp.py alg=mappo_teams_resources_random NUM_SEEDS=3
python src/mappo_teams_mlp.py alg=mappo_teams_simple_tag NUM_SEEDS=3 # vanilla baseline
# evaluation — 4×4 cross-play tournament
python src/tournament_resources.py --placement random --seeds 3 --eps 100 --K 5 --H 3
python src/tournament_resources.py --placement circle --seeds 3 --eps 100
python src/tournament_resources.py --placement corners --seeds 3 --eps 100
# evaluation — trajectory datasets for VAE pipeline
python src/generate_trajectory_dataset_resources.py --algorithm iql
python src/generate_trajectory_dataset_resources.py --algorithm oa_iql
python src/generate_trajectory_dataset_resources.py --algorithm mappo
# evaluation — VAE + multimodality
python src/train_traj_vae.py
python src/verify_multimodality.py
python src/verify_vae_modes.py
# Exp 3 — ~5 min training + ~50 s tournament (MLP, CPU)
python src/iql_teams_mlp.py alg=ql_teams_mlp_simple_tag NUM_SEEDS=3 # ~2 min
python src/iql_teams_oa_mlp.py alg=ql_teams_oa_mlp_simple_tag NUM_SEEDS=3 # ~2 min
python src/tournament.py --seeds 3 --eps 100 --K 5 --H 3 # ~50 s
python src/plot_exp3.py
# Exp 2 — ~24 min total (3 × ~8 min RNN trainings)
python src/smoke_test_static.py
python src/iql_teams.py alg=ql_teams_static_baseline NUM_SEEDS=3
python src/iql_teams.py alg=ql_teams_static_ccw NUM_SEEDS=3
python src/iql_teams.py alg=ql_teams_static_cw NUM_SEEDS=3
python src/exp2_behavior_mining.py
python src/compare_static_plots.py \
--static_baseline logs/MPE_simple_tag_v3/iql_teams_static_baseline_MPE_simple_tag_v3_seed0_metrics.npz \
--static_cw logs/MPE_simple_tag_v3/iql_teams_static_cw_MPE_simple_tag_v3_seed0_metrics.npz \
--static_ccw logs/MPE_simple_tag_v3/iql_teams_static_ccw_MPE_simple_tag_v3_seed0_metrics.npz
# Exp 1 — archived, 1 seed each, ~3 min
python src/iql_teams.py alg=ql_teams_simple_tag NUM_SEEDS=1
python src/iql_teams_oa.py alg=ql_teams_oa_simple_tag NUM_SEEDS=1
python src/compare_plots.py \
--baseline logs/MPE_simple_tag_v3/iql_teams_MPE_simple_tag_v3_seed0_metrics.npz \
--oa logs/MPE_simple_tag_v3/iql_teams_oa_MPE_simple_tag_v3_seed0_metrics.npz