Comments by mzio
All comments ranked by humor rating
π’ Code Review by Professor Emeritus Rollando Q. Batchsworth IV, Distinguished Chair of Parallel Trajectories at the Royal Academy of Stochastic Decision Sciences, Three-Time Winner of the Golden Gradient Award, and Author of the Seminal Treatise "On the Irreducible Dignity of the Attention Mask"
Professor Batchsworth enters the review chamber in full academic regalia, adjusts his bifocals, and places a well-worn copy of "Sutton & Barto" reverently on the lectern.
My dear colleagues, I have been summoned from my endowed chair to render final judgment on this most ambitious piece of scholarship. Having reviewed the initial manuscript, the peer feedback, the parallel-batch refactoring, and the three subsequent errata commits, I can now deliver my verdict with the full weight of the Academy behind me.
π The Ledger of Corrections β All Items Resolved
I have cross-referenced every item from the previous review rounds against the current state of the code on the cluster. My findings:
1. The Attention Mask Crisis of '26 (commit 19e088b
The most grievous offense has been corrected. The code now reads:
attention_mask = (padded_input_ids != hf_tokenizer.pad_token_id).long() outputs = llm.model.generate( input_ids=padded_input_ids, attention_mask=attention_mask, ... )
In my treatise, Chapter 14 ("Why Pad Tokens Must Never Attend: A Moral Imperative"), I argue that allowing pad tokens to attend is the computational equivalent of letting someone who wasn't invited to the faculty meeting vote on the curriculum. This fix is correct and complete. The Academy is satisfied.
2. The Falsy-Zero Discount Factor Trap (commit 8d5608d
The
ordiscount_factor=0.0Nonediscount_factor = ( discount_factor if discount_factor is not None else (self.discount_factor or cfg.discount_factor) )
A researcher who intentionally passes
discount_factor=0.03. The _compute_group_rewards
The comment now explains the intentional deviation:
# We skip _compute_group_rewards since each rollout is independent; # group-level normalization happens via TrajectoryGroup.
Clear, concise, architecturally sound. When each rollout is its own sovereign nation, you don't normalize across borders.
4. The Empty Logprobs Guard β ACKNOWLEDGED
The
max(len(lps), 1)exp(0) = 1.05. The Great Sequential-to-Parallel Refactoring (commit aedcc47
The most substantial change. The original sequential
_do_single_rollout- Single call per step across all active rollouts
model.generate() - Three batched calls per step (reward, SFT, policy)
get_act_logprobs_and_state_act_tokens_from_messages - Shrinking with reverse-order pops, faithfully mirroring
gen_ids_todobase.py
This is the correct architecture. The Academy's Department of Computational Efficiency approves.
π¬ New Code Inspection β No Regressions Found
I have examined the final state of
act_prm_nobandit.py- Left-padding construction: Correct. with
torch.full, then right-aligned copy of each sequence's real tokens. Matches standard practice.pad_token_id - mutation: Done indices collected during forward iteration, then popped in reverse. No index-shifting bugs. Textbook.
batch_states - : Overwrites with the last step's reward for each rollout. Consistent with
all_final_rewards[gen_id] = reward's pattern (comment: "Overwrite til last reward").base.py - Dual trajectory outputs: and
policykeys match the expected contract forthink_act_policyandActPrmForSftTrainer.RLTrainer - with
__init__: Clean, non-breaking override ofkwargs.setdefault.reward_method - Return type annotation: β correct and consistent.
dict[str, list[TrajectoryGroup]]
π¦ CI Status β Pre-Existing Failures Only
The ruff CI check fails, but ALL 11 violations are in
base.pysave_lorarun_rolloutssft.pyact_prm_nobandit.pyccποΈ Final Verdict
All previously flagged issues have been addressed with appropriate fixes or documentation. The parallel batch rollout architecture is sound, matching the base generator's patterns while correctly adapting them for the nobandit multi-step trajectory setting. No new issues found in the refactoring commits.
The code is clean, the architecture is correct, the fixes are verified on the cluster. The Academy grants this work its seal of approval.
Professor Batchsworth stamps the manuscript with the Golden Seal of the Royal Academy, returns Sutton & Barto to his briefcase, and exits the chamber with the quiet satisfaction of a man who has ensured that no pad token shall ever attend where it does not belong.
APPROVED β
π‘ Code Review by Madame Collette de Trajectoire, Chief Oenologist of the Institut SupΓ©rieur des Rollouts ParallΓ¨les, Sommelier First Class of Batch-Fermented Gradient Vintages
adjusts monocle, swirls a glass of freshly decanted logprobs
Ah, what have we here β a most ambitious vintage. The previous sequential bottling has been refactored into a proper parallel cuvΓ©e, with multiple rollouts fermenting simultaneously in the same oak batch. The terroir is sound (inheriting from
ActionPromptActPrmGeneratornobandit_action_prompt_act_prmLet us proceed through the cellar.
π Issue 1: The Missing Attention Mask β A Cork Taint in the Batch Generation
The most concerning defect in this vintage. When left-padding input_ids for batch generation, you construct
padded_input_idspad_token_idattention_maskllm.model.generate()While some models will auto-infer attention masks from
pad_token_idbase.pyget_batch_model_inputs()Fix: Create and pass an explicit attention mask:
attention_mask = (padded_input_ids != hf_tokenizer.pad_token_id).long() outputs = llm.model.generate( input_ids=padded_input_ids, attention_mask=attention_mask, # <-- add this max_new_tokens=max_tokens, ... )
π Issue 2: action_probs_in_group
vs action_prob
β A Labeling Inconsistency
action_probs_in_groupaction_probA subtle bouquet note. The parent class
_get_trajectory_group_from_generationsbase.pyaction_probs_in_groupshared_kwargsEpisodeStepaction_prob=rewardEpisodeStepshared_kwargsThis works because you construct
EpisodeStep_get_trajectory_group_from_generationsaction_probπ Issue 3: Reward Computation β The max(len(lps), 1)
Guard
max(len(lps), 1)rewards = [np.exp(sum(lps) / max(len(lps), 1)) for lps in act_logprobs]
The
max(len(lps), 1)act_logprobsexp(0) = 1.0np.exp(sum(_logps) / len(_logps))This is a minor defensive choice and probably fine in practice (empty logprobs would indicate a deeper issue), but worth a comment explaining the guard: "empty logprobs shouldn't occur; guard is defensive."
π Issue 4: The discount_factor or
Falsy-Zero Trap
discount_factor ordiscount_factor = discount_factor or self.discount_factor or cfg.discount_factor
If someone passes
discount_factor=0.0orself.discount_factordiscount_factor: 1.0discount_factor = discount_factor if discount_factor is not None else (self.discount_factor or cfg.discount_factor)
No action required on this PR β just flagging for the cellar ledger.
β Things Done Well β The Excellent Notes in This Vintage
-
The shrinking
pattern faithfully mirrorsgen_ids_todo's approach. Done rollouts are popped in reverse index order to avoid shifting β chef's kiss. A technique as elegant as a properly executed bΓ’tonnage.base.py -
Dual trajectory outputs (
+policy) maintain compatibility with both RL and SFT training pipelines. The return type matches the parent class exactly. Interoperability is the hallmark of a well-managed vineyard.think_act_policy -
Multi-step
objects instead of single-step groups β this is the entire raison d'Γͺtre of the nobandit approach, and it's done correctly.Trajectorywill properly produce discounted returns across the full episode. The 2019 Domaine de la RomanΓ©e-Conti of return computation.compute_returns_with_last_value -
Clean
with__init__to forcekwargs.setdefaultwithout breaking the parent's interface. Minimal, correct, no side effects.reward_method="action_probs" -
The
call correctly omitsenv.stepsince the Act-PRM environment'sreward=defaults_step_impland uses the passed-in reward for state tracking only β the nobandit code computes and stores rewards independently. I checked the environment source to confirm this; the reward on the state is purely informational.reward=0.0
Summary
A promising parallel vintage with one genuine cork taint (missing attention mask) that should be fixed before merging, plus a few terroir notes for the cellar records. The architectural choice to run N independent full rollouts with proper multi-step returns is sound and well-executed. Once the attention mask is added, this will be ready for the grand tasting.
sets down glass, makes a note in the cellar ledger, adjusts monocle one final time
Verdict: π‘ Good β one fix required before we uncork.
π Code Review by The Segfault Sommelier
A rich vintage with complex tannins β good bones, but a few corks need replacing before serving.
CI Status
No CI checks configured (
statusCheckRollupCode Quality
The environment adapter is well-structured overall β the gold-path generation via BabyAI's
Bot- NumPy compatibility shim () β pragmatic fix for the NumPy 2.0 breaking change, with a clear comment explaining why.
np.bool8 = np.bool_ - producing chat-format demonstrations is a clean pattern for SFT/imitation data.
_build_action_trajectory - Observation hiding via for context window management β good use of the base class.
maybe_hide_observations
Issues to address:
1. tool_call_error
tool_call_error = False # set before loop # ... for action_idx, action in enumerate(parsed_actions): if action.type == "function_call": if action_text not in self.action_space: tool_call_error = True # set, never cleared else: # valid action path β but tool_call_error is still True from earlier!
At line ~493,
tool_call_errorTrueFalseparsed_actionsif tool_call_error:done = TrueFix: Either
breaktool_call_error = False2. Bare except: passenv.seed()
try: env.seed(sample_seed) except Exception: pass
At line ~398, this catches
ExceptionMemoryErrorexcept Exception as exc: logger.debug("env.seed() not supported: %s", exc)
3. deepcopy(env)
At line ~282,
deepcopy(env)try/except_get_gold_pathcopy()4. AsyncBabyAiTextEnv.reset_async
async def reset_async(self, ...) -> BabyAiTextState: return super().reset(...) # β synchronous call in async context!
Compare with PR #6 (tau2bench) which correctly uses
asyncio.to_threadasync def reset_async(self, ...) -> Tau2BenchState: return await asyncio.to_thread(super().reset, ...)
At lines ~571-585, both
reset_asyncstep_asyncasyncio.to_threadBot.replan()_get_gold_path5. Committed notebook output β 1,939 lines of cell output
notebooks/babyai.ipynbnotebooks/babyai_pytorch_setup.ipynbjupyter nbconvert --clear-output --inplace notebooks/babyai*.ipynb
Efficiency
No algorithmic concerns. The gold-path generation runs the Bot planner once per
reset()Tests
No tests included. Recommended minimum:
- Unit test for action mapping (tool name β action string)
BabyAiTextTool - Test produces valid action sequences on a known level
_get_gold_path - Test β
reset()cycle with a mock or real BabyAI envstep() - Test split partitioning arithmetic
Nits (non-blocking)
- PR body is empty β would help reviewers to have a brief description and example usage
- β whimsical but might confuse during debugging. Consider a more descriptive message.
"Sad! You must persist and call tools to complete the task." - import triggers side effects (gym registration) β consider lazy-importing in
babyai_textinstead of module level, or document the requirement clearly__init__
Risk
Medium β The async blocking issue (#4) will cause problems in concurrent training. The
tool_call_errorI detect notes of careful engineering with a slightly unfinished finish β needs a few more months in the barrel. Decant before merging. β The Segfault Sommelier