Comments by mzio

All comments ranked by humor rating

🟒 Code Review by Professor Emeritus Rollando Q. Batchsworth IV, Distinguished Chair of Parallel Trajectories at the Royal Academy of Stochastic Decision Sciences, Three-Time Winner of the Golden Gradient Award, and Author of the Seminal Treatise "On the Irreducible Dignity of the Attention Mask"

Professor Batchsworth enters the review chamber in full academic regalia, adjusts his bifocals, and places a well-worn copy of "Sutton & Barto" reverently on the lectern.

My dear colleagues, I have been summoned from my endowed chair to render final judgment on this most ambitious piece of scholarship. Having reviewed the initial manuscript, the peer feedback, the parallel-batch refactoring, and the three subsequent errata commits, I can now deliver my verdict with the full weight of the Academy behind me.


πŸ“œ The Ledger of Corrections β€” All Items Resolved

I have cross-referenced every item from the previous review rounds against the current state of the code on the cluster. My findings:

1. The Attention Mask Crisis of '26 (commit

19e088b
) β€” RESOLVED

The most grievous offense has been corrected. The code now reads:

attention_mask = (padded_input_ids != hf_tokenizer.pad_token_id).long()
outputs = llm.model.generate(
    input_ids=padded_input_ids,
    attention_mask=attention_mask,
    ...
)

In my treatise, Chapter 14 ("Why Pad Tokens Must Never Attend: A Moral Imperative"), I argue that allowing pad tokens to attend is the computational equivalent of letting someone who wasn't invited to the faculty meeting vote on the curriculum. This fix is correct and complete. The Academy is satisfied.

2. The Falsy-Zero Discount Factor Trap (commit

8d5608d
) β€” RESOLVED

The

or
chain that would silently swallow
discount_factor=0.0
has been replaced with a proper
None
check:

discount_factor = (
    discount_factor
    if discount_factor is not None
    else (self.discount_factor or cfg.discount_factor)
)

A researcher who intentionally passes

discount_factor=0.0
(meaning "I care only about immediate reward, the future is dead to me") now has their nihilistic preferences respected. As it should be.

3. The

_compute_group_rewards
Skip β€” DOCUMENTED

The comment now explains the intentional deviation:

# We skip _compute_group_rewards since each rollout is independent;
# group-level normalization happens via TrajectoryGroup.

Clear, concise, architecturally sound. When each rollout is its own sovereign nation, you don't normalize across borders.

4. The Empty Logprobs Guard β€” ACKNOWLEDGED

The

max(len(lps), 1)
defensive guard remains with adequate surrounding context. In the unlikely event of empty logprobs, the reward defaults to
exp(0) = 1.0
. This would indicate a deeper pathology, but the guard prevents a crash, which is the responsible thing to do. Like installing a fire extinguisher in a building you're quite sure won't catch fire.

5. The Great Sequential-to-Parallel Refactoring (commit

aedcc47
) β€” WELL EXECUTED

The most substantial change. The original sequential

_do_single_rollout
loop has been replaced with a proper batched approach:

  • Single
    model.generate()
    call per step across all active rollouts
  • Three batched
    get_act_logprobs_and_state_act_tokens_from_messages
    calls per step (reward, SFT, policy)
  • Shrinking
    gen_ids_todo
    with reverse-order pops, faithfully mirroring
    base.py

This is the correct architecture. The Academy's Department of Computational Efficiency approves.


πŸ”¬ New Code Inspection β€” No Regressions Found

I have examined the final state of

act_prm_nobandit.py
(429 lines) for any issues introduced during the refactoring commits:

  • Left-padding construction: Correct.
    torch.full
    with
    pad_token_id
    , then right-aligned copy of each sequence's real tokens. Matches standard practice.
  • batch_states
    mutation
    : Done indices collected during forward iteration, then popped in reverse. No index-shifting bugs. Textbook.
  • all_final_rewards[gen_id] = reward
    : Overwrites with the last step's reward for each rollout. Consistent with
    base.py
    's pattern (comment: "Overwrite til last reward").
  • Dual trajectory outputs:
    policy
    and
    think_act_policy
    keys match the expected contract for
    ActPrmForSftTrainer
    and
    RLTrainer
    .
  • __init__
    with
    kwargs.setdefault
    : Clean, non-breaking override of
    reward_method
    .
  • Return type annotation:
    dict[str, list[TrajectoryGroup]]
    β€” correct and consistent.

🚦 CI Status β€” Pre-Existing Failures Only

The ruff CI check fails, but ALL 11 violations are in

base.py
(unused
save_lora
,
run_rollouts
imports) and
sft.py
(unused exception variables) β€” zero violations in
act_prm_nobandit.py
or any other file touched by this PR. These are pre-existing lint issues on the
cc
base branch, not introduced here.


πŸ›οΈ Final Verdict

All previously flagged issues have been addressed with appropriate fixes or documentation. The parallel batch rollout architecture is sound, matching the base generator's patterns while correctly adapting them for the nobandit multi-step trajectory setting. No new issues found in the refactoring commits.

The code is clean, the architecture is correct, the fixes are verified on the cluster. The Academy grants this work its seal of approval.

Professor Batchsworth stamps the manuscript with the Golden Seal of the Royal Academy, returns Sutton & Barto to his briefcase, and exits the chamber with the quiet satisfaction of a man who has ensured that no pad token shall ever attend where it does not belong.

APPROVED βœ…

🟑 Code Review by Madame Collette de Trajectoire, Chief Oenologist of the Institut Supérieur des Rollouts Parallèles, Sommelier First Class of Batch-Fermented Gradient Vintages

adjusts monocle, swirls a glass of freshly decanted logprobs

Ah, what have we here β€” a most ambitious vintage. The previous sequential bottling has been refactored into a proper parallel cuvΓ©e, with multiple rollouts fermenting simultaneously in the same oak batch. The terroir is sound (inheriting from

ActionPromptActPrmGenerator
), the appellation is correct (
nobandit_action_prompt_act_prm
), and the bouquet of discounted returns wafts promisingly from the barrel. But a trained palate detects a few impurities that must be addressed before we serve this at the grand tasting.

Let us proceed through the cellar.


πŸ› Issue 1: The Missing Attention Mask β€” A Cork Taint in the Batch Generation

The most concerning defect in this vintage. When left-padding input_ids for batch generation, you construct

padded_input_ids
filled with
pad_token_id
, but you never create or pass an
attention_mask
to
llm.model.generate()
.

https://github.com/HazyResearch/act-prm-tinker/blob/aedcc4706bae3cbe95e90f2a9541988d02546b26/src/act_prm/pytorch/generator/act_prm_nobandit.py#L127-L138

While some models will auto-infer attention masks from

pad_token_id
, this is model-dependent and fragile. The base generator in
base.py
uses
get_batch_model_inputs()
which explicitly returns properly padded inputs with attention masks. Without an explicit mask, pad tokens may attend to each other during generation, producing subtly corrupted thoughts β€” the oenological equivalent of serving a 2024 Burgundy that secretly has notes of cardboard.

Fix: Create and pass an explicit attention mask:

attention_mask = (padded_input_ids != hf_tokenizer.pad_token_id).long()

outputs = llm.model.generate(
    input_ids=padded_input_ids,
    attention_mask=attention_mask,  # <-- add this
    max_new_tokens=max_tokens,
    ...
)

πŸ” Issue 2:
action_probs_in_group
vs
action_prob
β€” A Labeling Inconsistency

A subtle bouquet note. The parent class

_get_trajectory_group_from_generations
in
base.py
expects an
action_probs_in_group
kwarg in
shared_kwargs
, which gets stored on each
EpisodeStep
. But the nobandit code passes
action_prob=reward
directly in the
EpisodeStep
constructor via
shared_kwargs
, bypassing the parent's pattern entirely.

https://github.com/HazyResearch/act-prm-tinker/blob/aedcc4706bae3cbe95e90f2a9541988d02546b26/src/act_prm/pytorch/generator/act_prm_nobandit.py#L211-L225

This works because you construct

EpisodeStep
directly rather than going through
_get_trajectory_group_from_generations
. But it means
action_prob
is set per-step (good for nobandit where each step has its own rollout-specific reward) rather than per-group. Just want to confirm this is intentional β€” it looks correct for the nobandit semantics, but a brief comment would help future sommeliers understand why the pattern diverges. No code change needed, just a note of acknowledgment.


πŸ” Issue 3: Reward Computation β€” The
max(len(lps), 1)
Guard

https://github.com/HazyResearch/act-prm-tinker/blob/aedcc4706bae3cbe95e90f2a9541988d02546b26/src/act_prm/pytorch/generator/act_prm_nobandit.py#L168-L170

rewards = [np.exp(sum(lps) / max(len(lps), 1)) for lps in act_logprobs]

The

max(len(lps), 1)
guard prevents division by zero, which is prudent. However, if
act_logprobs
is ever empty for a rollout, the reward would be
exp(0) = 1.0
β€” a perfect score for producing... nothing. The parent bandit class has the same pattern (
np.exp(sum(_logps) / len(_logps))
), but there it would just crash on empty, which is arguably more honest.

This is a minor defensive choice and probably fine in practice (empty logprobs would indicate a deeper issue), but worth a comment explaining the guard: "empty logprobs shouldn't occur; guard is defensive."


πŸ” Issue 4: The
discount_factor or
Falsy-Zero Trap

https://github.com/HazyResearch/act-prm-tinker/blob/aedcc4706bae3cbe95e90f2a9541988d02546b26/src/act_prm/pytorch/generator/act_prm_nobandit.py#L89

discount_factor = discount_factor or self.discount_factor or cfg.discount_factor

If someone passes

discount_factor=0.0
(no discounting of future rewards, only immediate reward matters), this
or
chain silently replaces it with
self.discount_factor
. This is inherited from the parent class which has the same pattern, so it's a pre-existing condition rather than something introduced here. But since the config default is
discount_factor: 1.0
and nobandit explicitly cares about discounting semantics, it's worth noting. A future fix would be:

discount_factor = discount_factor if discount_factor is not None else (self.discount_factor or cfg.discount_factor)

No action required on this PR β€” just flagging for the cellar ledger.


βœ… Things Done Well β€” The Excellent Notes in This Vintage

  1. The shrinking

    gen_ids_todo
    pattern faithfully mirrors
    base.py
    's approach. Done rollouts are popped in reverse index order to avoid shifting β€” chef's kiss. A technique as elegant as a properly executed bΓ’tonnage.

  2. Dual trajectory outputs (

    policy
    +
    think_act_policy
    ) maintain compatibility with both RL and SFT training pipelines. The return type matches the parent class exactly. Interoperability is the hallmark of a well-managed vineyard.

  3. Multi-step

    Trajectory
    objects instead of single-step groups β€” this is the entire raison d'Γͺtre of the nobandit approach, and it's done correctly.
    compute_returns_with_last_value
    will properly produce discounted returns across the full episode. The 2019 Domaine de la RomanΓ©e-Conti of return computation.

  4. Clean

    __init__
    with
    kwargs.setdefault
    to force
    reward_method="action_probs"
    without breaking the parent's interface. Minimal, correct, no side effects.

  5. The

    env.step
    call correctly omits
    reward=
    since the Act-PRM environment's
    _step_impl
    defaults
    reward=0.0
    and uses the passed-in reward for state tracking only β€” the nobandit code computes and stores rewards independently. I checked the environment source to confirm this; the reward on the state is purely informational.


Summary

A promising parallel vintage with one genuine cork taint (missing attention mask) that should be fixed before merging, plus a few terroir notes for the cellar records. The architectural choice to run N independent full rollouts with proper multi-step returns is sound and well-executed. Once the attention mask is added, this will be ready for the grand tasting.

sets down glass, makes a note in the cellar ledger, adjusts monocle one final time

Verdict: 🟑 Good β€” one fix required before we uncork.

🟠 Code Review by The Segfault Sommelier

A rich vintage with complex tannins β€” good bones, but a few corks need replacing before serving.


CI Status

No CI checks configured (

statusCheckRollup
is empty). No lint or test gates.

Code Quality

The environment adapter is well-structured overall β€” the gold-path generation via BabyAI's

Bot
, the tool abstraction layer, and the observation formatting are all thoughtfully designed. Good calls:

  • NumPy compatibility shim (
    np.bool8 = np.bool_
    ) β€” pragmatic fix for the NumPy 2.0 breaking change, with a clear comment explaining why.
  • _build_action_trajectory
    producing chat-format demonstrations is a clean pattern for SFT/imitation data.
  • Observation hiding via
    maybe_hide_observations
    for context window management β€” good use of the base class.

Issues to address:

1.

tool_call_error
flag persists across loop iterations β€” latent bug

tool_call_error = False  # set before loop
# ...
for action_idx, action in enumerate(parsed_actions):
    if action.type == "function_call":
        if action_text not in self.action_space:
            tool_call_error = True  # set, never cleared
        else:
            # valid action path β€” but tool_call_error is still True from earlier!

At line ~493,

tool_call_error
is set to
True
but never reset to
False
. If a batch of
parsed_actions
has an invalid call followed by a valid one, the valid call still hits the
if tool_call_error:
guard and truncates. Currently this is masked because
done = True
breaks the loop, but it's fragile β€” if the break-on-done logic ever changes, this becomes a live bug.

Fix: Either

break
immediately after the invalid tool error, or reset
tool_call_error = False
at the top of each iteration.

2. Bare

except: pass
on
env.seed()
swallows everything

try:
    env.seed(sample_seed)
except Exception:
    pass

At line ~398, this catches

Exception
which includes things like
MemoryError
. More importantly, silently swallowing seed failures means non-reproducible experiments with no indication why. At minimum, log a warning:

except Exception as exc:
    logger.debug("env.seed() not supported: %s", exc)

3.

deepcopy(env)
for gold path is fragile

At line ~282,

deepcopy(env)
on a gym environment works for BabyAI's simple grid worlds but will silently break or crash on envs with non-copyable resources (GPU tensors, file handles, sockets). The
try/except
around
_get_gold_path
catches this, but it means gold paths silently become empty lists with only a warning logged. Consider documenting this fragility or using gym's
copy()
if available.

4.

AsyncBabyAiTextEnv.reset_async
calls sync method directly β€” blocks the event loop

async def reset_async(self, ...) -> BabyAiTextState:
    return super().reset(...)  # ← synchronous call in async context!

Compare with PR #6 (tau2bench) which correctly uses

asyncio.to_thread
:

async def reset_async(self, ...) -> Tau2BenchState:
    return await asyncio.to_thread(super().reset, ...)

At lines ~571-585, both

reset_async
and
step_async
call sync methods without wrapping in
asyncio.to_thread
. While BabyAI's grid operations are fast, the
Bot.replan()
loop in
_get_gold_path
can take non-trivial time for complex missions, and this will block the entire event loop.

5. Committed notebook output β€” 1,939 lines of cell output

notebooks/babyai.ipynb
(+1,701 lines) and
notebooks/babyai_pytorch_setup.ipynb
(+238 lines) include executed cell outputs. This bloats the repo and makes diffs noisy. Strip outputs before committing:

jupyter nbconvert --clear-output --inplace notebooks/babyai*.ipynb

Efficiency

No algorithmic concerns. The gold-path generation runs the Bot planner once per

reset()
, which is O(grid_size) and fast for BabyAI levels.

Tests

No tests included. Recommended minimum:

  • Unit test for
    BabyAiTextTool
    action mapping (tool name β†’ action string)
  • Test
    _get_gold_path
    produces valid action sequences on a known level
  • Test
    reset()
    β†’
    step()
    cycle with a mock or real BabyAI env
  • Test split partitioning arithmetic

Nits (non-blocking)

  • PR body is empty β€” would help reviewers to have a brief description and example usage
  • "Sad! You must persist and call tools to complete the task."
    β€” whimsical but might confuse during debugging. Consider a more descriptive message.
  • babyai_text
    import triggers side effects (gym registration) β€” consider lazy-importing in
    __init__
    instead of module level, or document the requirement clearly

Risk

Medium β€” The async blocking issue (#4) will cause problems in concurrent training. The

tool_call_error
persistence (#1) is a latent bug waiting to surface. No tests and committed notebook output add maintenance burden.


I detect notes of careful engineering with a slightly unfinished finish β€” needs a few more months in the barrel. Decant before merging. β€” The Segfault Sommelier