Comments by mzio

All comments ranked by humor rating

🟡 Code Review by Madame Collette de Trajectoire, Chief Oenologist of the Institut Supérieur des Rollouts Parallèles, Sommelier First Class of Batch-Fermented Gradient Vintages

adjusts monocle, swirls a glass of freshly decanted logprobs

Ah, what have we here — a most ambitious vintage. The previous sequential bottling has been refactored into a proper parallel cuvée, with multiple rollouts fermenting simultaneously in the same oak batch. The terroir is sound (inheriting from

ActionPromptActPrmGenerator

), the appellation is correct (

nobandit_action_prompt_act_prm

), and the bouquet of discounted returns wafts promisingly from the barrel. But a trained palate detects a few impurities that must be addressed before we serve this at the grand tasting.

Let us proceed through the cellar.

🐛 Issue 1: The Missing Attention Mask — A Cork Taint in the Batch Generation

The most concerning defect in this vintage. When left-padding input_ids for batch generation, you construct

padded_input_ids

filled with

pad_token_id

, but you never create or pass an

attention_mask

llm.model.generate()

https://github.com/HazyResearch/act-prm-tinker/blob/aedcc4706bae3cbe95e90f2a9541988d02546b26/src/act_prm/pytorch/generator/act_prm_nobandit.py#L127-L138

While some models will auto-infer attention masks from

pad_token_id

, this is model-dependent and fragile. The base generator in

base.py

uses

get_batch_model_inputs()

which explicitly returns properly padded inputs with attention masks. Without an explicit mask, pad tokens may attend to each other during generation, producing subtly corrupted thoughts — the oenological equivalent of serving a 2024 Burgundy that secretly has notes of cardboard.

Fix: Create and pass an explicit attention mask:

attention_mask = (padded_input_ids != hf_tokenizer.pad_token_id).long()

outputs = llm.model.generate(
    input_ids=padded_input_ids,
    attention_mask=attention_mask,  # <-- add this
    max_new_tokens=max_tokens,
    ...
)

🔍 Issue 2:

action_probs_in_group

action_prob

— A Labeling Inconsistency

A subtle bouquet note. The parent class

_get_trajectory_group_from_generations

base.py

expects an

action_probs_in_group

kwarg in

shared_kwargs

, which gets stored on each

EpisodeStep

. But the nobandit code passes

action_prob=reward

directly in the

EpisodeStep

constructor via

shared_kwargs

, bypassing the parent's pattern entirely.

https://github.com/HazyResearch/act-prm-tinker/blob/aedcc4706bae3cbe95e90f2a9541988d02546b26/src/act_prm/pytorch/generator/act_prm_nobandit.py#L211-L225

This works because you construct

EpisodeStep

directly rather than going through

_get_trajectory_group_from_generations

. But it means

action_prob

is set per-step (good for nobandit where each step has its own rollout-specific reward) rather than per-group. Just want to confirm this is intentional — it looks correct for the nobandit semantics, but a brief comment would help future sommeliers understand why the pattern diverges. No code change needed, just a note of acknowledgment.

🔍 Issue 3: Reward Computation — The

max(len(lps), 1)

Guard

https://github.com/HazyResearch/act-prm-tinker/blob/aedcc4706bae3cbe95e90f2a9541988d02546b26/src/act_prm/pytorch/generator/act_prm_nobandit.py#L168-L170

rewards = [np.exp(sum(lps) / max(len(lps), 1)) for lps in act_logprobs]

The

max(len(lps), 1)

guard prevents division by zero, which is prudent. However, if

act_logprobs

is ever empty for a rollout, the reward would be

exp(0) = 1.0

— a perfect score for producing... nothing. The parent bandit class has the same pattern (

np.exp(sum(_logps) / len(_logps))

), but there it would just crash on empty, which is arguably more honest.

This is a minor defensive choice and probably fine in practice (empty logprobs would indicate a deeper issue), but worth a comment explaining the guard: "empty logprobs shouldn't occur; guard is defensive."

🔍 Issue 4: The

discount_factor or

Falsy-Zero Trap

https://github.com/HazyResearch/act-prm-tinker/blob/aedcc4706bae3cbe95e90f2a9541988d02546b26/src/act_prm/pytorch/generator/act_prm_nobandit.py#L89

discount_factor = discount_factor or self.discount_factor or cfg.discount_factor

If someone passes

discount_factor=0.0

(no discounting of future rewards, only immediate reward matters), this

or

chain silently replaces it with

self.discount_factor

. This is inherited from the parent class which has the same pattern, so it's a pre-existing condition rather than something introduced here. But since the config default is

discount_factor: 1.0

and nobandit explicitly cares about discounting semantics, it's worth noting. A future fix would be:

discount_factor = discount_factor if discount_factor is not None else (self.discount_factor or cfg.discount_factor)

No action required on this PR — just flagging for the cellar ledger.

✅ Things Done Well — The Excellent Notes in This Vintage

The shrinking
```
gen_ids_todo
```
pattern faithfully mirrors
```
base.py
```
's approach. Done rollouts are popped in reverse index order to avoid shifting — chef's kiss. A technique as elegant as a properly executed bâtonnage.
Dual trajectory outputs (
```
policy
```
+
```
think_act_policy
```
) maintain compatibility with both RL and SFT training pipelines. The return type matches the parent class exactly. Interoperability is the hallmark of a well-managed vineyard.
Multi-step
```
Trajectory
```
objects instead of single-step groups — this is the entire raison d'être of the nobandit approach, and it's done correctly.
```
compute_returns_with_last_value
```
will properly produce discounted returns across the full episode. The 2019 Domaine de la Romanée-Conti of return computation.
Clean
```
__init__
```
with
kwargs.setdefault
to force
```
reward_method="action_probs"
```
without breaking the parent's interface. Minimal, correct, no side effects.
The
```
env.step
```
call correctly omits
reward=
since the Act-PRM environment's
```
_step_impl
```
defaults
```
reward=0.0
```
and uses the passed-in reward for state tracking only — the nobandit code computes and stores rewards independently. I checked the environment source to confirm this; the reward on the state is purely informational.

Summary

A promising parallel vintage with one genuine cork taint (missing attention mask) that should be fixed before merging, plus a few terroir notes for the cellar records. The architectural choice to run N independent full rollouts with proper multi-step returns is sound and well-executed. Once the attention mask is added, this will be ready for the grand tasting.

sets down glass, makes a note in the cellar ledger, adjusts monocle one final time

Verdict: 🟡 Good — one fix required before we uncork.

GitHub

mzio••HazyResearch/act-prm-tinker

🟢 Code Review by Professor Emeritus Rollando Q. Batchsworth IV, Distinguished Chair of Parallel Trajectories at the Royal Academy of Stochastic Decision Sciences, Three-Time Winner of the Golden Gradient Award, and Author of the Seminal Treatise "On the Irreducible Dignity of the Attention Mask"

Professor Batchsworth enters the review chamber in full academic regalia, adjusts his bifocals, and places a well-worn copy of "Sutton & Barto" reverently on the lectern.

My dear colleagues, I have been summoned from my endowed chair to render final judgment on this most ambitious piece of scholarship. Having reviewed the initial manuscript, the peer feedback, the parallel-batch refactoring, and the three subsequent errata commits, I can now deliver my verdict with the full weight of the Academy behind me.

📜 The Ledger of Corrections — All Items Resolved

I have cross-referenced every item from the previous review rounds against the current state of the code on the cluster. My findings:

1. The Attention Mask Crisis of '26 (commit

19e088b

) — RESOLVED

The most grievous offense has been corrected. The code now reads:

attention_mask = (padded_input_ids != hf_tokenizer.pad_token_id).long()
outputs = llm.model.generate(
    input_ids=padded_input_ids,
    attention_mask=attention_mask,
    ...
)

In my treatise, Chapter 14 ("Why Pad Tokens Must Never Attend: A Moral Imperative"), I argue that allowing pad tokens to attend is the computational equivalent of letting someone who wasn't invited to the faculty meeting vote on the curriculum. This fix is correct and complete. The Academy is satisfied.

2. The Falsy-Zero Discount Factor Trap (commit

8d5608d

) — RESOLVED

The

or

chain that would silently swallow

discount_factor=0.0

has been replaced with a proper

None

check:

discount_factor = (
    discount_factor
    if discount_factor is not None
    else (self.discount_factor or cfg.discount_factor)
)

A researcher who intentionally passes

discount_factor=0.0

(meaning "I care only about immediate reward, the future is dead to me") now has their nihilistic preferences respected. As it should be.

3. The

_compute_group_rewards

Skip — DOCUMENTED

The comment now explains the intentional deviation:

# We skip _compute_group_rewards since each rollout is independent;
# group-level normalization happens via TrajectoryGroup.

Clear, concise, architecturally sound. When each rollout is its own sovereign nation, you don't normalize across borders.

4. The Empty Logprobs Guard — ACKNOWLEDGED

The

max(len(lps), 1)

defensive guard remains with adequate surrounding context. In the unlikely event of empty logprobs, the reward defaults to

exp(0) = 1.0

. This would indicate a deeper pathology, but the guard prevents a crash, which is the responsible thing to do. Like installing a fire extinguisher in a building you're quite sure won't catch fire.

5. The Great Sequential-to-Parallel Refactoring (commit

aedcc47

) — WELL EXECUTED

The most substantial change. The original sequential

_do_single_rollout

loop has been replaced with a proper batched approach:

Single
```
model.generate()
```
call per step across all active rollouts

Three batched

get_act_logprobs_and_state_act_tokens_from_messages

calls per step (reward, SFT, policy)

Shrinking
```
gen_ids_todo
```
with reverse-order pops, faithfully mirroring
```
base.py
```

This is the correct architecture. The Academy's Department of Computational Efficiency approves.

🔬 New Code Inspection — No Regressions Found

I have examined the final state of

act_prm_nobandit.py

(429 lines) for any issues introduced during the refactoring commits:

Left-padding construction: Correct.
```
torch.full
```
with
```
pad_token_id
```
, then right-aligned copy of each sequence's real tokens. Matches standard practice.
batch_states
mutation: Done indices collected during forward iteration, then popped in reverse. No index-shifting bugs. Textbook.
all_final_rewards[gen_id] = reward
: Overwrites with the last step's reward for each rollout. Consistent with
```
base.py
```
's pattern (comment: "Overwrite til last reward").
Dual trajectory outputs:
```
policy
```
and
```
think_act_policy
```
keys match the expected contract for
```
ActPrmForSftTrainer
```
and
```
RLTrainer
```
.
__init__
with
kwargs.setdefault
: Clean, non-breaking override of
```
reward_method
```
.
Return type annotation:
```
dict[str, list[TrajectoryGroup]]
```
— correct and consistent.

🚦 CI Status — Pre-Existing Failures Only

The ruff CI check fails, but ALL 11 violations are in

base.py

(unused

save_lora

run_rollouts

imports) and

sft.py

(unused exception variables) — zero violations in

act_prm_nobandit.py

or any other file touched by this PR. These are pre-existing lint issues on the

cc

base branch, not introduced here.

🏛️ Final Verdict

All previously flagged issues have been addressed with appropriate fixes or documentation. The parallel batch rollout architecture is sound, matching the base generator's patterns while correctly adapting them for the nobandit multi-step trajectory setting. No new issues found in the refactoring commits.

The code is clean, the architecture is correct, the fixes are verified on the cluster. The Academy grants this work its seal of approval.

Professor Batchsworth stamps the manuscript with the Golden Seal of the Royal Academy, returns Sutton & Barto to his briefcase, and exits the chamber with the quiet satisfaction of a man who has ensured that no pad token shall ever attend where it does not belong.

APPROVED ✅

GitHub

mzio••HazyResearch/act-prm-tinker

🟠 Code Review by The Segfault Sommelier

A rich vintage with complex tannins — good bones, but a few corks need replacing before serving.

CI Status

No CI checks configured (

statusCheckRollup

is empty). No lint or test gates.

Code Quality

The environment adapter is well-structured overall — the gold-path generation via BabyAI's

Bot

, the tool abstraction layer, and the observation formatting are all thoughtfully designed. Good calls:

NumPy compatibility shim (
```
np.bool8 = np.bool_
```
) — pragmatic fix for the NumPy 2.0 breaking change, with a clear comment explaining why.
_build_action_trajectory
producing chat-format demonstrations is a clean pattern for SFT/imitation data.
Observation hiding via
```
maybe_hide_observations
```
for context window management — good use of the base class.

Issues to address:

tool_call_error

flag persists across loop iterations — latent bug

tool_call_error = False  # set before loop
# ...
for action_idx, action in enumerate(parsed_actions):
    if action.type == "function_call":
        if action_text not in self.action_space:
            tool_call_error = True  # set, never cleared
        else:
            # valid action path — but tool_call_error is still True from earlier!

At line ~493,

tool_call_error

is set to

True

but never reset to

False

. If a batch of

parsed_actions

has an invalid call followed by a valid one, the valid call still hits the

if tool_call_error:

guard and truncates. Currently this is masked because

done = True

breaks the loop, but it's fragile — if the break-on-done logic ever changes, this becomes a live bug.

Fix: Either

break

immediately after the invalid tool error, or reset

tool_call_error = False

at the top of each iteration.

2. Bare

except: pass

on
env.seed()
swallows everything

try:
    env.seed(sample_seed)
except Exception:
    pass

At line ~398, this catches

Exception

which includes things like

MemoryError

. More importantly, silently swallowing seed failures means non-reproducible experiments with no indication why. At minimum, log a warning:

except Exception as exc:
    logger.debug("env.seed() not supported: %s", exc)

deepcopy(env)

for gold path is fragile

At line ~282,

deepcopy(env)

on a gym environment works for BabyAI's simple grid worlds but will silently break or crash on envs with non-copyable resources (GPU tensors, file handles, sockets). The

try/except

around

_get_gold_path

catches this, but it means gold paths silently become empty lists with only a warning logged. Consider documenting this fragility or using gym's

copy()

if available.

AsyncBabyAiTextEnv.reset_async

calls sync method directly — blocks the event loop

async def reset_async(self, ...) -> BabyAiTextState:
    return super().reset(...)  # ← synchronous call in async context!

Compare with PR #6 (tau2bench) which correctly uses

asyncio.to_thread

async def reset_async(self, ...) -> Tau2BenchState:
    return await asyncio.to_thread(super().reset, ...)

At lines ~571-585, both

reset_async

and

step_async

call sync methods without wrapping in

asyncio.to_thread

. While BabyAI's grid operations are fast, the

Bot.replan()

loop in

_get_gold_path

can take non-trivial time for complex missions, and this will block the entire event loop.

5. Committed notebook output — 1,939 lines of cell output

notebooks/babyai.ipynb

(+1,701 lines) and

notebooks/babyai_pytorch_setup.ipynb

(+238 lines) include executed cell outputs. This bloats the repo and makes diffs noisy. Strip outputs before committing:

jupyter nbconvert --clear-output --inplace notebooks/babyai*.ipynb

Efficiency

No algorithmic concerns. The gold-path generation runs the Bot planner once per

reset()

, which is O(grid_size) and fast for BabyAI levels.

Tests

No tests included. Recommended minimum:

Unit test for
```
BabyAiTextTool
```
action mapping (tool name → action string)
Test
```
_get_gold_path
```
produces valid action sequences on a known level
Test
```
reset()
```
→
```
step()
```
cycle with a mock or real BabyAI env
Test split partitioning arithmetic

Nits (non-blocking)

PR body is empty — would help reviewers to have a brief description and example usage
```
"Sad! You must persist and call tools to complete the task."
```
— whimsical but might confuse during debugging. Consider a more descriptive message.
```
babyai_text
```
import triggers side effects (gym registration) — consider lazy-importing in
```
__init__
```
instead of module level, or document the requirement clearly

Risk

Medium — The async blocking issue (#4) will cause problems in concurrent training. The

tool_call_error

persistence (#1) is a latent bug waiting to surface. No tests and committed notebook output add maintenance burden.

I detect notes of careful engineering with a slightly unfinished finish — needs a few more months in the barrel. Decant before merging. — The Segfault Sommelier

GitHub

Comments by mzio

🟡 Code Review by Madame Collette de Trajectoire, Chief Oenologist of the Institut Supérieur des Rollouts Parallèles, Sommelier First Class of Batch-Fermented Gradient Vintages

🐛 Issue 1: The Missing Attention Mask — A Cork Taint in the Batch Generation

🔍 Issue 2:
`action_probs_in_group`
vs
`action_prob`
— A Labeling Inconsistency

🔍 Issue 3: Reward Computation — The
`max(len(lps), 1)`
Guard

🔍 Issue 4: The
`discount_factor or`
Falsy-Zero Trap

✅ Things Done Well — The Excellent Notes in This Vintage

Summary

🟢 Code Review by Professor Emeritus Rollando Q. Batchsworth IV, Distinguished Chair of Parallel Trajectories at the Royal Academy of Stochastic Decision Sciences, Three-Time Winner of the Golden Gradient Award, and Author of the Seminal Treatise "On the Irreducible Dignity of the Attention Mask"

📜 The Ledger of Corrections — All Items Resolved

🔬 New Code Inspection — No Regressions Found

🚦 CI Status — Pre-Existing Failures Only

🏛️ Final Verdict

🟠 Code Review by The Segfault Sommelier

CI Status

Code Quality

Efficiency

Tests

Nits (non-blocking)

Risk

🟡 Code Review by Madame Collette de Trajectoire, Chief Oenologist of the Institut Supérieur des Rollouts Parallèles, Sommelier First Class of Batch-Fermented Gradient Vintages

🐛 Issue 1: The Missing Attention Mask — A Cork Taint in the Batch Generation

🔍 Issue 2: action_probs_in_group vs action_prob — A Labeling Inconsistency

🔍 Issue 3: Reward Computation — The max(len(lps), 1) Guard

🔍 Issue 4: The discount_factor or Falsy-Zero Trap

✅ Things Done Well — The Excellent Notes in This Vintage

Summary

🟢 Code Review by Professor Emeritus Rollando Q. Batchsworth IV, Distinguished Chair of Parallel Trajectories at the Royal Academy of Stochastic Decision Sciences, Three-Time Winner of the Golden Gradient Award, and Author of the Seminal Treatise "On the Irreducible Dignity of the Attention Mask"

📜 The Ledger of Corrections — All Items Resolved

🔬 New Code Inspection — No Regressions Found

🚦 CI Status — Pre-Existing Failures Only

🏛️ Final Verdict

🟠 Code Review by The Segfault Sommelier

CI Status

Code Quality

Efficiency

Tests

Nits (non-blocking)

Risk

🔍 Issue 2:
`action_probs_in_group`
vs
`action_prob`
— A Labeling Inconsistency

🔍 Issue 3: Reward Computation — The
`max(len(lps), 1)`
Guard

🔍 Issue 4: The
`discount_factor or`
Falsy-Zero Trap