-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
While reading the reward design in the documentation/paper, I noticed a potential sign inconsistency in Eq. (8). The node-death case (when E_(n,t)≤0 for some node n ) receives the largest positive number (2), although it is described as a penalty.
Screenshot for reference:
Equation (8) (as written):
Why this is problematic (for standard RL maximization)
Most RL algorithms (e.g., Q-learning, policy gradient) maximize expected return. Assigning a larger positive value to node death will increase action values for behaviors that cause death, which contradicts the stated intention of a penalty. Under maximization, 2 is better than 1.1 and 1 .
Possible explanations (please clarify which applies)
R is actually a cost that the algorithm minimizes (e.g., added to a loss).
There is a sign flip later (e.g., code uses -R in the update), making the death term effectively -2 .
Typographical error in the equation; the death case should be negative (e.g., -2 ) or at least strictly smaller than the alive cases.
Suggested fix (if the objective is to maximize reward)
Use a negative value for node death, or make it strictly the smallest return:
Optional shaping alternative:
Request
• Could you please confirm whether Eq. (8) is intended as a cost (to minimize) or a reward (to maximize)?
• If it is a reward to be maximized, would you consider changing the death case to a negative value (or documenting the sign flip in code)?
Thanks for the great work on PyNetSim!
Metadata
Metadata
Assignees
Labels
No labels