Skip to content

Conversation

@etterli
Copy link
Contributor

@etterli etterli commented Nov 24, 2025

This implements the proposed optimization in #28345.

Signals contributing to the local controller escalation signal directly factor into the insn_executing signal inside the otbn_controller.sv. This then drives various commit signals as well as the DMEM request signal (lsu_load_req_o). By cutting these escalation signals, these paths are shortened which results in better timing/area results. Modeling this delayed escalation required changes in the otbnsim as well as in most of the vseqs.

The regression results are very similar to the last weekly run.
I tested all 11 failing seeds also on the current master and they fail there as well. As far as I understand, these fail due to general DV problems of how things are modeled. The pass rate for these tests heavily depend on the seed.

Let me know whether I should squash the commits adapting the vseqs (I left them separated for ease of reviewing).

Copy link
Contributor

@vogelpi vogelpi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @etterli , the RTL looks good to me. I've left some comments.

Please let me know when you think the PR is ready for review.

@etterli etterli force-pushed the local-escalation-cut branch 5 times, most recently from c0fcb2b to e400a58 Compare December 2, 2025 15:02
@etterli etterli changed the title [otbn,rtl] Register local escalation signals for timing optimization [otbn] Register local escalation signals for timing optimization Dec 2, 2025
@etterli etterli force-pushed the local-escalation-cut branch 2 times, most recently from a9f1e3e to 4c7cb1b Compare December 2, 2025 15:17
@etterli
Copy link
Contributor Author

etterli commented Dec 2, 2025

I integrated the feedback from @vogelpi and adapted the DV tests to match the delayed escalation behaviour. See also updated PR description.

@etterli etterli marked this pull request as ready for review December 2, 2025 16:06
@etterli etterli requested a review from a team as a code owner December 2, 2025 16:06
@etterli etterli requested a review from martin-velay December 2, 2025 16:20
@etterli etterli force-pushed the local-escalation-cut branch from 4c7cb1b to 1c448fe Compare December 3, 2025 09:51
Copy link
Contributor

@vogelpi vogelpi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @etterli for the PR, this looks good!

I've carefully reviewed the RTL and we would need a DV expert to review the DV code. Most of the commits are very well structured and accurately described. But I would collapse the last 6 commits into a single one. As far as I understand, you have created separate commits for the different vseqs. But this is not really required as we want to fix all the vseqs anyway.

@vogelpi
Copy link
Contributor

vogelpi commented Dec 4, 2025

CHANGE AUTHORIZED: hw/ip/otbn/rtl/otbn_core.sv

This PR implements an RTL change to relax the timing in OTBN. It has previously been discussed in the Security WG. This is fine.

endfunction

// Wait until the SW error signal rises
clocking wait_sw_error_rises @(posedge u_otbn_controller.software_err);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of cool, but I think it's a bit unconventional to define a clocking block like this. Would something like this work instead? (Completely untested!)

  clocking error_cb @(posedge clk_i);
    input logic software_err       = u_otbn_controller.software_err;
    input logic fatal_software_err = u_otbn_controller.fatal_software_err;
  endclocking

In otbn_base_vseq, you've currently got

          @(cfg.controller_vif.wait_sw_error_rises);

I think this could then be:

          wait(cfg.controller_vif.error_cb.software_error);

(assuming you don't care about if it's already set)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above where this clocking block is used. It cannot be clocked on posedge of clk_i.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand what you're describing now (after reading the existing code more carefully).

Imagining that there's a posedge of the clock at each integer time, I think the timing looks something like this with the current code:

  • At time 0 a bad thing happens, causing u_otbn_controller.software_err to go high.
  • The clocking block fires, causing @(cfg.controller_vif.wait_sw_error_rises); to complete.
  • We send an error escalation on the vif (also at time zero)
  • The send_err_escalation task sets the wakeup_iss flag then waits until the next clock edge (time 1)
  • We then wait until time 2.

My suggestion wouldn't work! The problem is that we'd only see u_otbn_controller.software_err going high at time 1, which would be too late to send the escalation (because the model might have been stepped already by a fluke of scheduling).

However, it would work if we switched my proposed clocking block to the negedge. Then I think the vseq can become just this:

    fork begin : isolation_fork
      fork
        begin
          otbn_pkg::err_bits_t err_bits;
          string err_type;

          // Wait for a SW error of some sort (we only expect one to happen at once). The clocking
          // block is on the negedge of the clock.
          wait(cfg.controller_vif.error_cb.software_error or
               cfg.controller_vif.error_cb.fatal_software_error);

          if (cfg.controller_vif.error_cb.fatal_software_error) begin
            err_bits = '{fatal_software: 1'b1, default: 1'b0};
            err_type = "fatal";
          end else begin
            err_bits = '{illegal_insn: 1'b1, default: 1'b0};
            err_type = "non-fatal";
          end

          // Tell the model about the injected error
          `uvm_info(`gfn, $sformatf("The injection led to a %0s SW error", err_type), UVM_LOW)
          cfg.model_agent_cfg.vif.send_err_escalation(err_bits);
        end
        begin
          // Handle HW based escalation
          // Do nothing yet. We signal the error in the next cycle.
          @(cfg.clk_rst_vif.cb);
        end
      join_any
      disable fork; // Kill the SW handling forks if the HW case completes first.
    end join

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the timing is slightly different (injection happens already on negedge) and both SW errors can occur at the same time. The fatal SW error is triggered in addition in case any SW error should be treated as fatal error defined by the CTRL register. The model then requires both notifications.

Let me explain the timing using otbn_rf_base_intg_err_vseq as example.

  • This test waits on run_until_use() until the RF is read. The test will continue at the negedge of clk in the cycle the RF is used.
  • We then inject the error which will lead to a SW error (i.e., because the modified register value will result in an invalid branch target address).
  • We notify the model with cfg.model_agent_cfg.vif.tolerate_result_mismatch(1); that the next committed result may be wrong. Should be irrelevant to the timing (as long no delta cycle is advanced).
  • Now we have to wait until the next clk posedge to then send the HW escalation triggered by the violated integrity. At the same time we must send any SW error to the model before the model steps for the current cycle. I did this waiting with my unconventional clocking block.

I tried your waiting on the negedge but then the clocking block is too late and it does not fire a SW error.

Honestly, I'm now also surprised that my clocking on the actual signal does work. I think it only works because there is nothing (no delta cycle) between the error injection and my clocking block..? But it seems very susceptible to race conditions. Especially as the event happens before we "wait" (@cfg.controller_vif.wait_sw_error_rises) on the event after we triggered the event (Is event the correct wording in SV?). The same also applies in a similar way to your solution or am I missing something?

Copy link
Contributor Author

@etterli etterli Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So instead of creating a clocking block, we could also wait directly on the actual signal.

@(cfg.controller_vif.software_err);

This requires to add the software_err signal to the otbn_controller_if port but that is easy.
Using wait instead of @ would also work.

See the newly pushed version.

UVM_LOW)
// This isolation fork is needed to ensure that "disable fork" call won't kill any other
// processes at the same level from the base classes
fork begin : isolation_fork
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks rather complicated to me, and I'm surprised if it works. I think the third thread will always complete first and there's actually a race at the moment between that and the middles of the other two. Assuming that u_otbn_controller.software_err changes on the clock, the earliest that line 985 can complete is racing against line 1002. If you're unlucky, line 1002 will win and we'll never see the error.

I'm very happy to sketch out something robust, but I'm not certain I can figure out what's supposed to be happening. Can you leave a note that explains the timings you're trying to model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No the trick is that u_otbn_controller.software_err changes immediately when a SW error occurs / we inject an error which results in a SW error (it is combinatorial). What I want to do is to reuse the RTL to decide whether the injected error leads to a SW error or not. Otherwise the DV code would have to do this and this depends highly on the actual RTL.

In context: This task is called after the error injection and the @cfg.controller_vif.wait_sw_error_rises basically just checks in the next simulator delta cycle whether the injection led to a SW error or not. This check is required because a SW error will still lead to an immediate escalation (not delayed) whereas a fatal error like a predecode error will escalate in the next cycle (so when @(cfg.clk_rst_vif.cb) triggers).

We also need to distinguish between a fatal and non fatal SW as we either do or don't lock up OTBN at the end of the secure wipe (different error bits are set).

Does this explain what I want to achieve with this task? I agree it is probably not the cleanest way as I directly probe RTL signals..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand (and thanks for explaining things). I think that my proposal (added to a previous note) with something on the negedge might be a bit simpler?

@etterli etterli force-pushed the local-escalation-cut branch from 1c448fe to e412f9c Compare December 5, 2025 17:00
Copy link
Contributor Author

@etterli etterli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I incorporated the review from @rswarbrick. A full regression is still ongoing.

UVM_LOW)
// This isolation fork is needed to ensure that "disable fork" call won't kill any other
// processes at the same level from the base classes
fork begin : isolation_fork
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No the trick is that u_otbn_controller.software_err changes immediately when a SW error occurs / we inject an error which results in a SW error (it is combinatorial). What I want to do is to reuse the RTL to decide whether the injected error leads to a SW error or not. Otherwise the DV code would have to do this and this depends highly on the actual RTL.

In context: This task is called after the error injection and the @cfg.controller_vif.wait_sw_error_rises basically just checks in the next simulator delta cycle whether the injection led to a SW error or not. This check is required because a SW error will still lead to an immediate escalation (not delayed) whereas a fatal error like a predecode error will escalate in the next cycle (so when @(cfg.clk_rst_vif.cb) triggers).

We also need to distinguish between a fatal and non fatal SW as we either do or don't lock up OTBN at the end of the secure wipe (different error bits are set).

Does this explain what I want to achieve with this task? I agree it is probably not the cleanest way as I directly probe RTL signals..

endfunction

// Wait until the SW error signal rises
clocking wait_sw_error_rises @(posedge u_otbn_controller.software_err);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above where this clocking block is used. It cannot be clocked on posedge of clk_i.

@etterli etterli force-pushed the local-escalation-cut branch from e412f9c to d6b2a7a Compare December 5, 2025 17:08
@etterli etterli requested a review from rswarbrick December 8, 2025 07:30
Copy link
Contributor

@rswarbrick rswarbrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the explanation and changes. I think there are still a couple of small improvements to go, but it's looking really good!

endfunction

// Wait until the SW error signal rises
clocking wait_sw_error_rises @(posedge u_otbn_controller.software_err);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand what you're describing now (after reading the existing code more carefully).

Imagining that there's a posedge of the clock at each integer time, I think the timing looks something like this with the current code:

  • At time 0 a bad thing happens, causing u_otbn_controller.software_err to go high.
  • The clocking block fires, causing @(cfg.controller_vif.wait_sw_error_rises); to complete.
  • We send an error escalation on the vif (also at time zero)
  • The send_err_escalation task sets the wakeup_iss flag then waits until the next clock edge (time 1)
  • We then wait until time 2.

My suggestion wouldn't work! The problem is that we'd only see u_otbn_controller.software_err going high at time 1, which would be too late to send the escalation (because the model might have been stepped already by a fluke of scheduling).

However, it would work if we switched my proposed clocking block to the negedge. Then I think the vseq can become just this:

    fork begin : isolation_fork
      fork
        begin
          otbn_pkg::err_bits_t err_bits;
          string err_type;

          // Wait for a SW error of some sort (we only expect one to happen at once). The clocking
          // block is on the negedge of the clock.
          wait(cfg.controller_vif.error_cb.software_error or
               cfg.controller_vif.error_cb.fatal_software_error);

          if (cfg.controller_vif.error_cb.fatal_software_error) begin
            err_bits = '{fatal_software: 1'b1, default: 1'b0};
            err_type = "fatal";
          end else begin
            err_bits = '{illegal_insn: 1'b1, default: 1'b0};
            err_type = "non-fatal";
          end

          // Tell the model about the injected error
          `uvm_info(`gfn, $sformatf("The injection led to a %0s SW error", err_type), UVM_LOW)
          cfg.model_agent_cfg.vif.send_err_escalation(err_bits);
        end
        begin
          // Handle HW based escalation
          // Do nothing yet. We signal the error in the next cycle.
          @(cfg.clk_rst_vif.cb);
        end
      join_any
      disable fork; // Kill the SW handling forks if the HW case completes first.
    end join

UVM_LOW)
// This isolation fork is needed to ensure that "disable fork" call won't kill any other
// processes at the same level from the base classes
fork begin : isolation_fork
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand (and thanks for explaining things). I think that my proposal (added to a previous note) with something on the negedge might be a bit simpler?

@etterli etterli force-pushed the local-escalation-cut branch 2 times, most recently from c67db10 to a3b656e Compare December 9, 2025 09:15
@etterli etterli requested a review from rswarbrick December 9, 2025 15:40
@vogelpi vogelpi added the CI:Rerun Rerun failed CI jobs label Dec 19, 2025
@github-actions github-actions bot removed the CI:Rerun Rerun failed CI jobs label Dec 19, 2025
@etterli etterli force-pushed the local-escalation-cut branch from a3b656e to aa6959d Compare December 22, 2025 09:49
@etterli
Copy link
Contributor Author

etterli commented Dec 22, 2025

@vogelpi @rswarbrick

I added an explanation on why sampling the SW error signals from the RTL model to control the behaviour of the DV model is ok. See this comment.

@vogelpi
Copy link
Contributor

vogelpi commented Dec 23, 2025

CHANGE AUTHORIZED: hw/ip/otbn/rtl/otbn_core.sv

This PR changes the RTL of OTBN to fix a whole set of critical timing paths. The DV is aligned in sync with these changes. This is fine.

@vogelpi
Copy link
Contributor

vogelpi commented Dec 23, 2025

@vogelpi @rswarbrick

I added an explanation on why sampling the SW error signals from the RTL model to control the behaviour of the DV model is ok. See this comment.

Thanks for your message and the additional comment @etterli. After our discussion last week, I thought more about this and I also think the approach is okay. As you explain in your newly added comment, the DV change does not detect when we're expected to see an error and then instructs the model accordingly, it just aligns the timing between model and RTL.

To make this more obvious, I would suggest the following:

  1. Rename the function to convey the message that the purpose of this function is just to keep the timing of the model and the RTL aligned. Right now, the function name is handle_delayed_escalation() which is a lot more generic. Maybe optionally_delay_model() would work better. The name should not suggest that the DV code here changes the model behavior with respect to whether we see or not see an escalation. The escalation is always triggered.
  2. Shorten the big comment before the function and move the parts describing the different forks/branches into the function. Before the function, you just need to outline what the higher-level purpose of the function is (i.e. ensure the model remains in sync with RTL depending on the type of error seen).
  3. Create a separate GitHub issue where we can re-discuss whether we should change the model. We discussed a couple of approaches together and concluded that this approach here is okay. But we should also involve DV folks in that discussion. However, @rswarbrick is extremely busy with other critical work right now. My view is that the current approach is good to proceed for now and unblock other RTL work. If we decide to change the DV approach again for this, we can do so in the future. The problem is now well understood and required changes will mostly be in the model and in the now introduced function. This should be doable.

@vogelpi
Copy link
Contributor

vogelpi commented Dec 23, 2025

Also, we would need another CHANGE AUTHORIZE comment for this PR. Maybe @andreaskurth or @rswarbrick could help out please?

@rswarbrick
Copy link
Contributor

CHANGE AUTHORIZED: hw/ip/otbn/rtl/otbn_core.sv

I'm convinced that this change doesn't contain a risk to the design (so it should be authorized). I believe that the DV side of thing could probably have been done more easily (as described in reviews), but that shouldn't gate the PR.

Signals contributing to the local controller escalation signal directly factor into the
insn_executing signal inside the otbn_controller.sv. This then drives various commit signals
as well as the DMEM request signal (lsu_load_req_o). By registering these escalation signals,
these paths are shortened which results in better timing/area results. Signals contributing to
the start/stop escalation are also flopped to synchronize the escalation such that
transitioning into LOCKED and the start of wiping activities happen at the same time.

Registered signals:
- urnd_all_zero:    Is high when the local PRNG is all zero and locks up.
                    Potentially timing critical.
- predec_error:     Generated by checking all the registered, predecoded signals with decoded
                    signals and combining these.
                    Potentially timing critical.
- lsu_rdata_err:    DMEM ECC errors of data read from DMEM.
                    Timing critical because source is in memory.
- insn_addr_err:    Is asserted when the prefetched address does not match the requested
                    address. Depends on insn_fetch_req_addr_o and prefetch_ignore_errs_o from
                    the controller which factor in lots of combinatorial signals. Especially
                    the prefetch_ignore_errs_o depends on the internal error which depends on
                    many internal error signals.
                    Very likely timing critical.
- rf_base_intg_err: ECC errors triggered with in the base register file. The control signals
                    for the base register file are not predecoded.
                    So this is likely timing critical.
- non_controller_reg_intg_violation: Based on alu_bignum_reg_intg_violation_err,
                                     mac_bignum_reg_intg_violation_err and rf_base_intg_err.
                                     The former two are generated by checking the ECC of the
                                     MOD / ACC values. The checks depends on the actual
                                     ALU / MAC datapath outputs.
                                     **Highly** timing critical.

Not registered signals:
- escalate_en_i:           This must NOT be not cut. It stems from the system wide escalation
                           control mechanism and such a global escalation must be as fast as
                           possible.
- start_stop_fatal_error:  The error signal of the start stop FSM. Based on EDN urnd ack
                           signal and internal state checking. This error is not cut to
                           synchronize the escalation of the controller and start/stop
                           control.
- controller_fatal_err:    The error signal of the controller towards the start/stop logic.
                           This error is not cut to synchronize the escalation of the
                           controller and start/stop control.
- rf_base_spurious_we_err: This error is already registered at the source.
- insn_fetch_err:          IMEM ECC errors, not computed on the IMEM output but on a flopped
                           signal.

See lowRISC#28345 for discussion about the impact on security.

Signed-off-by: Pascal Etterli <[email protected]>
If a e.g. predec error happens during the first cycle of a LSU
instruction, due to the registering of the error the LSU will still
return a result but its error signal is X. We prevent the propagation of
the X signal into the error bits by ignoring the LSU error if there is
already an escalation ongoing.

Signed-off-by: Pascal Etterli <[email protected]>
With this change one can tell the checker to tolerate a mismatch on the
next RTL to ISS trace comparison.
This is useful when FI related tests lead to escalations which fire one
cycle delayed.

Signed-off-by: Pascal Etterli <[email protected]>
Previously the DMEM integrity validity was modelled with an Optional[Int] type
where an int value represented valid and None an invalid state.
The DMEM integrity test then did simply invalidate the whole DMEM by setting
all parts to None.

This is incompatible with a delayed escalation as a failing instruction (e.g.,
a LW) will trigger the escalation but will still retire. As of this it actually
must load a value and a None value cannot be written to a register. To solve
this, we model the integrity with a dedicated valid flag and return the current
value as well. The test now only has to specify that a comparison mismatch is
tolerated.

It would also be possible to modify the insn implementation that if None is
returned a fixed value is written to the register. However, in this case the
insn code would not exactly represent the HW behaviour. As this insn code is
converted to pseudo code for this the ISA description, this approach is less
favourable. Keeping the value field also allows to extend the DMEM model in future
to fully model the injected error (i.e., extending the otbnsim with a command
which lets set DMEM content to the faulted values). With this the "tolerate
mismatch" workaround could be fixed properly.

Signed-off-by: Pascal Etterli <[email protected]>
The simulator did reset the INSN_CNT value one cycle too late when a
fatal error occured during a non fatal secure wipe.

Signed-off-by: Pascal Etterli <[email protected]>
The simulator now delays the escalation of certain errors by one cycle to match
the RTL.

This approach cannot be used for all error bits as certain error bits can be
set by immediately by e.g. SW errors and also by delayed fatal escalations. The
 current mechanism cannot distinguish these two cases.

Signed-off-by: Pascal Etterli <[email protected]>
@etterli etterli force-pushed the local-escalation-cut branch from 1453a56 to 90936bd Compare January 7, 2026 00:04
…ring

This adds the possibilities to let the model stall for a cycle instead of retiring.
This is a workaround to model the ld_insn injection error from the otbn_ctrl_redun_vseq.sv
test.

When a load instruction is injected in parallel to the actual instruction, the OTBN will
not execute the current instruction and will stall (due to how the ld_insn signal factors
into the stall logic). Modeling this behaviour is complicated and would require to duplicate RTL
implementation details in the model. Thus, we add this command/flag which UVM can use to
signal that we should enter the stall state.

Signed-off-by: Pascal Etterli <[email protected]>
This adds a task to handle SW errors when injecting HW errors which should lead
to hardware escalations. This task ensures that the model is notified in case
an injected error results in a SW error in addition to the expected hardware
escalation.

Signed-off-by: Pascal Etterli <[email protected]>
Sequences must be adapted such that the error escalation behaviour is correctly
modelled.

Signed-off-by: Pascal Etterli <[email protected]>
@etterli etterli force-pushed the local-escalation-cut branch from 90936bd to 81fcaaf Compare January 7, 2026 00:12
@etterli
Copy link
Contributor Author

etterli commented Jan 7, 2026

@vogelpi @rswarbrick

I incorporated the feedback from @vogelpi and created the issue #29026 to track the discussed special RTL probing.

I also improved the regression results by rebasing onto the latest master (now includes the otbn_sec_cm fix) as well as implementing some improvements in the DV framework. Especially for tests like otbn_ctrl_redun.

Now only few seeds for the tests otbn_partial_wipe and otbn_stress_all_with_rand_reset fail which do pass on the current master. However, these failures are mostly due to DV framework errors (like the simulator does not perfectly model the RTL). I will create issues describing the remaining DV issues later.

As reference for later:
Seeds which do fail with the delayed escalation but pass on the master:

  • otbn_partial_wipe
    • 102652375855186875685574724665002460857039337799073695044744087342108108650176
  • otbn_stress_all_with_rand_reset
    • 77796259482280258487775019503527232235529693308155587808744077118562121958794

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants