Skip to content

Conversation

@MartinquaXD
Copy link
Contributor

Description

While looking into the degraded time to happy moo SLI it became apparent that ethflow orders have a significantly worse SLI compared to "regular" orders.
Ethflow orders are not harder to solve for than any other orders but they are special in the way they enter the system. Instead of having a REST API call that puts those orders into the DB they get placed by calling the ethflow contract onchain. The autopilot then indexes those events and puts them into the DB.

Since the autopilot run loop is synced to the block chain (start a new auction right after seeing a new block) ethflow orders are comparable to regular orders that ALWAYS get placed at the worst possible time (immediately before cutting the auction).

Due to being overwhelmed with indexing ethflow orders because of a trade inventive we moved ethflow indexing off of the critical path (see here) but that also had the consequence of more ethflow orders not making it into the first possible auction which immediately delays them at least by 12s.

Changes

This PR puts ethflow order indexing back on the critical path while still avoiding the issue that caused us to move it off the critical path in the first place.
Instead of having a system where the autopilot triggers the maintenance to happen before a new auction or after new block appearing (when waiting for submitted solutions) with an additional background task that checks every second for new ethflow orders that need indexing.
This PR moves autopilot maintenance (i.e. block indexing) completely into a background task which triggers ASAP when the system sees a new block. In order to build the auction only after the blocks have been indexed this background tasks feeds a channel of processed blocks. The autopilot then only has to wait for this channel to yield a block with a high enough block number.

So the properties of the new solutions are:

  • event indexing has as little delay as possible
  • indexing runs concurrently so it's as fast as possible (without speeding up the individual code paths)
  • autopilot can wait for data from a given block to be processed fully
  • autopilot stops waiting after a configurable amount of time to keep running auctions even if indexing is slow for whatever reason

How to test

Covered by existing e2e tests

@MartinquaXD MartinquaXD requested a review from a team as a code owner February 2, 2026 13:51
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the autopilot maintenance logic to run in a background task, using a watch channel for synchronization, to address the latency of ethflow order indexing. However, the current implementation introduces reliability issues, specifically a Denial of Service (DoS) vulnerability where a panic in the background task due to an unchecked unwrap() can crash the entire autopilot process. Additionally, unexpected stream termination could lead to panics and cascading failures, and the use of try_join! means a failure in one component could cancel other indexing tasks. Addressing these issues will improve system robustness and availability.

Copy link
Contributor

@squadgazzz squadgazzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change makes sense to me. But why has the SLI degraded? Is it because of the increased amount of orders or something else?

@fafk
Copy link
Contributor

fafk commented Feb 3, 2026

The change makes sense to me. But why has the SLI degraded? Is it because of the increased amount of orders or something else?

The SLI measures fast_orders / total and if ethflow orders are slower it pushes the overall SLI down.

@MartinquaXD
Copy link
Contributor Author

The change makes sense to me. But why has the SLI degraded? Is it because of the increased amount of orders or something else?

While what Jan wrote is correct it does not explain why it suddenly got worse. The change that moved ethflow indexing off of the critical path was a month or so before the ethflow specific metric took a nose dive. It's possible that on average the the rest of the order settlement pipeline was fast enough to compensate for the ethflow indexing degradation but later when another performance regression it it suddenly became a lot more problematic.
The thing with the SLI especially on mainnet is that it's very quantized. If an order misses an auction that's ~12s lost right away. If the settlement gets mined one block later that's another 12s. That's why I believe mainnet is especially prone to huge swings in the SLI when we have been "living on the edge" with the order submission.

On a different note: I did some analysis and according to data in our DB and only ~15% of ethflow orders make it into the first possible block. The rest has to wait at least 1 auction which puts some numbers to the huge difference in the SLI.

@MartinquaXD MartinquaXD enabled auto-merge February 3, 2026 07:58
@MartinquaXD MartinquaXD added this pull request to the merge queue Feb 3, 2026
Any commits made after this event will not be merged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants