-
Notifications
You must be signed in to change notification settings - Fork 154
Improve autopilot maintenance #4113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the autopilot maintenance logic to run in a background task, using a watch channel for synchronization, to address the latency of ethflow order indexing. However, the current implementation introduces reliability issues, specifically a Denial of Service (DoS) vulnerability where a panic in the background task due to an unchecked unwrap() can crash the entire autopilot process. Additionally, unexpected stream termination could lead to panics and cascading failures, and the use of try_join! means a failure in one component could cancel other indexing tasks. Addressing these issues will improve system robustness and availability.
squadgazzz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change makes sense to me. But why has the SLI degraded? Is it because of the increased amount of orders or something else?
The SLI measures |
While what Jan wrote is correct it does not explain why it suddenly got worse. The change that moved ethflow indexing off of the critical path was a month or so before the ethflow specific metric took a nose dive. It's possible that on average the the rest of the order settlement pipeline was fast enough to compensate for the ethflow indexing degradation but later when another performance regression it it suddenly became a lot more problematic. On a different note: I did some analysis and according to data in our DB and only ~15% of ethflow orders make it into the first possible block. The rest has to wait at least 1 auction which puts some numbers to the huge difference in the SLI. |
Description
While looking into the degraded time to happy moo SLI it became apparent that ethflow orders have a significantly worse SLI compared to "regular" orders.
Ethflow orders are not harder to solve for than any other orders but they are special in the way they enter the system. Instead of having a REST API call that puts those orders into the DB they get placed by calling the ethflow contract onchain. The autopilot then indexes those events and puts them into the DB.
Since the autopilot run loop is synced to the block chain (start a new auction right after seeing a new block) ethflow orders are comparable to regular orders that ALWAYS get placed at the worst possible time (immediately before cutting the auction).
Due to being overwhelmed with indexing ethflow orders because of a trade inventive we moved ethflow indexing off of the critical path (see here) but that also had the consequence of more ethflow orders not making it into the first possible auction which immediately delays them at least by 12s.
Changes
This PR puts ethflow order indexing back on the critical path while still avoiding the issue that caused us to move it off the critical path in the first place.
Instead of having a system where the autopilot triggers the maintenance to happen before a new auction or after new block appearing (when waiting for submitted solutions) with an additional background task that checks every second for new ethflow orders that need indexing.
This PR moves autopilot maintenance (i.e. block indexing) completely into a background task which triggers ASAP when the system sees a new block. In order to build the auction only after the blocks have been indexed this background tasks feeds a channel of processed blocks. The autopilot then only has to wait for this channel to yield a block with a high enough block number.
So the properties of the new solutions are:
How to test
Covered by existing e2e tests