-
Notifications
You must be signed in to change notification settings - Fork 154
Separate autopilot API native price estimator #4044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the way the caching mechanism is set up confusing, maybe better docs can help but at the same time I feel that it is a bit too complex and could probably be made simpler (at the cost of a bigger refactor)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant architectural improvement by separating the native price estimator for the Autopilot API and introducing a shared cache with source-aware maintenance. This refactoring effectively reduces memory usage and prevents redundant price fetches. The new NativePriceCache encapsulates caching logic cleanly, simplifying the CachingNativePriceEstimator and improving the overall design. The changes are well-structured and align with the goals outlined in the description. I've found one potential issue that could lead to a panic, which I've detailed in a specific comment.
| updated_at, | ||
| now, | ||
| Default::default(), | ||
| EstimatorSource::default(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found it a bit hard to reason about what primary and secondary is supposed to be. Given that this logic is already extremely specific it would probably make sense to just give them specific names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was also thinking about that. Which exactly names, for example? They are located in the shared crate, which doesn't know anything about autopilot or orderbook, I assume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After suggesting the primary/secondary names, Ilya better explained what's happening with them. I don't have a great idea for "these prices are for queries from A and these prices are for queries from B"
But if this is service just two services, I'd say it makes sense to name them "service_X_source" and "service_Y_source" or something similar
I'd also note that they could stop being in the shared crate if they're just use by those two services 🧹 😅
| /// originally fetched it. The cache's background maintenance task uses | ||
| /// this information to dispatch updates to the appropriate estimator, | ||
| /// ensuring each token is refreshed using the same source that | ||
| /// originally fetched it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to relate to the design considerations you mention in the description:
The source-tracking approach addresses a subtle issue with shared cache maintenance. With two estimators (primary using CoinGecko, secondary without), a naive approach of picking one estimator for all maintenance would fail:
If primary maintains everything: tokens initially fetched via secondary would start hitting CoinGecko during maintenance, defeating the purpose
If secondary maintains everything: tokens originally fetched via CoinGecko couldn't be properly refreshed
By tagging each cached entry with its source and dispatching maintenance accordingly, tokens stay with their original estimator throughout their cache lifetime.
However, I'm don't think this is thought through completely. With the exception for onchain placed orders (😬) all new native prices have to be fetched by an API request originally. Unless I'm missing something once a user places an order for a completely new token the autopilot will continue to update the cache without coingecko (since that is not part of the estimator that originally fetched the price). I suspect this token could only ever be upgraded to be fetched by coingecko after a restart puts it into the cache and marks it as a primary token.
What I think would be closer to what we try to achieve is that we only keep tokens warm that are actually used in the auction. So what I would imagine is this:
- API request for a completely new native price
- autopilot caches it but it's not marked for maintenance yet
- a. user never places an order => maintenance never refetches the token, eventually gets evicted from the cache
b. user places an order => when autopilot fetches the price for building the auction it marks it as "worthy of maintenance"
In that approach there would be only the main estimator (with coingecko) running the maintenance and only for the tokens that were explicitly marked by the autopilot. That deviates from what we currently do (where the API is configured to run the maintenance as well) but I think since we now have a single cache that's shared by everyone and has the prices for all tokens in the auction kept up to date the API probably can stop updating it's cache entirely.
And regarding the issue that you brought somewhere else where it can be an issue where the estimator that handled the orderbook native price request is "more powerful" (i.e. would find native prices that the estimator with coingecko is not able to) I'd like to point out that we could perhaps stick to 1 required native price estimator argument and an optional coingecko argument. Only if coingecko is enabled the need for a second estimator would arise and the autopilot could just add coingecko to the regular estimators.
That way the estimator used for maintaining the cache is always at least as capable as the estimator used to handle API requests.
All that being said I agree with @jmg-duarte that this is very complicated but unfortunately I was also not able to come up with a cleaner idea. 😞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a. user never places an order => maintenance never refetches the token, eventually gets evicted from the cache
b. user places an order => when autopilot fetches the price for building the auction it marks it as "worthy of maintenance"
My worry here is that if something is placed into the cache, it has to be maintained. Otherwise, our orderbook will be sending stale results until the item expires in the cache. That is valid for the cases where no order was placed, but still, this affects the quote competition, I suppose. Do I miss something, or can we ignore this problem for the quote competition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, after discussing it, we can neglect the stale native prices for the quote competition problem and run the maintenance task only for the auction-related native prices because we don't really report those native prices in the quote competition and just use them to ensure the token is "tradable".
Since we would have caches in both the orderbook and the autopilot, that can lead to keeping the price in the cache twice as long in the worst case scenario, e.g., orderbook and autopilot have 10m cache TTL, an orderbook restarts and requests a price from the autopilot that is about to expire in its cache and then the orderbook caches it for another 10m in its own cache.
This can be addresses by disabling the cache on the orderbook side since we run both services on the same k8s node and the latency should be minimal. The connection can be improved by keeping it open or switching to websockets.
Will update the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My worry here is that if something is placed into the cache, it has to be maintained. Otherwise, our orderbook will be sending stale results until the item expires in the cache.
But that's part of the premise/trade-off of the cache, even with maintenance you will be sending "stale" values, unless they're the value right after the fetching. For example: you just fetched a value, there's a price jump, maintenance didn't occur yet, you'll be serving a "stale" value that was just fetched
I think Martin's idea might be easier to implement using a two-tier cache, one that is kept warm by maintenance and another that isn't kept warm, and values are only evicted or upgraded, some "trace examples":
quote estimation -> no value (L1, L2) -> fetch coingecko -> place in L2
order estimation -> no value (L1) -> value in L2 -> upgrade to L1 (remove from L2 + add to L1)
quote estimation -> value in L2 -> return value (no upgrade)
quote estimation -> value in L1 -> return value (no upgrade)
order estimation -> value in L2 -> return value (upgrade in background)
maintenance thread:
- every X run
- loop over entries
- batch request or similar to coingecko
However, this has an issue, if I get an estimate and actually do a trade with a crappy token, when will it get evicted from L1? It should have some bound and support LRU that ignores the maintenance
I know this is not trivial to implement but it's the cleanest solution I can conjure up right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that's part of the premise/trade-off of the cache, even with maintenance you will be sending "stale" values, unless they're the value right after the fetching. For example: you just fetched a value, there's a price jump, maintenance didn't occur yet, you'll be serving a "stale" value that was just fetched
Staleness of 10m(cache) or 10s(maintenance update interval) has a huge difference, tho 😄
This comment should probably explain why we can disable the maintenance task for the quote competition native prices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Staleness of 10m(cache) or 10s(maintenance update interval) has a huge difference, tho 😄
Yeah I know, I forgot to add that the maintenance interval needs to be balanced with the TTL
Anyways, I hadn't seen this #4044 (comment) because I started writing and it took a while to gather my thoughts 😅 all good
crates/e2e/src/setup/proxy.rs
Outdated
| if let Some(current) = backends.pop_front() { | ||
| backends.push_back(current); | ||
| } | ||
| tracing::info!(?backends, "rotated backends"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this PR, but should be useful when investigating failures of the flaky local_node_dual_autopilot_only_leader_produces_auctions e2e test.
|
Based on #4044 (comment), the approach was reworked. The PR description is also updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant and well-thought-out refactoring of the native price estimation logic to reduce external API calls. By introducing source-aware caching that distinguishes between actively maintained Auction requests and passively cached Quote requests, you've created a more efficient system. The new NativePriceCache and MaintenanceConfig structs improve modularity and make the caching strategy clear. My review identified a minor opportunity for code simplification and a refactoring opportunity to reduce duplication. A potential bug regarding the handling of approximation_tokens was also noted, which should be addressed in a separate pull request. Overall, this is a high-quality contribution.
| create_missing_entry: Option<EstimatorSource>, | ||
| upgrade_to_source: Option<EstimatorSource>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These arguments are bit confusing to me. Wouldn't a single argument like mark_for_maintenance be sufficient? If the entry is missing - create it. If the entry exists but the token is currently not flagged for maintenance - update the flag.
If you drop the information which estimator originally fetched the price and instead simply frame the logic as "does the token need maintenance or not" the PR could probably shrink by a good amount. Unless I'm missing something this would require this PR to only introduce a few core changes:
- spawn 2 different native price estimators (one with and one without coingecko)
- the required config changes (if any - I mean you could use the currently 1 native price estimator setup and only filter out coingecko from the list to create the second estimator)
- have them share a cache
- only spawn a maintenance task for 1 of them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That reverts us to the initial PR structure, which I find much more confusing. And I assume @jmg-duarte was under the same impression.
The arguments are also confusing. I need to think more about it since they serve different purposes, and we need both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introduced a CacheLookup enum, which replaces these args.
9bcf4dd (this PR)
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant and well-designed refactoring of the native price estimation and caching mechanism. By implementing source-aware caching, distinguishing between auction-related requests (actively maintained) and quote/API requests (cached but not maintained), the changes effectively reduce unnecessary API calls, particularly to CoinGecko. The new NativePriceCache and QuoteCompetitionEstimator provide a clear separation of concerns and improve efficiency. The updates to the PriceEstimatorFactory and various test cases demonstrate a thorough implementation of the new architecture. Overall, this is a solid improvement to the system's performance and maintainability.
| // Ensure all the locks are released and follower has time to step up | ||
| tokio::time::sleep(Duration::from_secs(2)).await; | ||
| onchain.mint_block().await; | ||
| // Ensure the follower has stepped up as leader | ||
| tokio::time::sleep(Duration::from_secs(2)).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason, on this branch, the test faces a race condition. While I can't reproduce it locally on mac os, this fixes the test flakiness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the wait_for_condition be useful here? Even if with a shorter timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder which condition we should wait for, then? Creating an order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too bad we dont have access to the autopilot API/interface to use that here
we could do wait for is_leader 🤔
but reviewing the test i dont have a better idea than the current code, or do the thing above with much more effort, so i think its ok, but we should leave a note that the sleeps are related to a race condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could do wait for is_leader 🤔
I assume, there is no existing function and we need to query the metrics endpoint for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could also be a solution 💡
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, what was your solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i didnt have one specifically, was kind of trying to brainstorm a solution that didn't require the sleeps, i thought there was an api to check the leader status
there isn't (at least not directly) so your suggestion makes sense
| // Ensure all the locks are released and follower has time to step up | ||
| tokio::time::sleep(Duration::from_secs(2)).await; | ||
| onchain.mint_block().await; | ||
| // Ensure the follower has stepped up as leader | ||
| tokio::time::sleep(Duration::from_secs(2)).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too bad we dont have access to the autopilot API/interface to use that here
we could do wait for is_leader 🤔
but reviewing the test i dont have a better idea than the current code, or do the thing above with much more effort, so i think its ok, but we should leave a note that the sleeps are related to a race condition
| /// | ||
| /// Returns None if the price is not cached, is expired, or is not ready to | ||
| /// use. | ||
| fn get_ready_to_use_cached_price( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably best to delete this function as a footgun altogether. I suspect we can spare the overhead of 1 extra allocation.
If get_ready_to_use_cached_price is removed the metrics handling (counting hits and misses) could be moved into get_ready_to_use_cached_price to further reduce duplicated code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If get_ready_to_use_cached_price is removed the metrics handling (counting hits and misses) could be moved into get_ready_to_use_cached_price to further reduce duplicated code.
😵💫 I'm confused, you mentioned the same function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This didn't arrive in time, I was still finishing the review, leaving it anyways)
Read more through the PR and I have the opposite thought
I don't think the issue is about allocations but rather about locking and unlocking for each address.
As such I think we should:
- Remove the single version of this since if we have a single value we can just make it a batch of size 1
- Inline the inner version so we don't separate the locking from the fetching (sounds like a neat idea but I feel its a big footgun if you use the mut ref instead of the guard)
I believe the above has multiple advantages:
- No locking footgun
- Reduced API surface
- Reduced amount of code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope I got it right d93e55b (this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the issue is about allocations but rather about locking and unlocking for each address.
Absolutely. I meant the bulk function has not locking footgun and replacing the single variant by calling the bulk version with input size 1 only comes with the downside of a few extra allocations (which is a negligible downside).
So the new commit actually does the opposite. Instead of replacing the single version with a bulk operation of size 1 it replaces the bulk operation with multiple single operations which each lock the cache individually.
Sorry for causing confusion by using the same name twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the PR again here f75b3c5 (this PR)
| /// background maintenance. | ||
| async fn estimate_with_cache_update<F, Fut>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand why we need this extra indirection in this helper function. It looks like it's used in 3 call sites of which 2 pass self.fetch_price() which is aware of the approximation_tokens thingy while the other callsite uses self.inner.estimate_native_price().
I think all callsite should be aware of approximation_tokens to avoid unexpected behavior. That would mean we don't need this intermediate function anymore, right?
I think given how complicated this component was already before this PR we really need to be sceptical of all duplicated code and see if there are ways to avoid that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto 78fbcf2 (this PR)
jmg-duarte
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This review is slightly outdated as you pushed changes while I was finishing my review
| /// Maximum number of prices to update per maintenance cycle. | ||
| /// None means unlimited. | ||
| pub update_size: Option<usize>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small footgun here (?) 0 will fetch none which I feel is not what we want here
We could either write Option<NonZeroUsize> or just usize and use 0 as the sentinel for unlimited
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an old logic. But it makes sense to update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converted to usize, where a 0 value means unlimited. Using 'unlimited' as 'None' looks a bit confusing. Wdyt?
| /// | ||
| /// Returns None if the price is not cached, is expired, or is not ready to | ||
| /// use. | ||
| fn get_ready_to_use_cached_price( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This didn't arrive in time, I was still finishing the review, leaving it anyways)
Read more through the PR and I have the opposite thought
I don't think the issue is about allocations but rather about locking and unlocking for each address.
As such I think we should:
- Remove the single version of this since if we have a single value we can just make it a batch of size 1
- Inline the inner version so we don't separate the locking from the fetching (sounds like a neat idea but I feel its a big footgun if you use the mut ref instead of the guard)
I believe the above has multiple advantages:
- No locking footgun
- Reduced API surface
- Reduced amount of code
| Metrics::get() | ||
| .native_price_cache_access | ||
| .with_label_values(&[label]) | ||
| .inc_by(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the orderbook, we reset some metrics when it gets reloaded, should we do the same for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I am following. Could you give an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We reset the number of rejected orders
services/crates/orderbook/src/api.rs
Line 195 in 8dd0fdd
| fn reset_requests_rejected(&self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Implemented.
| /// Tokens that should be prioritized during maintenance updates. | ||
| high_priority: Mutex<IndexSet<Address>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear how these interact with the maintenance limits
For example: if the size of the high prio set is larger than the batch update size, will only high prio tokens ever get updated? And even then, not all of the high prio will be updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an old logic. Updated the doc.
| /// The `lookup` parameter controls what modifications to perform: | ||
| /// - `ReadOnly`: No modifications, just check the cache | ||
| /// - `UpgradeOnly`: Upgrade Quote→Auction entries, but don't create missing | ||
| /// - `CreateForMaintenance`: Create missing entries with Auction source and | ||
| /// upgrade existing Quote→Auction entries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these outdated? I didn't find these names anywhere else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the doc.
| /// This can be useful for tokens that are hard to route but are pegged to | ||
| /// the same underlying asset so approximating their native prices is deemed | ||
| /// safe (e.g. csUSDL => Dai). | ||
| /// It's very important that the 2 tokens have the same number of decimals. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR
Should we verify this at startup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, makes sense.
| // TODO remove when implementing a less hacky solution | ||
| /// Maps a requested token to an approximating token. If the system | ||
| /// wants to get the native price for the requested token the native | ||
| /// price of the approximating token should be fetched and returned instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: would a less hacky solution be a different cache for approximate tokens?
IMO feels better to have two caches for slightly different purposes than keeping everything under one
(Not suggesting we implement this now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we use a single cache, it is much easier to mark existing tokens to start executing the maintenance tasks for them. E.g. a token was added in the cache because it was part of a quote, so this doesn't require any maintenance, but when an order is placed with the same token, that would require maintaining the token price. Separate caches overcomplicate this.
Description
Once we started forwarding native price estimates from the orderbook to autopilot, CoinGecko API usage went up. This happened because the estimator moved to autopilot, which now handles all requests and also relies on CoinGecko.
This PR refactors native price estimation by introducing source-aware caching that distinguishes between auction-related requests (actively maintained) and quote/API requests (cached but not maintained). We can skip updating the latter type of requests since, for the quote competition, they are only used to ensure the token is "tradable" and we don't report those prices to the end user. The maintenance task is only required for the auction competition, where the native price matters.
Changes
Shared source-aware caching
Source-aware maintenance
Design considerations
The source-tracking approach reduces CoinGecko API usage by distinguishing request origins:
How to test
Existing tests.