-
Notifications
You must be signed in to change notification settings - Fork 8
Add retry logic and CSV directory creation #109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add retry logic and CSV directory creation #109
Conversation
|
Agree that the current behavior isn't optimal. We should not just skip blocks we can't fetch. However, I fear that retrying forever isn't the optimal solution here. If the node doesn't come online again, we'll retry forever.. Would it make more sense to have a hard error here? i.e. end the process with a nonzero exit code. |
|
Wouldn't this mean the entire processing of the chain has to start again? |
|
All blocks we can process are written to the db. Currently, we start again from the max(height in db) on subsequent runs: mainnet-observer/backend/src/lib.rs Lines 173 to 175 in 278f79e
Ideally, we'd check if we are missing any heights in the db below the current max(height in db) and retry these first.. |
Yeah, I should have been explicit - if we want to hard error, then we need a solution to handle the missing heights. Otherwise, we have to start from scratch again. Happy to look into this approach! |
|
Ok, having thought about this once more, I think once #113 is merged, this PR makes sense. However, we should give up retrying at some point and hard error. Maybe 3, 5, or 10 attempts? |
|
Converting to draft, will make changes so that it ties in with #113 |
…ching Implement bounded retry (5 attempts) with exponential backoff when fetching blocks from Bitcoin Core REST API. If a block cannot be fetched after all retry attempts, it is skipped rather than causing a hard error - the stats_version system will pick up missed blocks on the next run.
Automatically create the CSV output directory before writing files, preventing errors when the directory doesn't exist.
323e303 to
75c02af
Compare
|
Do we hard error after retries are exhausted or skip the block and move on? I opted for skip the block and move on, because a re-run would pickup the block. To my mind, it is better for the process to do "best effort" in this manner, potentially missing some blocks, rather than erroring out and requiring potentially more frequent user intervention? But I don't hold this opinion strongly! Here's a sample of the output from my testing |
|
I'm not convinced either is better at the moment. I'll try to challenge your comment a bit - maybe that helps us to come to a conclusion.
Why would an automatic re-run e.g. 12h later work (without manual user intervention) when it didn't work the previous five attempts?
Could Bitcoin Core (for some reason) changing the response format in a newer version break our fetches (e.g. as we can't parse the the returned JSON)? This would cause us to try to fetch all blocks with 5 retries but never being able to get one?
I'm curious! how did you test this? |
I guess this depends on failure mode. The failures I encountered when I implemented the initial version of this fix were intermittent and due to heavy load on the machine running the Bitcoin Core node. Probably intermittent failures would become silent with a "retry and skip" strategy, unless one is specifically paying attention? This could mask any underlying issue, which would not be desirable.
I don't know if Bitcoin Core changing response format is a valid concern (I guess we'd be long forewarned/aware as such), but having some persistent failure which causes us to try to fetch all blocks with 5 retries without ever getting one is! This would also be undesirable..
Upon reflection, given the above, I don't think the "retry and skip" strategy has advantages over "retry and fail". Retry and fail forces action, which is the behaviour you want:
Very hackily... via source code changes. I considered mocking but deemed it not worth the effort. I will modify to the "retry and fail" strategy? |
Yeah, sounds good to me. While debugging #117 I recently noticed that we currently fail in each rayon thread on it's own. So if we have 14 threads, we need to fail in 14 of them to cause the process to exit. It will otherwise happily continue to run with just a few threads remaining, but will never fetch all blocks.. I wasn't aware of that. |
|
Marking as draft for now :) |
I encountered issues when I re-ran from scratch (screenshots are from when I first encountered this in October)
Any blocks that could not be fetched, there's no retry, so they don't make it into the dataset!
This occurred when my development machine, with Bitcoin Core running on localhost, was under moderate load from some other processes.
The solution I opted for was to retry indefinitely with exponential backoff:
This prevents the entire process from failing due to temporary network issues or brief Bitcoin Core unavailability. I felt that indefinite retry was most appropriate because we don't want to miss a block from being processed, yet if we get stuck in some unavailability loop, it can be killed manually?
Fix confirmed on similarly loaded machine (again, at the time of issue)
In addition,
write_csv_files()now creates the output directory if it doesn't exist.Changes: