Add retry logic and CSV directory creation #109

deadmanoz · 2025-12-22T05:49:32Z

I encountered issues when I re-ran from scratch (screenshots are from when I first encountered this in October)

Any blocks that could not be fetched, there's no retry, so they don't make it into the dataset!

This occurred when my development machine, with Bitcoin Core running on localhost, was under moderate load from some other processes.

The solution I opted for was to retry indefinitely with exponential backoff:

Initial delay: 100ms
Doubles each attempt up to max 30 seconds
Logs at WARN level with attempt count and delay

This prevents the entire process from failing due to temporary network issues or brief Bitcoin Core unavailability. I felt that indefinite retry was most appropriate because we don't want to miss a block from being processed, yet if we get stuck in some unavailability loop, it can be killed manually?

Fix confirmed on similarly loaded machine (again, at the time of issue)

In addition, write_csv_files() now creates the output directory if it doesn't exist.

Changes:

Add exponential backoff retry for block fetching (100ms initial, 30s max) to handle transient network issues gracefully
Auto-create CSV output directory if it doesn't exist

0xB10C · 2026-01-05T14:39:48Z

Agree that the current behavior isn't optimal. We should not just skip blocks we can't fetch.

However, I fear that retrying forever isn't the optimal solution here. If the node doesn't come online again, we'll retry forever..

Would it make more sense to have a hard error here? i.e. end the process with a nonzero exit code.

deadmanoz · 2026-01-06T04:58:07Z

Wouldn't this mean the entire processing of the chain has to start again?

0xB10C · 2026-01-06T11:32:26Z

All blocks we can process are written to the db. Currently, we start again from the max(height in db) on subsequent runs:

mainnet-observer/backend/src/lib.rs

Lines 173 to 175 in 278f79e

    
           let mut heights: Vec<i64> = (std::cmp::max(db_height + 1, 0) 
        
               ..std::cmp::max((rest_height - REORG_SAFETY_MARGIN) as i64, 0)) 
        
               .collect();

Ideally, we'd check if we are missing any heights in the db below the current max(height in db) and retry these first..

deadmanoz · 2026-01-07T01:37:10Z

All blocks we can process are written to the db. Currently, we start again from the max(height in db) on subsequent runs:

mainnet-observer/backend/src/lib.rs

Lines 173 to 175 in 278f79e

let mut heights: Vec<i64> = (std::cmp::max(db_height + 1, 0)

..std::cmp::max((rest_height - REORG_SAFETY_MARGIN) as i64, 0))

.collect();

Ideally, we'd check if we are missing any heights in the db below the current max(height in db) and retry these first..

Yeah, I should have been explicit - if we want to hard error, then we need a solution to handle the missing heights. Otherwise, we have to start from scratch again.

Happy to look into this approach!

0xB10C · 2026-01-07T09:52:58Z

I'm not sure how well these things go together, but conceptually, it might be interesting to design with #74 in mind.

The goal of #74 would be to allow override stats in the database from older versions of mainnet-observer.

0xB10C · 2026-01-08T13:24:45Z

Ok, having thought about this once more, I think once #113 is merged, this PR makes sense. However, we should give up retrying at some point and hard error. Maybe 3, 5, or 10 attempts?

deadmanoz · 2026-01-09T06:09:36Z

Converting to draft, will make changes so that it ties in with #113

…ching Implement bounded retry (5 attempts) with exponential backoff when fetching blocks from Bitcoin Core REST API. If a block cannot be fetched after all retry attempts, it is skipped rather than causing a hard error - the stats_version system will pick up missed blocks on the next run.

Automatically create the CSV output directory before writing files, preventing errors when the directory doesn't exist.

deadmanoz · 2026-01-10T06:46:28Z

Do we hard error after retries are exhausted or skip the block and move on? I opted for skip the block and move on, because a re-run would pickup the block.

To my mind, it is better for the process to do "best effort" in this manner, potentially missing some blocks, rather than erroring out and requiring potentially more frequent user intervention? But I don't hold this opinion strongly!

Here's a sample of the output from my testing

[2026-01-10T06:22:51Z INFO  mainnet_observer_backend] Using 1 threads for block fetching & processing
127.0.0.1 - - [10/Jan/2026 14:22:51] "GET /rest/chaininfo.json HTTP/1.1" 200 -
[2026-01-10T06:22:51Z INFO  mainnet_observer_backend] Fetching 4 blocks (heights min=0, max=3)
127.0.0.1 - - [10/Jan/2026 14:22:51] "GET /rest/blockhashbyheight/0.hex HTTP/1.1" 500 -
[2026-01-10T06:22:51Z WARN  mainnet_observer_backend] Could not get block at height 0 (attempt 1/5): HTTP error: 500 Internal Server Error. Retrying in 100ms...
127.0.0.1 - - [10/Jan/2026 14:22:51] "GET /rest/blockhashbyheight/0.hex HTTP/1.1" 500 -
[2026-01-10T06:22:51Z WARN  mainnet_observer_backend] Could not get block at height 0 (attempt 2/5): HTTP error: 500 Internal Server Error. Retrying in 200ms...
127.0.0.1 - - [10/Jan/2026 14:22:52] "GET /rest/blockhashbyheight/0.hex HTTP/1.1" 500 -
[2026-01-10T06:22:52Z WARN  mainnet_observer_backend] Could not get block at height 0 (attempt 3/5): HTTP error: 500 Internal Server Error. Retrying in 400ms...
127.0.0.1 - - [10/Jan/2026 14:22:52] "GET /rest/blockhashbyheight/0.hex HTTP/1.1" 500 -
[2026-01-10T06:22:52Z WARN  mainnet_observer_backend] Could not get block at height 0 (attempt 4/5): HTTP error: 500 Internal Server Error. Retrying in 800ms...
127.0.0.1 - - [10/Jan/2026 14:22:53] "GET /rest/blockhashbyheight/0.hex HTTP/1.1" 500 -
[2026-01-10T06:22:53Z ERROR mainnet_observer_backend] Max retry attempts reached for block at height 0: HTTP error: 500 Internal Server Error. Skipping height.
127.0.0.1 - - [10/Jan/2026 14:22:53] "GET /rest/blockhashbyheight/1.hex HTTP/1.1" 500 -
[2026-01-10T06:22:53Z WARN  mainnet_observer_backend] Could not get block at height 1 (attempt 1/5): HTTP error: 500 Internal Server Error. Retrying in 100ms...
127.0.0.1 - - [10/Jan/2026 14:22:53] "GET /rest/blockhashbyheight/1.hex HTTP/1.1" 500 -
[2026-01-10T06:22:53Z WARN  mainnet_observer_backend] Could not get block at height 1 (attempt 2/5): HTTP error: 500 Internal Server Error. Retrying in 200ms...
127.0.0.1 - - [10/Jan/2026 14:22:53] "GET /rest/blockhashbyheight/1.hex HTTP/1.1" 500 -
[2026-01-10T06:22:53Z WARN  mainnet_observer_backend] Could not get block at height 1 (attempt 3/5): HTTP error: 500 Internal Server Error. Retrying in 400ms...
127.0.0.1 - - [10/Jan/2026 14:22:54] "GET /rest/blockhashbyheight/1.hex HTTP/1.1" 500 -
[2026-01-10T06:22:54Z WARN  mainnet_observer_backend] Could not get block at height 1 (attempt 4/5): HTTP error: 500 Internal Server Error. Retrying in 800ms...
127.0.0.1 - - [10/Jan/2026 14:22:54] "GET /rest/blockhashbyheight/1.hex HTTP/1.1" 500 -
[2026-01-10T06:22:54Z ERROR mainnet_observer_backend] Max retry attempts reached for block at height 1: HTTP error: 500 Internal Server Error. Skipping height.

0xB10C · 2026-01-10T13:30:50Z

I'm not convinced either is better at the moment. I'll try to challenge your comment a bit - maybe that helps us to come to a conclusion.

Do we hard error after retries are exhausted or skip the block and move on? I opted for skip the block and move on, because a re-run would pickup the block.

Why would an automatic re-run e.g. 12h later work (without manual user intervention) when it didn't work the previous five attempts?

To my mind, it is better for the process to do "best effort" in this manner, potentially missing some blocks, rather than erroring out and requiring potentially more frequent user intervention? But I don't hold this opinion strongly!

Could Bitcoin Core (for some reason) changing the response format in a newer version break our fetches (e.g. as we can't parse the the returned JSON)? This would cause us to try to fetch all blocks with 5 retries but never being able to get one?

Here's a sample of the output from my testing

I'm curious! how did you test this?

deadmanoz · 2026-01-16T05:22:43Z

Why would an automatic re-run e.g. 12h later work (without manual user intervention) when it didn't work the previous five attempts?

I guess this depends on failure mode. The failures I encountered when I implemented the initial version of this fix were intermittent and due to heavy load on the machine running the Bitcoin Core node.

Probably intermittent failures would become silent with a "retry and skip" strategy, unless one is specifically paying attention? This could mask any underlying issue, which would not be desirable.

Could Bitcoin Core (for some reason) changing the response format in a newer version break our fetches (e.g. as we can't parse the the returned JSON)? This would cause us to try to fetch all blocks with 5 retries but never being able to get one?

I don't know if Bitcoin Core changing response format is a valid concern (I guess we'd be long forewarned/aware as such), but having some persistent failure which causes us to try to fetch all blocks with 5 retries without ever getting one is! This would also be undesirable..

I'm not convinced either is better at the moment. I'll try to challenge your comment a bit - maybe that helps us to come to a conclusion.

Upon reflection, given the above, I don't think the "retry and skip" strategy has advantages over "retry and fail". Retry and fail forces action, which is the behaviour you want:

persistent failure - resolve the cause, re-run
intermittent failure - investigate the cause, re-run

I'm curious! how did you test this?

Very hackily... via source code changes. I considered mocking but deemed it not worth the effort.

I will modify to the "retry and fail" strategy?

0xB10C · 2026-01-16T09:04:56Z

I will modify to the "retry and fail" strategy?

Yeah, sounds good to me.

While debugging #117 I recently noticed that we currently fail in each rayon thread on it's own. So if we have 14 threads, we need to fail in 14 of them to cause the process to exit. It will otherwise happily continue to run with just a few threads remaining, but will never fetch all blocks.. I wasn't aware of that.

0xB10C · 2026-01-23T13:27:15Z

Marking as draft for now :)

0xB10C self-requested a review December 22, 2025 09:11

deadmanoz mentioned this pull request Jan 9, 2026

Introduce stats_version to overwrite blocks we have old stats for #113

Merged

deadmanoz marked this pull request as draft January 9, 2026 06:08

deadmanoz added 2 commits January 10, 2026 14:27

feat(backend): create CSV directory if it doesn't exist

75c02af

Automatically create the CSV output directory before writing files, preventing errors when the directory doesn't exist.

deadmanoz force-pushed the add-retry-logic-and-csv-dir-creation branch from 323e303 to 75c02af Compare January 10, 2026 06:28

deadmanoz marked this pull request as ready for review January 10, 2026 06:46

0xB10C marked this pull request as draft January 23, 2026 13:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retry logic and CSV directory creation #109

Add retry logic and CSV directory creation #109

deadmanoz commented Dec 22, 2025

Uh oh!

0xB10C commented Jan 5, 2026

Uh oh!

deadmanoz commented Jan 6, 2026

Uh oh!

0xB10C commented Jan 6, 2026

Uh oh!

deadmanoz commented Jan 7, 2026

Uh oh!

0xB10C commented Jan 7, 2026

Uh oh!

0xB10C commented Jan 8, 2026 •

edited

Loading

Uh oh!

deadmanoz commented Jan 9, 2026

Uh oh!

deadmanoz commented Jan 10, 2026

Uh oh!

0xB10C commented Jan 10, 2026

Uh oh!

deadmanoz commented Jan 16, 2026 •

edited

Loading

Uh oh!

0xB10C commented Jan 16, 2026

Uh oh!

0xB10C commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add retry logic and CSV directory creation #109

Are you sure you want to change the base?

Add retry logic and CSV directory creation #109

Conversation

deadmanoz commented Dec 22, 2025

Uh oh!

0xB10C commented Jan 5, 2026

Uh oh!

deadmanoz commented Jan 6, 2026

Uh oh!

0xB10C commented Jan 6, 2026

Uh oh!

deadmanoz commented Jan 7, 2026

Uh oh!

0xB10C commented Jan 7, 2026

Uh oh!

0xB10C commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deadmanoz commented Jan 9, 2026

Uh oh!

deadmanoz commented Jan 10, 2026

Uh oh!

0xB10C commented Jan 10, 2026

Uh oh!

deadmanoz commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xB10C commented Jan 16, 2026

Uh oh!

0xB10C commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

0xB10C commented Jan 8, 2026 •

edited

Loading

deadmanoz commented Jan 16, 2026 •

edited

Loading