Skip to content

Conversation

@deadmanoz
Copy link
Contributor

I encountered issues when I re-ran from scratch (screenshots are from when I first encountered this in October)

Screenshot 2025-10-01 at 11 27 41 am

Any blocks that could not be fetched, there's no retry, so they don't make it into the dataset!

This occurred when my development machine, with Bitcoin Core running on localhost, was under moderate load from some other processes.

The solution I opted for was to retry indefinitely with exponential backoff:

  • Initial delay: 100ms
  • Doubles each attempt up to max 30 seconds
  • Logs at WARN level with attempt count and delay

This prevents the entire process from failing due to temporary network issues or brief Bitcoin Core unavailability. I felt that indefinite retry was most appropriate because we don't want to miss a block from being processed, yet if we get stuck in some unavailability loop, it can be killed manually?

Fix confirmed on similarly loaded machine (again, at the time of issue)

Screenshot 2025-10-01 at 11 33 16 am

In addition, write_csv_files() now creates the output directory if it doesn't exist.

Changes:

  • Add exponential backoff retry for block fetching (100ms initial, 30s max) to handle transient network issues gracefully
  • Auto-create CSV output directory if it doesn't exist

@0xB10C 0xB10C self-requested a review December 22, 2025 09:11
@0xB10C
Copy link
Owner

0xB10C commented Jan 5, 2026

Agree that the current behavior isn't optimal. We should not just skip blocks we can't fetch.

However, I fear that retrying forever isn't the optimal solution here. If the node doesn't come online again, we'll retry forever..

Would it make more sense to have a hard error here? i.e. end the process with a nonzero exit code.

@deadmanoz
Copy link
Contributor Author

Wouldn't this mean the entire processing of the chain has to start again?

@0xB10C
Copy link
Owner

0xB10C commented Jan 6, 2026

All blocks we can process are written to the db. Currently, we start again from the max(height in db) on subsequent runs:

let mut heights: Vec<i64> = (std::cmp::max(db_height + 1, 0)
..std::cmp::max((rest_height - REORG_SAFETY_MARGIN) as i64, 0))
.collect();

Ideally, we'd check if we are missing any heights in the db below the current max(height in db) and retry these first..

@deadmanoz
Copy link
Contributor Author

All blocks we can process are written to the db. Currently, we start again from the max(height in db) on subsequent runs:

let mut heights: Vec<i64> = (std::cmp::max(db_height + 1, 0)
..std::cmp::max((rest_height - REORG_SAFETY_MARGIN) as i64, 0))
.collect();

Ideally, we'd check if we are missing any heights in the db below the current max(height in db) and retry these first..

Yeah, I should have been explicit - if we want to hard error, then we need a solution to handle the missing heights. Otherwise, we have to start from scratch again.

Happy to look into this approach!

@0xB10C
Copy link
Owner

0xB10C commented Jan 7, 2026

I'm not sure how well these things go together, but conceptually, it might be interesting to design with #74 in mind.

The goal of #74 would be to allow override stats in the database from older versions of mainnet-observer.

@0xB10C
Copy link
Owner

0xB10C commented Jan 8, 2026

Ok, having thought about this once more, I think once #113 is merged, this PR makes sense. However, we should give up retrying at some point and hard error. Maybe 3, 5, or 10 attempts?

@deadmanoz
Copy link
Contributor Author

Converting to draft, will make changes so that it ties in with #113

…ching

Implement bounded retry (5 attempts) with exponential backoff when
fetching blocks from Bitcoin Core REST API. If a block cannot be
fetched after all retry attempts, it is skipped rather than causing
a hard error - the stats_version system will pick up missed blocks
on the next run.
Automatically create the CSV output directory before writing files,
preventing errors when the directory doesn't exist.
@deadmanoz deadmanoz force-pushed the add-retry-logic-and-csv-dir-creation branch from 323e303 to 75c02af Compare January 10, 2026 06:28
@deadmanoz
Copy link
Contributor Author

Do we hard error after retries are exhausted or skip the block and move on? I opted for skip the block and move on, because a re-run would pickup the block.

To my mind, it is better for the process to do "best effort" in this manner, potentially missing some blocks, rather than erroring out and requiring potentially more frequent user intervention? But I don't hold this opinion strongly!

Here's a sample of the output from my testing

[2026-01-10T06:22:51Z INFO  mainnet_observer_backend] Using 1 threads for block fetching & processing
127.0.0.1 - - [10/Jan/2026 14:22:51] "GET /rest/chaininfo.json HTTP/1.1" 200 -
[2026-01-10T06:22:51Z INFO  mainnet_observer_backend] Fetching 4 blocks (heights min=0, max=3)
127.0.0.1 - - [10/Jan/2026 14:22:51] "GET /rest/blockhashbyheight/0.hex HTTP/1.1" 500 -
[2026-01-10T06:22:51Z WARN  mainnet_observer_backend] Could not get block at height 0 (attempt 1/5): HTTP error: 500 Internal Server Error. Retrying in 100ms...
127.0.0.1 - - [10/Jan/2026 14:22:51] "GET /rest/blockhashbyheight/0.hex HTTP/1.1" 500 -
[2026-01-10T06:22:51Z WARN  mainnet_observer_backend] Could not get block at height 0 (attempt 2/5): HTTP error: 500 Internal Server Error. Retrying in 200ms...
127.0.0.1 - - [10/Jan/2026 14:22:52] "GET /rest/blockhashbyheight/0.hex HTTP/1.1" 500 -
[2026-01-10T06:22:52Z WARN  mainnet_observer_backend] Could not get block at height 0 (attempt 3/5): HTTP error: 500 Internal Server Error. Retrying in 400ms...
127.0.0.1 - - [10/Jan/2026 14:22:52] "GET /rest/blockhashbyheight/0.hex HTTP/1.1" 500 -
[2026-01-10T06:22:52Z WARN  mainnet_observer_backend] Could not get block at height 0 (attempt 4/5): HTTP error: 500 Internal Server Error. Retrying in 800ms...
127.0.0.1 - - [10/Jan/2026 14:22:53] "GET /rest/blockhashbyheight/0.hex HTTP/1.1" 500 -
[2026-01-10T06:22:53Z ERROR mainnet_observer_backend] Max retry attempts reached for block at height 0: HTTP error: 500 Internal Server Error. Skipping height.
127.0.0.1 - - [10/Jan/2026 14:22:53] "GET /rest/blockhashbyheight/1.hex HTTP/1.1" 500 -
[2026-01-10T06:22:53Z WARN  mainnet_observer_backend] Could not get block at height 1 (attempt 1/5): HTTP error: 500 Internal Server Error. Retrying in 100ms...
127.0.0.1 - - [10/Jan/2026 14:22:53] "GET /rest/blockhashbyheight/1.hex HTTP/1.1" 500 -
[2026-01-10T06:22:53Z WARN  mainnet_observer_backend] Could not get block at height 1 (attempt 2/5): HTTP error: 500 Internal Server Error. Retrying in 200ms...
127.0.0.1 - - [10/Jan/2026 14:22:53] "GET /rest/blockhashbyheight/1.hex HTTP/1.1" 500 -
[2026-01-10T06:22:53Z WARN  mainnet_observer_backend] Could not get block at height 1 (attempt 3/5): HTTP error: 500 Internal Server Error. Retrying in 400ms...
127.0.0.1 - - [10/Jan/2026 14:22:54] "GET /rest/blockhashbyheight/1.hex HTTP/1.1" 500 -
[2026-01-10T06:22:54Z WARN  mainnet_observer_backend] Could not get block at height 1 (attempt 4/5): HTTP error: 500 Internal Server Error. Retrying in 800ms...
127.0.0.1 - - [10/Jan/2026 14:22:54] "GET /rest/blockhashbyheight/1.hex HTTP/1.1" 500 -
[2026-01-10T06:22:54Z ERROR mainnet_observer_backend] Max retry attempts reached for block at height 1: HTTP error: 500 Internal Server Error. Skipping height.

@deadmanoz deadmanoz marked this pull request as ready for review January 10, 2026 06:46
@0xB10C
Copy link
Owner

0xB10C commented Jan 10, 2026

I'm not convinced either is better at the moment. I'll try to challenge your comment a bit - maybe that helps us to come to a conclusion.

Do we hard error after retries are exhausted or skip the block and move on? I opted for skip the block and move on, because a re-run would pickup the block.

Why would an automatic re-run e.g. 12h later work (without manual user intervention) when it didn't work the previous five attempts?

To my mind, it is better for the process to do "best effort" in this manner, potentially missing some blocks, rather than erroring out and requiring potentially more frequent user intervention? But I don't hold this opinion strongly!

Could Bitcoin Core (for some reason) changing the response format in a newer version break our fetches (e.g. as we can't parse the the returned JSON)? This would cause us to try to fetch all blocks with 5 retries but never being able to get one?

Here's a sample of the output from my testing

I'm curious! how did you test this?

@deadmanoz
Copy link
Contributor Author

deadmanoz commented Jan 16, 2026

Why would an automatic re-run e.g. 12h later work (without manual user intervention) when it didn't work the previous five attempts?

I guess this depends on failure mode. The failures I encountered when I implemented the initial version of this fix were intermittent and due to heavy load on the machine running the Bitcoin Core node.

Probably intermittent failures would become silent with a "retry and skip" strategy, unless one is specifically paying attention? This could mask any underlying issue, which would not be desirable.

Could Bitcoin Core (for some reason) changing the response format in a newer version break our fetches (e.g. as we can't parse the the returned JSON)? This would cause us to try to fetch all blocks with 5 retries but never being able to get one?

I don't know if Bitcoin Core changing response format is a valid concern (I guess we'd be long forewarned/aware as such), but having some persistent failure which causes us to try to fetch all blocks with 5 retries without ever getting one is! This would also be undesirable..

I'm not convinced either is better at the moment. I'll try to challenge your comment a bit - maybe that helps us to come to a conclusion.

Upon reflection, given the above, I don't think the "retry and skip" strategy has advantages over "retry and fail". Retry and fail forces action, which is the behaviour you want:

  • persistent failure - resolve the cause, re-run
  • intermittent failure - investigate the cause, re-run

I'm curious! how did you test this?

Very hackily... via source code changes. I considered mocking but deemed it not worth the effort.

I will modify to the "retry and fail" strategy?

@0xB10C
Copy link
Owner

0xB10C commented Jan 16, 2026

I will modify to the "retry and fail" strategy?

Yeah, sounds good to me.

While debugging #117 I recently noticed that we currently fail in each rayon thread on it's own. So if we have 14 threads, we need to fail in 14 of them to cause the process to exit. It will otherwise happily continue to run with just a few threads remaining, but will never fetch all blocks.. I wasn't aware of that.

@0xB10C 0xB10C marked this pull request as draft January 23, 2026 13:26
@0xB10C
Copy link
Owner

0xB10C commented Jan 23, 2026

Marking as draft for now :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants