Skip to content

Conversation

@CuzImClicks
Copy link
Contributor

Description

Uses Notify per chunk and stops reference counting pending writes. This pr still deadlocks but its not triggered from flying around but can be triggered by just standing still for 20min? 😭

Testing

Please follow our Coding Guidelines

@tn-lorenz
Copy link

Could this be a ThreadPool becoming full, because we have a memory leak somewhere in the code? I think this might look the same as a dead lock.

@sprinox
Copy link
Contributor

sprinox commented Jan 6, 2026

If no reference counting is used, then with the current PR implementation, it’s possible for two save tasks to be issued for the same chunk. After the first save task finishes writing, the chunk gets unlocked. At that point, a reader thread may read it immediately without waiting for the second save task to complete, which can result in chunk data being rolled back.

@CuzImClicks
Copy link
Contributor Author

@sprinox

i added back the reference counting but its still deadlocking. What i noticed was that it deadlocks when im standing still and just coding while not loading/generating chunks. I unbounded all of the channels but that also didnt really fix anything. this deadlock seems to be really stubborn. but i definitely think using CancellationToken is better than condvar just because CancellationToken was meant for this exact purpose of waiting or completing instantly if a permit has been deposited.

i moved the write task onto its own thread but it would probably be better to just move all write tasks for all levels together onto one thread instead of each their own. i found that even though its async io its still better to have them on their own thread and cause less slow ticks.

2026-01-08 00:09:14 [INFO] (33) Write took 71.524089ms
2026-01-08 00:09:29 [INFO] (33) io write thread receive chunks size 2731
2026-01-08 00:09:29 [INFO] (33) Write took 33.541264ms
2026-01-08 00:09:45 [INFO] (33) io write thread receive chunks size 2731
2026-01-08 00:09:45 [INFO] (33) Write took 107.093064ms
2026-01-08 00:10:00 [INFO] (33) io write thread receive chunks size 2731
2026-01-08 00:10:00 [INFO] (33) Write took 57.036952ms
2026-01-08 00:10:15 [INFO] (33) io write thread receive chunks size 2731
2026-01-08 00:10:15 [INFO] (33) Write took 12.304892ms
2026-01-08 00:10:31 [INFO] (33) io write thread receive chunks size 2731
2026-01-08 00:10:31 [INFO] (33) Write took 14.138839ms

if you have any more ideas on how this could be improved i would appreciate your advice!

but i made some improvements to the tick_data stuff and eliminated a lot of per tick allocation. i found that per tick the random tick calculation actually takes the most time...

for i in 0..chunk.section.sections.len() {
    for _ in 0..random_tick_speed {
        let r = rng.random::<u32>();
        let x_offset = (r & 0xF) as i32;
        let y_offset = ((r >> 4) & 0xF) as i32 - 32;
        let z_offset = (r >> 8 & 0xF) as i32;

        let random_pos = BlockPos::new(
            chunk_x_base + x_offset,
            i as i32 * 16 + y_offset,
            chunk_z_base + z_offset,
        );

        let block_id = chunk
            .section
            .get_block_absolute_y(x_offset as usize, random_pos.0.y, z_offset as usize)
            .unwrap_or(Block::AIR.default_state.id);

        if has_random_ticks(block_id) {
            random_ticks.push(ScheduledTick {
                position: random_pos,
                delay: 0,
                priority: TickPriority::Normal,
                value: (),
            });
        }
    }
}

@sprinox
Copy link
Contributor

sprinox commented Jan 8, 2026

I agree that CancellationToken is better, but I can't figure out a safe implementation.

About deadlock, it seldom occurs in my pc. If anyone uses Linux and can reproduce deadlock, it will be a great help to enable the task dump to print all current task and find which one is deadlock.

I commented the code because I use Windows. By using it, you can get output like
#1086 (comment)

image

I suggest calling the dump function if shutdown takes longer than 300 seconds. I removed it earlier because it wouldn’t pass CI, so it needs to be added back locally.

@sprinox
Copy link
Contributor

sprinox commented Jan 8, 2026

There is also some resources leaking.

I mentioned it at my previous pr.

This PR does not touch entities or entity chunks. I’m not sure what mechanism is currently used for entities, but it causes them to keep being processed even when the player is far away. As a result, chunks are repeatedly requested, temporarily loaded, and then unloaded by the scheduling thread, which severely impacts performance. So, when testing performance, try freezing the game tick.

@CuzImClicks
Copy link
Contributor Author

2026-01-08_23 14 00 bro i have been sitting here with two accounts for over an hour and this shit just wont deadlock

@CuzImClicks
Copy link
Contributor Author

Could this be a ThreadPool becoming full, because we have a memory leak somewhere in the code? I think this might look the same as a dead lock.

We definitely have a memory leak which i was able to observe yesterday but it still didnt deadlock

@CuzImClicks CuzImClicks changed the title Fix: Stop Reference Counting Writes to Reduce Deadlocks fix: optimizations to try and reduce deadlocks Jan 9, 2026
@CuzImClicks
Copy link
Contributor Author

@sprinox are you deadlocking on this pr on windows? Because i cant get it to deadlock on linux rn

@sprinox
Copy link
Contributor

sprinox commented Jan 9, 2026

@sprinox are you deadlocking on this pr on windows? Because i cant get it to deadlock on linux rn

I have not test this pr. but I met deadlock on master branch.

let (send_gen, recv_gen) = crossfire::mpmc::bounded_blocking(gen_thread_count + 5);
let io_lock = Arc::new((Mutex::new(HashMapType::default()), Condvar::new()));
for _ in 0..oi_read_thread_count {
let (send_gen, recv_gen) = crossfire::mpmc::unbounded_blocking();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont use unbounded. It will be filled with too many tasks which can't be cancelled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sender blocks tho when the channel is filled... i made it unbounded to see if it fixes deadlocks

}

fn work(mut self, level: Arc<Level>) {
async fn work(mut self, level: Arc<Level>) {
Copy link
Contributor

@sprinox sprinox Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i still believe that schedule thread should not be async. it is a cpu-heavy task. you can use blocking_lock to lock tokio mutex in normal thread.

also if it should be changed to async, please change blocking channel to async channel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesnt really matter if its async or not... since its on its own thread and i wanted to make it async so i can see on what part of execution it gets stuck. i might try and put all of the generation thread onto their multi threaded tokio runtime at some point but i havent started with that yet

Copy link
Contributor

@sprinox sprinox Jan 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, sync function is much easier to debug than async. If sync function deadlock, we can simply pause at lldb or gdb, and see which line of code it stuck. But this don't work at async function. So I think async is really hard to debug.

Besides, async is much slower at recursion.

CuzImClicks and others added 4 commits January 14, 2026 12:23
@Snowiiii Snowiiii closed this Jan 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants