-
-
Notifications
You must be signed in to change notification settings - Fork 400
fix: optimizations to try and reduce deadlocks #1256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Acquire write lock only once
|
Could this be a |
|
If no reference counting is used, then with the current PR implementation, it’s possible for two save tasks to be issued for the same chunk. After the first save task finishes writing, the chunk gets unlocked. At that point, a reader thread may read it immediately without waiting for the second save task to complete, which can result in chunk data being rolled back. |
|
i added back the reference counting but its still deadlocking. What i noticed was that it deadlocks when im standing still and just coding while not loading/generating chunks. I unbounded all of the channels but that also didnt really fix anything. this deadlock seems to be really stubborn. but i definitely think using CancellationToken is better than condvar just because CancellationToken was meant for this exact purpose of waiting or completing instantly if a permit has been deposited. i moved the write task onto its own thread but it would probably be better to just move all write tasks for all levels together onto one thread instead of each their own. i found that even though its async io its still better to have them on their own thread and cause less slow ticks. if you have any more ideas on how this could be improved i would appreciate your advice! but i made some improvements to the tick_data stuff and eliminated a lot of per tick allocation. i found that per tick the random tick calculation actually takes the most time... for i in 0..chunk.section.sections.len() {
for _ in 0..random_tick_speed {
let r = rng.random::<u32>();
let x_offset = (r & 0xF) as i32;
let y_offset = ((r >> 4) & 0xF) as i32 - 32;
let z_offset = (r >> 8 & 0xF) as i32;
let random_pos = BlockPos::new(
chunk_x_base + x_offset,
i as i32 * 16 + y_offset,
chunk_z_base + z_offset,
);
let block_id = chunk
.section
.get_block_absolute_y(x_offset as usize, random_pos.0.y, z_offset as usize)
.unwrap_or(Block::AIR.default_state.id);
if has_random_ticks(block_id) {
random_ticks.push(ScheduledTick {
position: random_pos,
delay: 0,
priority: TickPriority::Normal,
value: (),
});
}
}
} |
|
I agree that CancellationToken is better, but I can't figure out a safe implementation. About deadlock, it seldom occurs in my pc. If anyone uses Linux and can reproduce deadlock, it will be a great help to enable the task dump to print all current task and find which one is deadlock. I commented the code because I use Windows. By using it, you can get output like I suggest calling the dump function if shutdown takes longer than 300 seconds. I removed it earlier because it wouldn’t pass CI, so it needs to be added back locally. |
|
There is also some resources leaking. I mentioned it at my previous pr.
|
We definitely have a memory leak which i was able to observe yesterday but it still didnt deadlock |
|
@sprinox are you deadlocking on this pr on windows? Because i cant get it to deadlock on linux rn |
I have not test this pr. but I met deadlock on master branch. |
| let (send_gen, recv_gen) = crossfire::mpmc::bounded_blocking(gen_thread_count + 5); | ||
| let io_lock = Arc::new((Mutex::new(HashMapType::default()), Condvar::new())); | ||
| for _ in 0..oi_read_thread_count { | ||
| let (send_gen, recv_gen) = crossfire::mpmc::unbounded_blocking(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont use unbounded. It will be filled with too many tasks which can't be cancelled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sender blocks tho when the channel is filled... i made it unbounded to see if it fixes deadlocks
| } | ||
|
|
||
| fn work(mut self, level: Arc<Level>) { | ||
| async fn work(mut self, level: Arc<Level>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i still believe that schedule thread should not be async. it is a cpu-heavy task. you can use blocking_lock to lock tokio mutex in normal thread.
also if it should be changed to async, please change blocking channel to async channel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesnt really matter if its async or not... since its on its own thread and i wanted to make it async so i can see on what part of execution it gets stuck. i might try and put all of the generation thread onto their multi threaded tokio runtime at some point but i havent started with that yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, sync function is much easier to debug than async. If sync function deadlock, we can simply pause at lldb or gdb, and see which line of code it stuck. But this don't work at async function. So I think async is really hard to debug.
Besides, async is much slower at recursion.
- actually put write tasks onto their own threads - only one entity chunks thread per dimension


Description
Uses
Notifyper chunk and stops reference counting pending writes. This pr still deadlocks but its not triggered from flying around but can be triggered by just standing still for 20min? 😭Testing
Please follow our Coding Guidelines