-
Notifications
You must be signed in to change notification settings - Fork 5
Integrate fiemap support and fixes v3 #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
The hinting currently doesn't handle backup and restore phase differently which can be improved. This patch tells the kernel that the data won't be reused in the copy function, thus it can discard it from the page cache after it was read. Next, it will instead tell the kernel that the accused file and the tmp file both will be needed for reading next. The first call to WILLNEED should be put up further in the call chain to also support reading the allocation map but this would invole tracking DONTNEED differently, so it is not implemented here. In the end, the page caches will be discarded explicitly by telling the kernel that the data is no longer needed. Testing shows that the active page cache during shaking now is released after each file operation and system responsiveness improves during shaking.
This patch adds support for FIEMAP and leaves the current FIBMAP approach in place as a fallback. Instead of trying to guess extents from FIBMAP block maps, it directly uses extents as fragments. The logic is similar as it merges adjecent extents as one fragment. It also detects hole extents and inlined extents and doesn't account those as fragments similar to the FIBMAP approach. It optimizes handling filesystems with tail support (e.g., reiserfs) by not accounting tailed extents as fragments, as the rewrite process probably would tail the last extent again. Similarily it tries to optimize for file systems that support shared extents (xfs, btrfs) by not accounting such extents as fragments. This helps preventing unsharing accused files and thus increasing space usage. I pretend that keeping shared extents is more important than continuous extents, as the operation of deduplicating those extents in the first place is known to fragment those files. However, if the heuristics decide to shake the file, extents still become unshared. I have no idea currently how to optimize for that, thus it is not part of this patch.
unbrice
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r1.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @kakra)
executive.c, line 53 at r1 (raw file):
/* Optimisation (on Linux it double the readahead window) */ posix_fadvise (in_fd, (off_t) 0, (off_t) 0, POSIX_FADV_SEQUENTIAL); posix_fadvise (in_fd, (off_t) 0, (off_t) 0, POSIX_FADV_NOREUSE);
man7 says POSIX_FADV_NOREUSE is a no-op
executive.c, line 233 at r1 (raw file):
shake_reg_backup_phase (struct accused *a, struct law *l) { posix_fadvise (a->fd, (off_t) 0, (off_t) 0, POSIX_FADV_WILLNEED);
I don't think prefetching just before the actual blocking read would help? As I understand, this is more likely to help if some time separate both calls.
linux.c, line 225 at r1 (raw file):
sizelog[logs_pos] = -1; poslog[logs_pos] = -1; }
As discussed in the initial review, I suggest splitting this to a separate function as it's grown quite large with the change.
That is possible. It's probably better to actually issue a WONT_NEED afterwards to free the page cache. That said, the man page doesn't always match what the kernel actually does. I found this when working on optimizing Steam fossilize. The second
It will hint the cache to prefetch the whole file in advance. Depending on how fast the process can read the file, it may reduce head movement and prefetch metadata. In this case, it's probably not that useful, especially since when there is concurrency with other IO under memory pressure, shake would start dominating the cache - which we probably don't want.
The whole process is a bit racy: While on ext4, locking the file and rewriting it is probably the way to go but there's still a window when shake can accidentally replace a modified file with the previous version, or even error out and leave a temporary file behind. For other filesystems, e.g. btrfs (and maybe even xfs) there's an IOCTL to replace the inode with a new set of extents atomically. For xfs, I'd look at how xfs_fsr does it. For btrfs, it's probably worth writing a temporary file with the new contents, then use "extent same" to atomically replace the old extents with the new defragmented ones. Since shake is not safe to be used unattended, I've stopped using it. Instead, I'm running on hybrid btrfs now with dedicated SSDs for metadata, and bcache dedicated for data. This works great and provides way better performance than running shake trying to relocate data blocks. The only downside is that shake was able to reduce the amount of extents. I'd look into shake again if it could be redesigned into a more safe way of operating but the problems are much inherent to its internal design. So if you want to apply the patches partially, feel free to do so. It's still an interesting software if its race conditions can be resolved because it can restore locality of files that belong together, e.g. I'd like to use the readahead data from the boot process to restore locality of all involved files (not on a directory level but order of read during boot). If you need help extracting the fiemap code into a separate function, I can help with that. Also, it should keep the part to skip extent marked as shared so it won't unshare snapshot extents. Before starting to merge patches, the code should be run through clang format or |
Thanks for offering. As you noted, the code dates a bit and could do better on modern filesystems. But as you also noted, the use of spinning disks is decreasing to the point that nowadays it's for some specific use cases. These days, people caring about latency use SSDs to serve that. And it seems both of us are doing just that... So that's the reason why I'm reluctant to making changes myself, and maybe to changes in general:
Actually the current indent are |
The question is: What do they patch? And would it be rather worth it to integrate those patches into shake upstream? I think shake still has its use-cases and is based on some great ideas. That's probably why distros still ship it despite the mainstream era of spinning disks has passed, and storage that still uses spinning disks will either cache (using SSDs) or parallelize (by increasing spindles) - and both of those are not the general target audience of what shake tries to do. Each FS has their own implementation of some defrag but none of them is locality aware as shake is. If shake wants to head anywhere, I think this is its strong selling point. And in that case, a follow-up version should look into how xfs or btrfs implement their atomic defraggers and apply the same (probably fs-specific) logic to shake. I had an idea of adding some sort of boot-defrag to shake by feeding systemd-readahead data into it. But I quickly dropped that idea when I realized that shake could not run unattended because defrag-problems could go unnoticed (destroying the original file and leaving a temp file in place, it can happen if package updates and shake collide on the same files). This is another area that needs fixing. |
This supersedes #11. It fixes an issue introduced during splitting out patches and rebasing.
This change is