Integrate fiemap support and fixes v3 #17

kakra · 2017-06-28T21:11:42Z

This supersedes #11. It fixes an issue introduced during splitting out patches and rebasing.

This change is

The hinting currently doesn't handle backup and restore phase differently which can be improved. This patch tells the kernel that the data won't be reused in the copy function, thus it can discard it from the page cache after it was read. Next, it will instead tell the kernel that the accused file and the tmp file both will be needed for reading next. The first call to WILLNEED should be put up further in the call chain to also support reading the allocation map but this would invole tracking DONTNEED differently, so it is not implemented here. In the end, the page caches will be discarded explicitly by telling the kernel that the data is no longer needed. Testing shows that the active page cache during shaking now is released after each file operation and system responsiveness improves during shaking.

This patch adds support for FIEMAP and leaves the current FIBMAP approach in place as a fallback. Instead of trying to guess extents from FIBMAP block maps, it directly uses extents as fragments. The logic is similar as it merges adjecent extents as one fragment. It also detects hole extents and inlined extents and doesn't account those as fragments similar to the FIBMAP approach. It optimizes handling filesystems with tail support (e.g., reiserfs) by not accounting tailed extents as fragments, as the rewrite process probably would tail the last extent again. Similarily it tries to optimize for file systems that support shared extents (xfs, btrfs) by not accounting such extents as fragments. This helps preventing unsharing accused files and thus increasing space usage. I pretend that keeping shared extents is more important than continuous extents, as the operation of deduplicating those extents in the first place is known to fragment those files. However, if the heuristics decide to shake the file, extents still become unshared. I have no idea currently how to optimize for that, thus it is not part of this patch.

unbrice

Reviewed 2 of 2 files at r1.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @kakra)

executive.c, line 53 at r1 (raw file):

  /* Optimisation (on Linux it double the readahead window) */
  posix_fadvise (in_fd, (off_t) 0, (off_t) 0, POSIX_FADV_SEQUENTIAL);
  posix_fadvise (in_fd, (off_t) 0, (off_t) 0, POSIX_FADV_NOREUSE);

man7 says POSIX_FADV_NOREUSE is a no-op

executive.c, line 233 at r1 (raw file):

shake_reg_backup_phase (struct accused *a, struct law *l)
{
  posix_fadvise (a->fd, (off_t) 0, (off_t) 0, POSIX_FADV_WILLNEED);

I don't think prefetching just before the actual blocking read would help? As I understand, this is more likely to help if some time separate both calls.

linux.c, line 225 at r1 (raw file):

      sizelog[logs_pos] = -1;
      poslog[logs_pos] = -1;
    }

As discussed in the initial review, I suggest splitting this to a separate function as it's grown quite large with the change.

kakra · 2022-03-12T19:59:22Z

executive.c, line 53 at r1 (raw file):

  /* Optimisation (on Linux it double the readahead window) */
  posix_fadvise (in_fd, (off_t) 0, (off_t) 0, POSIX_FADV_SEQUENTIAL);
  posix_fadvise (in_fd, (off_t) 0, (off_t) 0, POSIX_FADV_NOREUSE);

man7 says POSIX_FADV_NOREUSE is a no-op

That is possible. It's probably better to actually issue a WONT_NEED afterwards to free the page cache. That said, the man page doesn't always match what the kernel actually does. I found this when working on optimizing Steam fossilize.

The second posix_fadvise() call should probably simply be removed. The first line already doubles the readahead window which should be enough.

executive.c, line 233 at r1 (raw file):
shake_reg_backup_phase (struct accused *a, struct law *l)
{
  posix_fadvise (a->fd, (off_t) 0, (off_t) 0, POSIX_FADV_WILLNEED);
I don't think prefetching just before the actual blocking read would help? As I understand, this is more likely to help if some time separate both calls.

It will hint the cache to prefetch the whole file in advance. Depending on how fast the process can read the file, it may reduce head movement and prefetch metadata. In this case, it's probably not that useful, especially since when there is concurrency with other IO under memory pressure, shake would start dominating the cache - which we probably don't want.

linux.c, line 225 at r1 (raw file):
      sizelog[logs_pos] = -1;
      poslog[logs_pos] = -1;
    }
As discussed in the initial review, I suggest splitting this to a separate function as it's grown quite large with the change.

The whole process is a bit racy: While on ext4, locking the file and rewriting it is probably the way to go but there's still a window when shake can accidentally replace a modified file with the previous version, or even error out and leave a temporary file behind.

For other filesystems, e.g. btrfs (and maybe even xfs) there's an IOCTL to replace the inode with a new set of extents atomically. For xfs, I'd look at how xfs_fsr does it. For btrfs, it's probably worth writing a temporary file with the new contents, then use "extent same" to atomically replace the old extents with the new defragmented ones.

Since shake is not safe to be used unattended, I've stopped using it.

Instead, I'm running on hybrid btrfs now with dedicated SSDs for metadata, and bcache dedicated for data. This works great and provides way better performance than running shake trying to relocate data blocks. The only downside is that shake was able to reduce the amount of extents.

I'd look into shake again if it could be redesigned into a more safe way of operating but the problems are much inherent to its internal design. So if you want to apply the patches partially, feel free to do so. It's still an interesting software if its race conditions can be resolved because it can restore locality of files that belong together, e.g. I'd like to use the readahead data from the boot process to restore locality of all involved files (not on a directory level but order of read during boot).

If you need help extracting the fiemap code into a separate function, I can help with that. Also, it should keep the part to skip extent marked as shared so it won't unshare snapshot extents.

Before starting to merge patches, the code should be run through clang format or indent to fix its quite creative indents. It will make patch rebases much more easy. I'd prefer clang format as it is much more sophisticated and be be integrated into the commit process.

unbrice · 2022-03-13T18:17:03Z

If you need help extracting the fiemap code into a separate function, I can help with that. Also, it should keep the part to skip extent marked as shared so it won't unshare snapshot extents.

Thanks for offering. As you noted, the code dates a bit and could do better on modern filesystems. But as you also noted, the use of spinning disks is decreasing to the point that nowadays it's for some specific use cases. These days, people caring about latency use SSDs to serve that. And it seems both of us are doing just that...

So that's the reason why I'm reluctant to making changes myself, and maybe to changes in general:
I have the feeling shake is going to be like a CRT-only software (eg: tempest for eliza). Neat, at its epoch, now a memento.
Maybe the correct course here is to update the documentation, mention this context and officially transition it as a "maintenance only" software.

Before starting to merge patches, the code should be run through clang format or indent to fix its quite creative indents. It will make patch rebases much more easy. I'd prefer clang format as it is much more sophisticated and be be integrated into the commit process.

Actually the current indent are indent's default mode of operation :-P That's how they got here.
I gave clang-format a quick try. It does have a -style=GNU option, but that option doesn't properly follow GNU style (for example SpaceAfterCStyleCast is false when it ought to be true). I don't feel strongly about the current GNU style, but quickly glancing at some distros that carry shake, they patch it and I'd rather not break their build for no specific reason.

kakra · 2022-03-13T22:03:16Z

they patch it

The question is: What do they patch? And would it be rather worth it to integrate those patches into shake upstream?

I think shake still has its use-cases and is based on some great ideas. That's probably why distros still ship it despite the mainstream era of spinning disks has passed, and storage that still uses spinning disks will either cache (using SSDs) or parallelize (by increasing spindles) - and both of those are not the general target audience of what shake tries to do.

Each FS has their own implementation of some defrag but none of them is locality aware as shake is. If shake wants to head anywhere, I think this is its strong selling point. And in that case, a follow-up version should look into how xfs or btrfs implement their atomic defraggers and apply the same (probably fs-specific) logic to shake.

I had an idea of adding some sort of boot-defrag to shake by feeding systemd-readahead data into it. But I quickly dropped that idea when I realized that shake could not run unattended because defrag-problems could go unnoticed (destroying the original file and leaving a temp file in place, it can happen if package updates and shake collide on the same files). This is another area that needs fixing.

kakra added 2 commits June 28, 2017 23:08

kakra mentioned this pull request Jun 28, 2017

Integrate fiemap support and fixes v2 #11

Closed

Stoatwblr mentioned this pull request Dec 12, 2020

Support for ZFS (open zfs) filesystems #15

Open

unbrice requested changes Mar 12, 2022

View reviewed changes

unbrice added the enhancement label Mar 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate fiemap support and fixes v3 #17

Integrate fiemap support and fixes v3 #17

Uh oh!

kakra commented Jun 28, 2017 •

edited by unbrice

Loading

Uh oh!

unbrice left a comment

Uh oh!

kakra commented Mar 12, 2022

Uh oh!

unbrice commented Mar 13, 2022

Uh oh!

kakra commented Mar 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Integrate fiemap support and fixes v3 #17

Are you sure you want to change the base?

Integrate fiemap support and fixes v3 #17

Uh oh!

Conversation

kakra commented Jun 28, 2017 • edited by unbrice Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

unbrice left a comment

Choose a reason for hiding this comment

Uh oh!

kakra commented Mar 12, 2022

Uh oh!

unbrice commented Mar 13, 2022

Uh oh!

kakra commented Mar 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kakra commented Jun 28, 2017 •

edited by unbrice

Loading