Skip to content

Conversation

@kakra
Copy link
Contributor

@kakra kakra commented Jun 28, 2017

This supersedes #11. It fixes an issue introduced during splitting out patches and rebasing.


This change is Reviewable

kakra added 2 commits June 28, 2017 23:08
The hinting currently doesn't handle backup and restore phase
differently which can be improved.

This patch tells the kernel that the data won't be reused in the copy
function, thus it can discard it from the page cache after it was read.

Next, it will instead tell the kernel that the accused file and the tmp
file both will be needed for reading next. The first call to WILLNEED
should be put up further in the call chain to also support reading the
allocation map but this would invole tracking DONTNEED differently, so
it is not implemented here.

In the end, the page caches will be discarded explicitly by telling the
kernel that the data is no longer needed.

Testing shows that the active page cache during shaking now is released
after each file operation and system responsiveness improves during
shaking.
This patch adds support for FIEMAP and leaves the current FIBMAP
approach in place as a fallback. Instead of trying to guess extents from
FIBMAP block maps, it directly uses extents as fragments.

The logic is similar as it merges adjecent extents as one fragment. It
also detects hole extents and inlined extents and doesn't account those
as fragments similar to the FIBMAP approach.

It optimizes handling filesystems with tail support (e.g., reiserfs) by
not accounting tailed extents as fragments, as the rewrite process
probably would tail the last extent again.

Similarily it tries to optimize for file systems that support shared
extents (xfs, btrfs) by not accounting such extents as fragments. This
helps preventing unsharing accused files and thus increasing space
usage. I pretend that keeping shared extents is more important than
continuous extents, as the operation of deduplicating those extents in
the first place is known to fragment those files.

However, if the heuristics decide to shake the file, extents still
become unshared. I have no idea currently how to optimize for that, thus
it is not part of this patch.
Copy link
Owner

@unbrice unbrice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @kakra)


executive.c, line 53 at r1 (raw file):

  /* Optimisation (on Linux it double the readahead window) */
  posix_fadvise (in_fd, (off_t) 0, (off_t) 0, POSIX_FADV_SEQUENTIAL);
  posix_fadvise (in_fd, (off_t) 0, (off_t) 0, POSIX_FADV_NOREUSE);

man7 says POSIX_FADV_NOREUSE is a no-op


executive.c, line 233 at r1 (raw file):

shake_reg_backup_phase (struct accused *a, struct law *l)
{
  posix_fadvise (a->fd, (off_t) 0, (off_t) 0, POSIX_FADV_WILLNEED);

I don't think prefetching just before the actual blocking read would help? As I understand, this is more likely to help if some time separate both calls.


linux.c, line 225 at r1 (raw file):

      sizelog[logs_pos] = -1;
      poslog[logs_pos] = -1;
    }

As discussed in the initial review, I suggest splitting this to a separate function as it's grown quite large with the change.

@kakra
Copy link
Contributor Author

kakra commented Mar 12, 2022

executive.c, line 53 at r1 (raw file):

  /* Optimisation (on Linux it double the readahead window) */
  posix_fadvise (in_fd, (off_t) 0, (off_t) 0, POSIX_FADV_SEQUENTIAL);
  posix_fadvise (in_fd, (off_t) 0, (off_t) 0, POSIX_FADV_NOREUSE);

man7 says POSIX_FADV_NOREUSE is a no-op

That is possible. It's probably better to actually issue a WONT_NEED afterwards to free the page cache. That said, the man page doesn't always match what the kernel actually does. I found this when working on optimizing Steam fossilize.

The second posix_fadvise() call should probably simply be removed. The first line already doubles the readahead window which should be enough.

executive.c, line 233 at r1 (raw file):

shake_reg_backup_phase (struct accused *a, struct law *l)
{
  posix_fadvise (a->fd, (off_t) 0, (off_t) 0, POSIX_FADV_WILLNEED);

I don't think prefetching just before the actual blocking read would help? As I understand, this is more likely to help if some time separate both calls.

It will hint the cache to prefetch the whole file in advance. Depending on how fast the process can read the file, it may reduce head movement and prefetch metadata. In this case, it's probably not that useful, especially since when there is concurrency with other IO under memory pressure, shake would start dominating the cache - which we probably don't want.

linux.c, line 225 at r1 (raw file):

      sizelog[logs_pos] = -1;
      poslog[logs_pos] = -1;
    }

As discussed in the initial review, I suggest splitting this to a separate function as it's grown quite large with the change.

The whole process is a bit racy: While on ext4, locking the file and rewriting it is probably the way to go but there's still a window when shake can accidentally replace a modified file with the previous version, or even error out and leave a temporary file behind.

For other filesystems, e.g. btrfs (and maybe even xfs) there's an IOCTL to replace the inode with a new set of extents atomically. For xfs, I'd look at how xfs_fsr does it. For btrfs, it's probably worth writing a temporary file with the new contents, then use "extent same" to atomically replace the old extents with the new defragmented ones.

Since shake is not safe to be used unattended, I've stopped using it.

Instead, I'm running on hybrid btrfs now with dedicated SSDs for metadata, and bcache dedicated for data. This works great and provides way better performance than running shake trying to relocate data blocks. The only downside is that shake was able to reduce the amount of extents.

I'd look into shake again if it could be redesigned into a more safe way of operating but the problems are much inherent to its internal design. So if you want to apply the patches partially, feel free to do so. It's still an interesting software if its race conditions can be resolved because it can restore locality of files that belong together, e.g. I'd like to use the readahead data from the boot process to restore locality of all involved files (not on a directory level but order of read during boot).

If you need help extracting the fiemap code into a separate function, I can help with that. Also, it should keep the part to skip extent marked as shared so it won't unshare snapshot extents.

Before starting to merge patches, the code should be run through clang format or indent to fix its quite creative indents. It will make patch rebases much more easy. I'd prefer clang format as it is much more sophisticated and be be integrated into the commit process.

@unbrice
Copy link
Owner

unbrice commented Mar 13, 2022

If you need help extracting the fiemap code into a separate function, I can help with that. Also, it should keep the part to skip extent marked as shared so it won't unshare snapshot extents.

Thanks for offering. As you noted, the code dates a bit and could do better on modern filesystems. But as you also noted, the use of spinning disks is decreasing to the point that nowadays it's for some specific use cases. These days, people caring about latency use SSDs to serve that. And it seems both of us are doing just that...

So that's the reason why I'm reluctant to making changes myself, and maybe to changes in general:
I have the feeling shake is going to be like a CRT-only software (eg: tempest for eliza). Neat, at its epoch, now a memento.
Maybe the correct course here is to update the documentation, mention this context and officially transition it as a "maintenance only" software.

Before starting to merge patches, the code should be run through clang format or indent to fix its quite creative indents. It will make patch rebases much more easy. I'd prefer clang format as it is much more sophisticated and be be integrated into the commit process.

Actually the current indent are indent's default mode of operation :-P That's how they got here.
I gave clang-format a quick try. It does have a -style=GNU option, but that option doesn't properly follow GNU style (for example SpaceAfterCStyleCast is false when it ought to be true). I don't feel strongly about the current GNU style, but quickly glancing at some distros that carry shake, they patch it and I'd rather not break their build for no specific reason.

@kakra
Copy link
Contributor Author

kakra commented Mar 13, 2022

they patch it

The question is: What do they patch? And would it be rather worth it to integrate those patches into shake upstream?

I think shake still has its use-cases and is based on some great ideas. That's probably why distros still ship it despite the mainstream era of spinning disks has passed, and storage that still uses spinning disks will either cache (using SSDs) or parallelize (by increasing spindles) - and both of those are not the general target audience of what shake tries to do.

Each FS has their own implementation of some defrag but none of them is locality aware as shake is. If shake wants to head anywhere, I think this is its strong selling point. And in that case, a follow-up version should look into how xfs or btrfs implement their atomic defraggers and apply the same (probably fs-specific) logic to shake.

I had an idea of adding some sort of boot-defrag to shake by feeding systemd-readahead data into it. But I quickly dropped that idea when I realized that shake could not run unattended because defrag-problems could go unnoticed (destroying the original file and leaving a temp file in place, it can happen if package updates and shake collide on the same files). This is another area that needs fixing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants