Skip to content

Conversation

@JinZhou5042
Copy link
Member

@JinZhou5042 JinZhou5042 commented Dec 10, 2024

Proposed Changes

Fix #3996

  • Update the default values for a set of arguments regarding temp file replication:

    • q->worker_source_max_transfers from 3 to 10
    • max_transfer_procs from 5 to 10
    • I keep q->transfer_replica_per_cycle as is (which is 10), as it limits the number of replications to request per temp file per iteration, and we are not likely to replicate each temp file more than 10 times.
  • Improve the logic in the replication process, which helps the manager replicate more efficiently, add more comments

  • Provide an API for Python to query the replica count for a file

Merge Checklist

The following items must be completed before PRs can be merged.
Check these off to verify you have completed all steps.

  • make test Run local tests prior to pushing.
  • make format Format source code to comply with lint policies. Note that some lint errors can only be resolved manually (e.g., Python)
  • make lint Run lint on source code prior to pushing.
  • Manual Update: Update the manual to reflect user-visible changes.
  • Type Labels: Select a github label for the type: bugfix, enhancement, etc.
  • Product Labels: Select a github label for the product: TaskVine, Makeflow, etc.
  • PR RTM: Mark your PR as ready to merge.

@JinZhou5042 JinZhou5042 marked this pull request as draft December 10, 2024 15:50
@JinZhou5042 JinZhou5042 self-assigned this Dec 10, 2024
@JinZhou5042
Copy link
Member Author

JinZhou5042 commented Dec 10, 2024

I believe the argument values are definitely not the real issue, the actual issue is probably some logic bugs or a bit tangled if statements in the previous implementation. I've cleaned up the logic, and now the replica count for temp files comes out right at the end of every run.

Before:
image

After:

image

@JinZhou5042 JinZhou5042 marked this pull request as ready for review December 10, 2024 18:02
@JinZhou5042 JinZhou5042 requested a review from dthain December 10, 2024 18:03
Copy link
Member

@dthain dthain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@JinZhou5042
Copy link
Member Author

RTM

@btovar btovar merged commit ec2741e into cooperative-computing-lab:master Dec 10, 2024
10 checks passed
@JinZhou5042 JinZhou5042 deleted the temp_replica_count branch December 10, 2024 18:46
btovar pushed a commit that referenced this pull request Dec 14, 2024
* print rep count

* init

* update

* up

* up

* up

* lint

* up

* limit the max number of rep attempts

* update
dthain pushed a commit to dthain/cctools that referenced this pull request Jan 10, 2025
* print rep count

* init

* update

* up

* up

* up

* lint

* up

* limit the max number of rep attempts

* update
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

vine: most temp files do not reach their replication threshold

3 participants