Skip to content

Conversation

@colinthomas-z80
Copy link
Contributor

@colinthomas-z80 colinthomas-z80 commented Apr 26, 2024

Proposed changes

With task groups enabled, detect groups of sequential, temp-file dependent tasks, and send them in batches to a single worker, avoiding the overhead and complexity of maintaining fixed_location policy.

Post-change actions

Put an 'x' in the boxes that describe post-change actions that you have done.
The more 'x' ticked, the faster your changes are accepted by maintainers.

  • make test Run local tests prior to pushing.
  • make format Format source code to comply with lint policies. Note that some lint errors can only be resolved manually (e.g., Python)
  • make lint Run lint on source code prior to pushing.
  • Manual Update Did you update the manual to reflect your changes, if appropriate? This action should be done after your changes are approved but not merged.
  • Type Labels Select github labels for the type of this change: bug, enhancement, etc.
  • Product Labels Select github labels for the product affected: TaskVine, Makeflow, etc.
  • PR RTM Mark your PR as ready to merge.

Additional comments

This section is dedicated to changes that are ambitious or complex and require substantial discussions. Feel free to start the ball rolling.

@colinthomas-z80
Copy link
Contributor Author

This currently works (from some testing i have done). It does some pretty expensive searches and is a bit invasive to the code. I wanted to open it up at this stage in case anyone has thoughts on the idea and implementation.

@dthain
Copy link
Member

dthain commented Oct 15, 2024

Colin, this has been sitting here for a while but based on our recent discussions I am getting more comfortable with the approach. Do you think this is ready to merge? Does it need more testing? Rebasing?

@dthain
Copy link
Member

dthain commented Oct 15, 2024

Accidentally closed?

@colinthomas-z80
Copy link
Contributor Author

Yes I’m not sure why but I’ll fix it up so it doesn’t have all the outside commits.

Regarding merging in it should not create any issues with the feature turned off by default. While using the task groups there are still a few holes in the policy for error handling.

@dthain
Copy link
Member

dthain commented Oct 15, 2024

Per our discussion today, waiting until this works smoothly with the Montage+Parsl example, then will be more confident in merging...

break;
case VINE_CACHE_STATUS_UNKNOWN:
case VINE_CACHE_STATUS_FAILED:
if (p->task->group_id) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re our discussion today, move this failure-handling logic to task_can_run_eventually.
If a task has an input file that is UNKNOWN, check it see if another known tasks produces it as an output file.
If so, then the task can stay until it becomes runnable.

@btovar
Copy link
Member

btovar commented Nov 18, 2024

@colinthomas-z80 Please see conflicts above...

@dthain
Copy link
Member

dthain commented Dec 13, 2024

@colinthomas-z80 this is looking pretty close, except for the task_can_run_eventually bit.

What do you think -- is there something else missing here before we are ready to merge?

@colinthomas-z80
Copy link
Contributor Author

Yes that part could be a bit more elegant. Before merging I want to clean up the worker debug logging and I am testing a change that fixes an issue with recovery tasks.

@dthain
Copy link
Member

dthain commented Dec 13, 2024

ok, thanks for the update

@colinthomas-z80
Copy link
Contributor Author

RTM

return 0;
}

if (task->group_id && (task->refcount > 1)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need task->group_id = NULL? inside this if?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so since the task is getting deleted afterwards

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to have the if (task->group_id) { only, set it to NULL inside the if, and do not check for the refcount. The only place where we should be checking for the refcount is when deleting the task, otherwise it gets tricky if we ever call functions in a different context.

char *id = string_format("%d", group_id_counter++);
struct list *l = list_create();

t->group_id = id;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is group_id a string? Let's keep an int if possible so there is less chance of memory leaks/segfaults.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is because the group_id is a key to a hash table. They used to be a full uuid string but that was deemed to be excessive since the number of groups does not get very large.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using struct itable then. It is like struct hash_table but the keys are ints.

@btovar btovar merged commit 49abae3 into cooperative-computing-lab:master Jan 24, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants