Skip to content

increase timeout via posargs#37989

Open
aIbrahiim wants to merge 3 commits intoapache:masterfrom
aIbrahiim:fix-34739-precommit-python-job
Open

increase timeout via posargs#37989
aIbrahiim wants to merge 3 commits intoapache:masterfrom
aIbrahiim:fix-34739-precommit-python-job

Conversation

@aIbrahiim
Copy link
Copy Markdown
Contributor

@aIbrahiim aIbrahiim commented Mar 28, 2026

related to #34739


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Note

Gemini is unable to generate a summary for this pull request due to the file types involved not being currently supported.

@github-actions github-actions bot added the build label Mar 28, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Assigning reviewers:

R: @Abacn for label build.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@aIbrahiim aIbrahiim force-pushed the fix-34739-precommit-python-job branch from 2bd7b78 to f611cf8 Compare March 31, 2026 13:15
@aIbrahiim aIbrahiim force-pushed the fix-34739-precommit-python-job branch from f611cf8 to 0f85fec Compare March 31, 2026 13:16
TOX_TESTENV_PASSENV: "DOCKER_*,TESTCONTAINERS_*,TC_*,BEAM_*,GRPC_*,OMP_*,OPENBLAS_*,PYTHONHASHSEED,PYTEST_*"
# Aggressive retry and timeout settings for flaky CI
PYTEST_ADDOPTS: "-v --tb=short --maxfail=5 --durations=30 --reruns=5 --reruns-delay=15 --timeout=600 --disable-warnings"
PYTEST_ADDOPTS: "-v --tb=short --maxfail=5 --durations=30 --reruns=5 --reruns-delay=15 --timeout=900 --disable-warnings"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure further increasing timeout and/or rerun could help. We already see the test suite stuck for >5h e.g. https://github.com/apache/beam/actions/runs/23779768061

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I didn't see this run, then icreasing pytest timeoutcan mask hangs and extend stuck runs so i revert the timeout bump and keep this PR focused on grpc stability env vars only as a short term mitigation @Abacn

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

TC_MAX_TRIES: "15"
TC_SLEEP_TIME: "5"
# Additional gRPC stability for flaky environment
GRPC_ARG_KEEPALIVE_TIME_MS: "60000"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if there is some racing condition for tests run in parllell cc: @tvalentyn. Nevertheless we can add these environment variables at the moment.

Copy link
Copy Markdown
Contributor

@tvalentyn tvalentyn Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we can repro this stuckness locally, or ssh into GHA worker that is stuck, then we could spy on the python process with tools like pystack, and then examine stracktraces.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that could give some clues.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so should I run with SSH access on a stuck self-hosted runner and collect pystack traces there, or add temporary CI instrumentation (faulthandler + periodic pystack dumps as artifacts) so the next stuck run captures stack traces automatically? @tvalentyn

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so should I run with SSH access on a stuck self-hosted runner

is it an option? is it easy?
how often does this issue reproduce?

overall, i'd pick whatever option is more straightforward.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

@tvalentyn tvalentyn Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good findings!
re: 1 -- is there something specific to YAML suite? do you have some suggestions (e.g. skip copying tmp?) I
re: 2 -- Could you try setting this environment variable and seeing if it helps:

GRPC_ENABLE_FORK_SUPPORT: '0'
.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: 1 --The issue ties to YAML external providers: they clone a venv in yaml_provider via clonevirtualenv so source is either a fresh mini venv, or in .dev the whole tox venv (path from base_python) and in CI that tree is live, so paths like tmp/ can disappear mid-copy, fits what we saw so I suggesst ignore tmp/, pycache/, .pytest_cache/, pip/build temps (or replace clonevirtualenv with a copy that supports ignore / only needed dirs) and for .dev/CI, prefer _create_venv_from_scratch (or a template venv) instead of cloning the full tox env, slower, stabler. Optional unit test with a fake venv + volatile tmp/ to guard regressions.

re: 2 -- subprocess_server waits on gRPC channel ready for expansion_service_main until pytest-timeout. faulthandler showed xdist/execnet sometimes, xdist off for one Python version didn’t fix other legs, parallelism isnt the whole story so I suggesst on child exit or long wait, surface exit code + stdout/stderr (and maybe fail faster with a clear error).
For ML YAML, preinstall/pin heavy deps in tox so slow pip/import/model startup doesn’t look like a hang

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add the same env var to Python PreCommit and see if it changes the expansion / xdist behavior, will update you back whether it helps

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tvalentyn I added GRPC_ENABLE_FORK_SUPPORT: "0" in Python PreCommit and kept the YAML side stabilization in yaml_provider then I reran the workflow and it passed https://github.com/apache/beam/actions/runs/23892113364

Copy link
Copy Markdown
Contributor

@Abacn Abacn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@Amar3tto
Copy link
Copy Markdown
Collaborator

Amar3tto commented Apr 2, 2026

Failed checks are unrelated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants