Skip to content

fix: run_tests_with_ollama.sh proceeds silently when Ollama warmup times out #759

@ajbozarth

Description

@ajbozarth

Summary

The run_tests_with_ollama.sh script has a fast path for when it detects an existing Ollama server (lines 82–84), but it blindly trusts that server without verifying it is actually functional. If the existing server is in a bad state, the entire test run fails with Ollama connectivity errors rather than a clear setup failure.

What goes wrong

When the script detects Ollama already running, it skips starting its own server and proceeds directly to model pulls and warmups. If a warmup times out, the script logs a warning but carries on anyway:

Warning: warmup for granite4:micro timed out (will load on first test)

The subsequent tests then error with "could not create OllamaModelBackend: ollama server not running at None" rather than failing fast. The run still takes the full ~80 minutes working through connection timeouts on every affected test before reporting the failures.

Suggested improvements

  • Treat a warmup timeout as a fatal error rather than a warning — either die() with a clear message or attempt to restart the server
  • When reusing an existing server, verify it is responsive with a lightweight check (e.g. ollama ps) before proceeding to warmups
  • Consider adding a --force-restart-ollama flag for environments where stale servers are common

Context

Encountered during a manual cluster test run on an IBM LSF p-series GPU node (preemptable queue). The node had a stale Ollama server from a previous session that was running but unresponsive. Re-running after confirming no Ollama process was running produced a clean result.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions