Summary
The run_tests_with_ollama.sh script has a fast path for when it detects an existing Ollama server (lines 82–84), but it blindly trusts that server without verifying it is actually functional. If the existing server is in a bad state, the entire test run fails with Ollama connectivity errors rather than a clear setup failure.
What goes wrong
When the script detects Ollama already running, it skips starting its own server and proceeds directly to model pulls and warmups. If a warmup times out, the script logs a warning but carries on anyway:
Warning: warmup for granite4:micro timed out (will load on first test)
The subsequent tests then error with "could not create OllamaModelBackend: ollama server not running at None" rather than failing fast. The run still takes the full ~80 minutes working through connection timeouts on every affected test before reporting the failures.
Suggested improvements
- Treat a warmup timeout as a fatal error rather than a warning — either
die() with a clear message or attempt to restart the server
- When reusing an existing server, verify it is responsive with a lightweight check (e.g.
ollama ps) before proceeding to warmups
- Consider adding a
--force-restart-ollama flag for environments where stale servers are common
Context
Encountered during a manual cluster test run on an IBM LSF p-series GPU node (preemptable queue). The node had a stale Ollama server from a previous session that was running but unresponsive. Re-running after confirming no Ollama process was running produced a clean result.
Summary
The
run_tests_with_ollama.shscript has a fast path for when it detects an existing Ollama server (lines 82–84), but it blindly trusts that server without verifying it is actually functional. If the existing server is in a bad state, the entire test run fails with Ollama connectivity errors rather than a clear setup failure.What goes wrong
When the script detects Ollama already running, it skips starting its own server and proceeds directly to model pulls and warmups. If a warmup times out, the script logs a warning but carries on anyway:
The subsequent tests then error with
"could not create OllamaModelBackend: ollama server not running at None"rather than failing fast. The run still takes the full ~80 minutes working through connection timeouts on every affected test before reporting the failures.Suggested improvements
die()with a clear message or attempt to restart the serverollama ps) before proceeding to warmups--force-restart-ollamaflag for environments where stale servers are commonContext
Encountered during a manual cluster test run on an IBM LSF p-series GPU node (
preemptablequeue). The node had a stale Ollama server from a previous session that was running but unresponsive. Re-running after confirming no Ollama process was running produced a clean result.