Set up Django project with Celery and gevent #2

snopoke · 2025-11-10T12:01:01Z

Environment to try and reproduce the errors described in https://docs.google.com/document/d/1VMiPFP1qL17TyG2huzQxLXH49IasPCvaQRRfG32SP0I/edit?tab=t.0

@observe

Create a complete Django project that reproduces a bug affecting SSL verification in PostgreSQL connections when using: - Celery with gevent pool - Langfuse (>3.0) with OpenTelemetry instrumentation - psycopg3 with SSL connections Project includes: - Django 4.2 with PostgreSQL SSL configuration - Celery 5.3+ configured with gevent pool - Langfuse 3.0+ integration with @observe decorators - Docker Compose setup for PostgreSQL (with SSL) and Redis - Multiple test scripts to isolate and reproduce the bug - Comprehensive documentation Test scripts: - reproduce_bug.py: Standalone test with 4 scenarios - trigger_tasks.py: Trigger Celery tasks to test in worker context - manage.py test_bug: Django management command for testing The bug manifests when gevent's monkey patching interacts with OpenTelemetry's instrumentation, affecting SSL context handling in psycopg3 database connections.

Replace pip-based workflow with uv for significantly faster dependency installation and environment management. Changes: - Add pyproject.toml for modern Python project configuration - Add .python-version file for consistent Python version (3.11) - Generate uv.lock for deterministic dependency resolution - Update setup.sh to install and use uv instead of pip - Update run_celery_gevent.sh to use 'uv run' - Update README.md with uv-first instructions and examples - Update .gitignore to include .venv directory Benefits: - 10-100x faster dependency installation with uv - Deterministic builds with uv.lock - Automatic virtual environment management - Better dependency resolution - Still supports traditional pip workflow via requirements.txt All scripts now use 'uv run' for command execution. Users can still use traditional pip/venv workflow if preferred.

@observe

Significantly expand the bug reproduction capabilities to increase the likelihood of triggering the gevent + langfuse + psycopg3 SSL issue. New Features: 1. RequestLog Model - Track HTTP requests made during task execution - Log URL, method, status code, response time, and errors - Stores in PostgreSQL with SSL connection 2. New Celery Tasks - test_internal_observe: Uses @observe internally instead of as decorator - test_http_with_db_logging: Makes HTTP requests and logs to DB - test_multiple_http_requests: Multiple HTTP calls per task - test_mixed_operations: Combines DB queries, HTTP, and ORM operations 3. Enhanced trigger_tasks.py Script - Multiple test modes: simple, http, multiple, mixed, stress, all - Command-line arguments to control test parameters - Stress test mode with configurable concurrency - Default: 20 concurrent tasks + 10 HTTP tasks - Progress indicators and detailed result summaries 4. HTTP Integration - Uses httpbin.org for realistic HTTP requests - Adds I/O delays to increase concurrency pressure - Combines HTTP and database operations in single tasks Benefits: - Higher likelihood of reproducing the SSL verification bug - More realistic workload scenarios - Stress testing capabilities - Multiple test vectors for different bug scenarios - Better observability with request logging Usage Examples: uv run python trigger_tasks.py # All tests uv run python trigger_tasks.py --mode stress # Stress test only uv run python trigger_tasks.py --concurrency 50 # High concurrency Files Modified: - testapp/models.py: Add RequestLog model - testapp/tasks.py: Add 4 new comprehensive tasks - trigger_tasks.py: Complete rewrite with multiple test modes - pyproject.toml/requirements.txt: Add requests library - README.md: Document new tasks and test modes

Create specialized testing tools to increase likelihood of reproducing the intermittent SSL verification bug that occurs with gevent + langfuse + psycopg3. New Tools: 1. test_monkey_patching.py - Test 6 different monkey patching strategies - Early vs late patching relative to Django/OTEL imports - Aggressive vs minimal module patching - SSL-only and no-SSL variants - Helps identify which patching order triggers the bug 2. test_connection_pool.py - Connection cycling: rapidly open/close connections - Concurrent connections: many greenlets simultaneously - Mixed operations: varying timing patterns - Rapid context switches: continuous operations with greenlet switching - Exposes race conditions in connection pool and SSL context 3. inspect_ssl_context.py - Diagnostic tool showing SSL module state - Thread-local storage behavior in greenlets - psycopg3 internals and connection details - SSL parameters and certificate info - Helps understand current SSL context state 4. celery_worker_early_patch.py - Worker entry point with monkey patching before ALL imports - Tests whether early patching affects SSL initialization - Alternative to standard celery worker command 5. run_celery_multiworker.sh - Helper script for running multiple worker processes - Higher total concurrency to expose race conditions 6. REPRODUCING_THE_BUG.md - Comprehensive guide for bug reproduction strategies - Explains theory behind the bug - Step-by-step reproduction phases - What to look for and how to report results Key Strategies: - Vary monkey patching order (before/after Django imports) - Aggressive connection pool stress testing - High concurrency with many greenlets - Long-running continuous operations - Multiple worker processes - Rapid greenlet context switching - SSL context inspection and debugging Theory: The bug likely involves a race condition where: 1. Gevent's monkey patching affects SSL context initialization 2. OTEL's context propagation interferes with greenlet switching 3. Thread-local storage accessed from wrong greenlet 4. Connection pool state during specific timing windows These tools provide multiple attack vectors to trigger the bug by stressing different aspects of the system. Usage Examples: uv run python test_monkey_patching.py --strategy early_aggressive uv run python test_connection_pool.py --cycles 200 --greenlets 30 uv run python inspect_ssl_context.py uv run python celery_worker_early_patch.py --pool=gevent --concurrency=20 See REPRODUCING_THE_BUG.md for detailed reproduction strategies.

Problem: Monkey patching is persistent within a Python process. Once modules are patched, they cannot be unpatched, causing interference between different patching strategy tests. Solution: Refactor test_monkey_patching.py to spawn a subprocess for each strategy when running with --strategy=all (default behavior). Changes: - Add subprocess-based execution for each strategy - Add --internal-run flag for subprocess execution - Each strategy now runs in completely isolated Python process - No interference between different patching configurations - Simplified usage: just run "uv run python test_monkey_patching.py" - Can still test single strategy without subprocess overhead Benefits: - Reliable, reproducible results for each strategy - Clean Python environment for each test - Can test all 6 strategies in one command - Better isolation reveals true patching order effects Usage: # Test all strategies (automatic subprocesses) uv run python test_monkey_patching.py # Test single strategy (no subprocess needed) uv run python test_monkey_patching.py --strategy early_aggressive This addresses the concern that testing multiple strategies in the same process would produce unreliable results due to persistent monkey patching state. Updated documentation to reflect simpler usage pattern.

Updated inspect_ssl_context.py and test_connection_pool.py to use the pg_stat_ssl internal table instead of PostgreSQL SSL functions (ssl_is_used(), ssl_version(), ssl_cipher()) which are not available in all PostgreSQL Docker images. Changes: - Replace direct SSL function calls with subqueries from pg_stat_ssl - Query format: SELECT ssl, version, cipher FROM pg_stat_ssl WHERE pid = pg_backend_pid() - Add null handling for SSL information (N/A when not available) - Maintain backward compatibility with existing variable names

Enhanced bug reproduction strategy to test thread interference: 1. Updated test_monkey_patching.py: - Now spawns concurrent greenlets instead of sequential operations - Default increased to 50 concurrent greenlets per strategy - Forces context switches and connection cycling during tests - Much more likely to expose race conditions 2. Added langgraph to dependencies: - Langgraph uses real OS threads internally - Critical for testing thread interference with gevent 3. Created test_langgraph_gevent.py: - Tests interaction between langgraph threads and gevent greenlets - Key insight: langgraph's threads + gevent's monkey patching + SSL context in thread-local storage can cause SSL verification failures - Two test modes: basic (multiple rounds) and concurrent (maximum stress) - Includes HTTP requests to add I/O delays and timing variations 4. Updated REPRODUCING_THE_BUG.md: - Added langgraph test as PRIORITY test strategy - Explained why thread interference is critical to reproduce bug - Updated recommended reproduction strategy The langgraph test is most likely to reproduce the bug because it creates the exact conditions that cause SSL context to be accessed from wrong thread-local storage during greenlet context switches.

claude and others added 14 commits November 10, 2025 11:42

use different ports

8be1555

update dependencies

a8181d5

fix imports

70a4188

count request timeouts separately

305bbef

add custom langfuse tracing setup

32f472c

use multiple langfuse accounts

f92bd46

add long running test

031cca8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Set up Django project with Celery and gevent #2

Set up Django project with Celery and gevent #2

Uh oh!

snopoke commented Nov 10, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Set up Django project with Celery and gevent #2

Are you sure you want to change the base?

Set up Django project with Celery and gevent #2

Uh oh!

Conversation

snopoke commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

snopoke commented Nov 10, 2025 •

edited

Loading