Skip to content

Conversation

@snopoke
Copy link
Contributor

@snopoke snopoke commented Nov 10, 2025

claude and others added 14 commits November 10, 2025 11:42
Create a complete Django project that reproduces a bug affecting SSL
verification in PostgreSQL connections when using:
- Celery with gevent pool
- Langfuse (>3.0) with OpenTelemetry instrumentation
- psycopg3 with SSL connections

Project includes:
- Django 4.2 with PostgreSQL SSL configuration
- Celery 5.3+ configured with gevent pool
- Langfuse 3.0+ integration with @observe decorators
- Docker Compose setup for PostgreSQL (with SSL) and Redis
- Multiple test scripts to isolate and reproduce the bug
- Comprehensive documentation

Test scripts:
- reproduce_bug.py: Standalone test with 4 scenarios
- trigger_tasks.py: Trigger Celery tasks to test in worker context
- manage.py test_bug: Django management command for testing

The bug manifests when gevent's monkey patching interacts with
OpenTelemetry's instrumentation, affecting SSL context handling
in psycopg3 database connections.
Replace pip-based workflow with uv for significantly faster
dependency installation and environment management.

Changes:
- Add pyproject.toml for modern Python project configuration
- Add .python-version file for consistent Python version (3.11)
- Generate uv.lock for deterministic dependency resolution
- Update setup.sh to install and use uv instead of pip
- Update run_celery_gevent.sh to use 'uv run'
- Update README.md with uv-first instructions and examples
- Update .gitignore to include .venv directory

Benefits:
- 10-100x faster dependency installation with uv
- Deterministic builds with uv.lock
- Automatic virtual environment management
- Better dependency resolution
- Still supports traditional pip workflow via requirements.txt

All scripts now use 'uv run' for command execution. Users can still
use traditional pip/venv workflow if preferred.
Significantly expand the bug reproduction capabilities to increase
the likelihood of triggering the gevent + langfuse + psycopg3 SSL issue.

New Features:

1. RequestLog Model
   - Track HTTP requests made during task execution
   - Log URL, method, status code, response time, and errors
   - Stores in PostgreSQL with SSL connection

2. New Celery Tasks
   - test_internal_observe: Uses @observe internally instead of as decorator
   - test_http_with_db_logging: Makes HTTP requests and logs to DB
   - test_multiple_http_requests: Multiple HTTP calls per task
   - test_mixed_operations: Combines DB queries, HTTP, and ORM operations

3. Enhanced trigger_tasks.py Script
   - Multiple test modes: simple, http, multiple, mixed, stress, all
   - Command-line arguments to control test parameters
   - Stress test mode with configurable concurrency
   - Default: 20 concurrent tasks + 10 HTTP tasks
   - Progress indicators and detailed result summaries

4. HTTP Integration
   - Uses httpbin.org for realistic HTTP requests
   - Adds I/O delays to increase concurrency pressure
   - Combines HTTP and database operations in single tasks

Benefits:
- Higher likelihood of reproducing the SSL verification bug
- More realistic workload scenarios
- Stress testing capabilities
- Multiple test vectors for different bug scenarios
- Better observability with request logging

Usage Examples:
  uv run python trigger_tasks.py                    # All tests
  uv run python trigger_tasks.py --mode stress      # Stress test only
  uv run python trigger_tasks.py --concurrency 50   # High concurrency

Files Modified:
- testapp/models.py: Add RequestLog model
- testapp/tasks.py: Add 4 new comprehensive tasks
- trigger_tasks.py: Complete rewrite with multiple test modes
- pyproject.toml/requirements.txt: Add requests library
- README.md: Document new tasks and test modes
Create specialized testing tools to increase likelihood of reproducing
the intermittent SSL verification bug that occurs with gevent + langfuse
+ psycopg3.

New Tools:

1. test_monkey_patching.py
   - Test 6 different monkey patching strategies
   - Early vs late patching relative to Django/OTEL imports
   - Aggressive vs minimal module patching
   - SSL-only and no-SSL variants
   - Helps identify which patching order triggers the bug

2. test_connection_pool.py
   - Connection cycling: rapidly open/close connections
   - Concurrent connections: many greenlets simultaneously
   - Mixed operations: varying timing patterns
   - Rapid context switches: continuous operations with greenlet switching
   - Exposes race conditions in connection pool and SSL context

3. inspect_ssl_context.py
   - Diagnostic tool showing SSL module state
   - Thread-local storage behavior in greenlets
   - psycopg3 internals and connection details
   - SSL parameters and certificate info
   - Helps understand current SSL context state

4. celery_worker_early_patch.py
   - Worker entry point with monkey patching before ALL imports
   - Tests whether early patching affects SSL initialization
   - Alternative to standard celery worker command

5. run_celery_multiworker.sh
   - Helper script for running multiple worker processes
   - Higher total concurrency to expose race conditions

6. REPRODUCING_THE_BUG.md
   - Comprehensive guide for bug reproduction strategies
   - Explains theory behind the bug
   - Step-by-step reproduction phases
   - What to look for and how to report results

Key Strategies:

- Vary monkey patching order (before/after Django imports)
- Aggressive connection pool stress testing
- High concurrency with many greenlets
- Long-running continuous operations
- Multiple worker processes
- Rapid greenlet context switching
- SSL context inspection and debugging

Theory:
The bug likely involves a race condition where:
1. Gevent's monkey patching affects SSL context initialization
2. OTEL's context propagation interferes with greenlet switching
3. Thread-local storage accessed from wrong greenlet
4. Connection pool state during specific timing windows

These tools provide multiple attack vectors to trigger the bug
by stressing different aspects of the system.

Usage Examples:
  uv run python test_monkey_patching.py --strategy early_aggressive
  uv run python test_connection_pool.py --cycles 200 --greenlets 30
  uv run python inspect_ssl_context.py
  uv run python celery_worker_early_patch.py --pool=gevent --concurrency=20

See REPRODUCING_THE_BUG.md for detailed reproduction strategies.
Problem: Monkey patching is persistent within a Python process. Once
modules are patched, they cannot be unpatched, causing interference
between different patching strategy tests.

Solution: Refactor test_monkey_patching.py to spawn a subprocess for
each strategy when running with --strategy=all (default behavior).

Changes:
- Add subprocess-based execution for each strategy
- Add --internal-run flag for subprocess execution
- Each strategy now runs in completely isolated Python process
- No interference between different patching configurations
- Simplified usage: just run "uv run python test_monkey_patching.py"
- Can still test single strategy without subprocess overhead

Benefits:
- Reliable, reproducible results for each strategy
- Clean Python environment for each test
- Can test all 6 strategies in one command
- Better isolation reveals true patching order effects

Usage:
  # Test all strategies (automatic subprocesses)
  uv run python test_monkey_patching.py

  # Test single strategy (no subprocess needed)
  uv run python test_monkey_patching.py --strategy early_aggressive

This addresses the concern that testing multiple strategies in the
same process would produce unreliable results due to persistent
monkey patching state.

Updated documentation to reflect simpler usage pattern.
Updated inspect_ssl_context.py and test_connection_pool.py to use
the pg_stat_ssl internal table instead of PostgreSQL SSL functions
(ssl_is_used(), ssl_version(), ssl_cipher()) which are not available
in all PostgreSQL Docker images.

Changes:
- Replace direct SSL function calls with subqueries from pg_stat_ssl
- Query format: SELECT ssl, version, cipher FROM pg_stat_ssl WHERE pid = pg_backend_pid()
- Add null handling for SSL information (N/A when not available)
- Maintain backward compatibility with existing variable names
Enhanced bug reproduction strategy to test thread interference:

1. Updated test_monkey_patching.py:
   - Now spawns concurrent greenlets instead of sequential operations
   - Default increased to 50 concurrent greenlets per strategy
   - Forces context switches and connection cycling during tests
   - Much more likely to expose race conditions

2. Added langgraph to dependencies:
   - Langgraph uses real OS threads internally
   - Critical for testing thread interference with gevent

3. Created test_langgraph_gevent.py:
   - Tests interaction between langgraph threads and gevent greenlets
   - Key insight: langgraph's threads + gevent's monkey patching + SSL
     context in thread-local storage can cause SSL verification failures
   - Two test modes: basic (multiple rounds) and concurrent (maximum stress)
   - Includes HTTP requests to add I/O delays and timing variations

4. Updated REPRODUCING_THE_BUG.md:
   - Added langgraph test as PRIORITY test strategy
   - Explained why thread interference is critical to reproduce bug
   - Updated recommended reproduction strategy

The langgraph test is most likely to reproduce the bug because it creates
the exact conditions that cause SSL context to be accessed from wrong
thread-local storage during greenlet context switches.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants