Skip to content

fix: Resilient Redis cache reads for rate limiting#3455

Open
gabrielshanahan wants to merge 2 commits intomainfrom
gabi.shanahan/rate-limit-cache-migration
Open

fix: Resilient Redis cache reads for rate limiting#3455
gabrielshanahan wants to merge 2 commits intomainfrom
gabi.shanahan/rate-limit-cache-migration

Conversation

@gabrielshanahan
Copy link
Contributor

@gabrielshanahan gabrielshanahan commented Feb 5, 2026

Summary

  • Exclude actuator endpoints from rate limit filters — health probes (/actuator/health) no longer go through GlobalIpRateLimitFilter/GlobalUserRateLimitFilter, preventing crash loops when Redis cache entries are corrupted
  • Add ResilientCacheAccessor component — catches RedisException on direct cache.get() calls (not covered by TolgeeCacheErrorHandler which only handles @Cacheable), evicts bad entries, returns null (cache miss)
  • Integrate into RateLimitService — corrupted Bucket entries are treated as cache misses instead of 500 errors

Root cause (v3.154.0 crash loop)

Bucket class gained strikeCount/lastStrikeAt fields → old Kryo-serialized entries in Redis couldn't deserialize → KryoBufferUnderflowException → health checks hit rate limit filter → HTTP 500 → K8s restarts pod → crash loop.

Defense in depth

Layer What it does Protects against
Actuator bypass Skips rate limiting for /actuator/** Health probe failures from any cache issue
ResilientCacheAccessor Catches RedisException, evicts, returns null 500 errors on API requests from corrupted cache

Test plan

  • ResilientCacheAccessorTest — 5 tests (normal get, null return, RedisException caught + eviction, non-Redis propagation)
  • RateLimitServiceTest — 15 tests including new corrupted cache entry is treated as cache miss
  • Full backend test suite (server-app:runStandardTests)
  • Manual: start with Redis, corrupt a rateLimits entry, verify graceful handling

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Improved cache resilience: corrupted cache entries are evicted and treated as misses to prevent runtime errors.
    • Rate limiting now skips actuator endpoints so health and metrics remain accessible (including when an application context path is set).
  • Tests

    • Added tests validating cache resilience, eviction on corruption, null-handling, and exception propagation.
    • Expanded rate limiting tests to verify actuator endpoint exemptions and context-path handling.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 5, 2026

📝 Walkthrough

Walkthrough

Adds a new Spring component ResilientCacheAccessor to safely wrap Cache.get calls (catches RedisException, logs, evicts key, returns null). Integrates it into RateLimitService. Global rate-limit filters now skip requests whose path equals or starts with /actuator. Tests added/updated.

Changes

Cohort / File(s) Summary
Cache Resilience Component
backend/data/src/main/kotlin/io/tolgee/component/ResilientCacheAccessor.kt, backend/data/src/test/kotlin/io/tolgee/unit/ResilientCacheAccessorTest.kt
New Spring component ResilientCacheAccessor with fun <T> get(cache: Cache, key: Any, type: Class<T>): T? that catches RedisException, logs warnings, evicts via cache.evictIfPresent(key), and returns null. Unit tests cover success, miss, RedisException (eviction), and non-Redis exception propagation.
Rate Limit Service Integration
backend/data/src/main/kotlin/io/tolgee/security/ratelimit/RateLimitService.kt, backend/data/src/test/kotlin/io/tolgee/security/ratelimit/RateLimitServiceTest.kt, backend/security/src/test/kotlin/io/tolgee/security/ratelimit/RateLimitInterceptorTest.kt
RateLimitService constructor now accepts ResilientCacheAccessor; replaces direct cache.get(...) with resilientCacheAccessor.get(...) when loading buckets. Tests updated to inject/mock the accessor and assert behavior when null is returned.
Rate Limit Filters & Tests
backend/security/src/main/kotlin/io/tolgee/security/ratelimit/GlobalIpRateLimitFilter.kt, backend/security/src/main/kotlin/io/tolgee/security/ratelimit/GlobalUserRateLimitFilter.kt, backend/security/src/test/kotlin/.../GlobalIpRateLimitFilterTest.kt, backend/security/src/test/kotlin/.../GlobalUserRateLimitFilterTest.kt
shouldNotFilter logic extended to skip URIs equal to /actuator or starting with /actuator/. Added tests asserting actuator endpoints bypass rate limiting (no exception propagation).

Sequence Diagram

sequenceDiagram
    participant Client
    participant RateLimitFilter
    participant RateLimitService
    participant ResilientCacheAccessor
    participant Cache
    participant Redis

    Client->>RateLimitFilter: HTTP request
    RateLimitFilter->>RateLimitService: consume/check bucket
    RateLimitService->>ResilientCacheAccessor: get(cache, key, Bucket.class)
    ResilientCacheAccessor->>Cache: cache.get(key, type)
    Cache->>Redis: fetch entry
    alt Redis deserialization succeeds
        Redis-->>Cache: value
        Cache-->>ResilientCacheAccessor: Bucket
        ResilientCacheAccessor-->>RateLimitService: Bucket
    else RedisException occurs
        Redis-->>Cache: throws RedisException
        ResilientCacheAccessor->>ResilientCacheAccessor: log warnings
        ResilientCacheAccessor->>Cache: cache.evictIfPresent(key)
        ResilientCacheAccessor-->>RateLimitService: null
    end
    RateLimitService->>RateLimitService: create fresh bucket if null
    RateLimitService-->>RateLimitFilter: allow/reject
    RateLimitFilter-->>Client: response
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Suggested reviewers

  • JanCizmar
  • bdshadow

Poem

🐇 I nibble through caches in moonlit code,

When Redis hiccups and spills the load,
I log a whisper, sweep the broken key,
Return a fresh bucket for services to be,
Hopping onward, resilient and bold.

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.70% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely describes the main objective of the PR: implementing resilient Redis cache reads for the rate limiting system. It directly reflects the core changes (ResilientCacheAccessor wrapper, integration with RateLimitService, and actuator endpoint exclusions).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch gabi.shanahan/rate-limit-cache-migration

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gabrielshanahan gabrielshanahan force-pushed the gabi.shanahan/rate-limit-cache-migration branch from 26d5b9f to af106c8 Compare February 5, 2026 21:20
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In
`@backend/security/src/main/kotlin/io/tolgee/security/ratelimit/GlobalUserRateLimitFilter.kt`:
- Around line 53-55: In GlobalUserRateLimitFilter.shouldNotFilter, the current
check request.requestURI.startsWith("/actuator/") misses the bare "/actuator"
path; update the condition to cover both forms (e.g. replace
startsWith("/actuator/") with request.requestURI == "/actuator" ||
request.requestURI.startsWith("/actuator/") or simply startsWith("/actuator"))
so the filter skips both "/actuator" and any subpaths.

@gabrielshanahan gabrielshanahan force-pushed the gabi.shanahan/rate-limit-cache-migration branch from af106c8 to 065aa6c Compare February 5, 2026 21:30
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In
`@backend/security/src/main/kotlin/io/tolgee/security/ratelimit/GlobalIpRateLimitFilter.kt`:
- Around line 55-57: In shouldNotFilter, strip the servlet context path from
request.requestURI before performing the actuator checks so endpoints under a
context path still match; use request.contextPath (or request.getContextPath())
to remove the prefix, then compare the resulting path for equals "/actuator",
startsWith "/actuator/", or method == "OPTIONS" to decide skipping; update the
logic in GlobalIpRateLimitFilter.shouldNotFilter to operate on the trimmed URI.
🧹 Nitpick comments (3)
backend/security/src/test/kotlin/io/tolgee/security/ratelimit/GlobalUserRateLimitFilterTest.kt (1)

94-120: Also assert the rate-limit call is skipped.

These tests only assert “no exception”; adding a never() verify ensures actuator requests truly bypass the rate-limiter.

🧪 Suggested assertion
     assertDoesNotThrow { rateLimitFilter.doFilter(req, res, chain) }
+    Mockito.verify(rateLimitService, Mockito.never())
+      .consumeGlobalUserRateLimitPolicy(any(), any())
backend/security/src/test/kotlin/io/tolgee/security/ratelimit/GlobalIpRateLimitFilterTest.kt (1)

89-115: Also assert the rate-limit call is skipped.

These tests pass even if the rate-limit call still occurs; consider verifying it’s never invoked to lock in the bypass behavior.

🧪 Suggested assertion
     assertDoesNotThrow { rateLimitFilter.doFilter(req, res, chain) }
+    Mockito.verify(rateLimitService, Mockito.never())
+      .consumeGlobalIpRateLimitPolicy(any())
backend/data/src/test/kotlin/io/tolgee/security/ratelimit/RateLimitServiceTest.kt (1)

340-364: Strengthen the corrupted-cache test by asserting the accessor call.

Right now it only checks that no exception is thrown; verifying the accessor interaction ensures the intended path is exercised.

🧪 Suggested assertion
     serviceWithMockAccessor.consumeBucket(testPolicy)
+    Mockito.verify(mockAccessor)
+      .get(any<Cache>(), eq("corrupted_test"), eq(Bucket::class.java))

@gabrielshanahan gabrielshanahan force-pushed the gabi.shanahan/rate-limit-cache-migration branch from 065aa6c to 0752976 Compare February 5, 2026 21:57
JanCizmar
JanCizmar previously approved these changes Feb 6, 2026
@JanCizmar
Copy link
Contributor

Can we remove the CacheSchemaCleaner now?

Gabriel Shanahan and others added 2 commits February 6, 2026 10:37
Health checks (actuator endpoints) go through Spring Security's FilterChainProxy
on both port 8080 and 8081. When cached Bucket objects can't be deserialized
(e.g. after schema changes), rate limit filters throw RedisException, causing
health probes to fail and Kubernetes to restart pods in a crash loop.

Two defense layers:
1. Exclude /actuator/ from rate limit filters — prevents crash loops even
   with corrupted cache entries
2. ResilientCacheAccessor — catches RedisException on direct cache.get()
   calls (not covered by TolgeeCacheErrorHandler which only handles
   @Cacheable), evicts bad entries, and returns null (cache miss)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants