Skip to content

ADR 002: Caching and Rate Limiting Strategy

Status

Accepted

Context

Folionaut needs caching and rate limiting for:

  • Rate limiting state (request counts per IP/key)
  • LLM response caching (expensive API calls)
  • Content caching (static content that rarely changes)
  • Protection against abuse and DoS attacks

Requirements:

  • Sub-millisecond latency for rate limit checks
  • Graceful degradation if cache is unavailable
  • Smooth traffic shaping with burst support
  • Clear feedback to clients

Decision

Caching: Layered Redis with Memory Fallback

Implement a layered caching strategy with Redis as the primary cache and in-memory LRU fallback.

LayerTechnologyPurpose
L1lru-cache (in-memory)Fast reads, sub-ms latency
L2RedisDistributed state, persistence

Cache TTLs:

  • Rate limit: 60s
  • LLM response: 1 hour
  • Content: 5 minutes
  • Session: 30 minutes

Fallback behavior: If Redis is unavailable, application continues with memory-only caching.

Rate Limiting: Token Bucket Algorithm

Use Token Bucket for rate limiting with configurable capacity and refill rates.

EndpointCapacityRefill RateNotes
Chat50.1/s~6/min sustained
Content10010/sHigh burst for reads
Admin505/sGenerous for admin

Response headers:

  • X-RateLimit-Limit: Bucket capacity
  • X-RateLimit-Remaining: Current tokens
  • X-RateLimit-Reset: Seconds until full
  • Retry-After: Seconds to wait (when limited)

Alternatives Considered

Caching

OptionProsCons
Redis onlySimpleSingle point of failure
Memory onlyFastestLost on restart, not distributed
LayeredFast reads, fault tolerantMore complex

Rate Limiting

AlgorithmProsCons
Fixed WindowSimpleBoundary spike problem
Token BucketSmooth, allows burstsMore state per client
Leaky BucketConstant rateNo burst allowance

Consequences

Positive

  • Fast path: most reads hit memory cache (sub-ms)
  • Fault tolerant: application continues without Redis
  • Burst tolerance: legitimate bursts not rejected
  • Cost control: LLM endpoints can have higher token costs

Negative

  • Cache coherence issues in multi-instance deployment
  • State per client for rate limiting
  • Two cache layers to reason about

Mitigations

  • Short L1 TTLs (30-60s) to limit staleness
  • LRU eviction for memory pressure
  • Health checks and fallback event logging

References

Released under the MIT License.