* fix(middleware): offload memory injection off event loop to prevent tiktoken blocking (#3402)
DynamicContextMiddleware.abefore_agent() called _inject() synchronously
on the asyncio event loop. The first time memory is injected (second
request), _inject() → format_memory_for_injection() → _count_tokens()
→ tiktoken.get_encoding("cl100k_base") needs to download the BPE data
from openaipublic.blob.core.windows.net. In network-restricted
environments this download blocks until the OS TCP timeout (~26 min),
starving ALL concurrent handlers including /api/v1/auth/me.
Fix:
- abefore_agent now uses asyncio.to_thread(self._inject, state) so
file I/O and tiktoken never block the event loop.
- Extract _get_tiktoken_encoding() with a module-level cache so
tiktoken.get_encoding() is called at most once per encoding name.
- Add warm_tiktoken_cache() startup helper; gateway lifespan pre-warms
the cache via asyncio.to_thread so the first request never triggers a
cold download.
- _count_tokens falls back to len(text) // 4 on any encoding failure.
Tests:
- tests/test_tiktoken_cache_and_count_tokens.py (12 tests): cache
hit/miss, fallback paths, warm-up helper.
- tests/blocking_io/test_dynamic_context_middleware.py (2 tests):
Blockbuster gate verifies abefore_agent does not block the event
loop; async/sync parity check.
Fixes#3402
* Apply suggestions from code review
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* fix the lint error
* fix(memory): use future annotations to avoid NameError when tiktoken is absent
Add `from __future__ import annotations` to prompt.py so that
tiktoken.Encoding type hints are never evaluated at runtime. Without
this, environments where tiktoken is not installed could raise NameError
on the module-level cache and function return annotations.
Addresses Copilot review comment on PR #3411.
* fix(middleware): bound abefore_agent injection with timeout to prevent hung requests
Wrap the asyncio.to_thread(self._inject) offload in asyncio.wait_for()
with a 5-second cap. If the startup warm-up failed silently (e.g.
network blip during deploy), a cold tiktoken BPE download on the first
request can block until the OS TCP timeout (~26 min). The bounded
timeout ensures the request degrades gracefully (no memory/date context
for that turn) rather than hanging.
Adds test_abefore_agent_returns_none_on_timeout to the blocking-IO
regression anchors.
Addresses review feedback from xg-gh-25 on PR #3411.
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>