docs(config-refactor): rewrite design and plan to reflect shipped architecture

Bring the design and plan docs in line with what actually shipped in #2271. The original plan specified a simple "single ContextVar + ConfigNotInitializedError" lifecycle in a new context.py module; the shipped lifecycle is a 3-tier fallback (process-global + ContextVar override + auto-load warning) attached to AppConfig itself, because ContextVar alone could not propagate config updates across Gateway async request boundaries. Design doc: replace the lifecycle section with the shipped 3-tier model, document DeerFlowContext + resolve_context() in deer_flow_context.py, add an access-pattern matrix (typed middleware vs dict-legacy vs non-agent), and a post-mortem section explaining the three material divergences from the plan with commit references (7a11e925, 4df595b0, a934a822). Plan doc: collapse the historical TDD step-by-step into a task log with checkboxes marked complete, add a top-of-file post-mortem table, and update the file-structure tables to match shipped call-site migrations. All 16 tasks retained; commit SHAs added for the divergent ones.
2026-04-25 11:18:22 +00:00 · 2026-04-16 14:20:17 +08:00 · 2026-04-16 14:20:17 +08:00 · 7656d6399d
commit 7656d6399d
parent a934a822df
2 changed files with 322 additions and 1114 deletions
--- a/docs/plans/2026-04-12-config-refactor-design.md
+++ b/docs/plans/2026-04-12-config-refactor-design.md
@ -1,14 +1,16 @@
 # Design: Eliminate Global Mutable State in Configuration System

-> Implements [#1811](https://github.com/bytedance/deer-flow/issues/1811) · Tracked in [#2151](https://github.com/bytedance/deer-flow/issues/2151)
+> Implements [#1811](https://github.com/bytedance/deer-flow/issues/1811) · Tracked in [#2151](https://github.com/bytedance/deer-flow/issues/2151) · Shipped in [PR #2271](https://github.com/bytedance/deer-flow/pull/2271)
+>
+> **Status:** Shipped. This document reflects the architecture that was merged. For the divergence from the original plan and the reasoning, see §7.

 ## Problem

-`deerflow/config/` has three structural issues:
+`deerflow/config/` had three structural issues:

-1. **Dual source of truth** — each sub-config exists both as an `AppConfig` field and a module-level global (e.g. `_memory_config`). Consumers don't know which to trust.
-2. **Side-effect coupling** — `AppConfig.from_file()` silently mutates 8 sub-module globals via `load_*_from_dict()` calls.
-3. **Incomplete isolation** — `ContextVar` only scopes `AppConfig`, not the 8 sub-config globals.
+1. **Dual source of truth** — each sub-config existed both as an `AppConfig` field and a module-level global (e.g. `_memory_config`). Consumers didn't know which to trust.
+2. **Side-effect coupling** — `AppConfig.from_file()` silently mutated 8 sub-module globals via `load_*_from_dict()` calls.
+3. **Incomplete isolation** — `ContextVar` only scoped `AppConfig`, not the 8 sub-config globals.

 ## Design Principle

@ -18,14 +20,14 @@

 ### 1. Frozen AppConfig (full tree)

-All config models set `frozen=True`. No mutation after construction.
+All config models set `frozen=True`, including `DatabaseConfig` and `RunEventsConfig` (added late in review). No mutation after construction.

 ```python
 class MemoryConfig(BaseModel):
    model_config = ConfigDict(frozen=True)

 class AppConfig(BaseModel):
-    model_config = ConfigDict(frozen=True)
+    model_config = ConfigDict(extra="allow", frozen=True)
    memory: MemoryConfig
    title: TitleConfig
    ...
@ -35,31 +37,84 @@ Changes use copy-on-write: `config.model_copy(update={...})`.

 ### 2. Pure `from_file()`

-`AppConfig.from_file()` becomes a pure function — returns a frozen object, no side effects. All `load_*_from_dict()` calls removed.
+`AppConfig.from_file()` is a pure function — returns a frozen object, no side effects. All 8 `load_*_from_dict()` calls and their imports were removed.

-### 3. Delete sub-module globals
+### 3. Deleted sub-module globals

-Every sub-config module's global state is deleted:
+Every sub-config module's global state was deleted:

-| Delete | Files |
-|--------|-------|
+| Deleted | Files |
+|---------|-------|
 | `_memory_config`, `get_memory_config()`, `set_memory_config()`, `load_memory_config_from_dict()` | `memory_config.py` |
 | `_title_config`, `get_title_config()`, `set_title_config()`, `load_title_config_from_dict()` | `title_config.py` |
 | Same pattern | `summarization_config.py`, `subagents_config.py`, `guardrails_config.py`, `tool_search_config.py`, `checkpointer_config.py`, `stream_bridge_config.py`, `acp_config.py` |
 | `_extensions_config`, `reload_extensions_config()`, `reset_extensions_config()`, `set_extensions_config()` | `extensions_config.py` |
 | `reload_app_config()`, `reset_app_config()`, `set_app_config()`, mtime detection, `push/pop_current_app_config()` | `app_config.py` |

-Consumers migrate from `get_memory_config()` → `get_app_config().memory`.
+Consumers migrated from `get_memory_config()` → `AppConfig.current().memory` (~100 call-sites).

-### 4. Propagation
+### 4. Lifecycle: 3-tier `AppConfig.current()`

-#### Agent path: `Runtime[DeerFlowContext]`
+The original plan called for a single `ContextVar` with hard-fail on uninitialized access. The shipped lifecycle is a **3-tier fallback** attached to `AppConfig` itself (no separate `context.py` module). The divergence is explained in §7.

-LangGraph's official DI mechanism. Context is injected per-invocation, type-safe.
+```python
+# app_config.py
+class AppConfig(BaseModel):
+    ...
+
+    # Process-global singleton. Atomic pointer swap under the GIL,
+    # so no lock is needed for current read/write patterns.
+    _global: ClassVar[AppConfig | None] = None
+
+    # Per-context override (tests, multi-client scenarios).
+    _override: ClassVar[ContextVar[AppConfig]] = ContextVar("deerflow_app_config_override")
+
+    @classmethod
+    def init(cls, config: AppConfig) -> None:
+        """Set the process-global. Visible to all subsequent async tasks."""
+        cls._global = config
+
+    @classmethod
+    def set_override(cls, config: AppConfig) -> Token[AppConfig]:
+        """Per-context override. Returns Token for reset_override()."""
+        return cls._override.set(config)
+
+    @classmethod
+    def reset_override(cls, token: Token[AppConfig]) -> None:
+        cls._override.reset(token)
+
+    @classmethod
+    def current(cls) -> AppConfig:
+        """Priority: per-context override > process-global > auto-load from file."""
+        try:
+            return cls._override.get()
+        except LookupError:
+            pass
+        if cls._global is not None:
+            return cls._global
+        logger.warning(
+            "AppConfig.current() called before init(); auto-loading from file. "
+            "Call AppConfig.init() at process startup to surface config errors early."
+        )
+        config = cls.from_file()
+        cls._global = config
+        return config
+```
+
+**Why three tiers and not one:**
+
+- **Process-global** is required because `ContextVar` doesn't propagate config updates across async request boundaries. Gateway receives a `PUT /mcp/config` on one request, reloads config, and the next request — in a fresh async context — must see the new value. A plain class variable (`_global`) does this; a `ContextVar` does not.
+- **Per-context override** is retained for test isolation and multi-client scenarios. A test can scope its config without mutating the process singleton. `reset_override()` restores the previous state deterministically via `Token`.
+- **Auto-load fallback** is a backward-compatibility escape hatch with a warning. Call sites that skipped explicit `init()` (legacy or test) still work, but the warning surfaces the miss.
+
+### 5. Per-invocation context: `DeerFlowContext`
+
+Lives in `deerflow/config/deer_flow_context.py` (not `context.py` as originally planned — the name was reserved to avoid implying a lifecycle module).

 ```python
@dataclass(frozen=True)
 class DeerFlowContext:
+    """Typed, immutable, per-invocation context injected via LangGraph Runtime."""
    app_config: AppConfig
    thread_id: str
    agent_name: str | None = None
@ -69,80 +124,99 @@ class DeerFlowContext:

 | Field | Type | Source | Mutability |
 |-------|------|--------|-----------|
-| `app_config` | `AppConfig` | ContextVar (`get_app_config()`) | Immutable per-run |
+| `app_config` | `AppConfig` | `AppConfig.current()` at run start | Immutable per-run |
 | `thread_id` | `str` | Caller-provided | Immutable per-run |
 | `agent_name` | `str \| None` | Caller-provided (bootstrap only) | Immutable per-run |

-**Not in context:** `sandbox_id` is mutable runtime state (lazy-acquired mid-execution). It flows through `ThreadState.sandbox` (state channel), not context. The 3 existing `runtime.context["sandbox_id"] = ...` writes in `sandbox/tools.py` are removed; `SandboxMiddleware.after_agent` reads from `state["sandbox"]` only.
+**Not in context:** `sandbox_id` is mutable runtime state (lazy-acquired mid-execution). It flows through `ThreadState.sandbox` (state channel), not context. All 3 `runtime.context["sandbox_id"] = ...` writes in `sandbox/tools.py` were removed; `SandboxMiddleware.after_agent` reads from `state["sandbox"]` only.

-**Construction per entry point (Gateway is primary):**
+**Construction per entry point:**

 ```python
 # Gateway runtime (worker.py) — primary path
-context = DeerFlowContext(app_config=get_app_config(), thread_id=thread_id)
-agent.astream(input, config=config, context=context)
+deer_flow_context = DeerFlowContext(
+    app_config=AppConfig.current(),
+    thread_id=thread_id,
+)
+agent.astream(input, config=config, context=deer_flow_context)

 # DeerFlowClient (client.py)
-context = DeerFlowContext(app_config=self._app_config, thread_id=thread_id)
+AppConfig.init(AppConfig.from_file(config_path))
+context = DeerFlowContext(app_config=AppConfig.current(), thread_id=thread_id)
 agent.stream(input, config=config, context=context)

-# LangGraph Server — legacy path, context=None, fallback via resolve_context()
+# LangGraph Server — legacy path, context=None or dict, fallback via resolve_context()
 ```

-**Access in middleware/tools:**
+### 6. Access pattern by caller type
+
+The shipped code stratifies callers by what `runtime.context` type they see, and tightened middleware access over time:
+
+| Caller type | Access pattern | Examples |
+|-------------|---------------|----------|
+| Typed middleware (declares `Runtime[DeerFlowContext]`) | `runtime.context.app_config.xxx` — direct field access, no wrapper | `memory_middleware`, `title_middleware`, `thread_data_middleware`, `uploads_middleware`, `loop_detection_middleware` |
+| Tools that may see legacy dict context | `resolve_context(runtime).xxx` | `sandbox/tools.py` (bash-guard gate, sandbox config), `task_tool.py` (bash subagent gate) |
+| Tools with typed runtime | `runtime.context.xxx` directly | `present_file_tool.py`, `setup_agent_tool.py`, `skill_manage_tool.py` |
+| Non-agent paths (Gateway routers, CLI, factories) | `AppConfig.current().xxx` | `app/gateway/routers/*`, `reset_admin.py`, `models/factory.py` |
+
+**Middleware hardening** (late commit `a934a822`): the original plan had middlewares call `resolve_context(runtime)` everywhere. In practice, once the middleware signature was typed as `Runtime[DeerFlowContext]`, the wrapper became defensive noise. The commit removed:
+- `try/except` wrappers around `resolve_context(...)` in middlewares and sandbox tools
+- Optional `title_config=None` fallback on every `_build_title_prompt` / `_format_for_title_model` helper; they now take `TitleConfig` as a **required parameter**
+- Ad-hoc `get_config()` fallback chains in `memory_middleware`
+
+Dropping the swallowed-exception layer means config-resolution bugs surface as errors instead of silently degrading — aligning with let-it-crash.
+
+`resolve_context()` itself still exists and handles three cases:

 ```python
-from deerflow.config.deer_flow_context import DeerFlowContext, resolve_context
-
-# Middleware
-def before_model(self, state, runtime: Runtime[DeerFlowContext]):
-    ctx = resolve_context(runtime)
-    ctx.app_config.title     # typed
-    ctx.thread_id             # typed
-
-# Tool
-@tool
-def my_tool(runtime: ToolRuntime[DeerFlowContext]) -> str:
-    ctx = resolve_context(runtime)
-    ctx.app_config.memory    # typed
+def resolve_context(runtime: Any) -> DeerFlowContext:
+    ctx = getattr(runtime, "context", None)
+    if isinstance(ctx, DeerFlowContext):
+        return ctx                        # typed path (Gateway, Client)
+    if isinstance(ctx, dict):
+        return DeerFlowContext(           # legacy dict path (with warning if empty thread_id)
+            app_config=AppConfig.current(),
+            thread_id=ctx.get("thread_id", ""),
+            agent_name=ctx.get("agent_name"),
+        )
+    # Final fallback: LangGraph configurable (e.g. LangGraph Server)
+    cfg = get_config().get("configurable", {})
+    return DeerFlowContext(
+        app_config=AppConfig.current(),
+        thread_id=cfg.get("thread_id", ""),
+        agent_name=cfg.get("agent_name"),
+    )
 ```

-`resolve_context()` returns `runtime.context` directly when it's already a `DeerFlowContext` (Gateway/Client paths). For legacy LangGraph Server path (context is None), it falls back to constructing from ContextVar + `configurable`.
+### 7. Divergence from original plan

-Why `Runtime` over `RunnableConfig.configurable`:
- `Runtime` is LangGraph's official DI, not a private dict hack
- Generic type parameter (`Runtime[DeerFlowContext]`) gives type safety
- `RunnableConfig` is for framework internals (tags, callbacks), not user dependencies
+Two material divergences from the original design, both driven by implementation feedback:

-#### Non-agent path: ContextVar
+**7.1 Lifecycle: `ContextVar` → process-global + `ContextVar` override**

-Gateway API routers use `get_app_config()` backed by a single ContextVar. This is appropriate — Gateway doesn't run through the LangGraph execution graph.
+*Original:* single `ContextVar` in a new `context.py` module. `get_app_config()` raises `ConfigNotInitializedError` if unset.

-### 5. No reload
+*Shipped:* process-global `AppConfig._global` (primary) + `ContextVar` override (scoped) + auto-load with warning (fallback).

-Config lifecycle is simple:
+*Why:* a `ContextVar` set by Gateway startup is not visible to subsequent requests that spawn fresh async contexts. `PUT /mcp/config` must update config such that the next incoming request sees the new value in *its* async task — this requires process-wide state. ContextVar is retained for test isolation (`reset_override()` works cleanly per test via `Token`) and for per-client scoping if ever needed.

-```
-Process start → from_file() → set ContextVar → run
-                                                 ↓
-                               Gateway API changed file?
-                                                 ↓
-                               from_file() → new frozen config
-                               → set ContextVar → rebuild agent
-```
+The `ConfigNotInitializedError` was replaced by a warning + auto-load. The hard error caught more legitimate bugs but also broke call sites that historically worked without explicit init (internal scripts, test fixtures during import-time). The warning preserves the signal without breaking backward compatibility; `backend/tests/conftest.py` now has an autouse fixture that sets `_global` to a minimal `AppConfig` so tests never hit auto-load.

- Edit `config.yaml` → restart process
- Gateway updates MCP/Skills → construct new config + rebuild agent
- No mtime detection, no `reload_*()`, no auto-refresh
+**7.2 Module name: `context.py` → lifecycle on `AppConfig`, `deer_flow_context.py` for the invocation context**

-### 6. Structure vs runtime config
+*Original:* lifecycle and `DeerFlowContext` both in `deerflow/config/context.py`.

-| Type | Example | Reload behavior |
-|------|---------|----------------|
-| Structural (agent composition) | model, tools, middleware chain | Requires agent rebuild |
-| Runtime (execution behavior) | `memory.enabled`, `title.max_words` | Next invocation picks up new config automatically via `Runtime` |
+*Shipped:* lifecycle is classmethods on `AppConfig` itself (`init`, `current`, `set_override`, `reset_override`). `DeerFlowContext` and `resolve_context()` live in `deerflow/config/deer_flow_context.py`.

-Middleware reads config from `Runtime` at execution time (not `__init__` capture), so runtime config changes take effect without agent rebuild.
+*Why:* the lifecycle operates on `AppConfig` directly — putting it on the class removes one level of module coupling. The per-invocation context is conceptually separate (it's agent-execution plumbing, not config lifecycle) so it got its own file with a distinguishing name.
+
+**7.3 Client lifecycle: `init() + set_override()` → `init()` only**
+
+*Original (never finalized):* `DeerFlowClient.__init__` called both `init()` (process-global) and `set_override()` so two clients with different configs wouldn't clobber each other.
+
+*Shipped:* `init()` only.
+
+*Why (commit `a934a822`):* `set_override()` leaked overrides across test boundaries because the `ContextVar` wasn't reset between client instances. Single-client is the common case, and tests use the autouse fixture for isolation. Multi-client scoping can be added back with explicit `set_override()` if the need arises.

 ## What doesn't change

@ -150,11 +224,12 @@ Middleware reads config from `Runtime` at execution time (not `__init__` capture
 - `extensions_config.json` loading
 - External API behavior (Gateway, DeerFlowClient)

-## Migration scope
+## Migration scope (actual)

- 50+ call sites: `get_*_config()` → `get_app_config().xxx`
- Middleware: `__init__` capture → `Runtime[DeerFlowContext]` read
- Tools: global getters → `ToolRuntime[DeerFlowContext]`
- Tests: `reset_*_config()` → construct frozen config directly
- Gateway update flow: reload → construct new config + rebuild agent
- Dependency: upgrade langgraph >= 1.1.5 for `Runtime` support
+- ~100 call-sites: `get_*_config()` → `AppConfig.current().xxx`
+- 6 runtime-path migrations: middlewares + sandbox tools read from `runtime.context` or `resolve_context()`
+- 3 deleted sandbox_id writes in `sandbox/tools.py`
+- ~100 test locations updated; `conftest.py` autouse fixture added
+- New tests: `test_config_frozen.py`, `test_deer_flow_context.py`, `test_app_config_reload.py`
+- Gateway update flow: `reload_*` → `AppConfig.init(AppConfig.from_file())`
+- Dependency: langgraph `Runtime` / `ToolRuntime` (already available at target version)
--- a/docs/plans/2026-04-12-config-refactor-plan.md
+++ b/docs/plans/2026-04-12-config-refactor-plan.md