* async
* add test
* test(threads): assert aput preserves endpoint-assigned checkpoint id
Confirm the update_thread_state fix is real, not a no-op: all supported
savers (InMemorySaver, AsyncSqliteSaver, AsyncPostgresSaver) persist and
echo checkpoint["id"] verbatim rather than minting their own. Add
assertions that each POST /state response's checkpoint_id round-tripped
into persisted history and kept its uuid6 time-ordering through aput,
and document the verified contract in the router.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs(spec): MiniMax integration for generation skills + new music skill
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs(plan): MiniMax generation providers implementation plan
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(skills): add importlib loader + FakeResp for skill tests
* test(skills): register loaded module in sys.modules; raise requests.HTTPError in FakeResp
* feat(image-generation): add MiniMax provider with env auto-detect
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor(image-generation): guard unknown provider, derive ref MIME, strengthen tests
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(video-generation): add MiniMax provider with async poll/download
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor(video-generation): surface base_resp errors while polling; add timeout test
* feat(podcast-generation): add MiniMax t2a_v2 provider with env auto-detect
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor(podcast-generation): restore TTS credential guard; add volcengine + voice tests
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(music-generation): new MiniMax music skill via skill-creator
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(music-generation): treat empty lyrics as absent; test no-audio-data path
* refactor(skills): add request timeouts to MiniMax network calls
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Potential fix for pull request finding 'Explicit returns mixed with implicit (fall through) returns'
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
* fix(models): strip inconsistent user-message names for MiniMax chat
DeerFlow middlewares tag user messages with provenance names (user-input, summary, loop_warning); langchain serializes them into the OpenAI-compatible payload and MiniMax rejects mismatched user-message names with "user name must be consistent (2013)". PatchedChatMiniMax now drops the per-message name from user-role messages. Point the config.example MiniMax models at PatchedChatMiniMax so they also get reasoning_content mapping.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(image-generation): MiniMax sends JSON prompt field, guard 1500-char limit
MiniMax image-01 takes one text string capped at 1500 chars, but the skill was sending the whole structured JSON. The MiniMax provider now extracts the JSON `prompt` field (relying on prompt_optimizer to expand it) and fails fast with a clear error before calling the API when that field exceeds 1500 chars. Authoring stays provider-agnostic; Gemini still receives the full JSON.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(podcast-generation): per-provider TTS concurrency and retry/backoff
Each TTS provider owns its concurrency internally — MiniMax runs single-threaded to reduce rate-limit failures, Volcengine keeps 4 workers — with automatic retry and backoff on transient HTTP and base_resp errors. No caller-facing concurrency knob.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(skills): address Copilot review comments on generation skills
- video: add raise_for_status + timeout to the Gemini download/POST/poll calls so non-2xx responses surface as clear HTTP errors instead of JSON/KeyError or hangs
- video: check the task Fail status before the generic base_resp check so the failure keeps its task_id context
- video/image: create the output file parent directory before writing (matching music-generation) so nested output paths do not raise FileNotFoundError
- music: require a non-empty prompt and fail fast with ValueError instead of sending an empty prompt to the API
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(scripts): reclaim dev ports across worktrees in make stop/dev
All deer-flow worktrees (main checkout + linked worktrees) hardcode the same dev ports (8001/3000/2026), so a service started from any worktree must be reclaimable from another. stop_all now resolves the set of worktree roots (DEERFLOW_ROOTS) and treats a process as deer-flow-owned when its open files live under any of them. It also force-kills survivors on 2026 alongside 8001/3000, fixing `make dev` aborting on the nginx port preflight when a prior nginx lingered on 2026.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(view-image): hide the injected image-context message from the UI
ViewImageMiddleware injects a HumanMessage (text + base64 images) so the vision model can see viewed images, but it was the only internal injector that set neither hide_from_ui nor a hidden name, so it leaked into the chat UI (and IM channels) as a user bubble reading "Here are the images you've viewed:". Mark it with additional_kwargs={"hide_from_ui": True}, matching todo/dynamic_context injections, which the frontend isHiddenFromUIMessage and the channel sender already honor. The model still receives the full content.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(minimax): mark M2.7 models as text-only (no vision)
MiniMax M2.7 / M2.7-highspeed do not support vision; only M3 does. The
provider config asserted vision support for M2.7 in four places.
- config.example.yaml: 4 M2.7 entries -> supports_vision: false
- backend/docs/CONFIGURATION.md: M2.7 + highspeed -> supports_vision: false
- wizard: add LLMProvider.model_vision_overrides + extra_config_for() so
selecting an M2.7 model writes supports_vision: false while M3 (default)
keeps vision; wire it through setup_wizard.py
- tests: M2.7-highspeed fixture -> supports_vision=False; add
test_minimax_vision_is_per_model
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
* fix(replay-e2e): match by conversation, not the living system prompt
The model-replay match key hashed the full input including the lead-agent
system prompt. That prompt is edited frequently (e.g. #3195 added a "File
Editing Workflow" section), so the committed fixture went stale the moment
the prompt changed on main — turning the Layer-2 render gate RED on every
unrelated PR (#3430, #3432, ...). This was a self-inflicted false positive.
Root-cause fix:
- replay_provider._canonical_messages now EXCLUDES the system message from
the hash. The conversation (human/ai/tool) is the stable contract that
identifies a recorded turn; the system prompt is an internal detail not
part of the front-back contract under test. (Mirrors how open-design keys
its mock picker on the user prompt, not the system internals.) Proven
robust: injecting a prompt edit no longer causes a replay miss.
- Layer-1 golden was BLIND to replay misses: the gateway swallows a miss
into an assistant error message, so the shape-only golden stayed green on
a stale fixture. It now inspects replay_provider.replay_misses() and fails
loud. (Layer-2 already fails on a miss.)
- Re-recorded write_read_file.ultra fixture + regenerated golden under the
new conversation-only hash.
- Layer-2 render spec: assert the in-graph auto-title (deterministic); the
follow-up suggestion is fired async and depends on a clean JSON model
output, so assert it only when the fixture captured one — never gate on
its absence (recording flakiness must not block CI).
- docs: REPLAY_E2E.md updated.
Verified: Layer-1 golden green (no miss), Layer-2 both specs green,
CI=true make test 4033 passed / 0 failed, frontend pnpm check clean.
* test(replay-e2e): restore suggestions coverage with a reliable capture
Addresses review feedback (the suggestion path was dropped from Layer-2):
- record spec now waits for the `/suggestions` response before checking
capture stability, so the recorded fixture reliably includes the
frontend-fired suggestions turn (previously the stability window could
return before suggestions fired, yielding a fixture without it).
- Re-recorded write_read_file.ultra: 5 turns (write_file, auto-title,
read_file, answer, suggestions). Golden unchanged — suggestions is a
separate /suggestions call, not part of the /runs/stream SSE sequence.
- Layer-2 spec: restore the hard `EXPECTED_SUGGESTION` assertion. With the
record spec now waiting for /suggestions, a fixture missing the suggestion
turn means a broken recording and must fail loud, not pass silently.
Verified: Layer-1 golden green (no miss), Layer-2 both specs green
(auto-title + suggestion render), frontend pnpm check clean.
* ci: re-trigger (flaky Docker Hub image pull in sandbox e2e, unrelated)
backend-unit-tests failed only in test_sandbox_orphan_reconciliation_e2e.py
with 'docker pull busybox:latest ... context deadline exceeded' — a CI-runner
network flake reaching Docker Hub, not related to this docs/tests-only change.
Empty commit to re-run CI.
---------
Co-authored-by: DanielWalnut <45447813+hetaoBackend@users.noreply.github.com>
Reasoning models such as MiniMax-M3 inline their chain-of-thought into the
message content as <think>...</think> (reasoning_split defaults to false)
instead of a separate reasoning_content field. The follow-up-suggestions
endpoint extracted the JSON array via find('[') / rfind(']'), which silently
broke whenever the reasoning text contained '[' or ']' — or when long thinking
hit max_tokens and truncated before the array was emitted — returning empty
suggestions.
- Add _strip_think_blocks() and apply it before JSON extraction; it removes
complete <think>...</think> blocks (case-insensitive) and drops an unclosed
<think> left by max_tokens truncation.
- Document the MiniMax thinking toggle in config.example.yaml
(when_thinking_enabled: adaptive / when_thinking_disabled: disabled) so
thinking_enabled=False actually disables reasoning on M3; note that M2.x
models always think and rely on the defensive strip above.
- Tests cover complete/unclosed think blocks, brackets-inside-think, think +
code-fence, and an end-to-end suggestions case reproducing the empty-result
bug.
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(e2e): record/replay front-back contract verification
Guards the front-back contract with a deterministic, key-free record/replay
harness (mirrors open-design's golden-trace approach):
- ReplayChatModel (tests/replay_provider.py): replays recorded LLM turns by a
normalized hash of the model input. Strips <system-reminder>/date/uuid/tmp-path
so one fixture replays across days and from both the browser and direct-POST
paths; a miss raises loudly (no silent divergence).
- Recording is record-through-browser (scripts/record_gateway.py +
build_fixture_from_jsonl.py + frontend/tests/e2e-record): a real run is driven
through the real frontend so captured inputs match exactly what the browser
sends; fixtures contain no API key.
- Layer 1 — backend golden (tests/test_replay_golden.py): replay through the real
gateway, assert the SSE event sequence == committed golden.
- Layer 2 — full-stack render (frontend/tests/e2e-real-backend): real Next.js +
real gateway (replay model) + Chromium; assert the replayed auto-title and
follow-up suggestions render. DOM assertions are the gate; visual regression is
a local dev gate (CI uploads the render as an artifact).
- CI (.github/workflows/replay-e2e.yml): both layers, triggered on EITHER side of
the contract (frontend/** or backend gateway/harness/fixtures).
* test(e2e): multi-run render-order cross-stack scenario (#3352)
Guards the dangerous front-back class where a backend ordering change
silently breaks a frontend assumption while both sides' unit tests stay
green. Reproduces issue #3352: backend list_by_thread returns runs
newest-first (#2932) and the frontend prepended per-run pages, inverting
chronological order once the checkpoint no longer held the older messages.
- tests/seed_runs_router.py: test-only seeder, mounted on the replay
gateway only when DEERFLOW_ENABLE_TEST_SEED=1 (never in the production
app). Seeds a thread with >=2 runs + per-run message events and no
checkpoint -- the #3352 precondition -- so the frontend per-run reload
path is the sole source of truth and the prepend inversion is observable.
- frontend/tests/e2e-real-backend/multi-run-order.spec.ts: drives the real
frontend against the real gateway, asserts the first run renders above
the second. Reverting the #3354 fix turns it red.
- replay-e2e.yml: trigger on the new replay test-infra paths.
- docs: REPLAY_E2E.md cross-stack scenario section.
* test(e2e): address Copilot review on the replay harness
- Fix stale recorder references (scripts/record_traces.py ->
scripts/record_gateway.py + scripts/build_fixture_from_jsonl.py) in
replay_provider.py, test_replay_golden.py, _replay_fixture.py.
- MODE_CONTEXT['ultra']: thinking_enabled False -> True, mirroring the
frontend's `context.mode !== 'flash'` (hooks.ts). It did not affect the
hashed input (Layer 1 golden still green), but the table now matches the
real frontend context it claims to mirror.
- replay_provider.py docstring: stop claiming memory is recorded-enabled;
the replay config disables memory/summarization for determinism (title
stays, as an in-graph deterministic call).
- record_gateway.py / run_replay_gateway.py: override DEER_FLOW_HOME instead
of setdefault, so an outer value can't leak into the hermetic harness.
- record_gateway.py: clear error when DEERFLOW_RECORD_OUT is unset (was a
bare KeyError).
- playwright.record.config.ts: forward OPENAI_*/DEERFLOW_RECORD_OUT only when
set, so the gateway raises a clear 'missing env' error instead of getting ''.
* test(e2e): address Copilot review round 2
- seed_runs_router.py: constrain SeedMessage.role to Literal['human','ai']
so a bad value is a clean 422 at the boundary instead of a 500
(KeyError on _EVENT_TYPE).
- record-write-read-file.spec.ts: waitForCaptureStable now throws on
timeout instead of returning the last count, so a truncated/partial
recording can't pass silently.
- real-backend-render.spec.ts: guard the suggestions JSON.parse; a
bracket-prefixed non-JSON turn falls back to '' so the existing
not.toBe('') assertion fails clearly instead of a generic parse throw.
* fix(middleware): externalize oversized tool output into sandbox for non-mounted sandboxes
ToolOutputBudgetMiddleware persisted oversized tool results to the host
filesystem and returned a /mnt/user-data/outputs virtual path. For sandboxes
that do not use thread-data mounts (e.g. remote AIO sandbox), that virtual
path does not exist inside the sandbox, so the model's read_file tool could
not read it back and reported 'file not found'.
Branch on SandboxProvider.uses_thread_data_mounts:
- Mounted sandboxes (local Docker, AIO + LocalContainerBackend) keep the
original host-disk path; the host outputs dir is bind-mounted to the same
virtual path inside the sandbox, so behavior is unchanged.
- Non-mounted (remote) sandboxes externalize into the sandbox itself via
execute_command('mkdir -p ...') + write_file + 'test -s' validation. The
validation step is required because AIO sandbox execute_command returns
'Error: ...' as a string on failure instead of raising, so a silent mkdir
failure would otherwise leak through.
Any failure (rejected subdir, mkdir/write/validate error) falls back to the
existing inline head+tail truncation, so an unreadable path is never returned
to the model.
The sandbox resolver reads the sandbox_id that SandboxMiddleware already
writes into runtime.state['sandbox']; it never calls provider.acquire(),
keeping the tool-call hot path free of blocking I/O. Tools that do not use a
sandbox (web_search, MCP, ...) resolve to None and fall through to inline
truncation, which is the safe behavior for them.
Fixes#3416
* fix(middleware): address Copilot review feedback on sandbox externalization
- Make get_sandbox_provider() lookup best-effort in _budget_content: only
query when outputs_path or sandbox is available, and fall back to inline
truncation if provider initialization raises rather than propagating
the error. A resolved sandbox instance is sufficient on its own to take
the non-mounted externalization branch.
- Strict-match the sandbox post-write validation echo
(check.strip() == 'OK') to avoid false positives if execute_command
ever surfaces unrelated stdout/stderr containing 'OK' as a substring.
Refs: #3417
* test: fix flaky tests relying on /nonexistent/... path under container root
Two tests in this module (test_returns_none_on_invalid_path and
test_fallback_when_disk_write_fails) used paths like
'/nonexistent/impossible/path' to trigger _externalize's OSError
fallback. These paths are creatable when the test process runs as root
inside the CI container: os.makedirs(..., exist_ok=True) successfully
creates the entire chain under /, so the OSError branch is never hit
and the tests fail. Reproducible on main independently of this PR.
Switch to '/dev/null/cannot-mkdir-here'. /dev/null is a character
device on both Linux and macOS, so os.makedirs always fails with
NotADirectoryError regardless of privileges, reliably exercising the
OSError fallback.
* fix(tool-output-budget): only consult sandbox provider when a sandbox is resolved
The previous revision called get_sandbox_provider() whenever externalization
was triggered, including on the legacy host-disk path. Environments without
a configured sandbox -- in particular CI runners without a config.yaml --
would raise FileNotFoundError there, get caught, and silently fall back to
inline truncation. That defeated the host-disk externalization path that
predates this PR and was the root cause of the regressing legacy tests.
Restructure the branching so the provider is only consulted when a sandbox
has actually been resolved for the current tool call:
- sandbox resolved + provider.uses_thread_data_mounts: host-disk write
(bind-mounted into the sandbox, equivalent to a sandbox-side write).
- sandbox resolved + non-mounted provider: sandbox write (#3416).
- no sandbox + outputs_path: host-disk write
(legacy / non-sandbox tools, no provider call at all).
- otherwise: inline fallback.
No test changes; the legacy externalization tests are provider-agnostic by
construction and now pass without monkeypatching.
Refs: #3416
* test(tool-output-budget): assert legacy path does not call sandbox provider
Lock in the contract introduced by d6e2d25b: when no sandbox is resolved
for a tool call, _budget_content must externalize to the host outputs
directory without consulting get_sandbox_provider(). Regressing this would
re-break legacy / non-sandbox tools in environments without a configured
sandbox (e.g. CI without config.yaml), which is the failure mode #3416's
fix avoids.
The test injects a get_sandbox_provider that raises on call, so any
future refactor that moves the provider lookup out of the sandbox-only
branch will fail loudly.
Refs: #3416
* fix(middleware): offload memory injection off event loop to prevent tiktoken blocking (#3402)
DynamicContextMiddleware.abefore_agent() called _inject() synchronously
on the asyncio event loop. The first time memory is injected (second
request), _inject() → format_memory_for_injection() → _count_tokens()
→ tiktoken.get_encoding("cl100k_base") needs to download the BPE data
from openaipublic.blob.core.windows.net. In network-restricted
environments this download blocks until the OS TCP timeout (~26 min),
starving ALL concurrent handlers including /api/v1/auth/me.
Fix:
- abefore_agent now uses asyncio.to_thread(self._inject, state) so
file I/O and tiktoken never block the event loop.
- Extract _get_tiktoken_encoding() with a module-level cache so
tiktoken.get_encoding() is called at most once per encoding name.
- Add warm_tiktoken_cache() startup helper; gateway lifespan pre-warms
the cache via asyncio.to_thread so the first request never triggers a
cold download.
- _count_tokens falls back to len(text) // 4 on any encoding failure.
Tests:
- tests/test_tiktoken_cache_and_count_tokens.py (12 tests): cache
hit/miss, fallback paths, warm-up helper.
- tests/blocking_io/test_dynamic_context_middleware.py (2 tests):
Blockbuster gate verifies abefore_agent does not block the event
loop; async/sync parity check.
Fixes#3402
* Apply suggestions from code review
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* fix the lint error
* fix(memory): use future annotations to avoid NameError when tiktoken is absent
Add `from __future__ import annotations` to prompt.py so that
tiktoken.Encoding type hints are never evaluated at runtime. Without
this, environments where tiktoken is not installed could raise NameError
on the module-level cache and function return annotations.
Addresses Copilot review comment on PR #3411.
* fix(middleware): bound abefore_agent injection with timeout to prevent hung requests
Wrap the asyncio.to_thread(self._inject) offload in asyncio.wait_for()
with a 5-second cap. If the startup warm-up failed silently (e.g.
network blip during deploy), a cold tiktoken BPE download on the first
request can block until the OS TCP timeout (~26 min). The bounded
timeout ensures the request degrades gracefully (no memory/date context
for that turn) rather than hanging.
Adds test_abefore_agent_returns_none_on_timeout to the blocking-IO
regression anchors.
Addresses review feedback from xg-gh-25 on PR #3411.
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* fix(subagent): structured subagent_status field over text parsing
Closes#3146.
## Why
The frontend used to derive subtask card state by string-matching the
leading text of the `task` tool's result. That contract surface was
fragile — `#3107` BUG-007 and the `#3131` review both surfaced cases
where new backend wording (`Task cancelled by user.`,
`Task polling timed out after N minutes`, `ToolErrorHandlingMiddleware`
exception wrappers) silently broke the card lifecycle. The frontend
fallback kept growing more prefixes; any future rewording would break
it again.
## Design
1. **Backend → frontend contract**: `ToolMessage.additional_kwargs`
carries `subagent_status` (one of `completed | failed | cancelled |
timed_out | polling_timed_out`) and an optional `subagent_error`
blob. The frontend prefers it over parsing `content`.
2. **Centralised stamping, not 8 sprinkled stamps**: rather than have
each of `task_tool.py`'s 5 normal-return + 3 pre-execution `Error:`
paths remember to set `additional_kwargs`, `ToolErrorHandlingMiddleware`
stamps the field after every task-tool call. Adding a new return
path in `task_tool.py` cannot now skip the stamp.
3. **Cross-language contract fixture**: the prefix→status mapping is
the one piece both sides must agree on. The shared fixture at
`contracts/subagent_status_contract.json` lists every backend return
string, the expected status, and what the error substring should
contain. Backend test (`backend/tests/test_subagent_status_contract.py`)
and frontend test (`frontend/tests/unit/core/tasks/subtask-result.test.ts`)
both load that fixture and assert the same cases. A wording drift on
either side fails the matching language's test.
4. **Round-trip serialisation pinned**: the round-trip test asserts
`ToolMessage.model_dump_json()` → `model_validate_json()` preserves
`additional_kwargs.subagent_status`. Catches the case where a future
LangChain or Pydantic upgrade silently strips unknown kwargs.
5. **Frontend status collapse documented**: the backend has five status
values, the frontend card has three (`completed | failed |
in_progress`). `cancelled` / `timed_out` / `polling_timed_out` all
collapse to `failed` with the original status preserved in `error`.
`parseSubtaskResult` returns `in_progress` for unknown values so a
backend that ships a new enum variant before the frontend upgrades
degrades to the legacy prefix fallback instead of getting pinned.
## Changes
Backend:
- `deerflow.subagents.status_contract` — new module exporting
`SUBAGENT_STATUS_KEY`, `SUBAGENT_ERROR_KEY`,
`SUBAGENT_STATUS_VALUES`, `extract_subagent_status(content)`, and
`make_subagent_additional_kwargs(status, error)`.
- `ToolErrorHandlingMiddleware`: new `_stamp_task_subagent_status`
helper centralises the stamp; `wrap_tool_call` / `awrap_tool_call`
stamp on the success path; `_build_error_message` stamps on the
wrapper path (carrying `ExcClass: detail` into `subagent_error`).
Non-task tools are untouched.
- New tests: `test_subagent_status_contract.py` (19 cases from the
shared fixture + status-enum / blank-error / unknown-status
rejection) and `test_tool_error_handling_subagent_stamp.py`
(middleware integration: terminal-content stamps, non-terminal
doesn't, non-task tools untouched, async path mirrors sync,
existing additional_kwargs survive, JSON round-trip preserved).
Frontend:
- `parseSubtaskResult(text, additionalKwargs?)` — prefers the
structured stamp; falls back to the legacy prefix matcher for
historical threads / unknown future status values.
- `STRUCTURED_STATUS_TO_SUBTASK` documents the five→three collapse.
- `message-list.tsx` passes `message.additional_kwargs` through.
- `subtask-result.test.ts` adds a structured-status block + a
fixture-driven contract block; legacy prefix tests stay green for
the fallback path.
Contract:
- `contracts/subagent_status_contract.json` — single source of truth
both languages load. Whitespace variants, varied N for polling
timeouts, the 3 pre-execution `Error:` returns task_tool produces,
and the middleware wrapper shape are all in there.
## Test plan
- `make lint` clean (backend + frontend).
- `pytest tests/test_subagent_status_contract.py
tests/test_tool_error_handling_subagent_stamp.py` → 37 passed.
- `pnpm test --run` → 103 passed (was 76, +27 new).
## Migration / fallback retirement
The text-prefix fallback stays in place until backend telemetry shows
the frontend never hits it for newly produced messages. At that point
a follow-up PR can drop the prefix branches and keep only the
structured-status branch.
Refs: bytedance/deer-flow#3138 (split summary), #3107 (origin), #3131
(prior prefix-only fix), #3146 (this issue).
* fix(subtask): back-fill result/error from text when structured status present
Three follow-ups on the PR #3154 review:
1. `readStructuredStatus` no longer short-circuits the prefix parse.
The backend currently stamps only the `subagent_status` enum value;
the human-facing `result` body and wrapped-error message still live
in `ToolMessage.content`. Dropping the text parse meant successful
tasks rendered empty completed pills and wrapped failures lost their
diagnostic. Now both shapes get composed: structured status wins,
`result`/`error` come from text when both sides agree, and a lying
success body under a `failed` stamp is dropped instead of leaking.
2. Replace the ESM-incompatible `__dirname` fixture lookup in
subtask-result.test.ts with `fileURLToPath(new URL(..., import.meta.url))`.
The frontend package is `"type": "module"`, so the previous path
would have thrown at runtime if anything ever changed under the
contract directory.
3. Drop the `$schema` reference from contracts/subagent_status_contract.json
pointing at a file that doesn't exist in the tree.
Three new tests cover the structured + text composition: completed
back-fills the success body, failed back-fills the wrapper text, and
unrecognised content under a `failed` stamp stays empty rather than
echoing noise.
* fix(mcp): close stdio sessions on their owning loop to avoid cross-task cancel-scope error (#3379)
Adopt an owner-task lifecycle for pooled MCP ClientSessions so each
session is entered, initialized, and exited within a single asyncio task
on its owning event loop. This eliminates the anyio "Attempted to exit
cancel scope in a different task than it was entered in" RuntimeError
that surfaced when stdio MCP tools were used via the sync tool wrapper
(which spins up and tears down event loops across tasks).
Also harden the pool lifecycle:
- track in-flight session creation per (server, scope) to dedupe
concurrent get_session() calls for the same key
- make close_scope/close_server/close_all/close_all_sync cover both
established entries and in-flight creations so sessions cannot be
resurrected or leaked after close
- handle cross-loop preemption of an in-flight creation by cancelling
the stale owner task instead of only signalling it
- define close_all_sync() semantics for a running loop on the current
thread (signal-only, async completion) and route reset_mcp_tools_cache
through a deterministic async close in that case
* fix(mcp): avoid reset deadlock on running loop cache reset
* fix(mcp): address session pool review feedback
* fix(config): make the reload boundary discoverable from code, not just docs
Closes#3144.
The hot-reload contract — per-run fields are resolved through
`get_app_config()` on every request, infrastructure fields snapshot at
gateway startup — landed in `backend/CLAUDE.md` as part of #3131. A
maintainer reading `get_config()` or an `AppConfig` field still had to
context-switch to that document to know which fields require a process
restart, and there was no enforcement that the prose list stayed in
sync with the code.
This commit moves the boundary to a machine-readable single source of
truth and surfaces it where the code lives:
- New `deerflow.config.reload_boundary` module owns the registry of
restart-required fields (`STARTUP_ONLY_FIELDS`) and a tiny helper
API (`is_startup_only_field`, `iter_startup_only_field_paths`,
`format_field_description`). The standardised `"startup-only:"`
prefix is exported as `STARTUP_ONLY_PREFIX` so future scanners /
lint hooks / doc generators can pivot off it without re-parsing
prose.
- `AppConfig`'s `database`, `checkpointer`, `run_events`,
`stream_bridge`, `sandbox`, and `log_level` fields now build their
`Field(description=...)` from `format_field_description(...)`. The
same text shows up in IDE hover (Pydantic v2 exposes `description`
via `model_fields[...]`).
- `channels` is restart-required too but lives outside the AppConfig
Pydantic schema (the config section is consumed directly by
`start_channel_service`). The registry owns it so the boundary is
not split between two places.
- `get_config()` docstring points to the registry instead of leaving
the reader to find `CLAUDE.md`. The `CLAUDE.md` table collapses to
a one-liner pointing back at `reload_boundary.py` so the boundary
has one canonical location, not two.
Drift coverage in `tests/test_reload_boundary.py`:
- Every registered field has a non-trivial reason.
- Iterator / membership helpers stay in sync with the dict.
- Every registry entry that maps to an `AppConfig` field also carries
the `"startup-only:"` prefix in the schema (catches "forgot to
update the schema").
- Reverse drift: any AppConfig field whose description starts with
the prefix must be registered (catches "marked restart-required in
the schema but forgot the registry").
- The runtime introspection that IDE hover depends on
(`AppConfig.model_fields["database"].description`) is pinned, so a
future Pydantic upgrade or schema swap that breaks the hover surface
shows up as a test failure rather than a silent regression.
Refs: bytedance/deer-flow#3138 (split summary), #3107 (origin), #3131
(prior boundary fix in prose form).
* fix(config): preserve field doc and correct log_level reload reason
Two follow-ups on the PR #3153 review:
1. The `log_level` STARTUP_ONLY_FIELDS reason previously claimed
`apply_logging_level()` mutates the root logger level. It does not:
only the `deerflow` / `app` logger levels are set, and root handler
thresholds are conditionally lowered so messages from those loggers
can propagate. Reword to match the actual behavior so operators
reading IDE hover get accurate restart guidance.
2. `format_field_description(field_path)` was the sole `Field(description=)`
for every restart-required field, which silently overwrote the
original human-facing documentation — most visibly the `log_level`
field that used to list debug/info/warning/error and clarify that
third-party libraries are not affected. Extend the helper with a
keyword-only `field_doc` parameter that composes the startup-only
marker with the original prose so IDE hover documents both *why*
the field is restart-required and *what* it actually accepts.
Updated all six restart-required AppConfig fields (`log_level`,
`database`, `sandbox`, `run_events`, `checkpointer`, `stream_bridge`)
to pass their original descriptions through the helper.
Tests: two new cases in `test_reload_boundary.py` pin (a) the helper
composition and (b) every AppConfig restart-required field still
surfaces a recognisable substring of its original documentation.
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
* fix(summarization): tag summary LLM calls nostream to stop phantom stream messages (#2503)
The SummarizationMiddleware runs its summary LLM call inside a before_model
hook. Without a nostream tag the summary tokens were captured by LangGraph's
messages-tuple stream callback and broadcast to the frontend as a phantom AI
message.
Generate a dedicated summary model copy tagged with "nostream" (merged on top
of any existing tags such as "middleware:summarize" so RunJournal attribution
is preserved) and override _create_summary / _acreate_summary to invoke it
directly. This avoids temporarily swapping the shared self.model, which would
otherwise leak the RunnableBinding across concurrent runs and break parent
logic that inspects the raw model (profile / _get_ls_params).
Add regression tests covering nostream tagging, concurrent-run isolation, raw
model preservation, and existing-tag merge.
* fix(summarization): address nostream review feedback
* fix(#3189): prevent write_file streaming timeout on long reports
Adds a layered defense against StreamChunkTimeoutError caused by oversized
single-shot write_file tool calls:
- factory: default stream_chunk_timeout to 240s for OpenAI-compatible
clients (overridable via ModelConfig.stream_chunk_timeout in config.yaml)
- sandbox/tools: server-side 80 KB length guard on non-append write_file
calls (configurable via DEERFLOW_WRITE_FILE_MAX_BYTES env var, 0 disables);
rejects oversized payloads with a structured error pointing the model at
str_replace or append=True
- middleware: classify StreamChunkTimeoutError as transient but cap retries
at 1 via per-exception _RETRY_BUDGET_OVERRIDES (same-payload retry on a
chunk-gap timeout buffers the same way upstream; full 3-attempt loop
would stack 6-12 min of dead air)
- middleware: surface an actionable user-facing message for stream-drop
exceptions instead of leaking the raw langchain stack
- prompts: add a routing-style File Editing Workflow hint to both lead_agent
and general_purpose subagent prompts, pointing the model at str_replace
for incremental edits (mirrors Claude Code's Edit / Codex's apply_patch)
- tests: behavioural coverage for size guard, retry budget override,
stream-drop user message, factory default injection
Refs #3189
* fix(#3189): drop stream_chunk_timeout for non-OpenAI providers
Address CR feedback on PR #3195:
- factory: pop `stream_chunk_timeout` from kwargs for any model_use_path other than `langchain_openai:ChatOpenAI` instead of returning early. `ModelConfig.stream_chunk_timeout` is part of the shared schema, so a user-supplied value on a non-OpenAI provider would otherwise be forwarded to its constructor and raise `TypeError: unexpected keyword argument`.
- factory: rewrite docstring to describe the actual `exclude_none=True` behaviour (explicit null is excluded and falls back to the default) instead of the misleading "None falling out via exclude_none=True keeps its value".
- tests: add regression coverage asserting the kwarg is stripped before reaching a non-OpenAI provider's constructor.
Refs: bytedance#3189
* fix(#3189): restrict stream-drop user copy to StreamChunkTimeoutError only
Per CR on #3195: narrow _STREAM_DROP_EXCEPTIONS to StreamChunkTimeoutError. Generic httpx RemoteProtocolError / ReadError fall back to the standard 'temporarily unavailable' copy, since they routinely fire on transient network blips where the 'split the output' guidance is misleading. Retry/backoff classification is unchanged — both remain transient/retriable. Tests updated to reflect new copy, plus a symmetric regression test for ReadError.
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
* fix(gateway): drain in-flight runs before closing checkpointer on shutdown
Chat runs execute in fire-and-forget background asyncio tasks that write
checkpoints through a shared checkpointer. On shutdown, langgraph_runtime's
AsyncExitStack tore down the checkpointer's postgres connection pool while
those run tasks were still mid-graph. langgraph's
AsyncPregelLoop._checkpointer_put_after_previous then ran its
`finally: await checkpointer.aput(...)` against the closed pool, raising
psycopg_pool.PoolClosed. Because that put runs in a langgraph-internal task
(not on run_agent's call stack), run_agent's try/except cannot catch it and it
surfaces as "unhandled exception during asyncio.run() shutdown".
Add RunManager.shutdown() to cancel and bounded-await all in-flight runs, and
call it from langgraph_runtime BEFORE the AsyncExitStack closes the
checkpointer, so the final checkpoint write lands while the pool is still open.
The drain is bounded by a timeout so a stuck run cannot hang worker shutdown,
and is shielded so a second shutdown signal cannot abandon it mid-drain and
reopen the race.
Closes#3373
* fix(gateway): address review — preserve completed-run status, bound drain persistence
Addresses Copilot review on #3381:
- RunManager.shutdown(): decide run status AFTER the drain. Under the lock it
now only requests cancellation; after asyncio.wait it marks/persists
`interrupted` only for runs still pending or ended cancelled. A run that
completes (e.g. `success`) during the drain window keeps its real terminal
status instead of being unconditionally overwritten.
- Bound the trailing status persistence within the timeout budget
(deadline = loop.time()+timeout; gather wrapped in asyncio.wait_for) so a slow
store backing off under DB pressure cannot push shutdown past the deadline.
- deps: use asyncio.create_task instead of asyncio.ensure_future.
- tests: wait deterministically for the run to be in-flight (poll the first
checkpoint) instead of a fixed sleep; init shutdown_calls explicitly in the
recovery test double; add regression test asserting a run completing during
the drain keeps its status (in memory and in the store).
* fix(gateway): address maintainer review — surface failed drain persists, clarify timeout constant
Addresses @WillemJiang review on #3381:
- shutdown(): inspect the gather result of the trailing interrupted-status
persistence. _persist_status is best-effort (it catches + logs its own
failure with exc_info and returns False, so it never raises out of the
gather), but the aggregate result was never checked — a partial failure had
no shutdown-level visibility. Now any escaped Exception is logged, and any
False (a persist that did not confirm) is logged with the run_id. Added
regression test test_shutdown_surfaces_failed_interrupted_persist.
- deps: clarify the _RUN_DRAIN_TIMEOUT_SECONDS comment — state the actual value
of _SHUTDOWN_HOOK_TIMEOUT_SECONDS (5.0s) and that both count toward the
lifespan shutdown window. Kept as two separate constants (independent teardown
steps that may diverge) rather than one shared "must match" value.
- Verified no other test fake needs the shutdown stub: _FakeRunManager in
test_worker_langfuse_metadata.py is a run_agent() argument (worker path),
never injected into langgraph_runtime, so it never receives shutdown().
Follow-up to #3342 (deferred MCP tool loading). Maintainability cleanup plus
hardening of malformed/empty tool_search queries; no change to the deferral
mechanism or search ranking.
- Add deerflow/tools/mcp_metadata.py as the single source of truth for the
"deerflow_mcp" tag (MCP_TOOL_METADATA_KEY + tag_mcp_tool + public
is_mcp_tool). Removes the duplicated magic string and the private,
cross-module _is_mcp_tool import.
- tool_search.search: never raise on model-generated input. Extract
_compile_catalog_regex (shared compile-with-literal-fallback); return empty
for empty/whitespace queries and a bare "+" instead of matching everything
or raising IndexError.
- DeferredToolSetup: document the empty-vs-populated invariant.
- build_deferred_tool_setup: comment the two distinct empty-return branches.
- _assemble_deferred: add return type, rename local to deferred_setup, build
the final list with an explicit append.
- Tests: use tag_mcp_tool instead of per-file tag helpers; cover empty and
bare-"+" queries.
* chore: remove stale langgraph server runtime remnants
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* fix(skills): surface offending line and quoting hint on SKILL.md YAML errors
When a SKILL.md front-matter fails to parse, the existing log only
echoes PyYAML's raw message, leaving authors to grep the file for the
offending line. This is especially painful for the very common
LLM-authored mistake of an unquoted scalar containing ': '
(e.g. 'description: foo: bar'), which fails with
'mapping values are not allowed here' and silently drops the skill.
Enrich the error log with:
- the source line PyYAML pointed at via problem_mark
- a targeted, copy-pasteable quoting hint when (and only when) the
error is the well-known 'mapping values are not allowed' scanner
error on an unquoted value
The skill is still rejected (no semantics are guessed or rewritten);
only the diagnostic is improved.
Fixes#3333
* improve(skills): address CR feedback on SKILL.md YAML error diagnostics
Per review on #3335:
- Log the file line number (mark.line + 2) instead of the
front-matter-internal line number, so authors land on the right
row in their editor.
- Use exc.problem == "mapping values are not allowed here" for a
tighter match than substring-scanning str(exc).
- Preserve the offending key's leading whitespace in the quoting
hint so nested mappings stay nested when authors paste the fix
back.
- Rewrite the regression test to actually exercise the new
behaviour: PyYAML's own message already echoes the offending
line (and truncates it with "..."), so the old assertion
passed on main. New assertions pin (a) the file-line number,
(b) the full untruncated line, and (c) the copy-pasteable hint.
- Add a guard test for nested-key indentation so the
partition()/strip() shape cannot regress silently.
Refs #3333, #3335
* fix(skills): escape backslashes in YAML quoting hint
The hint emitted by _format_yaml_error previously escaped only double
quotes, so values containing backslashes (e.g. Windows paths like
C:\Temp or regex escapes like \d) produced a suggested scalar that
was either invalid YAML or silently re-interpreted by PyYAML's
double-quoted escape rules when pasted back. Escape order matters:
backslashes first, then double quotes.
Adds two regression tests covering Windows-path and regex-style
backslashes.
Address Copilot CR feedback on PR #3335.
The official MCP configuration schema uses `transport` to specify the
transport mechanism (stdio/sse/http), but `McpServerConfig` only honored
`type` and defaulted to `stdio`. Remote MCP servers configured with just
`transport: sse` were therefore misidentified as stdio and failed with
"with stdio transport requires 'command' field".
Add a model validator that promotes `transport` to `type` when only
`transport` is provided, while keeping `type` authoritative when both
are set. This matches the MCP-spec field name without breaking existing
configurations.
Fixes#3238
- Add MiniMax-M3 to model list and set as default
- Keep MiniMax-M2.7 and MiniMax-M2.7-highspeed
- Remove older models (M2.5)
- Update related tests
Co-authored-by: octo-patch <octo-patch@github.com>
* fix(sandbox): close AioSandbox HTTP client during provider teardown (#2872)
AioSandbox allocates a host-side agent_sandbox client (wrapping an
httpx.Client) in __init__, but AioSandboxProvider.release/destroy/shutdown
only popped provider state and tore down the backend container — the
client/transport owned by each cached AioSandbox was never explicitly
closed, accumulating unreclaimed sockets in long-running services.
- Add AioSandbox.close(): best-effort, idempotent close of the wrapped
httpx_client (falls back to top-level client.close()); errors are
logged but never raised so backend cleanup is never blocked.
- AioSandboxProvider.release()/destroy() now close the cached AioSandbox
before dropping it; shutdown() inherits this via destroy().
* fix(sandbox): close the real httpx.Client owned by AioSandbox (#2872)
The previous close() only walked one level (wrapper.httpx_client), which resolves to the Fern-generated HttpClient wrapper that has no close(). The real socket-owning httpx.Client lives one level deeper at _client_wrapper.httpx_client.httpx_client, so the close path never fired and host-side sockets still leaked.
Resolve the real httpx.Client with graceful degradation; clear self._client under the lock for use-after-close and concurrent double-close safety; mark provider release()/destroy() try/except as defense-in-depth; rewrite TestClose against the real nested structure to lock down the original no-op bug.
* feat(tool-search): add hash-scoped promoted state to ThreadState
* feat(tool-search): add immutable DeferredToolCatalog with stable hash
* feat(tool-search): add build_deferred_tool_setup + Command-writing tool_search
* refactor(tool-search): replace deferred-tool ContextVar with closures + graph state (#3272)
Build the deferred catalog + tool_search tool per agent from the policy-filtered
tool list (after skill allowed-tools), pass deferred_names + catalog_hash
explicitly to DeferredToolFilterMiddleware and the prompt, and record promotions
in ThreadState.promoted (scoped by catalog_hash) via a Command-returning
tool_search. Removes DeferredToolRegistry and the _registry_var ContextVar so
deferral no longer depends on build/execute sharing an async context. MCP tools
are tagged with metadata[deerflow_mcp]; client.py assembles deferral the same way.
Catalog is built AFTER tool-policy filtering (no policy-excluded tool can leak via
tool_search) and assembly is fail-closed. Migrate tests off the deleted registry
APIs; delete the obsolete ContextVar-based #2884 regression (re-covered by
state-based tests in a follow-up).
* test(tool-search): lock tool_search promotion into next model turn via graph state
* test(tool-search): cross-context, policy-leak, fail-closed, #2884 isolation regressions
* test(tool-search): align real-LLM e2e with closure-based deferred setup
* docs: update DeferredToolFilterMiddleware description for closure+state design
* style(tests): drop unused import in test_deferred_setup (ruff)
* test(tool-search): harden merge_promoted + replace tautological catalog test
From independent code review:
- merge_promoted: use existing.get("catalog_hash") so a forward-incompatible
or externally-injected persisted promoted dict triggers a replace instead of
a KeyError crash; add regression test for the malformed-existing case.
- test_deferred_catalog: replace the `== [] or True` tautology (a test that
could never fail) with a deterministic invalid-regex->literal-fallback check
(positive match on calc + negative empty match).
- DeferredToolCatalog: comment why frozen-without-slots is required for the
cached_property hash/names fields (adding slots=True would break them).
* fix(tool-search): read tool_search.enabled from self._app_config in client
DeerFlowClient._ensure_agent called get_app_config() directly to read
tool_search.enabled, but the client already resolves and stores its config as
self._app_config at construction (and uses it everywhere else). The bare call
re-resolves config from disk at agent-build time, which raises FileNotFoundError
in environments without a config.yaml (CI) — test_client.py's fixture only
patches get_app_config during __init__, so the later call hit the real loader.
Use self._app_config, matching the rest of the client.
* test(tool-search): lock tool_search post-policy append ordering
tool_search is appended after skill-allowlist filtering, so the allowlist
can no longer deny it by name. Lock the intended contract: it only appears
when allowed MCP tools survive the filter, and its catalog (derived from the
already policy-filtered list) can never expose a denied tool. Addresses the
ordering observation from the Copilot review on #3342.
* fix(checkpointer): use AsyncConnectionPool for postgres to prevent stale connection errors (#3223)
Replace AsyncPostgresSaver.from_conn_string() with an explicit
AsyncConnectionPool that has check_connection enabled, so dead idle
connections are detected and replaced on checkout instead of raising
OperationalError.
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* Fixed the unit test error and lint error
* fix(checkpointer): add TCP keepalive to postgres connection pool (#3254)
Enable TCP keepalive probes on the AsyncConnectionPool to prevent
idle postgres connections from being dropped by the server or network
middleware. Combined with the existing check_connection callback, this
provides defense-in-depth against stale connection errors.
Fixes#3254
* Changed the code as review suggestion
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
UploadsMiddleware defines only the sync `before_agent` hook. LangChain wires a
sync-only hook as `RunnableCallable(before_agent, None)`, and LangGraph's
`ainvoke` runs it directly on the event loop when `afunc is None` — so the
per-message uploads-directory scan (`exists`/`iterdir`/`stat` plus reading
sibling `.md` outlines) blocks the asyncio event loop on every message that has
an uploads directory.
Add `abefore_agent` that offloads the scan to a worker thread via
`run_in_executor`; it copies the current context, preserving the `user_id`
contextvar read by `get_effective_user_id()`.
Add a runtime anchor under `tests/blocking_io/` that drives the real
`create_agent` graph via `ainvoke` under the strict Blockbuster gate, so a
regression back onto the event loop fails CI. Update blocking-IO docs.
* Share assistant payload replay matching
* fix(provider): recover assistant field when ordinal AI index is taken
The mismatch-length fallback in `_match_ai_message` only tried the exact
`fallback_ordinal` AI index. When serialization drops or reorders an
assistant message, a unique signature match can consume a non-ordinal
index, leaving a later ambiguous payload's ordinal already used — so its
provider field (e.g. `reasoning_content`) was silently dropped.
Scan forward from the ordinal for the next unused `AIMessage` (wrapping to
earlier indices) to preserve the positional bias while still recovering
the field. Forward scanning avoids a naive min-unused pick that could
restore the wrong field after a leading message is dropped.
Add a regression test for the dropped-leading-message case.
* fix(provider): avoid earlier assistant fallback replay
* test(runtime): add Blockbuster runtime anchor for JsonlRunEventStore async IO
#3084 offloaded `JsonlRunEventStore`'s file IO via `asyncio.to_thread` and added
a mock-based offload assertion (`tests/test_jsonl_event_store_async_io.py`) that
covers `put()` only. That guard is not part of the Blockbuster runtime gate
(`tests/blocking_io/`) run by `backend-blocking-io-tests.yml`.
Add a runtime anchor that drives the full async surface (`put`, `put_batch`,
`list_messages`, `list_events`, `list_messages_by_run`, `count_messages`,
`delete_by_run`, `delete_by_thread`) under the strict Blockbuster gate, so any
blocking IO reintroduced on the event loop in any of these methods fails CI —
not only removal of a specific `to_thread` call. Verified each offloaded method
goes red when its offload is reverted. Test-only; no production change.
* test(runtime): exercise list_events event_types filter branch
Per review feedback: the anchor called list_events without event_types,
so the filter branch never ran after _read_run_events' filesystem IO.
Add a second list_events call with event_types=["message"] so the full
read path -- including the filter branch -- executes under the gate.
* feat(agent): add ToolOutputBudgetMiddleware for oversized tool output protection
Closes#3289. Adds a unified middleware that enforces per-result budgets
on ALL tool outputs (MCP, sandbox, community, custom), preventing
oversized external tool results from blowing the model context window.
Design informed by claude-code (persistToolResult), hermes-agent
(tool_result_storage), and pi (OutputAccumulator) — the three most
mature implementations in production coding-agent frameworks.
Key features:
- Disk externalization: oversized outputs written to thread-local
.tool-results/ directory, replaced with compact preview + file
reference. Model can read full output via read_file with offset/limit.
- Fallback truncation: head+tail truncation when disk is unavailable
(no thread_data, write failure), ensuring the context is always
protected.
- read_file exemption: prevents persist-read-persist infinite loops
(independently discovered by claude-code, hermes-agent, and pi).
- Per-tool threshold overrides via config.
- Line-boundary-aware truncation (no partial lines in previews).
- Multimodal content passthrough (images/structured blocks skip budget).
- Historical ToolMessage patching in wrap_model_call for checkpoint
recovery scenarios.
Related: #3222 (design RFC), #1844 (comprehensive context management),
#3137 (write_file args compaction), #1677 (sandbox tool truncation).
* test: add MCP content_and_artifact format coverage
Add 5 tests for MCP tool output format (list of content blocks):
- text content blocks are extracted and budgeted
- multiple text blocks are joined and budgeted
- image content blocks are skipped (multimodal passthrough)
- mixed text+image blocks are skipped
- small text blocks pass through unchanged
Total test count: 59 (was 54).
* fix(agent): address Codex review findings for ToolOutputBudgetMiddleware
Three issues identified by Codex code review, all fixed:
1. `enabled` config field was unused — middleware now checks
`config.enabled` and skips all processing when disabled.
2. `_build_fallback` could exceed `fallback_max_chars` — the marker
text itself (~139 chars) was not deducted from the budget. Now
pre-computes marker overhead and falls back to hard slice when
max_chars is smaller than the marker.
3. Sync file I/O in async path — `awrap_tool_call` now delegates
`_patch_result` to `asyncio.to_thread` to avoid blocking the
event loop during disk writes.
Tests updated to use realistic fallback_max_chars values (500+)
that can accommodate the marker overhead, plus two new tests:
- `test_result_never_exceeds_max_chars` (parametric across sizes)
- `test_very_small_max_chars_does_not_crash`
* fix(agent): address Copilot review — path traversal, async perf, shared config
1. Path traversal defense: sanitize tool_name via _sanitize_tool_name()
(strips separators, .., absolute paths), validate storage_subdir is
relative, and verify resolved filepath stays inside storage_dir.
2. Async hot-path optimization: add _needs_budget() cheap check before
asyncio.to_thread offload — small outputs (99% of calls) skip the
thread overhead entirely.
3. Replace shared module-level _DEFAULT_CONFIG with _default_config()
factory to prevent cross-instance mutation of mutable fields.
12 new tests: TestSanitizeToolName (5), TestExternalizePathTraversal (3),
TestNeedsBudget (4).
* fix(agent): correct preview hint to match read_file actual API
read_file uses start_line/end_line (1-indexed line numbers), not
offset/limit. The previous wording was copied from hermes-agent
which has a different read_file interface.
* perf(agent): hoist hot-path imports, add model-call pre-scan (review #3303)
Address maintainer review feedback:
1. Hoist inline imports to module level — `import asyncio` (was in
awrap_tool_call hot path) and `from dataclasses import replace`
(was in _patch_result) now live at module top.
2. Add a cheap pre-scan to _patch_model_messages so the historical
message list is not rebuilt on every model call when nothing is
oversized (the common case once results are budgeted at tool-call
time). Also adds the same _needs_budget gate to the sync
wrap_tool_call for symmetry with awrap_tool_call.
The pre-scan is refactored into per-tool-aware helpers
(_effective_trigger / _tool_message_over_budget) that mirror the exact
trigger conditions in _budget_content — including tool_overrides — so
the fast-path can never produce a false negative (silently skipping
budgeting for a tool with a low per-tool threshold).
7 new regression tests lock the per-tool-override-through-pre-scan path
and the model-call early return.
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
* fix(runtime): harden JSONL async I/O and DB put_batch thread validation (#2816)
- JsonlRunEventStore: offload all file I/O to asyncio.to_thread() so the
event loop is never blocked; add per-thread asyncio.Lock to serialise
concurrent puts and prevent interleaved JSONL lines
- Split _ensure_seq_loaded into a sync _compute_max_seq (runs in thread)
and an async wrapper; seq counter is recovered from disk on fresh store init
- DbRunEventStore.put_batch: raise ValueError when events span multiple
thread_ids (previously silently assumed same thread)
- Add test_jsonl_event_store_async_io.py: 12 tests covering lock reuse,
concurrent seq monotonicity, disk recovery, and mixed-thread batch rejection
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: address Copilot review comments
- delete_by_thread: pop _write_locks after releasing the lock to prevent
unbounded growth when threads are repeatedly created and deleted
- tests: add regression guard asserting asyncio.to_thread is called for
_write_record in put(); assert _write_locks entry removed on delete
* fix(lint): move patch import to local scope to fix ruff I001
* fix(lint): apply ruff check+format fixes to test file
* fix(runtime): address review feedback for JSONL async I/O hardening (#2816)
Use setdefault for atomic lock init in _get_write_lock; pop _write_locks
inside the held lock scope in delete_by_thread; update test docstring
and assert lock entry also cleared on delete.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: rayhpeng <rayhpeng@gmail.com>
* fix(gateway): split stream_existing_run into per-method routes for unique OpenAPI operationIds
`@router.api_route("/.../stream", methods=["GET", "POST"])` registers a
single FastAPI route that holds both methods. FastAPI's auto-generated
`operationId` is computed once per route from a single method picked out
of `route.methods`, so when OpenAPI generation iterates over every method
on that route both end up sharing the same `operationId`. That triggers
`UserWarning: Duplicate Operation ID stream_existing_run_..._stream_(get|post) for function stream_existing_run`
during `app.openapi()` and produces an invalid OpenAPI spec for SDK /
codegen consumers.
Register GET and POST as two separate routes on the same handler so each
method gets a distinct auto-generated `operationId` ("..._stream_get" and
"..._stream_post"). Behavior is otherwise unchanged: same handler, same
`require_permission` decoration, same response.
Add `tests/test_openapi_operation_ids.py` to lock in the invariant:
no duplicate-operationId warnings during spec generation, globally unique
operationIds across the spec, and distinct GET / POST operationIds on the
stream endpoint specifically. Reverted the source change locally and
confirmed all three tests fail before the fix.
* test(runtime): widen CancelledError catch in _ScriptedAgent to fix cancel-race flake
`_ScriptedAgent.astream()` previously only caught `asyncio.CancelledError`
inside the inner `if self.block_after_first_chunk:` while-loop. Cancellation
arriving during any earlier `await` in the same body
(`self.model.ainvoke`, `_write_checkpoint`, the `yield`) would propagate
without setting `controller.cancelled`, so callers waiting on
`controller.cancelled.wait(5)` after `POST /cancel` returned 204 could race
and time out.
`test_cancel_interrupt_stops_running_background_run` waits only for the
`started` event (set on the first line of `astream`) before issuing cancel,
so its race window spans all three pre-loop `await`s. On a clean `main`
checkout, stress-running the test 20× reproduces the failure 6/20
(~30%). `test_cancel_rollback_restores_pre_run_checkpoint`, which waits
for the later `checkpoint_written` event, passes 20/20 — confirming the
race lives entirely in the gap between `started.set()` and the
cancellation-aware block.
Widen the try/except to cover the entire `astream` body so any
`CancelledError` sets the controller event; the non-cancel path is
unchanged (no exception means no event set). After this change the
previously flaky test passes 50/50, the rollback test still passes 30/30,
and the full backend suite remains at 3649 passed / 19 skipped.
Test-only change — `backend/tests/test_runtime_lifecycle_e2e.py` is the
only file touched; the production cancel pipeline is unaffected.
* fix(gateway): honour on_disconnect on /wait endpoints (#3265)
The non-streaming /threads/{tid}/runs/wait and /runs/wait handlers used
to await record.task directly with no disconnect handling and silently
swallow CancelledError. When a long tool call (e.g. pip install inside
a custom skill) kept the connection idle long enough for an
intermediate HTTP layer to time out, the handler would still read the
in-progress checkpoint and return it as if the run had completed
normally -- masking a half-finished run as a successful response.
Add wait_for_run_completion in app.gateway.services that mirrors
sse_consumer's bridge-consumption pattern: subscribe to the stream
bridge until END_SENTINEL, poll request.is_disconnected on every
wake-up, and on real client disconnect cancel the background run when
record.on_disconnect is "cancel". Wire it into both wait endpoints.
The streaming path was unaffected because sse_consumer already has
this loop; this just brings /wait to parity.
* fix(gateway): skip checkpoint serialization on /wait disconnect
Copilot review on #3267 caught a follow-on of the same #3265 bug: when
the client disconnects, wait_for_run_completion breaks out of the bridge
loop and cancels the run, but the /wait endpoint then continues to read
the checkpointer and serializes whatever partial checkpoint exists as a
normal 200 response.
Have the helper return a bool — True only when END_SENTINEL was observed
— and skip the checkpoint serialization path on False. Also reorder the
inner check so END_SENTINEL is honoured even when is_disconnected() flips
true in the same iteration; the run truly finished so the real final
checkpoint is still valid.
* fix(mcp): skip session pooling for HTTP/SSE transports to avoid anyio RuntimeError (#3203)
HTTP/SSE transports use anyio.TaskGroup internally for streamable
connections. These task groups have cancel scopes bound to the async task
that created them, so closing a pooled session from a different task
raises RuntimeError. Restrict session pooling to stdio transports only.
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* docs: clarify MCP pooling applies only to stdio tools
Agent-Logs-Url: https://github.com/bytedance/deer-flow/sessions/2dd9881d-54c6-45fd-90bc-154a09e29841
Co-authored-by: WillemJiang <219644+WillemJiang@users.noreply.github.com>
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>