31 Commits

Author SHA1 Message Date
Xinmin Zeng
799bef6d9d
fix(replay-e2e): match by conversation, not the living system prompt (#3436)
* fix(replay-e2e): match by conversation, not the living system prompt

The model-replay match key hashed the full input including the lead-agent
system prompt. That prompt is edited frequently (e.g. #3195 added a "File
Editing Workflow" section), so the committed fixture went stale the moment
the prompt changed on main — turning the Layer-2 render gate RED on every
unrelated PR (#3430, #3432, ...). This was a self-inflicted false positive.

Root-cause fix:
- replay_provider._canonical_messages now EXCLUDES the system message from
  the hash. The conversation (human/ai/tool) is the stable contract that
  identifies a recorded turn; the system prompt is an internal detail not
  part of the front-back contract under test. (Mirrors how open-design keys
  its mock picker on the user prompt, not the system internals.) Proven
  robust: injecting a prompt edit no longer causes a replay miss.
- Layer-1 golden was BLIND to replay misses: the gateway swallows a miss
  into an assistant error message, so the shape-only golden stayed green on
  a stale fixture. It now inspects replay_provider.replay_misses() and fails
  loud. (Layer-2 already fails on a miss.)
- Re-recorded write_read_file.ultra fixture + regenerated golden under the
  new conversation-only hash.
- Layer-2 render spec: assert the in-graph auto-title (deterministic); the
  follow-up suggestion is fired async and depends on a clean JSON model
  output, so assert it only when the fixture captured one — never gate on
  its absence (recording flakiness must not block CI).
- docs: REPLAY_E2E.md updated.

Verified: Layer-1 golden green (no miss), Layer-2 both specs green,
CI=true make test 4033 passed / 0 failed, frontend pnpm check clean.

* test(replay-e2e): restore suggestions coverage with a reliable capture

Addresses review feedback (the suggestion path was dropped from Layer-2):

- record spec now waits for the `/suggestions` response before checking
  capture stability, so the recorded fixture reliably includes the
  frontend-fired suggestions turn (previously the stability window could
  return before suggestions fired, yielding a fixture without it).
- Re-recorded write_read_file.ultra: 5 turns (write_file, auto-title,
  read_file, answer, suggestions). Golden unchanged — suggestions is a
  separate /suggestions call, not part of the /runs/stream SSE sequence.
- Layer-2 spec: restore the hard `EXPECTED_SUGGESTION` assertion. With the
  record spec now waiting for /suggestions, a fixture missing the suggestion
  turn means a broken recording and must fail loud, not pass silently.

Verified: Layer-1 golden green (no miss), Layer-2 both specs green
(auto-title + suggestion render), frontend pnpm check clean.

* ci: re-trigger (flaky Docker Hub image pull in sandbox e2e, unrelated)

backend-unit-tests failed only in test_sandbox_orphan_reconciliation_e2e.py
with 'docker pull busybox:latest ... context deadline exceeded' — a CI-runner
network flake reaching Docker Hub, not related to this docs/tests-only change.
Empty commit to re-run CI.

---------

Co-authored-by: DanielWalnut <45447813+hetaoBackend@users.noreply.github.com>
2026-06-08 17:32:41 +08:00
Xinmin Zeng
88759015e4
test(e2e): deterministic record/replay front-back contract verification (#3365)
* test(e2e): record/replay front-back contract verification

Guards the front-back contract with a deterministic, key-free record/replay
harness (mirrors open-design's golden-trace approach):

- ReplayChatModel (tests/replay_provider.py): replays recorded LLM turns by a
  normalized hash of the model input. Strips <system-reminder>/date/uuid/tmp-path
  so one fixture replays across days and from both the browser and direct-POST
  paths; a miss raises loudly (no silent divergence).
- Recording is record-through-browser (scripts/record_gateway.py +
  build_fixture_from_jsonl.py + frontend/tests/e2e-record): a real run is driven
  through the real frontend so captured inputs match exactly what the browser
  sends; fixtures contain no API key.
- Layer 1 — backend golden (tests/test_replay_golden.py): replay through the real
  gateway, assert the SSE event sequence == committed golden.
- Layer 2 — full-stack render (frontend/tests/e2e-real-backend): real Next.js +
  real gateway (replay model) + Chromium; assert the replayed auto-title and
  follow-up suggestions render. DOM assertions are the gate; visual regression is
  a local dev gate (CI uploads the render as an artifact).
- CI (.github/workflows/replay-e2e.yml): both layers, triggered on EITHER side of
  the contract (frontend/** or backend gateway/harness/fixtures).

* test(e2e): multi-run render-order cross-stack scenario (#3352)

Guards the dangerous front-back class where a backend ordering change
silently breaks a frontend assumption while both sides' unit tests stay
green. Reproduces issue #3352: backend list_by_thread returns runs
newest-first (#2932) and the frontend prepended per-run pages, inverting
chronological order once the checkpoint no longer held the older messages.

- tests/seed_runs_router.py: test-only seeder, mounted on the replay
  gateway only when DEERFLOW_ENABLE_TEST_SEED=1 (never in the production
  app). Seeds a thread with >=2 runs + per-run message events and no
  checkpoint -- the #3352 precondition -- so the frontend per-run reload
  path is the sole source of truth and the prepend inversion is observable.
- frontend/tests/e2e-real-backend/multi-run-order.spec.ts: drives the real
  frontend against the real gateway, asserts the first run renders above
  the second. Reverting the #3354 fix turns it red.
- replay-e2e.yml: trigger on the new replay test-infra paths.
- docs: REPLAY_E2E.md cross-stack scenario section.

* test(e2e): address Copilot review on the replay harness

- Fix stale recorder references (scripts/record_traces.py ->
  scripts/record_gateway.py + scripts/build_fixture_from_jsonl.py) in
  replay_provider.py, test_replay_golden.py, _replay_fixture.py.
- MODE_CONTEXT['ultra']: thinking_enabled False -> True, mirroring the
  frontend's `context.mode !== 'flash'` (hooks.ts). It did not affect the
  hashed input (Layer 1 golden still green), but the table now matches the
  real frontend context it claims to mirror.
- replay_provider.py docstring: stop claiming memory is recorded-enabled;
  the replay config disables memory/summarization for determinism (title
  stays, as an in-graph deterministic call).
- record_gateway.py / run_replay_gateway.py: override DEER_FLOW_HOME instead
  of setdefault, so an outer value can't leak into the hermetic harness.
- record_gateway.py: clear error when DEERFLOW_RECORD_OUT is unset (was a
  bare KeyError).
- playwright.record.config.ts: forward OPENAI_*/DEERFLOW_RECORD_OUT only when
  set, so the gateway raises a clear 'missing env' error instead of getting ''.

* test(e2e): address Copilot review round 2

- seed_runs_router.py: constrain SeedMessage.role to Literal['human','ai']
  so a bad value is a clean 422 at the boundary instead of a 500
  (KeyError on _EVENT_TYPE).
- record-write-read-file.spec.ts: waitForCaptureStable now throws on
  timeout instead of returning the last count, so a truncated/partial
  recording can't pass silently.
- real-backend-render.spec.ts: guard the suggestions JSON.parse; a
  bracket-prefixed non-JSON turn falls back to '' so the existing
  not.toBe('') assertion fails clearly instead of a generic parse throw.
2026-06-08 12:35:03 +08:00
Xinmin Zeng
8d2e55a05f
fix(subagent): structured subagent_status field over text parsing (#3146) (#3154)
* fix(subagent): structured subagent_status field over text parsing

Closes #3146.

## Why

The frontend used to derive subtask card state by string-matching the
leading text of the `task` tool's result. That contract surface was
fragile — `#3107` BUG-007 and the `#3131` review both surfaced cases
where new backend wording (`Task cancelled by user.`,
`Task polling timed out after N minutes`, `ToolErrorHandlingMiddleware`
exception wrappers) silently broke the card lifecycle. The frontend
fallback kept growing more prefixes; any future rewording would break
it again.

## Design

1. **Backend → frontend contract**: `ToolMessage.additional_kwargs`
   carries `subagent_status` (one of `completed | failed | cancelled |
   timed_out | polling_timed_out`) and an optional `subagent_error`
   blob. The frontend prefers it over parsing `content`.

2. **Centralised stamping, not 8 sprinkled stamps**: rather than have
   each of `task_tool.py`'s 5 normal-return + 3 pre-execution `Error:`
   paths remember to set `additional_kwargs`, `ToolErrorHandlingMiddleware`
   stamps the field after every task-tool call. Adding a new return
   path in `task_tool.py` cannot now skip the stamp.

3. **Cross-language contract fixture**: the prefix→status mapping is
   the one piece both sides must agree on. The shared fixture at
   `contracts/subagent_status_contract.json` lists every backend return
   string, the expected status, and what the error substring should
   contain. Backend test (`backend/tests/test_subagent_status_contract.py`)
   and frontend test (`frontend/tests/unit/core/tasks/subtask-result.test.ts`)
   both load that fixture and assert the same cases. A wording drift on
   either side fails the matching language's test.

4. **Round-trip serialisation pinned**: the round-trip test asserts
   `ToolMessage.model_dump_json()` → `model_validate_json()` preserves
   `additional_kwargs.subagent_status`. Catches the case where a future
   LangChain or Pydantic upgrade silently strips unknown kwargs.

5. **Frontend status collapse documented**: the backend has five status
   values, the frontend card has three (`completed | failed |
   in_progress`). `cancelled` / `timed_out` / `polling_timed_out` all
   collapse to `failed` with the original status preserved in `error`.
   `parseSubtaskResult` returns `in_progress` for unknown values so a
   backend that ships a new enum variant before the frontend upgrades
   degrades to the legacy prefix fallback instead of getting pinned.

## Changes

Backend:
- `deerflow.subagents.status_contract` — new module exporting
  `SUBAGENT_STATUS_KEY`, `SUBAGENT_ERROR_KEY`,
  `SUBAGENT_STATUS_VALUES`, `extract_subagent_status(content)`, and
  `make_subagent_additional_kwargs(status, error)`.
- `ToolErrorHandlingMiddleware`: new `_stamp_task_subagent_status`
  helper centralises the stamp; `wrap_tool_call` / `awrap_tool_call`
  stamp on the success path; `_build_error_message` stamps on the
  wrapper path (carrying `ExcClass: detail` into `subagent_error`).
  Non-task tools are untouched.
- New tests: `test_subagent_status_contract.py` (19 cases from the
  shared fixture + status-enum / blank-error / unknown-status
  rejection) and `test_tool_error_handling_subagent_stamp.py`
  (middleware integration: terminal-content stamps, non-terminal
  doesn't, non-task tools untouched, async path mirrors sync,
  existing additional_kwargs survive, JSON round-trip preserved).

Frontend:
- `parseSubtaskResult(text, additionalKwargs?)` — prefers the
  structured stamp; falls back to the legacy prefix matcher for
  historical threads / unknown future status values.
- `STRUCTURED_STATUS_TO_SUBTASK` documents the five→three collapse.
- `message-list.tsx` passes `message.additional_kwargs` through.
- `subtask-result.test.ts` adds a structured-status block + a
  fixture-driven contract block; legacy prefix tests stay green for
  the fallback path.

Contract:
- `contracts/subagent_status_contract.json` — single source of truth
  both languages load. Whitespace variants, varied N for polling
  timeouts, the 3 pre-execution `Error:` returns task_tool produces,
  and the middleware wrapper shape are all in there.

## Test plan
- `make lint` clean (backend + frontend).
- `pytest tests/test_subagent_status_contract.py
   tests/test_tool_error_handling_subagent_stamp.py` → 37 passed.
- `pnpm test --run` → 103 passed (was 76, +27 new).

## Migration / fallback retirement

The text-prefix fallback stays in place until backend telemetry shows
the frontend never hits it for newly produced messages. At that point
a follow-up PR can drop the prefix branches and keep only the
structured-status branch.

Refs: bytedance/deer-flow#3138 (split summary), #3107 (origin), #3131
(prior prefix-only fix), #3146 (this issue).

* fix(subtask): back-fill result/error from text when structured status present

Three follow-ups on the PR #3154 review:

1. `readStructuredStatus` no longer short-circuits the prefix parse.
   The backend currently stamps only the `subagent_status` enum value;
   the human-facing `result` body and wrapped-error message still live
   in `ToolMessage.content`. Dropping the text parse meant successful
   tasks rendered empty completed pills and wrapped failures lost their
   diagnostic. Now both shapes get composed: structured status wins,
   `result`/`error` come from text when both sides agree, and a lying
   success body under a `failed` stamp is dropped instead of leaking.

2. Replace the ESM-incompatible `__dirname` fixture lookup in
   subtask-result.test.ts with `fileURLToPath(new URL(..., import.meta.url))`.
   The frontend package is `"type": "module"`, so the previous path
   would have thrown at runtime if anything ever changed under the
   contract directory.

3. Drop the `$schema` reference from contracts/subagent_status_contract.json
   pointing at a file that doesn't exist in the tree.

Three new tests cover the structured + text composition: completed
back-fills the success body, failed back-fills the wrapper text, and
unrecognised content under a `failed` stamp stays empty rather than
echoing noise.
2026-06-07 22:49:55 +08:00
Huixin615
9a53f9dfbb
fix(frontend): preserve chronological order of thread history after context compression (#3354)
* fix(frontend): preserve chronological order of thread history after context compression

Iterate runs from newest to match backend `list_by_thread` (newest-first) and the prepend semantics of the history loader, so refreshed history renders in A→B→C→D→E→F order.

Fixes #3352

* fix(frontend): auto-continue loading runs with no visible messages after context compression
2026-06-03 21:51:48 +08:00
Eilen Shin
019bd16a06
fix: load paginated run history messages (#3305) 2026-06-01 15:50:39 +08:00
Nan Gao
d46a5779bc
fix(chat): preserve messages after summarization (#3280)
* fix(chat): preserve messages after summarization

* make format

* fix(chat): address summarization review comments
2026-05-29 08:24:47 +08:00
Xinmin Zeng
2ace78d1e5
fix(frontend): surface backend detail when agent name check fails (#3048)
* fix(frontend): surface backend detail when agent name check fails

The new-agent page caught AgentNameCheckError but only branched on
reason === "backend_unreachable". Everything else (notably the 422
"Invalid agent name '...'. Must match ^[A-Za-z0-9-]+$" response from
GET /api/agents/check when the user submits a name with disallowed
characters — trailing space, dot, Chinese, invisible whitespace from
copy-paste) fell through to the generic fallback "Could not verify
name availability — please try again", swallowing the detail that
already told the user exactly what to fix.

Add a request_failed branch that surfaces err.message (which
checkAgentName already populates from the backend's detail at
core/agents/api.ts). The disabled / backend_unreachable / unknown-
error paths are unchanged.

Pin the contract with unit tests covering: 200 success, fetch
rejection, 502/503/504 network errors, agents_api disabled detail,
422 validation detail carried verbatim, statusText fallback when
detail is absent, and a regression guard against misclassifying a
422 as agents_api disabled.

Closes #3041

* fix(frontend): localise the error prefix when surfacing backend detail

The previous commit surfaced the backend's raw `err.message` on the
new-agent page when the name check failed. The detail itself is
English (backend's `_validate_agent_name` text, any 5xx business
message, etc.) and dropping it bare into a zh-CN page produced a
jarring English-among-Chinese line that didn't match neighbouring
strings like "已存在同名智能体" / "无法验证名称可用性".

Add `nameStepCheckErrorWithDetail` as a templated string ("Name
check failed: {detail}" / "名称校验失败:{detail}"), mirroring the
existing `nameStepBootstrapMessage` `{name}` template pattern. The
page wraps `err.message` in it when present and falls back to the
plain `nameStepCheckError` when the detail is empty.

Rendered output (verified locally with a Console fetch mock that
returns 500 + detail):

  zh-CN: 名称校验失败:Database connection lost: SQLAlchemy connection
         pool exhausted (max 5 connections, all in use)
  en-US: Name check failed: Database connection lost: SQLAlchemy
         connection pool exhausted (max 5 connections, all in use)

The localised prefix tells the user *what operation* failed; the
raw detail tells them *why*. Translating the detail itself would
be lossy (any unbounded backend string would need a translation
table) and would break the debuggability the previous commit
delivered.

Refs #3041

* fix(frontend): distinguish backend detail from generated fallback in AgentNameCheckError

Addresses Copilot's review on #3048: the previous commits keyed off
`err.message`, but `checkAgentName` substitutes a generated fallback
string ("Failed to check agent name: ${statusText}") when the backend
sent no detail. That guaranteed `err.message` was always truthy, made
the `nameStepCheckError` fallback branch unreachable in practice, and
could surface awkward strings like "名称校验失败:Failed to check
agent name: Bad Gateway" in the UI.

Add an explicit `detail: string | null` field to AgentNameCheckError.
`checkAgentName` populates it only when the backend response actually
carried a string `detail` (defensive guard against the dict-shaped
detail that other deer-flow endpoints use for typed error codes).
The new-agent page now selects on `err.detail` instead of `err.message`
so the localised fallback wins when no real detail exists.

Also fix the prettier formatting that broke lint-frontend CI on the
previous push.

Test changes:
- The 422 carry-through test now asserts both `detail` and `message`
  hold the backend string verbatim.
- A new "falls back to statusText in message but leaves detail null"
  test pins the contract that no real detail ⇒ no UI surface leak.
- A new "treats non-string detail as null" test guards against future
  backend schema drift toward dict-shaped detail.

Refs #3041 #3048
2026-05-28 18:38:45 +08:00
Admire
2fdfff0db3
fix(frontend): fix Mermaid preview failure in historical messages (#3196)
* fix(frontend): render historical mermaid diagrams

* fix(frontend): address mermaid review feedback

* Stabilize cancel lifecycle test

* fix(frontend): handle mermaid fence variants

* fix(frontend): normalize mermaid arrow spacing

* fix(frontend): handle mermaid CRLF fences

* chore: keep mermaid fix frontend-scoped

---------

Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
2026-05-28 18:20:02 +08:00
zgenu
737abc0e45
fix: ignore stale run reconnect conflicts (#3284)
* fix: ignore stale run reconnect conflicts

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* fix: ignore stale run reconnect conflicts

---------

Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-05-28 17:29:30 +08:00
Admire
f68bcb771c
fix(frontend): guard message copy clipboard access (#3211)
* fix(frontend): guard message copy clipboard access

* fix(frontend): reuse clipboard guard across copy actions
2026-05-26 09:37:51 +08:00
AochenShen99
11dd5b0683
fix(frontend): strip unclosed <think> tags from streaming AI content (#3218)
* fix(frontend): strip unclosed <think> tags from streaming AI content

During streaming, an opening <think> tag may arrive in one chunk
while the matching </think> arrives in a later chunk. The existing
splitInlineReasoning regex only matched fully closed pairs, so the
mid-flight reasoning was left in message.content and rendered into
the chat bubble via the markdown pipeline's rehypeRaw plugin until
the closing tag landed.

Extend splitInlineReasoning with a second pass: after stripping every
closed <think>...</think> pair, route any remaining content from a
lone opener to the reasoning slot and leave only the preceding
preamble in content. Closed-tag behavior is unchanged.

Covers every provider whose stream emits reasoning inline as <think>
tags (MiniMax streaming path, MindIE, PatchedChatOpenAI, and any
gateway-served DeepSeek/OpenAI-compatible model).

* style(frontend): apply prettier formatting to streaming reasoning tests

* fix(frontend): skip <think> split for literal think tags in inline code

Treats a `<think>` opener immediately preceded by a backtick as part of
markdown inline code rather than a streaming reasoning marker. Prevents
permanent content truncation when an AI message documents the `<think>`
tag literally (e.g. ``Use `<think>` markers``), where the streaming-safe
fallback would otherwise route the rest of the answer into the reasoning
panel because no `</think>` ever arrives.

Adds regression tests for both the post-stream and mid-stream cases.
2026-05-26 09:35:07 +08:00
Admire
e7967a7fc3
fix(frontend): hide copy for streaming assistant turn (#3176) 2026-05-23 23:29:16 +08:00
Admire
d0fa37e71d
fix(frontend): avoid duplicate optimistic user message (#3002) 2026-05-23 17:02:23 +08:00
AochenShen99
604fcbb9d2
Stabilize write artifact previews (#3172) 2026-05-23 16:56:14 +08:00
JeffJiang
b103d1a7f5
feat(frontend): support static website demo mode (#3170)
* feat(frontend): support static website demo mode

* fix(frontend): render html artifact previews from blob content

* chore(frontend): apply pre-commit formatting

* fix(frontend): address static demo PR review comments

* Update the release information of DeerFlow

---------

Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
2026-05-23 00:10:56 +08:00
Xinmin Zeng
e93f658472
fix(stability): resolve P0 blockers from v2.0-m1-rc1 stability audit (#3107) (#3131)
* fix(task-tool): unwrap callback manager when locating usage recorder

`config["callbacks"]` may arrive as a `BaseCallbackManager` (e.g. the
`AsyncCallbackManager` LangChain hands to async tool runs), not just a plain
list. The previous `for cb in callbacks` loop raised
`TypeError: 'AsyncCallbackManager' object is not iterable`, which
`ToolErrorHandlingMiddleware` then converted into a failed `task` ToolMessage
even though the subagent had completed internally — Ultra mode lost subagent
results and the lead agent fell back to redoing the work.

Unwrap `BaseCallbackManager.handlers` before searching for the recorder.

Refs: bytedance/deer-flow#3107 (BUG-002)

* fix(frontend): treat any task tool error as a terminal subtask failure

The subtask card status machine matched only three English prefixes (`Task
Succeeded. Result:`, `Task failed.`, `Task timed out`). Anything else fell
through to `in_progress`, so a `task` tool error wrapped by
`ToolErrorHandlingMiddleware` (`Error: Tool 'task' failed ...`) left the card
spinning forever even after the run had ended.

Extract the prefix logic into `parseSubtaskResult` and recognise any leading
`Error:` token as a terminal failure. The extracted function is unit-tested
against the legacy prefixes plus the `AsyncCallbackManager` regression
captured in the upstream issue.

Refs: bytedance/deer-flow#3107 (BUG-007)

* fix(frontend): exclude hidden, reasoning, and tool payloads from chat export

`formatThreadAsMarkdown` / `formatThreadAsJSON` iterated raw messages without
running the UI-level `isHiddenFromUIMessage` filter. Exported transcripts
therefore included `hide_from_ui` system reminders, memory injections,
provider `reasoning_content`, tool calls, and tool result messages — content
that is intentionally hidden in the chat view.

Filter the export to the user-visible transcript by default and gate
reasoning / tool calls / tool messages / hidden messages behind explicit
`ExportOptions` flags so a future debug export can opt back in without
forking the formatter.

Refs: bytedance/deer-flow#3107 (BUG-006)

* fix(gateway): route get_config through get_app_config for mtime hot reload

`get_config(request)` returned the `app.state.config` snapshot captured at
startup. The worker / lead-agent path then threaded that frozen `AppConfig`
through `RunContext` and `agent_factory`, so per-run fields edited in
`config.yaml` (notably `max_tokens`) were ignored until the gateway process
was restarted — even though `get_app_config()` already does mtime-based
reload at the bottom layer.

Route the request dependency through `get_app_config()` directly. Runtime
`ContextVar` overrides (`push_current_app_config`) and test-injected
singletons (`set_app_config`) keep working; `app.state.config` is now only
read at startup for one-shot bootstrap (logging level, IM channels,
`langgraph_runtime` engines).

`tests/test_gateway_deps_config.py` encoded the old snapshot contract and is
removed; `tests/test_gateway_config_freshness.py` replaces it with mtime,
ContextVar, and `set_app_config` coverage. `test_skills_custom_router.py` and
`test_uploads_router.py` now inject test configs via FastAPI
`dependency_overrides[get_config]` instead of mutating `app.state.config`.

Document the hot-reload boundary in `backend/CLAUDE.md` so reviewers know
which fields are picked up on the next request vs. which still require a
restart (`database`, `checkpointer`, `run_events`, `stream_bridge`,
`sandbox.use`, `log_level`, `channels.*`).

Refs: bytedance/deer-flow#3107 (BUG-001)

* fix(gateway): broaden get_config 503 to any config-load failure

Address review feedback on the previous commit:

1. Narrow exception catch removed. The old contract returned 503 whenever
   `app.state.config is None`. The first cut only mapped
   `FileNotFoundError`, leaving `PermissionError`, YAML parse errors, and
   pydantic `ValidationError` to bubble up as 500. At the request boundary
   we treat any inability to materialise the config as "configuration not
   available" (503) and log the original exception so the operator still
   has the stack.

2. Removed the unused `request: Request` parameter and the matching
   `# noqa: ARG001`. FastAPI's `Depends()` does not require the dependency
   to accept `Request`; the only call site uses the no-arg form.

3. `backend/CLAUDE.md` boundary now lists the *reason* each field is
   restart-required (engine binding, singleton caching, one-shot
   `apply_logging_level`, etc.), not just the field name, so reviewers do
   not have to reverse-engineer the boundary themselves.

Tests parametrise four exception classes (`FileNotFoundError`,
`PermissionError`, `ValueError`, `RuntimeError`) and assert 503 for each.

Refs: bytedance/deer-flow#3107 (BUG-001)

* fix(task-tool): defend _find_usage_recorder against non-list callbacks

Address review feedback. The previous commit handled the two common shapes
LangChain hands to async tool runs — a plain `list[BaseCallbackHandler]` and
a `BaseCallbackManager` subclass — but iterated any other shape directly,
which would still raise `TypeError` if e.g. a single handler instance leaked
through without a list wrapper.

Treat any non-list, non-manager `config["callbacks"]` value as "no recorder"
rather than crash. Docstring now lists all four shapes explicitly. New tests
cover the single-handler-object case, `runtime is None`, `callbacks is None`,
and `runtime.config` being a non-dict — all required to be silent no-ops.

Refs: bytedance/deer-flow#3107 (BUG-002)

* fix(frontend): drop dead identity ternary and add opt-in export tests

Address review feedback on the previous export commit:

1. Removed the no-op `typeof msg.content === "string" ? msg.content : msg.content`
   expression in `formatThreadAsJSON`. Both branches returned the same value;
   the message content now flows through unchanged whether it is a string or
   the rich `MessageContent[]` shape (LangChain JSON-serialises the array
   structure correctly already).

2. Expanded the JSDoc on `ExportOptions` to make it clearer that the four
   flags are not currently wired to any UI control — callers wanting a debug
   export must build the options object explicitly. The default behaviour
   continues to match the explicit prescription in
   bytedance/deer-flow#3107 BUG-006.

3. Added opt-in coverage. The previous tests only exercised the
   `options = {}` default path; the new cases verify each flag flips the
   corresponding payload back into the export so a future debug-export
   surface does not silently break the contract.

Refs: bytedance/deer-flow#3107 (BUG-006)

* fix(frontend): export subtask prefix constants and document fallback intent

Address review feedback on the previous BUG-007 commit:

1. `SUCCESS_PREFIX`, `FAILURE_PREFIX`, `TIMEOUT_PREFIX`, and the
   `ERROR_WRAPPER_PATTERN` regex are now exported. The JSDoc explicitly
   pins them as part of the backend↔frontend contract defined in
   `task_tool.py` and `tool_error_handling_middleware.py`, so any future
   structured-status migration (e.g. backend writing
   `additional_kwargs.subagent_status` instead of leading text) can
   reference these from one canonical place rather than redefine them.

2. The `in_progress` fallback now carries a docstring explaining the
   deliberate choice — LangChain only ever emits a `ToolMessage` once the
   tool itself has returned, so unrecognised content means the contract
   has drifted and "still running" is the right operator signal (eagerly
   marking it terminal-failed would mask the drift).

No behaviour change; this is documentation and an API export.

Refs: bytedance/deer-flow#3107 (BUG-007)

* fix(gateway): drop app.state.config snapshot and freeze run_events_config

Address @ShenAC-SAC's BUG-001 review on #3131. The previous cut still
stored an ``AppConfig`` snapshot on ``app.state.config`` for startup
bootstrap. Two follow-on hazards from that:

1. Future code touching the gateway lifespan could accidentally start
   reading ``app.state.config`` again, silently regressing the request
   hot path back to a stale snapshot.
2. ``get_run_context()`` paired a freshly-reloaded ``AppConfig`` with the
   startup-bound ``event_store`` and a *live* ``run_events_config``
   field — so an operator who edited ``run_events.backend`` mid-flight
   would have produced a run context whose ``event_store`` and
   ``run_events_config`` referred to different backends.

Clean approach (aligned with the direction in PR #3128):

- ``lifespan()`` keeps a local ``startup_config`` variable and passes it
  explicitly into ``langgraph_runtime(app, startup_config)`` and into
  ``start_channel_service``. No ``app.state.config`` attribute is set at
  any point.
- ``langgraph_runtime`` now accepts ``startup_config`` as a required
  parameter, removing the ``getattr(app.state, "config", None)`` lookup
  and the "config not initialised" runtime error.
- The matching ``run_events_config`` is frozen onto ``app.state`` next
  to ``run_event_store`` so ``get_run_context`` reads the two from the
  same startup-time source. ``app_config`` continues to be resolved
  live via ``get_app_config()``.
- ``backend/CLAUDE.md`` boundary explanation updated to spell out the
  ``startup_config`` / ``get_app_config()`` split.

New regression test ``test_run_context_app_config_reflects_yaml_edit``
exercises the worker-feeding path: it asserts that ``ctx.app_config``
follows a mid-flight ``config.yaml`` edit while
``ctx.run_events_config`` stays frozen to the startup snapshot the
event store was built from.

Refs: bytedance/deer-flow#3107 (BUG-001), bytedance/deer-flow#3131 review

* fix(frontend): parse Task cancelled and polling timed out as terminal

Address @ShenAC-SAC's BUG-007 review on #3131. `task_tool.py` actually
emits five terminal strings:

- `Task Succeeded. Result: …`
- `Task failed. …`
- `Task timed out. …`
- `Task cancelled by user.`               ← previously matched none
- `Task polling timed out after N minutes …` ← previously matched none

The previous cut handled three; the last two fell through to the
"unknown content" branch and pushed the subtask card back to
`in_progress` even though the backend had already reached a terminal
state. Add explicit matches plus regression tests for both. The
`in_progress` fallback is now reserved for genuinely unrecognised
output (i.e. contract drift), as documented.

Refs: bytedance/deer-flow#3107 (BUG-007), bytedance/deer-flow#3131 review

* fix(frontend): sanitize JSON export content via the Markdown content path

Address @ShenAC-SAC's BUG-006 review and the Copilot inline comment on
#3131. The previous cut filtered hidden/tool messages out of the JSON
export but still serialised `msg.content` verbatim, so:

- inline `<think>…</think>` wrappers stayed in the exported `content`
  even with `includeReasoning: false`,
- content-array thinking blocks leaked the `thinking` field,
- `<uploaded_files>…</uploaded_files>` markers leaked the workspace
  paths a user uploaded files to.

JSON now goes through the same sanitiser the Markdown path uses
(`extractContentFromMessage` + `stripUploadedFilesTag`). Reasoning and
tool_calls remain gated behind their `ExportOptions` flags. AI / human
rows that sanitise to empty content with no opted-in reasoning or tool
calls are dropped so the JSON matches the Markdown path's `continue`
on empty assistant fragments.

New regression tests cover the three leak shapes the reviewer called
out plus the empty-content-drop case.

Refs: bytedance/deer-flow#3107 (BUG-006), bytedance/deer-flow#3131 review

* test(gateway): align lifespan stub with langgraph_runtime two-arg signature

Codex round-3 review of c0bc7a06 flagged this: changing
`langgraph_runtime` to require `startup_config` as a second positional
argument broke the one-arg stub `_noop_langgraph_runtime(_app)` in
`test_gateway_lifespan_shutdown.py`, which is patched into
`app.gateway.app.langgraph_runtime` by the lifespan shutdown bounded-timeout
regression. Lifespan would then call the stub with two args and raise
`TypeError` before the bounded-shutdown assertion ran.

Update the stub to match the new signature. The shutdown test itself is
unaffected — it only cares about the channel `stop_channel_service` hang
path.

Refs: bytedance/deer-flow#3107 (BUG-001), bytedance/deer-flow#3131 review

* fix(frontend): strip every known backend marker in export, not just uploads

Codex round-3 review of 258ca800 and the matching maintainer feedback on
PR #3131 made the same point: the JSON export now ran the
Markdown-side sanitiser, but that sanitiser only stripped
`<uploaded_files>`. The full set of payloads middleware embeds inside
message `content` is larger:

- `<uploaded_files>` — `UploadsMiddleware`
- `<system-reminder>` — `DynamicContextMiddleware`
- `<memory>` — `DynamicContextMiddleware` (nested inside system-reminder)
- `<current_date>` — `DynamicContextMiddleware`

The primary protection is still `isHiddenFromUIMessage`: the
`<system-reminder>` HumanMessage is marked `hide_from_ui: true` and never
reaches the formatter. This commit adds the second line of defence so a
regression that drops the `hide_from_ui` flag — or any future middleware
that injects the same tag vocabulary into a visible HumanMessage —
cannot leak the payload into the export file.

Concrete changes:

- New `INTERNAL_MARKER_TAGS` constant + `stripInternalMarkers(content)`
  helper in `core/messages/utils.ts`. The constant doubles as
  documentation for the backend↔frontend contract.
- `formatMessageContent` in `export.ts` now calls `stripInternalMarkers`
  instead of `stripUploadedFilesTag`. UI render paths
  (`message-list-item.tsx`) keep using the narrower function so a user
  legitimately typing `<memory>` in a meta-discussion is preserved.
- The "drop empty rows" guard in `buildJSONMessage` switched from
  `=== undefined` to truthy `!` checks. Codex spotted the asymmetry: when
  `extractReasoningContentFromMessage` returned the empty string (which it
  legitimately can), the JSON path emitted `{reasoning: ""}` while the
  Markdown path's `!reasoning` `continue` correctly dropped the row.

New regression tests cover the defence-in-depth strip with a
`<system-reminder><memory><current_date>` payload deliberately *not*
marked `hide_from_ui`; tool-message sanitization under
`includeToolMessages: true`; the mixed-content-array case
(`thinking + text + image_url`); and the opted-in empty-reasoning drop.

Live verification on a real Ultra-mode thread that uploaded a PDF
(`曾鑫民-薪资交易流水.pdf`): backend state's first HumanMessage carries the
`<uploaded_files>` block (with `/mnt/user-data/uploads/...` paths) as part
of a content-array. The Markdown and JSON export blobs both come back
free of `<uploaded_files>`, `<system-reminder>`, `<current_date>`,
`tool_calls`, and reasoning — while preserving the user's `这是什么 ?`
prompt and the assistant's visible answer.

Refs: bytedance/deer-flow#3107 (BUG-006), bytedance/deer-flow#3131 review

* test(frontend): cover trim, varied N, and pre-execution Error: prefixes

Codex round-3 review of 50e2c257 flagged three coverage gaps in the
subtask-status parser:

1. `Task cancelled by user.` and `Task polling timed out` previously had
   no whitespace-trim coverage — the original trim test only exercised
   the success prefix. Streaming chunks can arrive with leading/trailing
   newlines; the regex needed an explicit assertion.
2. The polling-timeout case was tested only at one `N` (15 minutes). The
   backend interpolates the live `timeout_seconds // 60` value, so the
   matcher must hold for any positive integer. Now we run the case for
   1, 5, and 60 minutes.
3. `task_tool.py` also emits three `Error:` strings for pre-execution
   failures — unknown subagent type, host-bash disabled, and "task
   disappeared from background tasks". They are intentionally handled by
   `ERROR_WRAPPER_PATTERN` rather than dedicated prefixes (the wrapper
   already produces the right terminal-failed shape) but had no test
   coverage proving that wiring. Codex was right that a refactor splitting
   one of them off into its own prefix would silently break things.

The JSDoc on the constants block now spells the three pre-execution
errors out so the relationship between `task_tool.py` returns and the
prefix vocabulary is explicit.

No production code change beyond the docstring — this commit is pure
coverage hardening for the contract that already exists.

Refs: bytedance/deer-flow#3107 (BUG-007), bytedance/deer-flow#3131 review
2026-05-21 21:18:10 +08:00
Nan Gao
6d3cffb4f0
fix(frontend): deduplicate restored thread messages (#2958)
* fix(frontend): fix duplicate messages when reopening agent sessions (#2957)

* make format

* fix(frontend): retry pending thread history loads
2026-05-16 08:48:19 +08:00
Admire
7c42ab3e16
fix(frontend): wait for async chat submit before clearing (#2940)
* fix(frontend): wait for async chat submit before clearing

* test(frontend): cover pending attachment uploads

* fix(frontend): preserve sync submit semantics
2026-05-15 22:27:10 +08:00
Nan Gao
0c37509b38
fix(middleware): Prevent todo completion reminder IMMessage leak (#2907)
* fix(middleware): Prevent todo completion reminder IMMessage leak (#2892)

* make format

* fix(middleware): Clear stale todo reminder counts (#2892)

* add size guard for _completion_reminder_counts and add a integration test
2026-05-15 22:12:37 +08:00
YuJitang
9892a7d468
fix: bucket subagent token usage into parent run totals (#2838)
* fix: bucket subagent token usage into RunRow.subagent_tokens

Add caller-bucketed token tracking to RunJournal so subagent and
middleware LLM calls are written to the correct RunRow columns instead
of all falling into lead_agent_tokens (default 0).

- RunJournal: accumulate _lead_agent_tokens / _subagent_tokens /
  _middleware_tokens in on_llm_end, deduped by langchain run_id.
  Add record_external_llm_usage_records() for external sources
  (respects track_token_usage flag). Return caller buckets from
  get_completion_data().
- SubagentTokenCollector: new lightweight callback handler that
  collects LLM usage within subagent execution.
- SubagentExecutor: wire collector into subagent run_config and sync
  records to SubagentResult on every chunk (timeout/cancel safe).
- SubagentResult: add token_usage_records and usage_reported fields.
- task_tool: report subagent usage to parent RunJournal on every
  terminal status (COMPLETED/FAILED/CANCELLED/TIMED_OUT), including
  the CancelledError path, guarded against double-reporting.

No DB migration needed — RunRow columns already exist.

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* fix: address token usage review feedback

* Address review follow-ups

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-05-10 22:47:30 +08:00
YuJitang
5127f08e1a
enable token usage by default (#2841) 2026-05-10 22:00:57 +08:00
YuJitang
417416087b
fix: use backend thread token usage for header total (#2800)
* fix: use backend thread token usage for header total

* Refactor thread token usage fetch
2026-05-09 19:40:32 +08:00
YuJitang
530bda7107
fix: dedupe token usage aggregation by message id (#2770) 2026-05-08 09:54:20 +08:00
Xinmin Zeng
27559f3675
fix(frontend): defer thread id to onStart to avoid 404 on new chat (#2749)
* fix(frontend): defer thread id to onStart to avoid 404 on new chat

The LangGraph SDK's useStream eagerly fetches /threads/{id}/history the
moment it receives a thread id, and the local useThreadRuns issues
GET /threads/{id}/runs for the same reason. The chats page used to flip
isNewThread=false (and forward the client-generated thread id) inside
the synchronous onSend callback, before thread.submit had created the
thread on the backend. The two queries therefore raced ahead of
POST /runs/stream and returned 404 on the very first send.

Drop the onSend handler so isNewThread stays true until onStart fires
from useStream's onCreated — by then the backend has the thread, and
the SDK's submittingRef guard naturally suppresses the redundant
history fetch. The agent chat page already uses this pattern, so this
also unifies the two flows.

Adds an E2E regression that records request ordering and asserts
GET /history and GET /runs are never issued before POST /runs/stream
on the first send from /chats/new.

Closes #2746

* fix(frontend): split welcome layout from backend thread state

Removing onSend kept GET /history and GET /runs from racing ahead of
POST /runs/stream, but it also coupled the welcome layout (centered
input, hero, quick actions) to backend thread creation.  Until onCreated
returned, the user's optimistic message and the welcome hero rendered on
top of each other.

Introduce a dedicated `isWelcomeMode` UI flag, separate from
`isNewThread`:
- `isNewThread` still tracks "backend has no thread yet" and gates the
  thread id forwarded to useStream.
- `isWelcomeMode` drives the visual layout (header background, input
  box position, max width, hero, quick actions, autoFocus) and flips to
  false inside onSend so the layout animates immediately.

`isWelcomeMode` is kept in sync with `isNewThread` via an effect so
sidebar navigation and "new chat" still behave correctly.  All 15 E2E
tests pass, including the ordering regression added in the previous
commit.

* test(e2e): use monotonic sequence for thread-init ordering check

Date.now() is millisecond-resolution, so two requests emitted within
the same tick would share a timestamp and slip past the strict `<`
ordering assertions. Replace the timestamp with a monotonic counter
that increments on every observed request/requestfinished event so the
ordering check is robust regardless of scheduling.

Per PR #2749 review feedback from copilot-pull-request-reviewer.

* refactor(input-box): rename isNewThread prop to isWelcomeMode

Inside InputBox, the prop named `isNewThread` is only ever consulted
for visual layout decisions — gating follow-up suggestions, the bottom
background strip, and the welcome-mode quick-action SuggestionList. It
never reflects "the backend has created the thread", which after #2746
is tracked separately via `isNewThread` in the chat pages themselves.

Rename the prop to `isWelcomeMode` and update both call sites
(workspace chats page and agent chats page) so the prop name matches
its actual semantics. No behavior change.

Per PR #2749 review feedback from @WillemJiang.
2026-05-07 16:11:44 +08:00
Xinmin Zeng
aded753de3
fix(frontend): restore localhost fallback for getGatewayConfig in prod mode (#2705) (#2718)
* fix(frontend): unify gateway-config localhost fallback for prod (#2705)

`getGatewayConfig()` only fell back to localhost defaults when
`NODE_ENV === "development"`, while `next.config.js` always falls back
to `127.0.0.1:8001`. Running `make start` (which sets NODE_ENV=production
via `next start`) without `DEER_FLOW_INTERNAL_GATEWAY_BASE_URL` /
`DEER_FLOW_TRUSTED_ORIGINS` therefore caused zod to throw inside SSR
layouts and surfaced as a 500.

Drop the NODE_ENV gating and use localhost defaults everywhere — the
"force explicit config in prod" intent should be enforced by deployment
templates (docker-compose already sets both vars), not by request-time
crashes. Document the two vars in both .env.example files and add unit
coverage for the dev/prod env-unset paths.

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* Update internalGatewayUrl in gateway config tests

---------

Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-05-05 16:27:29 +08:00
YuJitang
d02f762ab0
feat: refine token usage display modes (#2329)
* feat: refine token usage display modes

* docs: clarify token usage accounting semantics

* fix: avoid duplicate subtask debug keys

* style: format token usage tests

* chore: address token attribution review feedback

* Update test_token_usage_middleware.py

* Update test_token_usage_middleware.py

* chore: simplify token attribution fallback

* fix token usage metadata follow-up handling

---------

Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
2026-05-04 09:56:16 +08:00
yangzheli
748429ef0d
fix(frontend): add missing mock routes for runs-list, models, and suggestions (#2578)
* fix(frontend): add missing mock routes for runs-list, models, and suggestions

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-04-26 23:29:59 +08:00
Copilot
ef04174194
Fix invalid HTML nesting in reasoning trigger during complex task rendering (#2382)
* Initial plan

* fix(frontend): avoid invalid paragraph nesting in reasoning trigger

Agent-Logs-Url: https://github.com/bytedance/deer-flow/sessions/4c9eb0c2-ff29-4629-a61c-4e33d736d918

Co-authored-by: WillemJiang <219644+WillemJiang@users.noreply.github.com>

* test(frontend): strengthen reasoning trigger DOM nesting assertion

Agent-Logs-Url: https://github.com/bytedance/deer-flow/sessions/4c9eb0c2-ff29-4629-a61c-4e33d736d918

Co-authored-by: WillemJiang <219644+WillemJiang@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: WillemJiang <219644+WillemJiang@users.noreply.github.com>
2026-04-21 09:41:28 +08:00
Xun
7c87dc5bca
fix(reasoning): prevent LLM-hallucinated HTML tags from rendering as DOM elements (#2321)
* fix

* add test

* fix
2026-04-19 19:27:34 +08:00
yangzheli
c6b0423558
feat(frontend): add Playwright E2E tests with CI workflow (#2279)
* feat(frontend): add Playwright E2E tests with CI workflow

Add end-to-end testing infrastructure using Playwright (Chromium only).
14 tests across 5 spec files cover landing page, chat workspace,
thread history, sidebar navigation, and agent chat — all with mocked
LangGraph/Backend APIs via network interception (zero backend dependency).

New files:
- playwright.config.ts — Chromium, 30s timeout, auto-start Next.js
- tests/e2e/utils/mock-api.ts — shared API mocks & SSE stream helpers
- tests/e2e/{landing,chat,thread-history,sidebar,agent-chat}.spec.ts
- .github/workflows/e2e-tests.yml — push main + PR trigger, paths filter

Updated: package.json, Makefile, .gitignore, CONTRIBUTING.md,
frontend/CLAUDE.md, frontend/AGENTS.md, frontend/README.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: apply Copilot suggestions

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
2026-04-18 08:21:08 +08:00
yangzheli
4efc8d404f
feat(frontend): set up Vitest frontend testing infrastructure with CI workflow (#2147)
* feat: set up Vitest frontend testing infrastructure with CI workflow

Migrate existing 4 frontend test files from Node.js native test runner
(node:test + node:assert/strict) to Vitest, reorganize test directory
structure under tests/unit/ mirroring src/ layout, and add a dedicated
CI workflow for frontend unit tests.

- Add vitest as devDependency, remove tsx
- Create vitest.config.ts with @/ path alias
- Migrate tests to Vitest API (test/expect/vi)
- Rename .mjs test files to .ts
- Move tests from src/ to tests/unit/ (mirrors src/ layout)
- Add frontend/Makefile `test` target
- Add .github/workflows/frontend-unit-tests.yml (parallel to backend)
- Update CONTRIBUTING.md, README.md, AGENTS.md, CLAUDE.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* style: fix the lint error

* style: fix the lint error

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
2026-04-12 18:00:43 +08:00