deer-flow

mirror of https://github.com/bytedance/deer-flow.git synced 2026-06-09 17:12:01 +00:00

Author	SHA1	Message	Date
Xinmin Zeng	799bef6d9d	fix(replay-e2e): match by conversation, not the living system prompt (#3436 ) * fix(replay-e2e): match by conversation, not the living system prompt The model-replay match key hashed the full input including the lead-agent system prompt. That prompt is edited frequently (e.g. #3195 added a "File Editing Workflow" section), so the committed fixture went stale the moment the prompt changed on main — turning the Layer-2 render gate RED on every unrelated PR (#3430, #3432, ...). This was a self-inflicted false positive. Root-cause fix: - replay_provider._canonical_messages now EXCLUDES the system message from the hash. The conversation (human/ai/tool) is the stable contract that identifies a recorded turn; the system prompt is an internal detail not part of the front-back contract under test. (Mirrors how open-design keys its mock picker on the user prompt, not the system internals.) Proven robust: injecting a prompt edit no longer causes a replay miss. - Layer-1 golden was BLIND to replay misses: the gateway swallows a miss into an assistant error message, so the shape-only golden stayed green on a stale fixture. It now inspects replay_provider.replay_misses() and fails loud. (Layer-2 already fails on a miss.) - Re-recorded write_read_file.ultra fixture + regenerated golden under the new conversation-only hash. - Layer-2 render spec: assert the in-graph auto-title (deterministic); the follow-up suggestion is fired async and depends on a clean JSON model output, so assert it only when the fixture captured one — never gate on its absence (recording flakiness must not block CI). - docs: REPLAY_E2E.md updated. Verified: Layer-1 golden green (no miss), Layer-2 both specs green, CI=true make test 4033 passed / 0 failed, frontend pnpm check clean. * test(replay-e2e): restore suggestions coverage with a reliable capture Addresses review feedback (the suggestion path was dropped from Layer-2): - record spec now waits for the `/suggestions` response before checking capture stability, so the recorded fixture reliably includes the frontend-fired suggestions turn (previously the stability window could return before suggestions fired, yielding a fixture without it). - Re-recorded write_read_file.ultra: 5 turns (write_file, auto-title, read_file, answer, suggestions). Golden unchanged — suggestions is a separate /suggestions call, not part of the /runs/stream SSE sequence. - Layer-2 spec: restore the hard `EXPECTED_SUGGESTION` assertion. With the record spec now waiting for /suggestions, a fixture missing the suggestion turn means a broken recording and must fail loud, not pass silently. Verified: Layer-1 golden green (no miss), Layer-2 both specs green (auto-title + suggestion render), frontend pnpm check clean. * ci: re-trigger (flaky Docker Hub image pull in sandbox e2e, unrelated) backend-unit-tests failed only in test_sandbox_orphan_reconciliation_e2e.py with 'docker pull busybox:latest ... context deadline exceeded' — a CI-runner network flake reaching Docker Hub, not related to this docs/tests-only change. Empty commit to re-run CI. --------- Co-authored-by: DanielWalnut <45447813+hetaoBackend@users.noreply.github.com>	2026-06-08 17:32:41 +08:00
Xinmin Zeng	88759015e4	test(e2e): deterministic record/replay front-back contract verification (#3365 ) * test(e2e): record/replay front-back contract verification Guards the front-back contract with a deterministic, key-free record/replay harness (mirrors open-design's golden-trace approach): - ReplayChatModel (tests/replay_provider.py): replays recorded LLM turns by a normalized hash of the model input. Strips <system-reminder>/date/uuid/tmp-path so one fixture replays across days and from both the browser and direct-POST paths; a miss raises loudly (no silent divergence). - Recording is record-through-browser (scripts/record_gateway.py + build_fixture_from_jsonl.py + frontend/tests/e2e-record): a real run is driven through the real frontend so captured inputs match exactly what the browser sends; fixtures contain no API key. - Layer 1 — backend golden (tests/test_replay_golden.py): replay through the real gateway, assert the SSE event sequence == committed golden. - Layer 2 — full-stack render (frontend/tests/e2e-real-backend): real Next.js + real gateway (replay model) + Chromium; assert the replayed auto-title and follow-up suggestions render. DOM assertions are the gate; visual regression is a local dev gate (CI uploads the render as an artifact). - CI (.github/workflows/replay-e2e.yml): both layers, triggered on EITHER side of the contract (frontend/** or backend gateway/harness/fixtures). * test(e2e): multi-run render-order cross-stack scenario (#3352) Guards the dangerous front-back class where a backend ordering change silently breaks a frontend assumption while both sides' unit tests stay green. Reproduces issue #3352: backend list_by_thread returns runs newest-first (#2932) and the frontend prepended per-run pages, inverting chronological order once the checkpoint no longer held the older messages. - tests/seed_runs_router.py: test-only seeder, mounted on the replay gateway only when DEERFLOW_ENABLE_TEST_SEED=1 (never in the production app). Seeds a thread with >=2 runs + per-run message events and no checkpoint -- the #3352 precondition -- so the frontend per-run reload path is the sole source of truth and the prepend inversion is observable. - frontend/tests/e2e-real-backend/multi-run-order.spec.ts: drives the real frontend against the real gateway, asserts the first run renders above the second. Reverting the #3354 fix turns it red. - replay-e2e.yml: trigger on the new replay test-infra paths. - docs: REPLAY_E2E.md cross-stack scenario section. * test(e2e): address Copilot review on the replay harness - Fix stale recorder references (scripts/record_traces.py -> scripts/record_gateway.py + scripts/build_fixture_from_jsonl.py) in replay_provider.py, test_replay_golden.py, _replay_fixture.py. - MODE_CONTEXT['ultra']: thinking_enabled False -> True, mirroring the frontend's `context.mode !== 'flash'` (hooks.ts). It did not affect the hashed input (Layer 1 golden still green), but the table now matches the real frontend context it claims to mirror. - replay_provider.py docstring: stop claiming memory is recorded-enabled; the replay config disables memory/summarization for determinism (title stays, as an in-graph deterministic call). - record_gateway.py / run_replay_gateway.py: override DEER_FLOW_HOME instead of setdefault, so an outer value can't leak into the hermetic harness. - record_gateway.py: clear error when DEERFLOW_RECORD_OUT is unset (was a bare KeyError). - playwright.record.config.ts: forward OPENAI_/DEERFLOW_RECORD_OUT only when set, so the gateway raises a clear 'missing env' error instead of getting ''. test(e2e): address Copilot review round 2 - seed_runs_router.py: constrain SeedMessage.role to Literal['human','ai'] so a bad value is a clean 422 at the boundary instead of a 500 (KeyError on _EVENT_TYPE). - record-write-read-file.spec.ts: waitForCaptureStable now throws on timeout instead of returning the last count, so a truncated/partial recording can't pass silently. - real-backend-render.spec.ts: guard the suggestions JSON.parse; a bracket-prefixed non-JSON turn falls back to '' so the existing not.toBe('') assertion fails clearly instead of a generic parse throw.	2026-06-08 12:35:03 +08:00
Xinmin Zeng	8d2e55a05f	fix(subagent): structured subagent_status field over text parsing (#3146 ) (#3154 ) * fix(subagent): structured subagent_status field over text parsing Closes #3146. ## Why The frontend used to derive subtask card state by string-matching the leading text of the `task` tool's result. That contract surface was fragile — `#3107` BUG-007 and the `#3131` review both surfaced cases where new backend wording (`Task cancelled by user.`, `Task polling timed out after N minutes`, `ToolErrorHandlingMiddleware` exception wrappers) silently broke the card lifecycle. The frontend fallback kept growing more prefixes; any future rewording would break it again. ## Design 1. Backend → frontend contract: `ToolMessage.additional_kwargs` carries `subagent_status` (one of `completed \| failed \| cancelled \| timed_out \| polling_timed_out`) and an optional `subagent_error` blob. The frontend prefers it over parsing `content`. 2. Centralised stamping, not 8 sprinkled stamps: rather than have each of `task_tool.py`'s 5 normal-return + 3 pre-execution `Error:` paths remember to set `additional_kwargs`, `ToolErrorHandlingMiddleware` stamps the field after every task-tool call. Adding a new return path in `task_tool.py` cannot now skip the stamp. 3. Cross-language contract fixture: the prefix→status mapping is the one piece both sides must agree on. The shared fixture at `contracts/subagent_status_contract.json` lists every backend return string, the expected status, and what the error substring should contain. Backend test (`backend/tests/test_subagent_status_contract.py`) and frontend test (`frontend/tests/unit/core/tasks/subtask-result.test.ts`) both load that fixture and assert the same cases. A wording drift on either side fails the matching language's test. 4. Round-trip serialisation pinned: the round-trip test asserts `ToolMessage.model_dump_json()` → `model_validate_json()` preserves `additional_kwargs.subagent_status`. Catches the case where a future LangChain or Pydantic upgrade silently strips unknown kwargs. 5. Frontend status collapse documented: the backend has five status values, the frontend card has three (`completed \| failed \| in_progress`). `cancelled` / `timed_out` / `polling_timed_out` all collapse to `failed` with the original status preserved in `error`. `parseSubtaskResult` returns `in_progress` for unknown values so a backend that ships a new enum variant before the frontend upgrades degrades to the legacy prefix fallback instead of getting pinned. ## Changes Backend: - `deerflow.subagents.status_contract` — new module exporting `SUBAGENT_STATUS_KEY`, `SUBAGENT_ERROR_KEY`, `SUBAGENT_STATUS_VALUES`, `extract_subagent_status(content)`, and `make_subagent_additional_kwargs(status, error)`. - `ToolErrorHandlingMiddleware`: new `_stamp_task_subagent_status` helper centralises the stamp; `wrap_tool_call` / `awrap_tool_call` stamp on the success path; `_build_error_message` stamps on the wrapper path (carrying `ExcClass: detail` into `subagent_error`). Non-task tools are untouched. - New tests: `test_subagent_status_contract.py` (19 cases from the shared fixture + status-enum / blank-error / unknown-status rejection) and `test_tool_error_handling_subagent_stamp.py` (middleware integration: terminal-content stamps, non-terminal doesn't, non-task tools untouched, async path mirrors sync, existing additional_kwargs survive, JSON round-trip preserved). Frontend: - `parseSubtaskResult(text, additionalKwargs?)` — prefers the structured stamp; falls back to the legacy prefix matcher for historical threads / unknown future status values. - `STRUCTURED_STATUS_TO_SUBTASK` documents the five→three collapse. - `message-list.tsx` passes `message.additional_kwargs` through. - `subtask-result.test.ts` adds a structured-status block + a fixture-driven contract block; legacy prefix tests stay green for the fallback path. Contract: - `contracts/subagent_status_contract.json` — single source of truth both languages load. Whitespace variants, varied N for polling timeouts, the 3 pre-execution `Error:` returns task_tool produces, and the middleware wrapper shape are all in there. ## Test plan - `make lint` clean (backend + frontend). - `pytest tests/test_subagent_status_contract.py tests/test_tool_error_handling_subagent_stamp.py` → 37 passed. - `pnpm test --run` → 103 passed (was 76, +27 new). ## Migration / fallback retirement The text-prefix fallback stays in place until backend telemetry shows the frontend never hits it for newly produced messages. At that point a follow-up PR can drop the prefix branches and keep only the structured-status branch. Refs: bytedance/deer-flow#3138 (split summary), #3107 (origin), #3131 (prior prefix-only fix), #3146 (this issue). * fix(subtask): back-fill result/error from text when structured status present Three follow-ups on the PR #3154 review: 1. `readStructuredStatus` no longer short-circuits the prefix parse. The backend currently stamps only the `subagent_status` enum value; the human-facing `result` body and wrapped-error message still live in `ToolMessage.content`. Dropping the text parse meant successful tasks rendered empty completed pills and wrapped failures lost their diagnostic. Now both shapes get composed: structured status wins, `result`/`error` come from text when both sides agree, and a lying success body under a `failed` stamp is dropped instead of leaking. 2. Replace the ESM-incompatible `__dirname` fixture lookup in subtask-result.test.ts with `fileURLToPath(new URL(..., import.meta.url))`. The frontend package is `"type": "module"`, so the previous path would have thrown at runtime if anything ever changed under the contract directory. 3. Drop the `$schema` reference from contracts/subagent_status_contract.json pointing at a file that doesn't exist in the tree. Three new tests cover the structured + text composition: completed back-fills the success body, failed back-fills the wrapper text, and unrecognised content under a `failed` stamp stays empty rather than echoing noise.	2026-06-07 22:49:55 +08:00
Huixin615	9a53f9dfbb	fix(frontend): preserve chronological order of thread history after context compression (#3354 ) * fix(frontend): preserve chronological order of thread history after context compression Iterate runs from newest to match backend `list_by_thread` (newest-first) and the prepend semantics of the history loader, so refreshed history renders in A→B→C→D→E→F order. Fixes #3352 * fix(frontend): auto-continue loading runs with no visible messages after context compression	2026-06-03 21:51:48 +08:00
Eilen Shin	019bd16a06	fix: load paginated run history messages (#3305 )	2026-06-01 15:50:39 +08:00
Nan Gao	d46a5779bc	fix(chat): preserve messages after summarization (#3280 ) * fix(chat): preserve messages after summarization * make format * fix(chat): address summarization review comments	2026-05-29 08:24:47 +08:00
Xinmin Zeng	2ace78d1e5	fix(frontend): surface backend detail when agent name check fails (#3048 ) * fix(frontend): surface backend detail when agent name check fails The new-agent page caught AgentNameCheckError but only branched on reason === "backend_unreachable". Everything else (notably the 422 "Invalid agent name '...'. Must match ^[A-Za-z0-9-]+$" response from GET /api/agents/check when the user submits a name with disallowed characters — trailing space, dot, Chinese, invisible whitespace from copy-paste) fell through to the generic fallback "Could not verify name availability — please try again", swallowing the detail that already told the user exactly what to fix. Add a request_failed branch that surfaces err.message (which checkAgentName already populates from the backend's detail at core/agents/api.ts). The disabled / backend_unreachable / unknown- error paths are unchanged. Pin the contract with unit tests covering: 200 success, fetch rejection, 502/503/504 network errors, agents_api disabled detail, 422 validation detail carried verbatim, statusText fallback when detail is absent, and a regression guard against misclassifying a 422 as agents_api disabled. Closes #3041 * fix(frontend): localise the error prefix when surfacing backend detail The previous commit surfaced the backend's raw `err.message` on the new-agent page when the name check failed. The detail itself is English (backend's `_validate_agent_name` text, any 5xx business message, etc.) and dropping it bare into a zh-CN page produced a jarring English-among-Chinese line that didn't match neighbouring strings like "已存在同名智能体" / "无法验证名称可用性". Add `nameStepCheckErrorWithDetail` as a templated string ("Name check failed: {detail}" / "名称校验失败：{detail}"), mirroring the existing `nameStepBootstrapMessage` `{name}` template pattern. The page wraps `err.message` in it when present and falls back to the plain `nameStepCheckError` when the detail is empty. Rendered output (verified locally with a Console fetch mock that returns 500 + detail): zh-CN: 名称校验失败：Database connection lost: SQLAlchemy connection pool exhausted (max 5 connections, all in use) en-US: Name check failed: Database connection lost: SQLAlchemy connection pool exhausted (max 5 connections, all in use) The localised prefix tells the user what operation failed; the raw detail tells them why. Translating the detail itself would be lossy (any unbounded backend string would need a translation table) and would break the debuggability the previous commit delivered. Refs #3041 * fix(frontend): distinguish backend detail from generated fallback in AgentNameCheckError Addresses Copilot's review on #3048: the previous commits keyed off `err.message`, but `checkAgentName` substitutes a generated fallback string ("Failed to check agent name: ${statusText}") when the backend sent no detail. That guaranteed `err.message` was always truthy, made the `nameStepCheckError` fallback branch unreachable in practice, and could surface awkward strings like "名称校验失败：Failed to check agent name: Bad Gateway" in the UI. Add an explicit `detail: string \| null` field to AgentNameCheckError. `checkAgentName` populates it only when the backend response actually carried a string `detail` (defensive guard against the dict-shaped detail that other deer-flow endpoints use for typed error codes). The new-agent page now selects on `err.detail` instead of `err.message` so the localised fallback wins when no real detail exists. Also fix the prettier formatting that broke lint-frontend CI on the previous push. Test changes: - The 422 carry-through test now asserts both `detail` and `message` hold the backend string verbatim. - A new "falls back to statusText in message but leaves detail null" test pins the contract that no real detail ⇒ no UI surface leak. - A new "treats non-string detail as null" test guards against future backend schema drift toward dict-shaped detail. Refs #3041 #3048	2026-05-28 18:38:45 +08:00
Admire	2fdfff0db3	fix(frontend): fix Mermaid preview failure in historical messages (#3196 ) * fix(frontend): render historical mermaid diagrams * fix(frontend): address mermaid review feedback * Stabilize cancel lifecycle test * fix(frontend): handle mermaid fence variants * fix(frontend): normalize mermaid arrow spacing * fix(frontend): handle mermaid CRLF fences * chore: keep mermaid fix frontend-scoped --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-05-28 18:20:02 +08:00
zgenu	737abc0e45	fix: ignore stale run reconnect conflicts (#3284 ) * fix: ignore stale run reconnect conflicts * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * fix: ignore stale run reconnect conflicts --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-05-28 17:29:30 +08:00
Admire	f68bcb771c	fix(frontend): guard message copy clipboard access (#3211 ) * fix(frontend): guard message copy clipboard access * fix(frontend): reuse clipboard guard across copy actions	2026-05-26 09:37:51 +08:00
AochenShen99	11dd5b0683	fix(frontend): strip unclosed <think> tags from streaming AI content (#3218 ) * fix(frontend): strip unclosed <think> tags from streaming AI content During streaming, an opening <think> tag may arrive in one chunk while the matching </think> arrives in a later chunk. The existing splitInlineReasoning regex only matched fully closed pairs, so the mid-flight reasoning was left in message.content and rendered into the chat bubble via the markdown pipeline's rehypeRaw plugin until the closing tag landed. Extend splitInlineReasoning with a second pass: after stripping every closed <think>...</think> pair, route any remaining content from a lone opener to the reasoning slot and leave only the preceding preamble in content. Closed-tag behavior is unchanged. Covers every provider whose stream emits reasoning inline as <think> tags (MiniMax streaming path, MindIE, PatchedChatOpenAI, and any gateway-served DeepSeek/OpenAI-compatible model). * style(frontend): apply prettier formatting to streaming reasoning tests * fix(frontend): skip <think> split for literal think tags in inline code Treats a `<think>` opener immediately preceded by a backtick as part of markdown inline code rather than a streaming reasoning marker. Prevents permanent content truncation when an AI message documents the `<think>` tag literally (e.g. ``Use `<think>` markers``), where the streaming-safe fallback would otherwise route the rest of the answer into the reasoning panel because no `</think>` ever arrives. Adds regression tests for both the post-stream and mid-stream cases.	2026-05-26 09:35:07 +08:00
Admire	e7967a7fc3	fix(frontend): hide copy for streaming assistant turn (#3176 )	2026-05-23 23:29:16 +08:00
Admire	d0fa37e71d	fix(frontend): avoid duplicate optimistic user message (#3002 )	2026-05-23 17:02:23 +08:00
AochenShen99	604fcbb9d2	Stabilize write artifact previews (#3172 )	2026-05-23 16:56:14 +08:00
JeffJiang	b103d1a7f5	feat(frontend): support static website demo mode (#3170 ) * feat(frontend): support static website demo mode * fix(frontend): render html artifact previews from blob content * chore(frontend): apply pre-commit formatting * fix(frontend): address static demo PR review comments * Update the release information of DeerFlow --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-05-23 00:10:56 +08:00
Xinmin Zeng	e93f658472	fix(stability): resolve P0 blockers from v2.0-m1-rc1 stability audit (#3107 ) (#3131 ) * fix(task-tool): unwrap callback manager when locating usage recorder `config["callbacks"]` may arrive as a `BaseCallbackManager` (e.g. the `AsyncCallbackManager` LangChain hands to async tool runs), not just a plain list. The previous `for cb in callbacks` loop raised `TypeError: 'AsyncCallbackManager' object is not iterable`, which `ToolErrorHandlingMiddleware` then converted into a failed `task` ToolMessage even though the subagent had completed internally — Ultra mode lost subagent results and the lead agent fell back to redoing the work. Unwrap `BaseCallbackManager.handlers` before searching for the recorder. Refs: bytedance/deer-flow#3107 (BUG-002) * fix(frontend): treat any task tool error as a terminal subtask failure The subtask card status machine matched only three English prefixes (`Task Succeeded. Result:`, `Task failed.`, `Task timed out`). Anything else fell through to `in_progress`, so a `task` tool error wrapped by `ToolErrorHandlingMiddleware` (`Error: Tool 'task' failed ...`) left the card spinning forever even after the run had ended. Extract the prefix logic into `parseSubtaskResult` and recognise any leading `Error:` token as a terminal failure. The extracted function is unit-tested against the legacy prefixes plus the `AsyncCallbackManager` regression captured in the upstream issue. Refs: bytedance/deer-flow#3107 (BUG-007) * fix(frontend): exclude hidden, reasoning, and tool payloads from chat export `formatThreadAsMarkdown` / `formatThreadAsJSON` iterated raw messages without running the UI-level `isHiddenFromUIMessage` filter. Exported transcripts therefore included `hide_from_ui` system reminders, memory injections, provider `reasoning_content`, tool calls, and tool result messages — content that is intentionally hidden in the chat view. Filter the export to the user-visible transcript by default and gate reasoning / tool calls / tool messages / hidden messages behind explicit `ExportOptions` flags so a future debug export can opt back in without forking the formatter. Refs: bytedance/deer-flow#3107 (BUG-006) * fix(gateway): route get_config through get_app_config for mtime hot reload `get_config(request)` returned the `app.state.config` snapshot captured at startup. The worker / lead-agent path then threaded that frozen `AppConfig` through `RunContext` and `agent_factory`, so per-run fields edited in `config.yaml` (notably `max_tokens`) were ignored until the gateway process was restarted — even though `get_app_config()` already does mtime-based reload at the bottom layer. Route the request dependency through `get_app_config()` directly. Runtime `ContextVar` overrides (`push_current_app_config`) and test-injected singletons (`set_app_config`) keep working; `app.state.config` is now only read at startup for one-shot bootstrap (logging level, IM channels, `langgraph_runtime` engines). `tests/test_gateway_deps_config.py` encoded the old snapshot contract and is removed; `tests/test_gateway_config_freshness.py` replaces it with mtime, ContextVar, and `set_app_config` coverage. `test_skills_custom_router.py` and `test_uploads_router.py` now inject test configs via FastAPI `dependency_overrides[get_config]` instead of mutating `app.state.config`. Document the hot-reload boundary in `backend/CLAUDE.md` so reviewers know which fields are picked up on the next request vs. which still require a restart (`database`, `checkpointer`, `run_events`, `stream_bridge`, `sandbox.use`, `log_level`, `channels.`). Refs: bytedance/deer-flow#3107 (BUG-001) fix(gateway): broaden get_config 503 to any config-load failure Address review feedback on the previous commit: 1. Narrow exception catch removed. The old contract returned 503 whenever `app.state.config is None`. The first cut only mapped `FileNotFoundError`, leaving `PermissionError`, YAML parse errors, and pydantic `ValidationError` to bubble up as 500. At the request boundary we treat any inability to materialise the config as "configuration not available" (503) and log the original exception so the operator still has the stack. 2. Removed the unused `request: Request` parameter and the matching `# noqa: ARG001`. FastAPI's `Depends()` does not require the dependency to accept `Request`; the only call site uses the no-arg form. 3. `backend/CLAUDE.md` boundary now lists the reason each field is restart-required (engine binding, singleton caching, one-shot `apply_logging_level`, etc.), not just the field name, so reviewers do not have to reverse-engineer the boundary themselves. Tests parametrise four exception classes (`FileNotFoundError`, `PermissionError`, `ValueError`, `RuntimeError`) and assert 503 for each. Refs: bytedance/deer-flow#3107 (BUG-001) * fix(task-tool): defend _find_usage_recorder against non-list callbacks Address review feedback. The previous commit handled the two common shapes LangChain hands to async tool runs — a plain `list[BaseCallbackHandler]` and a `BaseCallbackManager` subclass — but iterated any other shape directly, which would still raise `TypeError` if e.g. a single handler instance leaked through without a list wrapper. Treat any non-list, non-manager `config["callbacks"]` value as "no recorder" rather than crash. Docstring now lists all four shapes explicitly. New tests cover the single-handler-object case, `runtime is None`, `callbacks is None`, and `runtime.config` being a non-dict — all required to be silent no-ops. Refs: bytedance/deer-flow#3107 (BUG-002) * fix(frontend): drop dead identity ternary and add opt-in export tests Address review feedback on the previous export commit: 1. Removed the no-op `typeof msg.content === "string" ? msg.content : msg.content` expression in `formatThreadAsJSON`. Both branches returned the same value; the message content now flows through unchanged whether it is a string or the rich `MessageContent[]` shape (LangChain JSON-serialises the array structure correctly already). 2. Expanded the JSDoc on `ExportOptions` to make it clearer that the four flags are not currently wired to any UI control — callers wanting a debug export must build the options object explicitly. The default behaviour continues to match the explicit prescription in bytedance/deer-flow#3107 BUG-006. 3. Added opt-in coverage. The previous tests only exercised the `options = {}` default path; the new cases verify each flag flips the corresponding payload back into the export so a future debug-export surface does not silently break the contract. Refs: bytedance/deer-flow#3107 (BUG-006) * fix(frontend): export subtask prefix constants and document fallback intent Address review feedback on the previous BUG-007 commit: 1. `SUCCESS_PREFIX`, `FAILURE_PREFIX`, `TIMEOUT_PREFIX`, and the `ERROR_WRAPPER_PATTERN` regex are now exported. The JSDoc explicitly pins them as part of the backend↔frontend contract defined in `task_tool.py` and `tool_error_handling_middleware.py`, so any future structured-status migration (e.g. backend writing `additional_kwargs.subagent_status` instead of leading text) can reference these from one canonical place rather than redefine them. 2. The `in_progress` fallback now carries a docstring explaining the deliberate choice — LangChain only ever emits a `ToolMessage` once the tool itself has returned, so unrecognised content means the contract has drifted and "still running" is the right operator signal (eagerly marking it terminal-failed would mask the drift). No behaviour change; this is documentation and an API export. Refs: bytedance/deer-flow#3107 (BUG-007) * fix(gateway): drop app.state.config snapshot and freeze run_events_config Address @ShenAC-SAC's BUG-001 review on #3131. The previous cut still stored an ``AppConfig`` snapshot on ``app.state.config`` for startup bootstrap. Two follow-on hazards from that: 1. Future code touching the gateway lifespan could accidentally start reading ``app.state.config`` again, silently regressing the request hot path back to a stale snapshot. 2. ``get_run_context()`` paired a freshly-reloaded ``AppConfig`` with the startup-bound ``event_store`` and a live ``run_events_config`` field — so an operator who edited ``run_events.backend`` mid-flight would have produced a run context whose ``event_store`` and ``run_events_config`` referred to different backends. Clean approach (aligned with the direction in PR #3128): - ``lifespan()`` keeps a local ``startup_config`` variable and passes it explicitly into ``langgraph_runtime(app, startup_config)`` and into ``start_channel_service``. No ``app.state.config`` attribute is set at any point. - ``langgraph_runtime`` now accepts ``startup_config`` as a required parameter, removing the ``getattr(app.state, "config", None)`` lookup and the "config not initialised" runtime error. - The matching ``run_events_config`` is frozen onto ``app.state`` next to ``run_event_store`` so ``get_run_context`` reads the two from the same startup-time source. ``app_config`` continues to be resolved live via ``get_app_config()``. - ``backend/CLAUDE.md`` boundary explanation updated to spell out the ``startup_config`` / ``get_app_config()`` split. New regression test ``test_run_context_app_config_reflects_yaml_edit`` exercises the worker-feeding path: it asserts that ``ctx.app_config`` follows a mid-flight ``config.yaml`` edit while ``ctx.run_events_config`` stays frozen to the startup snapshot the event store was built from. Refs: bytedance/deer-flow#3107 (BUG-001), bytedance/deer-flow#3131 review * fix(frontend): parse Task cancelled and polling timed out as terminal Address @ShenAC-SAC's BUG-007 review on #3131. `task_tool.py` actually emits five terminal strings: - `Task Succeeded. Result: …` - `Task failed. …` - `Task timed out. …` - `Task cancelled by user.` ← previously matched none - `Task polling timed out after N minutes …` ← previously matched none The previous cut handled three; the last two fell through to the "unknown content" branch and pushed the subtask card back to `in_progress` even though the backend had already reached a terminal state. Add explicit matches plus regression tests for both. The `in_progress` fallback is now reserved for genuinely unrecognised output (i.e. contract drift), as documented. Refs: bytedance/deer-flow#3107 (BUG-007), bytedance/deer-flow#3131 review * fix(frontend): sanitize JSON export content via the Markdown content path Address @ShenAC-SAC's BUG-006 review and the Copilot inline comment on #3131. The previous cut filtered hidden/tool messages out of the JSON export but still serialised `msg.content` verbatim, so: - inline `<think>…</think>` wrappers stayed in the exported `content` even with `includeReasoning: false`, - content-array thinking blocks leaked the `thinking` field, - `<uploaded_files>…</uploaded_files>` markers leaked the workspace paths a user uploaded files to. JSON now goes through the same sanitiser the Markdown path uses (`extractContentFromMessage` + `stripUploadedFilesTag`). Reasoning and tool_calls remain gated behind their `ExportOptions` flags. AI / human rows that sanitise to empty content with no opted-in reasoning or tool calls are dropped so the JSON matches the Markdown path's `continue` on empty assistant fragments. New regression tests cover the three leak shapes the reviewer called out plus the empty-content-drop case. Refs: bytedance/deer-flow#3107 (BUG-006), bytedance/deer-flow#3131 review * test(gateway): align lifespan stub with langgraph_runtime two-arg signature Codex round-3 review of c0bc7a06 flagged this: changing `langgraph_runtime` to require `startup_config` as a second positional argument broke the one-arg stub `_noop_langgraph_runtime(_app)` in `test_gateway_lifespan_shutdown.py`, which is patched into `app.gateway.app.langgraph_runtime` by the lifespan shutdown bounded-timeout regression. Lifespan would then call the stub with two args and raise `TypeError` before the bounded-shutdown assertion ran. Update the stub to match the new signature. The shutdown test itself is unaffected — it only cares about the channel `stop_channel_service` hang path. Refs: bytedance/deer-flow#3107 (BUG-001), bytedance/deer-flow#3131 review * fix(frontend): strip every known backend marker in export, not just uploads Codex round-3 review of 258ca800 and the matching maintainer feedback on PR #3131 made the same point: the JSON export now ran the Markdown-side sanitiser, but that sanitiser only stripped `<uploaded_files>`. The full set of payloads middleware embeds inside message `content` is larger: - `<uploaded_files>` — `UploadsMiddleware` - `<system-reminder>` — `DynamicContextMiddleware` - `<memory>` — `DynamicContextMiddleware` (nested inside system-reminder) - `<current_date>` — `DynamicContextMiddleware` The primary protection is still `isHiddenFromUIMessage`: the `<system-reminder>` HumanMessage is marked `hide_from_ui: true` and never reaches the formatter. This commit adds the second line of defence so a regression that drops the `hide_from_ui` flag — or any future middleware that injects the same tag vocabulary into a visible HumanMessage — cannot leak the payload into the export file. Concrete changes: - New `INTERNAL_MARKER_TAGS` constant + `stripInternalMarkers(content)` helper in `core/messages/utils.ts`. The constant doubles as documentation for the backend↔frontend contract. - `formatMessageContent` in `export.ts` now calls `stripInternalMarkers` instead of `stripUploadedFilesTag`. UI render paths (`message-list-item.tsx`) keep using the narrower function so a user legitimately typing `<memory>` in a meta-discussion is preserved. - The "drop empty rows" guard in `buildJSONMessage` switched from `=== undefined` to truthy `!` checks. Codex spotted the asymmetry: when `extractReasoningContentFromMessage` returned the empty string (which it legitimately can), the JSON path emitted `{reasoning: ""}` while the Markdown path's `!reasoning` `continue` correctly dropped the row. New regression tests cover the defence-in-depth strip with a `<system-reminder><memory><current_date>` payload deliberately not marked `hide_from_ui`; tool-message sanitization under `includeToolMessages: true`; the mixed-content-array case (`thinking + text + image_url`); and the opted-in empty-reasoning drop. Live verification on a real Ultra-mode thread that uploaded a PDF (`曾鑫民-薪资交易流水.pdf`): backend state's first HumanMessage carries the `<uploaded_files>` block (with `/mnt/user-data/uploads/...` paths) as part of a content-array. The Markdown and JSON export blobs both come back free of `<uploaded_files>`, `<system-reminder>`, `<current_date>`, `tool_calls`, and reasoning — while preserving the user's `这是什么？` prompt and the assistant's visible answer. Refs: bytedance/deer-flow#3107 (BUG-006), bytedance/deer-flow#3131 review * test(frontend): cover trim, varied N, and pre-execution Error: prefixes Codex round-3 review of 50e2c257 flagged three coverage gaps in the subtask-status parser: 1. `Task cancelled by user.` and `Task polling timed out` previously had no whitespace-trim coverage — the original trim test only exercised the success prefix. Streaming chunks can arrive with leading/trailing newlines; the regex needed an explicit assertion. 2. The polling-timeout case was tested only at one `N` (15 minutes). The backend interpolates the live `timeout_seconds // 60` value, so the matcher must hold for any positive integer. Now we run the case for 1, 5, and 60 minutes. 3. `task_tool.py` also emits three `Error:` strings for pre-execution failures — unknown subagent type, host-bash disabled, and "task disappeared from background tasks". They are intentionally handled by `ERROR_WRAPPER_PATTERN` rather than dedicated prefixes (the wrapper already produces the right terminal-failed shape) but had no test coverage proving that wiring. Codex was right that a refactor splitting one of them off into its own prefix would silently break things. The JSDoc on the constants block now spells the three pre-execution errors out so the relationship between `task_tool.py` returns and the prefix vocabulary is explicit. No production code change beyond the docstring — this commit is pure coverage hardening for the contract that already exists. Refs: bytedance/deer-flow#3107 (BUG-007), bytedance/deer-flow#3131 review	2026-05-21 21:18:10 +08:00
Nan Gao	6d3cffb4f0	fix(frontend): deduplicate restored thread messages (#2958 ) * fix(frontend): fix duplicate messages when reopening agent sessions (#2957) * make format * fix(frontend): retry pending thread history loads	2026-05-16 08:48:19 +08:00
Admire	7c42ab3e16	fix(frontend): wait for async chat submit before clearing (#2940 ) * fix(frontend): wait for async chat submit before clearing * test(frontend): cover pending attachment uploads * fix(frontend): preserve sync submit semantics	2026-05-15 22:27:10 +08:00
Nan Gao	0c37509b38	fix(middleware): Prevent todo completion reminder IMMessage leak (#2907 ) * fix(middleware): Prevent todo completion reminder IMMessage leak (#2892) * make format * fix(middleware): Clear stale todo reminder counts (#2892) * add size guard for _completion_reminder_counts and add a integration test	2026-05-15 22:12:37 +08:00
YuJitang	9892a7d468	fix: bucket subagent token usage into parent run totals (#2838 ) * fix: bucket subagent token usage into RunRow.subagent_tokens Add caller-bucketed token tracking to RunJournal so subagent and middleware LLM calls are written to the correct RunRow columns instead of all falling into lead_agent_tokens (default 0). - RunJournal: accumulate _lead_agent_tokens / _subagent_tokens / _middleware_tokens in on_llm_end, deduped by langchain run_id. Add record_external_llm_usage_records() for external sources (respects track_token_usage flag). Return caller buckets from get_completion_data(). - SubagentTokenCollector: new lightweight callback handler that collects LLM usage within subagent execution. - SubagentExecutor: wire collector into subagent run_config and sync records to SubagentResult on every chunk (timeout/cancel safe). - SubagentResult: add token_usage_records and usage_reported fields. - task_tool: report subagent usage to parent RunJournal on every terminal status (COMPLETED/FAILED/CANCELLED/TIMED_OUT), including the CancelledError path, guarded against double-reporting. No DB migration needed — RunRow columns already exist. * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * fix: address token usage review feedback * Address review follow-ups --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-05-10 22:47:30 +08:00
YuJitang	5127f08e1a	enable token usage by default (#2841 )	2026-05-10 22:00:57 +08:00
YuJitang	417416087b	fix: use backend thread token usage for header total (#2800 ) * fix: use backend thread token usage for header total * Refactor thread token usage fetch	2026-05-09 19:40:32 +08:00
YuJitang	530bda7107	fix: dedupe token usage aggregation by message id (#2770 )	2026-05-08 09:54:20 +08:00
Xinmin Zeng	27559f3675	fix(frontend): defer thread id to onStart to avoid 404 on new chat (#2749 ) * fix(frontend): defer thread id to onStart to avoid 404 on new chat The LangGraph SDK's useStream eagerly fetches /threads/{id}/history the moment it receives a thread id, and the local useThreadRuns issues GET /threads/{id}/runs for the same reason. The chats page used to flip isNewThread=false (and forward the client-generated thread id) inside the synchronous onSend callback, before thread.submit had created the thread on the backend. The two queries therefore raced ahead of POST /runs/stream and returned 404 on the very first send. Drop the onSend handler so isNewThread stays true until onStart fires from useStream's onCreated — by then the backend has the thread, and the SDK's submittingRef guard naturally suppresses the redundant history fetch. The agent chat page already uses this pattern, so this also unifies the two flows. Adds an E2E regression that records request ordering and asserts GET /history and GET /runs are never issued before POST /runs/stream on the first send from /chats/new. Closes #2746 * fix(frontend): split welcome layout from backend thread state Removing onSend kept GET /history and GET /runs from racing ahead of POST /runs/stream, but it also coupled the welcome layout (centered input, hero, quick actions) to backend thread creation. Until onCreated returned, the user's optimistic message and the welcome hero rendered on top of each other. Introduce a dedicated `isWelcomeMode` UI flag, separate from `isNewThread`: - `isNewThread` still tracks "backend has no thread yet" and gates the thread id forwarded to useStream. - `isWelcomeMode` drives the visual layout (header background, input box position, max width, hero, quick actions, autoFocus) and flips to false inside onSend so the layout animates immediately. `isWelcomeMode` is kept in sync with `isNewThread` via an effect so sidebar navigation and "new chat" still behave correctly. All 15 E2E tests pass, including the ordering regression added in the previous commit. * test(e2e): use monotonic sequence for thread-init ordering check Date.now() is millisecond-resolution, so two requests emitted within the same tick would share a timestamp and slip past the strict `<` ordering assertions. Replace the timestamp with a monotonic counter that increments on every observed request/requestfinished event so the ordering check is robust regardless of scheduling. Per PR #2749 review feedback from copilot-pull-request-reviewer. * refactor(input-box): rename isNewThread prop to isWelcomeMode Inside InputBox, the prop named `isNewThread` is only ever consulted for visual layout decisions — gating follow-up suggestions, the bottom background strip, and the welcome-mode quick-action SuggestionList. It never reflects "the backend has created the thread", which after #2746 is tracked separately via `isNewThread` in the chat pages themselves. Rename the prop to `isWelcomeMode` and update both call sites (workspace chats page and agent chats page) so the prop name matches its actual semantics. No behavior change. Per PR #2749 review feedback from @WillemJiang.	2026-05-07 16:11:44 +08:00
Xinmin Zeng	aded753de3	fix(frontend): restore localhost fallback for getGatewayConfig in prod mode (#2705 ) (#2718 ) * fix(frontend): unify gateway-config localhost fallback for prod (#2705) `getGatewayConfig()` only fell back to localhost defaults when `NODE_ENV === "development"`, while `next.config.js` always falls back to `127.0.0.1:8001`. Running `make start` (which sets NODE_ENV=production via `next start`) without `DEER_FLOW_INTERNAL_GATEWAY_BASE_URL` / `DEER_FLOW_TRUSTED_ORIGINS` therefore caused zod to throw inside SSR layouts and surfaced as a 500. Drop the NODE_ENV gating and use localhost defaults everywhere — the "force explicit config in prod" intent should be enforced by deployment templates (docker-compose already sets both vars), not by request-time crashes. Document the two vars in both .env.example files and add unit coverage for the dev/prod env-unset paths. * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Update internalGatewayUrl in gateway config tests --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-05-05 16:27:29 +08:00
YuJitang	d02f762ab0	feat: refine token usage display modes (#2329 ) * feat: refine token usage display modes * docs: clarify token usage accounting semantics * fix: avoid duplicate subtask debug keys * style: format token usage tests * chore: address token attribution review feedback * Update test_token_usage_middleware.py * Update test_token_usage_middleware.py * chore: simplify token attribution fallback * fix token usage metadata follow-up handling --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-05-04 09:56:16 +08:00
yangzheli	748429ef0d	fix(frontend): add missing mock routes for runs-list, models, and suggestions (#2578 ) * fix(frontend): add missing mock routes for runs-list, models, and suggestions * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-04-26 23:29:59 +08:00
Copilot	ef04174194	Fix invalid HTML nesting in reasoning trigger during complex task rendering (#2382 ) * Initial plan * fix(frontend): avoid invalid paragraph nesting in reasoning trigger Agent-Logs-Url: https://github.com/bytedance/deer-flow/sessions/4c9eb0c2-ff29-4629-a61c-4e33d736d918 Co-authored-by: WillemJiang <219644+WillemJiang@users.noreply.github.com> * test(frontend): strengthen reasoning trigger DOM nesting assertion Agent-Logs-Url: https://github.com/bytedance/deer-flow/sessions/4c9eb0c2-ff29-4629-a61c-4e33d736d918 Co-authored-by: WillemJiang <219644+WillemJiang@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: WillemJiang <219644+WillemJiang@users.noreply.github.com>	2026-04-21 09:41:28 +08:00
Xun	7c87dc5bca	fix(reasoning): prevent LLM-hallucinated HTML tags from rendering as DOM elements (#2321 ) * fix * add test * fix	2026-04-19 19:27:34 +08:00
yangzheli	c6b0423558	feat(frontend): add Playwright E2E tests with CI workflow (#2279 ) * feat(frontend): add Playwright E2E tests with CI workflow Add end-to-end testing infrastructure using Playwright (Chromium only). 14 tests across 5 spec files cover landing page, chat workspace, thread history, sidebar navigation, and agent chat — all with mocked LangGraph/Backend APIs via network interception (zero backend dependency). New files: - playwright.config.ts — Chromium, 30s timeout, auto-start Next.js - tests/e2e/utils/mock-api.ts — shared API mocks & SSE stream helpers - tests/e2e/{landing,chat,thread-history,sidebar,agent-chat}.spec.ts - .github/workflows/e2e-tests.yml — push main + PR trigger, paths filter Updated: package.json, Makefile, .gitignore, CONTRIBUTING.md, frontend/CLAUDE.md, frontend/AGENTS.md, frontend/README.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: apply Copilot suggestions --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-04-18 08:21:08 +08:00
yangzheli	4efc8d404f	feat(frontend): set up Vitest frontend testing infrastructure with CI workflow (#2147 ) * feat: set up Vitest frontend testing infrastructure with CI workflow Migrate existing 4 frontend test files from Node.js native test runner (node:test + node:assert/strict) to Vitest, reorganize test directory structure under tests/unit/ mirroring src/ layout, and add a dedicated CI workflow for frontend unit tests. - Add vitest as devDependency, remove tsx - Create vitest.config.ts with @/ path alias - Migrate tests to Vitest API (test/expect/vi) - Rename .mjs test files to .ts - Move tests from src/ to tests/unit/ (mirrors src/ layout) - Add frontend/Makefile `test` target - Add .github/workflows/frontend-unit-tests.yml (parallel to backend) - Update CONTRIBUTING.md, README.md, AGENTS.md, CLAUDE.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: fix the lint error * style: fix the lint error --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-04-12 18:00:43 +08:00

31 Commits