* feat: real-time subagent token usage display in header and per-turn
Backend:
- Persist subagent token usage to AIMessage.usage_metadata via
TokenUsageMiddleware, so accumulateUsage() naturally includes
subagent tokens without frontend state management
- Cache subagent usage by tool_call_id in task_tool, write back
to the dispatching AIMessage on next model response
- Emit subagent token usage on all terminal task events
(task_completed, task_failed, task_cancelled, task_timed_out)
- Report subagent usage to parent RunJournal for API totals
- Search backward from ToolMessage to find dispatching AIMessage
for correct multi-tool-call attribution
Frontend:
- Remove subagentUsage state, custom event handling, and prop
threading — subagent tokens are now embedded in message metadata
- Simplify selectHeaderTokenUsage (no subagentUsage parameter)
- Per-turn inline badges show turn-specific usage via message
accumulation
- Remove isLoading guard from MessageTokenUsageList for dynamic
updates during streaming
* fix: prevent header token double counting from baseline reset race
onFinish, onError, and thread-switch useEffect all reset
pendingUsageBaselineMessageIdsRef to an empty Set. If
thread.isLoading is still true on the next render, all messages
pass the getMessagesAfterBaseline filter and their tokens are
added to backendUsage (which already includes them), causing
the header to display up to 2× the actual token count.
Capture current message IDs instead of using an empty Set so
that getMessagesAfterBaseline correctly returns no pending
messages even if thread.isLoading lags behind the stream end.
* fix: write back subagent tokens for all concurrent task tool calls
TokenUsageMiddleware only processed messages[-2], so when a
single model response dispatched multiple task tool calls only
the last ToolMessage had its cached subagent usage written back
to the dispatch AIMessage.usage_metadata. Earlier tasks' usage
stayed in _subagent_usage_cache indefinitely (leak) and never
appeared in the per-turn inline token display.
Walk backward through all consecutive ToolMessages before the
new AIMessage, and accumulate updates targeting the same
dispatch message into one state update so overlapping writes
don't clobber each other.
* fix: clean up subagent usage cache entry on task cancellation
When a task_tool invocation is cancelled via CancelledError, any
cached subagent usage entry leaked because the TokenUsageMiddleware
writeback path never fires after cancellation. Pop the cache entry
before re-raising to prevent unbounded growth of the module-level
_subagent_usage_cache dict.
* fix: address token usage review feedback
* fix: handle missing config for subagent usage cache
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
* fix(tools): preserve tool_search promotions across re-entrant get_available_tools
Closes#2884.
``get_available_tools`` used to unconditionally call
``reset_deferred_registry()`` and rebuild a fresh ``DeferredToolRegistry``
on every invocation. That works for the first call of a request (the
ContextVar starts at its default of ``None``), but any RE-ENTRANT call
during the same async context — e.g. ``task_tool`` building a subagent's
toolset, or a custom middleware that rebuilds tools mid-run — wiped any
``tool_search`` promotions the parent agent had already made. The
``DeferredToolFilterMiddleware`` would then re-hide those tools from the
next model call, leaving the agent able to see a tool's name (via the
prior ``tool_search`` result that's still in conversation history) but
unable to invoke it.
Fix: when the ContextVar already holds a registry, reuse it instead of
rebuilding. Fresh requests still get a fresh registry because each new
graph run starts in a new asyncio task with the ContextVar at ``None``.
## Verification
- Unit-level reproduction (``test_get_available_tools_resets_registry_wiping_promotion``):
promote a tool in the registry, call ``get_available_tools`` again, assert
the promotion is preserved. Fails on main, passes on this branch.
- Graph-execution reproduction (two tests): drive a real
``langchain.agents.create_agent`` graph with the real
``DeferredToolFilterMiddleware`` through two model turns, including one
that issues a re-entrant ``get_available_tools`` call to simulate the
task_tool subagent path.
- Real-LLM end-to-end (``test_deferred_tool_promotion_real_llm.py``,
opt-in via ``ONEAPI_E2E=1``): drives the same flow against a real
OpenAI-compatible model (verified on GPT-5.4-mini through the one-api
gateway), watches the model call the promoted ``fake_calculator``
through the deferred-filter middleware, and asserts the right arithmetic
result. Passes against the fixed branch.
- Companion update to ``test_tool_deduplication.py``: dropped the
``@patch("deerflow.tools.tools.reset_deferred_registry")`` decorators
because the symbol is no longer imported there.
- Test fixtures in the new files patch ``deerflow.tools.tools.get_app_config``
with a minimal ``model_construct``-ed ``AppConfig`` instead of calling
the real loader, so they never trigger ``_apply_singleton_configs`` and
never leak ``_memory_config``/``_title_config``/… mutations into the
rest of the suite.
Full backend suite: 3208 passed / 14 skipped / 0 failed. ruff check + format clean.
* fix(tools): address Copilot review on #2885
- tools.py: rewrite the reuse-path comment to spell out (a) why we don't
reconcile the registry against the current ``mcp_tools`` snapshot — the
MCP cache doesn't refresh mid-graph-run, the lead agent's ``ToolNode``
is already bound to the previous tool set anyway, and ``promote()``
drops the entry so a naive re-sync misclassifies promotions as new
tools — and (b) why the log uses ``max(0, …)`` to avoid negative
counts when the cache shrinks between snapshots.
- Replace direct ``ts_mod._registry_var.set(None)`` in test fixtures with
the public ``reset_deferred_registry()`` helper so tests don't couple
to module internals.
- Correct the docstring path in ``test_deferred_tool_registry_promotion.py``
to match the actual monkeypatch target (``deerflow.mcp.cache.get_cached_mcp_tools``).
- Rename
``test_get_available_tools_resets_registry_wiping_promotion`` to
``test_get_available_tools_preserves_promotions_across_reentrant_calls``
so the test name describes the contract being asserted, not the bug it
originally reproduced.
Full backend suite: 3208 passed / 14 skipped. Real-LLM e2e: 1 passed.
* perf(harness): push thread metadata filters into SQL
Replace Python-side metadata filtering (5x overfetch + in-memory match)
with database-side json_extract predicates so LIMIT/OFFSET pagination
is exact regardless of match density.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* fix(harness): add dialect-aware JsonMatch compiler for type-safe metadata SQL filters
Replace SQLAlchemy JSON index/comparator APIs with a custom JsonMatch
ColumnElement that compiles to json_type/json_extract on SQLite and
jsonb_typeof/->>/-> on PostgreSQL. Tighten key validation regex to
single-segment identifiers, handle None/bool/numeric value types with
json_type-based discrimination, and strengthen test coverage for edge
cases and discriminability.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* fix(harness): address Copilot review comments on JSON metadata filters
- Use json_typeof instead of jsonb_typeof in PostgreSQL compiler; the
metadata_json column is JSON not JSONB so jsonb_typeof would error at
runtime on any PostgreSQL backend
- Align _is_safe_json_key with json_match's _KEY_CHARSET_RE so keys
containing hyphens or leading digits are not silently skipped
- Add thread_id as secondary ORDER BY in search() to make pagination
deterministic when updated_at values collide; remove asyncio.sleep
from the pagination regression test
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
* fix(harness): address remaining review comments on metadata SQL filters
- Remove _is_safe_json_key() and reuse json_match ValueError to avoid
validator drift (Copilot #3217603895, #3217411616)
- Raise ValueError when all metadata keys are rejected so callers never
get silent unfiltered results (WillemJiang)
- Fix integer precision: split int/float branches, bind int as Integer()
with INTEGER/BIGINT CAST instead of float() coercion (Copilot #3217603972)
- Fix jsonb_typeof -> json_typeof on JSON column (Copilot #3217411579)
- Replace manual _cleanup() calls with async yield fixture so teardown
always runs (Copilot #3217604019)
- Remove asyncio.sleep(0.01) pagination ordering; use thread_id secondary
sort instead (Copilot #3217411636)
- Add type annotations to _bind/_build_clause/_compile_* and remove EOL
comments from _Dialect fields (coding.mdc)
- Expand test coverage: boolean/null/mixed-type/large-int precision,
partial unsafe-key skip with caplog assertion
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(harness): address third-round Copilot review comments on JsonMatch
- Reject unsupported value types (list, dict, ...) in JsonMatch.__init__
with TypeError so inherit_cache=True never receives an unhashable value
and callers get an explicit error instead of silent str() coercion
(Copilot #3217933201)
- Upgrade int bindparam from Integer() to BigInteger() to align with
BIGINT CAST and avoid overflow on large integers (Copilot #3217933252)
- Catch TypeError alongside ValueError in search() so non-string metadata
keys are warned and skipped rather than raising unexpectedly
(Copilot #3217933300)
- Add three tests: json_match rejects unsupported value types, search()
warns and raises on non-string key, search() warns and raises on
unsupported value type
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(harness): address fourth-round Copilot review comments on JsonMatch
- Add CASE WHEN guard for PostgreSQL integer matching: json_typeof returns
'number' for both ints and floats; wrap CAST in CASE with regex guard
'^-?[0-9]+$' so float rows never trigger CAST error (Copilot #3218413860)
- Validate isinstance(key, str) before regex match in JsonMatch.__init__
so non-string keys raise ValueError consistently instead of TypeError
from re.match (Copilot #3218413900)
- Include exception message in metadata filter skip warning so callers
can distinguish invalid key from unsupported value type (Copilot #3218413924)
- Update tests: assert CASE WHEN guard in PG int compilation, cover
non-string key ValueError in test_json_match_rejects_unsafe_key
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(harness): align ThreadMetaStore.search() signature with sql.py implementation
Use `dict[str, Any]` for `metadata` and `list[dict[str, Any]]` as return
type in base class and MemoryThreadMetaStore to resolve an LSP signature
mismatch; also correct a test docstring that cited the wrong exception type.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(harness): surface InvalidMetadataFilterError as HTTP 400 in search endpoint
Replace bare ValueError with a domain-specific InvalidMetadataFilterError
(subclass of ValueError) so the Gateway handler can catch it and return
HTTP 400 instead of letting it bubble up as a 500.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* fix(harness): sanitize metadata keys in log output to prevent log injection
Use ascii() instead of %r to escape control characters in client-supplied
metadata keys before logging, preventing multiline/forged log entries.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* fix(harness): validate metadata filters at API boundary and dedupe key/value rules
- Add Pydantic ``field_validator`` on ``ThreadSearchRequest.metadata`` so
unsafe keys / unsupported value types are rejected with HTTP 422 from
both SQL and memory backends (closes Copilot review 3218830849).
- Export ``validate_metadata_filter_key`` / ``validate_metadata_filter_value``
(and ``ALLOWED_FILTER_VALUE_TYPES``) from ``json_compat`` and have
``JsonMatch.__init__`` reuse them — the Gateway-side validator and the
SQL-side ``JsonMatch`` constructor now share one admission rule and
cannot drift.
- Format ``InvalidMetadataFilterError`` rejected-keys list as a
comma-separated plain string instead of a Python list repr so the
surfaced HTTP 400 detail is readable (closes Copilot review 3218830899).
- Update router tests to cover both 422 boundary paths plus the 400
defense-in-depth path when a backend still raises the error.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(harness): harden JsonMatch compile-time key validation against __init__ bypass
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
* fix: address review feedback on metadata filter SQL push-down
- Add signed 64-bit range check to validate_metadata_filter_value; give
out-of-range ints a distinct TypeError message.
- Replace assert guards in _compile_sqlite/_compile_pg with explicit
if/raise so they survive python -O optimisation.
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4 <noreply@anthropic.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(agents): make update_agent honor runtime.context user_id like setup_agent
PR #2784 hardened setup_agent to prefer runtime.context["user_id"] (set by
inject_authenticated_user_context from the auth-validated request) over the
contextvar, so an agent created during the bootstrap flow always lands under
users/<auth_uid>/agents/<name>. update_agent was left calling
get_effective_user_id() unconditionally — the same class of bug that produced
issues #2782 / #2862 still applies whenever the contextvar is not available
on the executing task (background work, future cross-process drivers,
checkpoint resume on a different task). In that regime update_agent silently
routes writes to users/default/agents/<name>, corrupting the shared default
bucket and losing the user's edit.
Extract the resolution policy into a shared resolve_runtime_user_id helper
on deerflow.runtime.user_context and route both setup_agent and update_agent
through it so the two halves of the lifecycle stay in lockstep.
Add load-bearing end-to-end tests that drive a real langchain.agents
create_agent graph with a fake LLM, exercising the full pipeline:
HTTP wire format
-> app.gateway.services.start_run config-assembly
-> deerflow.runtime.runs.worker._build_runtime_context
-> langchain.agents create_agent graph
-> ToolNode dispatch (sync + async + sub-graph + ContextThreadPoolExecutor)
-> setup_agent / update_agent
The negative-control tests intentionally land in users/default/ to prove the
positive tests are actually load-bearing rather than vacuously passing.
The new test_update_agent_e2e_user_isolation suite included a test that
failed against main and now passes after this fix.
* style: ruff format on new e2e tests
* test(e2e): real-server HTTP test driving setup_agent through the full ASGI stack
Adds tests/test_setup_agent_http_e2e_real_server.py — a single load-bearing
test that drives the entire FastAPI gateway through starlette.testclient.
TestClient with no mocks above the LLM:
- lifespan boots (config, sqlite engine, LangGraph runtime, channels)
- POST /api/v1/auth/register (real password hash, real sqlite write,
issues access_token + csrf_token cookies)
- POST /api/threads (real thread_meta + checkpoint creation)
- POST /api/threads/{id}/runs/stream with the exact wire shape the React
frontend sends (assistant_id + input + config + context with
agent_name/is_bootstrap)
- AuthMiddleware -> CSRFMiddleware -> require_permission ->
start_run -> inject_authenticated_user_context ->
asyncio.create_task(run_agent) -> worker._build_runtime_context ->
Runtime injection -> ToolNode dispatch -> real setup_agent
- Asserts SOUL.md is under users/<authenticated_uid>/agents/<name>/
and NOT under users/default/agents/<name>/.
DEER_FLOW_HOME and the sqlite path are redirected into tmp_path so the test
never touches the real .deer-flow directory or developer database. The only
patch above the LLM boundary is replacing create_chat_model with a fake that
emits a single setup_agent tool_call.
This is the "真实验证" answer: it reproduces what curl-against-uvicorn would
do, minus the network socket layer.
* test: address Copilot review on user-isolation e2e tests
- Drop "currently expected to FAIL" wording from update_agent e2e docstring
and header (Copilot review): the fix is in this PR, the test pins the
corrected behaviour rather than driving a future change.
- Rephrase the assertion failure messages from "BUG:" to "REGRESSION:" to
match the test's role on the fixed branch.
- Bound _drain_stream with a wall-clock timeout, a max-bytes cap, and an
early break on the "event: end" SSE frame (Copilot review). Stops the
test from hanging on a stuck run or runaway heartbeat loop.
- Replace the misleading "patch both module aliases" comment with an
explanation of why patching lead_agent.agent.create_chat_model is the
only correct target (Copilot review): lead_agent rebinds the symbol
into its own namespace at import time, so patching deerflow.models is
too late.
* test(refactor): address WillemJiang review on user-isolation e2e tests
- Extract the duplicated FakeToolCallingModel (and a
build_single_tool_call_model helper) into tests/_agent_e2e_helpers.py.
All three e2e files now import from the shared module instead of
redefining the shim locally.
- Convert the manual p.start() / p.stop() try/finally blocks in
test_update_agent_e2e_user_isolation.py to contextlib.ExitStack so
patch lifecycle is Pythonic and exception-safe.
- Lift the isolated_app fixture's private-attribute resets into a
named _reset_process_singletons helper with a comment block
explaining why each singleton has to be invalidated for true e2e
isolation, and why raising=False is intentional. Makes the
fragility visible and the intent self-documenting rather than
leaving the resets inline as opaque monkeypatch calls.
Net change: -59 lines (143 -> 84) across the three test files, with
every assertion intact. Full suite remains 69 passed / lint clean.
* test(e2e): make real-server test self-supply its config
CI's actions/checkout only ships config.example.yaml (the real config.yaml
is gitignored), so the production config-discovery search
(./config.yaml -> ../config.yaml -> $DEER_FLOW_CONFIG_PATH) finds nothing
and the test fails at lifespan boot with FileNotFoundError. The dev-machine
run passed only because a local config.yaml happened to exist.
Write a minimal AppConfig-valid yaml into tmp_path and pin
DEER_FLOW_CONFIG_PATH to it. The yaml carries just what the schema requires
(a single fake-test-model entry, LocalSandboxProvider, sqlite database).
The LLM never gets instantiated because the test patches create_chat_model
on the lead agent module, so the api_key/base_url stay placeholders.
Verified by hiding the local config.yaml to mirror the CI checkout — the
test now passes in both environments.
* docs: document auth design and user isolation
* docs: align auth docs with current storage and reset behavior
---------
Co-authored-by: greatmengqi <chenmengqi.0376@bytedance.com>
* feat(run): propagate model_name from gateway request context to persistence layer
Pass model_name through the full run creation pipeline — from
RunCreateRequest.context in the gateway, through RunManager, to the
RunStore interface and SQL persistence. This enables client-specified
model selection to be recorded per-run in the database.
* feat(run): add model allowlist validation and effective model name capture
- Validate model_name against allowlist in gateway services.py using
get_app_config().get_model_config()
- Truncate model_name to 128 chars to match DB column constraint
- In worker.py, capture effective model name from agent.metadata after
agent creation and persist if resolved differently than requested
* feat(run): add defense-in-depth model_name normalization and round-trip persistence tests
- Add _normalize_model_name() to RunRepository for whitespace stripping
and 128-char truncation before DB writes.
- Add round-trip unit tests for model_name creation and default None
in test_run_manager.py.
* fix(run): coerce non-string model_name values before strip/truncate in _normalize_model_name
* fix(gateway): add runtime type guard for model_name coercion in gateway services
Add isinstance check and str() coercion before calling .strip() to prevent
AttributeError when non-string types (int, None, etc.) flow through the
gateway. Paired with SQL integration test for end-to-end model_name
persistence across gateway → langgraph → persistence layer.
* fix(run): drop Alembic migration for model_name (no-op) and expose public update method on RunManager
- Drop a1b2c3d4e5f6 migration: model_name already exists in RunRow schema
and is auto-created via Base.metadata.create_all() at startup
- Add update_model_name() public method to RunManager to replace the private
_persist_to_store call in worker.py, preserving internal locking/persistence
* fix(nginx): defer cors to gateway allowlist
Remove proxy-level wildcard CORS handling so browser origins are controlled by the Gateway allowlist and stay aligned with CSRF origin checks.
* docs: document gateway cors allowlist
Clarify that same-origin nginx access needs no CORS headers while split-origin or port-forwarded browser clients must opt in with GATEWAY_CORS_ORIGINS.
* docs(gateway): record cors source of truth
Document that Gateway CORSMiddleware and CSRFMiddleware share GATEWAY_CORS_ORIGINS as the split-origin source of truth.
* fix(gateway): align cors origin normalization
* docs: clarify gateway langgraph routing
* docs(gateway): update runtime routing note
* fix(subagents): consolidate system_prompt and skills into single SystemMessage
Some LLM APIs (vLLM, Xinference, Chinese LLM providers) reject multiple
system messages with \”System message must be at the beginning.\” The
subagent executor was sending separate SystemMessages for the configured
system_prompt and each loaded skill, which caused failures when calling
task tool with sub-agents.
Merge system_prompt and all skill content into one SystemMessage in the
initial state, and pass system_prompt=None to create_agent() so the
factory doesn't prepend a second one.
Fixes#2693
* fix(subagents): update SubagentConfig.system_prompt to str | None and add astream regression test
Agent-Logs-Url: https://github.com/bytedance/deer-flow/sessions/2ee03a26-e19b-4106-abc5-c76a2906383b
Co-authored-by: WillemJiang <219644+WillemJiang@users.noreply.github.com>
* fixed the lint error
* fix the lint error in the backend
* fix the unit test error of test_subagent_executor
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
* Fix local sandbox singleton reset on provider lifecycle
* Fix local sandbox singleton reset on provider reset
---------
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
* fix: make tool argument behavior discoverable
The write_file tool already supported append=false by default with append=true for end-of-file writes, but the parsed docstring did not describe append in the model-facing schema. This records the overwrite default and append path in the tool description, adds resilient schema regression coverage, and keeps backend sandbox docs aligned.
The regression now also checks that every public parameter in the existing tool schema test matrix has a description. Enabling docstring parsing on setup_agent and update_agent fills the two existing gaps with their existing Args docs instead of duplicating descriptions elsewhere.
Constraint: Issue #2831 asks for a small docstring/schema discoverability fix without changing runtime file-writing behavior
Rejected: Changing write_file defaults | would alter existing overwrite semantics and broaden the fix beyond schema discoverability
Rejected: Exact phrase assertions | too brittle for future docstring rewording while testing the same behavior
Confidence: high
Scope-risk: narrow
Directive: Keep model-facing tool parameters documented through parsed docstrings or equivalent schema descriptions
Tested: cd backend && uv run pytest tests/test_setup_agent_tool.py tests/test_update_agent_tool.py tests/test_tool_args_schema_no_pydantic_warning.py tests/test_sandbox_tools_security.py::test_str_replace_and_append_on_same_path_should_preserve_both_updates -q
Tested: cd backend && uv run ruff check packages/harness/deerflow/sandbox/tools.py packages/harness/deerflow/tools/builtins/setup_agent_tool.py packages/harness/deerflow/tools/builtins/update_agent_tool.py tests/test_tool_args_schema_no_pydantic_warning.py
Not-tested: Full backend test suite
Co-authored-by: OmX <omx@oh-my-codex.dev>
* Fix the lint error
---------
Co-authored-by: OmX <omx@oh-my-codex.dev>
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
* fix: bucket subagent token usage into RunRow.subagent_tokens
Add caller-bucketed token tracking to RunJournal so subagent and
middleware LLM calls are written to the correct RunRow columns instead
of all falling into lead_agent_tokens (default 0).
- RunJournal: accumulate _lead_agent_tokens / _subagent_tokens /
_middleware_tokens in on_llm_end, deduped by langchain run_id.
Add record_external_llm_usage_records() for external sources
(respects track_token_usage flag). Return caller buckets from
get_completion_data().
- SubagentTokenCollector: new lightweight callback handler that
collects LLM usage within subagent execution.
- SubagentExecutor: wire collector into subagent run_config and sync
records to SubagentResult on every chunk (timeout/cancel safe).
- SubagentResult: add token_usage_records and usage_reported fields.
- task_tool: report subagent usage to parent RunJournal on every
terminal status (COMPLETED/FAILED/CANCELLED/TIMED_OUT), including
the CancelledError path, guarded against double-reporting.
No DB migration needed — RunRow columns already exist.
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* fix: address token usage review feedback
* Address review follow-ups
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
`make dev` ran `uv sync` unconditionally on every restart, wiping any
optional extras the user had installed manually with
`uv sync --all-packages --extra postgres`. The Docker image-build path
already solved this via the `UV_EXTRAS` build-arg in backend/Dockerfile;
the local serve.sh path and the docker-compose-dev startup command
were the remaining outliers.
`scripts/serve.sh` now resolves extras before `uv sync`:
1. honors `UV_EXTRAS` (parity with backend/Dockerfile and
docker/docker-compose.yaml — no new convention introduced);
2. falls back to parsing config.yaml — `database.backend: postgres`
or legacy `checkpointer.type: postgres` auto-pins
`--extra postgres`, so the common case needs zero extra config.
3. detector stderr is no longer suppressed, so whitelist warnings or
crashes surface to the dev terminal (review feedback).
Detection lives in `scripts/detect_uv_extras.py` (stdlib-only — has to
run before the venv exists). Extra names are validated against
`^[A-Za-z][A-Za-z0-9_-]*$` so a stray shell metacharacter in `.env`
cannot reach `uv sync` downstream (defense in depth).
`docker/docker-compose-dev.yaml`'s startup command is now extracted to
`docker/dev-entrypoint.sh` (review feedback — the inline command had
grown to a ~350-char one-liner). The script:
- parses comma/whitespace-separated UV_EXTRAS, applying the same
`^[A-Za-z][A-Za-z0-9_-]*$` whitelist as the local detector;
- emits one `--extra X` flag per token, so `UV_EXTRAS=postgres,ollama`
works in Docker dev too (harmonized with local — review feedback);
- calls `uv sync --all-packages` (PR #2584) so workspace member
extras (deerflow-harness's postgres extra) are installed;
- keeps the existing self-heal `(uv sync || (recreate venv && retry))`
branch;
- exposes `--print-extras` for dry-run testing.
The compose file mounts the script read-only at runtime, so script
edits take effect on `make docker-restart` without an image rebuild.
The `--no-sync` alternative (a separate suggestion in the issue thread)
was considered but rejected for dev paths because it would drop the
self-heal branch and the auto-pickup of new pyproject deps. `--no-sync`
is already in use for the production CMD (`backend/Dockerfile:101`)
where it's appropriate.
Updates the asyncpg-missing error message to include the
`--all-packages` flag (matching #2584) plus the persistent install flow,
and expands `config.example.yaml` so all three install paths
(local / docker dev / docker image build) are documented with their
multi-extra capabilities.
Tests:
- `tests/test_detect_uv_extras.py` (21 tests) — local-path env parsing,
YAML edge cases, env-vs-config precedence, whitelist rejection of
shell metacharacters.
- `tests/test_dev_entrypoint.py` (15 tests) — docker-path validation
via `--print-extras`, multi-extra parsing, metacharacter abort.
- `tests/test_persistence_scaffold.py` (22 tests, unchanged) — passes
with the merged `--all-packages --extra postgres` error message.
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
Surface artifacts produced via the present_files tool in the CLI debug
REPL so headless clients without a frontend (VS Code launch configs,
etc.) can locate output files. Each turn prints newly added artifacts
plus their resolved host path. Works for any source that goes through
present_files — ACP agents, subagents, or sandbox writes.
Co-authored-by: Claude Opus 4 <noreply@anthropic.com>
* feat(middleware): inject dynamic context via DynamicContextMiddleware
Move memory and current date out of the system prompt and into a
dedicated <system-reminder> HumanMessage injected once per session
(frozen-snapshot pattern) via a new DynamicContextMiddleware.
This keeps the system prompt byte-exact across all users and sessions,
enabling maximum Anthropic/Bedrock prefix-cache reuse.
Key design decisions:
- ID-swap technique: reminder takes the first HumanMessage's ID
(replacing it in-place via add_messages), original content gets a
derived `{id}__user` ID (appended after). Preserves correct ordering.
- hide_from_ui: True on reminder messages so frontend filters them out.
- Midnight crossing: date-update reminder injected before the current
turn's HumanMessage when the conversation spans midnight.
- INFO-level logging for production diagnostics.
Also adds prompt-caching breakpoint budget enforcement tests and
updates ClaudeChatModel docs to reference the new pattern.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(token-usage): log input/output token detail breakdown in middleware
Extend the LLM token usage log line to include input_token_details and
output_token_details (cache_creation, cache_read, reasoning, audio, etc.)
when present. Adds tests covering Anthropic cache detail logging from
both usage_metadata and response_metadata.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: fix nginx
* fix(middleware): always inject date; gate memory on injection_enabled
Date injection is now unconditional — it is part of the static system
prompt replacement and should always be present. Memory injection
remains gated by `memory.injection_enabled` in the app config.
Previously the entire DynamicContextMiddleware was skipped when
injection_enabled was False, which also suppressed the date.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(lint): format files and correct test assertions for token usage middleware
- ruff format dynamic_context_middleware.py and test_claude_provider_prompt_caching.py
- Remove unused pytest import from test_dynamic_context_middleware.py
- Fix two tests that asserted response_metadata fallback logic that
doesn't exist: replace with tests that match actual middleware behavior
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(middleware): address Copilot review comments on DynamicContextMiddleware
- Use additional_kwargs flag for reminder detection instead of content
substring matching, so user messages containing '<system-reminder>'
are not mistakenly treated as injected reminders
- Generate stable UUID when original HumanMessage.id is None to prevent
ambiguous 'None__user' derived IDs and message collisions
- Downgrade per-turn no-op log to DEBUG; keep actual injection events at INFO
- Add two new tests: missing-id UUID fallback and user-text false-positive
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(task): remove max_turns parameter from task tool interface
Subagents should always use their configured max_turns value. Exposing
this parameter allowed callers to override the admin-configured limit,
which is undesirable. The value is now exclusively driven by subagent
config (per-agent overrides and global defaults in config.yaml).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* fix(tools): introduce Runtime type alias to eliminate Pydantic serialization warning
Add deerflow/tools/types.py with:
Runtime = ToolRuntime[dict[str, Any], ThreadState]
Replace every runtime: ToolRuntime[ContextT, ThreadState] and
runtime: ToolRuntime[dict[str, Any], ThreadState] annotation in
sandbox/tools.py, present_file_tool.py, task_tool.py, view_image_tool.py,
and skill_manage_tool.py with the new Runtime alias.
The unbound ContextT TypeVar (default None) caused
PydanticSerializationUnexpectedValue warnings on every tool call because
LangChain's BaseTool._parse_input calls model_dump() on the auto-generated
args_schema while DeerFlow passes a dict as runtime context.
Binding the context to dict[str, Any] aligns Pydantic's serialization
expectations with reality and removes the noise from all run modes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(tools): extend Runtime alias to setup_agent and update_agent tools
Replace bare ToolRuntime annotations in setup_agent_tool.py and
update_agent_tool.py with the shared Runtime alias introduced in the
previous commit, and add both tools to the Pydantic serialization
warning regression test (13 cases total).
Co-authored-by: Cursor <cursoragent@cursor.com>
* test(tools): loosen Pydantic warning filter to avoid version-specific format
Replace the brittle "field_name='context'" substring check with a looser
"context" match so the assertion stays valid if Pydantic changes its
internal warning format across versions.
Co-authored-by: Cursor <cursoragent@cursor.com>
* test(tools): simplify warning filter and clean up docstring
Remove the "context" substring condition from the Pydantic warning
filter — asserting that no PydanticSerializationUnexpectedValue fires
at all is both simpler and more comprehensive, since the test payload
contains only the tool's own args plus runtime.
Also update the module docstring to remove the version-specific warning
format example that was inconsistent with the looser filter.
Co-authored-by: Cursor <cursoragent@cursor.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(chat): prevent first user message from being swallowed in new conversations
The optimistic message clearing effect cleared too eagerly — any stream
message (including AI messages from messages-tuple events) triggered the
clear before the server's human message had arrived via values events.
For new threads this caused the user's first prompt to disappear permanently.
Only clear optimistic messages once the server's human message has been
confirmed to arrive in thread.messages, not just when any message arrives.
Fixes#2730
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* Make loop detection configurable
Expose LoopDetectionMiddleware thresholds through config.yaml while preserving existing defaults and allowing the middleware to be disabled.
Refs bytedance/deer-flow#2517
* feat(loop-detection): add per-tool tool_freq_overrides to Phase 1
Adds ToolFreqOverride model and tool_freq_overrides field to
LoopDetectionConfig, wires it through LoopDetectionMiddleware, and
documents the option in config.example.yaml.
Resolves the gap flagged in the #2586 review: without per-tool overrides,
users hit by #2510/#2511 (RNA-seq workflows exceeding the bash hard limit)
had no way to raise thresholds for one tool without loosening the global
limit for every tool.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* docs(loop-detection): document tool_freq_overrides in LoopDetectionMiddleware docstring
Add the missing Args entry for tool_freq_overrides, explaining the
(warn, hard_limit) tuple structure and how per-tool thresholds supersede
the global tool_freq_warn / tool_freq_hard_limit for named tools.
Also run ruff format on the three files flagged by the lint check.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(loop-detection): validate LoopDetectionMiddleware __init__ params eagerly
Raise clear ValueError at construction time instead of crashing at
unpack-time inside _track_and_check when bad values are passed:
- tool_freq_overrides: must be 2-tuples of positive ints with hard_limit >= warn
- scalar thresholds: warn_threshold, hard_limit, tool_freq_warn,
tool_freq_hard_limit must be >= 1 and hard limits must >= their warn pairs
- window_size, max_tracked_threads must be >= 1
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(test): isolate credential loader directory-path test from real ~/.claude
The test didn't monkeypatch HOME, so on any machine with real Claude Code
credentials at ~/.claude/.credentials.json the function fell through to
those credentials and the assertion failed. Adding HOME redirect ensures
the default credential path doesn't exist during the test.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* style(test): add blank lines after import pytest in TestInitValidation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor(loop-detection): collapse dual validation to LoopDetectionConfig
Modifications
- LoopDetectionMiddleware.__init__: stripped of all ValueError raises;
becomes a plain field-assignment constructor.
- LoopDetectionMiddleware.from_config: classmethod that builds the
middleware from a Pydantic-validated LoopDetectionConfig and handles
the ToolFreqOverride -> tuple[int, int] conversion.
- agents/factory.py: SDK construction routed through
LoopDetectionMiddleware.from_config(LoopDetectionConfig()) so the
defaults path is Pydantic-validated too.
- agents/lead_agent/agent.py: uses from_config instead of unpacking
config fields by hand.
- tests/test_loop_detection_middleware.py: deleted TestInitValidation
(16 methods exercising the removed __init__ checks); added
TestFromConfig (4 tests: scalar field mapping, override tuple
conversion, empty overrides, behavioral smoke test).
Result: one validation layer (Pydantic), zero duplication, no __new__
hacks. Both production construction sites flow through LoopDetectionConfig.
Test results
make test -> 2977 passed, 18 skipped, 0 failed (137s)
make format -> All checks passed; 411 files left unchanged
* feat(agents): make loop_detection configurable in create_deerflow_agent
Adds a `loop_detection: bool | AgentMiddleware = True` field to
RuntimeFeatures, mirroring the existing pattern used by `sandbox`,
`memory`, and `vision`. SDK users can now disable LoopDetectionMiddleware
or replace it with a custom instance built from their own
LoopDetectionConfig — e.g.
`LoopDetectionMiddleware.from_config(my_cfg)` — instead of being stuck
with the hardcoded defaults previously installed by the SDK factory.
The lead-agent path (which already reads AppConfig.loop_detection) is
unchanged, and the default `True` preserves prior always-on behavior for
all existing callers.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---------
Co-authored-by: knight0940 <631532668@qq.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Amorend <142649913+knight0940@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
* fix(frontend): defer thread id to onStart to avoid 404 on new chat
The LangGraph SDK's useStream eagerly fetches /threads/{id}/history the
moment it receives a thread id, and the local useThreadRuns issues
GET /threads/{id}/runs for the same reason. The chats page used to flip
isNewThread=false (and forward the client-generated thread id) inside
the synchronous onSend callback, before thread.submit had created the
thread on the backend. The two queries therefore raced ahead of
POST /runs/stream and returned 404 on the very first send.
Drop the onSend handler so isNewThread stays true until onStart fires
from useStream's onCreated — by then the backend has the thread, and
the SDK's submittingRef guard naturally suppresses the redundant
history fetch. The agent chat page already uses this pattern, so this
also unifies the two flows.
Adds an E2E regression that records request ordering and asserts
GET /history and GET /runs are never issued before POST /runs/stream
on the first send from /chats/new.
Closes#2746
* fix(frontend): split welcome layout from backend thread state
Removing onSend kept GET /history and GET /runs from racing ahead of
POST /runs/stream, but it also coupled the welcome layout (centered
input, hero, quick actions) to backend thread creation. Until onCreated
returned, the user's optimistic message and the welcome hero rendered on
top of each other.
Introduce a dedicated `isWelcomeMode` UI flag, separate from
`isNewThread`:
- `isNewThread` still tracks "backend has no thread yet" and gates the
thread id forwarded to useStream.
- `isWelcomeMode` drives the visual layout (header background, input
box position, max width, hero, quick actions, autoFocus) and flips to
false inside onSend so the layout animates immediately.
`isWelcomeMode` is kept in sync with `isNewThread` via an effect so
sidebar navigation and "new chat" still behave correctly. All 15 E2E
tests pass, including the ordering regression added in the previous
commit.
* test(e2e): use monotonic sequence for thread-init ordering check
Date.now() is millisecond-resolution, so two requests emitted within
the same tick would share a timestamp and slip past the strict `<`
ordering assertions. Replace the timestamp with a monotonic counter
that increments on every observed request/requestfinished event so the
ordering check is robust regardless of scheduling.
Per PR #2749 review feedback from copilot-pull-request-reviewer.
* refactor(input-box): rename isNewThread prop to isWelcomeMode
Inside InputBox, the prop named `isNewThread` is only ever consulted
for visual layout decisions — gating follow-up suggestions, the bottom
background strip, and the welcome-mode quick-action SuggestionList. It
never reflects "the backend has created the thread", which after #2746
is tracked separately via `isNewThread` in the chat pages themselves.
Rename the prop to `isWelcomeMode` and update both call sites
(workspace chats page and agent chats page) so the prop name matches
its actual semantics. No behavior change.
Per PR #2749 review feedback from @WillemJiang.