deer-flow

mirror of https://github.com/bytedance/deer-flow.git synced 2026-07-25 23:48:00 +00:00

Author	SHA1	Message	Date
Aari	37c343fe30	fix(summarization): summarize with the run model, fall back on summary-provider failure (#4361 ) * fix(summarization): own the run model for compaction; bound failure With summarization.model_name: null the summary model resolved to config.models[0] while the executing model is selected per run; when they differ and models[0]'s provider is broken (expired key, quota, outage) compaction silently failed every triggered turn and context grew unbounded until the main provider 400s the run (#3103's shape), even though the run's own model was healthy. Model ownership is now sourced from the builders, not re-derived at runtime: - The lead, subagent, and manual /compact builders each pass the resolved run model into create_summarization_middleware(run_model_name=...). The middleware no longer reads runtime.context / get_config(), which do not carry a custom agent's or a subagent's resolved model, so a custom-agent lead run and a distinct-model subagent now summarize with their own model, not models[0] / the parent's. Runtime re-resolution and the per-name model cache are removed. - model_name: null summarizes with the run's own model; an explicitly configured summary model generates and falls back to the run model on failure. The fallback is built lazily after the primary fails and its construction is guarded, so a broken fallback cannot skip a healthy primary or escape the automatic failure boundary. Failure is bounded and side-effect-safe: - An empty or whitespace-only response is treated as a generation failure, not a valid summary, so compaction never removes all history for an empty replacement. - compact_state/acompact_state take raise_on_failure independent of force: the manual /compact path always surfaces a generation failure (even force=false) and routes it to the existing ContextCompactionFailed path (HTTP 500 -> frontend error toast) instead of an unconsumed response reason. The automatic path leaves compaction state unchanged. - before_summarization hooks fire only after a replacement summary exists. SummarizationConfig.model_name, config.example.yaml, and docs/summarization.md document the final lead/subagent/manual ownership rules. Part of RFC #4346 (section A). Evaluating fraction/triggers against the run model's profile (profile ownership) is a separate follow-up. * fix(summarization): manual /compact model ownership + fail-open construct/parse Manual /compact carried only agent_name, so it derived the run model from the custom-agent model or config.models[0] and missed the request-selected model the run path uses (request -> custom-agent -> default). Carry model_name through ThreadCompactRequest and the frontend compact call, resolve with the same precedence, and move the custom-agent config read off the event loop (asyncio .to_thread) with user_id so the strict blocking-IO gate is not bypassed by the broad except. Make one summary attempt own its full lifecycle so the fail-open boundary covers construction and response parsing, not just invocation: build each candidate model lazily and guarded (a raising constructor falls through to the healthy run model instead of breaking agent construction), build the model_name:null primary from the run model rather than config.models[0], and run response text extraction inside the invocation try so a failing .text accessor falls back instead of escaping compaction. Adds factory-level constructor-failure, response-extraction-failure (sync/async), and route-path model-ownership tests.	2026-07-26 07:39:39 +08:00
Willem Jiang	b41a8d44df	fix(lint):fix the lint error on the main branch (#4461 )	2026-07-26 07:06:26 +08:00
MiaoRuidx	735f67a5b2	fix: guard pending run startup cancellation (#4450 ) * fix: guard pending run startup cancellation * fix(run): address startup review feedback * fix(run): narrow start_run store contract --------- Co-authored-by: MiaoRuidx <12540796+MiaoRuidx@users.noreply.github.com> Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-07-25 23:50:21 +08:00
Huixin615	8af760fc30	fix(runtime): make orphan reconciliation lease-aware (#4427 )	2026-07-25 23:26:17 +08:00
Vanzeren	3c8b82c594	fix(runtime): serialize checkpoint writes with active runs (#4437 ) * fix(runtime): serialize checkpoint writes with active runs * fix(runtime): address checkpoint reservation reviews * fix(runtime): address reservation race reviews * fix(runtime): refine reservation conflict semantics	2026-07-25 23:18:34 +08:00
March-77	a65eb531ae	fix(telegram): receive inbound attachments (#4392 ) * fix(telegram): receive inbound attachments * refactor(telegram): tighten inbound attachment handoff --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-07-25 21:55:31 +08:00
VectorPeak	07d8b98864	fix(mcp): ignore malformed path-like text (#4456 ) Co-authored-by: chatgpt-codex-connector[bot] <199175422+chatgpt-codex-connector[bot]@users.noreply.github.com>	2026-07-25 21:43:33 +08:00
Vanzeren	8c19a2eb36	perf(checkpoint): linearize message write merging (#4421 ) * perf(checkpoint): linearize message write merging * test(checkpoint): address message reducer review	2026-07-25 21:19:24 +08:00
luo jiyin	3b77a7401b	fix(sandbox): enforce E2B replica capacity limits (#4391 ) * fix(sandbox): enforce E2B replica capacity limits (in-process) Add SandboxCapacityExceededError with diagnostic fields. Add overflow_policy (wait/reject/burst), acquire_timeout, and burst_limit config options. Implement atomic capacity reservation with a four-slot model: reserved / active / warm / transitioning. Transitioning slots close the window where active-to-warm or warm-to-active transitions appear to have zero occupied slots, which would let concurrent acquires exceed the configured replica ceiling. Re-route release, reclaim, and evict through transitioning counters. Add shutdown guard: reject waiters, kill VMs created during shutdown. Add 14 tests: policy enforcement, release+acquire race, warm-reclaim race, shutdown-waiter interaction, shutdown-during-create, and concurrent different-thread capacity assertion. Related: #4339 * fix: harden e2b sandbox capacity lifecycle * fix: retain e2b capacity during uncertain eviction * fix: serialize e2b tombstone eviction * fix: retain capacity after uncertain e2b cleanup * fix: track e2b remote operations during shutdown * fix(sandbox): validate E2B capacity config * fix(sandbox): classify capacity errors * fix(sandbox): harden E2B capacity lifecycle * test(sandbox): cover E2B review findings * docs(changelog): note E2B capacity behavior * docs(readme): explain E2B overflow handling * docs(backend): record E2B lifecycle rules * docs(sandbox): clarify destructive E2B reset * fix(sandbox): close E2B capacity race gaps --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-07-25 10:54:14 +08:00
ShitK	0f0955bf7b	fix(client): preserve ToolMessage artifacts (#4422 ) Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-07-25 09:47:58 +08:00
Nan Gao	58befaf248	fix(thread-history): keep completed subtask cards stable after reload (#4432 ) * fix(thread-history): hide subagent AI responses * refactor(thread-runs): remove unused _is_middleware_message_row helper --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-07-25 09:29:11 +08:00
Daoyuan Li	d2b5f884e3	fix(channels): buffer GitHub follow-ups during busy runs (#4133 ) Queue comments received while a run is active, then submit one deduplicated follow-up after it finishes. Failed drains are requeued and watcher tasks stop cleanly with the channel manager.	2026-07-24 22:41:07 +08:00
ly-wang19	25d9ac0a43	fix(skills): offload blocking filesystem IO in get_custom_skill_history (#3563 ) * fix(skills): offload blocking filesystem IO in get_custom_skill_history The GET /api/skills/custom/{name}/history handler ran its storage probes and the per-skill .history read directly on the asyncio event loop: get_or_new_skill_storage(), custom_skill_exists(), get_skill_history_file().exists() and read_history() are all blocking filesystem IO. make detect-blocking-io flagged the existence probe (routers/skills.py:224) as DIRECT_ASYNC. Move the whole read into a nested sync function run via asyncio.to_thread; a None return signals 404 (distinct from an empty history list). Behavior is unchanged. Per the blocking-io-guard SOP: - Candidate: get_custom_skill_history (FILE_METADATA, DIRECT_ASYNC) -> FIX+ANCHOR. - Re-scan: the finding no longer appears for this handler. - Anchor: tests/blocking_io/test_skills_router.py drives the real handler against a real on-disk skill + history; teeth verified red (pre-fix) -> green (post-fix) under make test-blocking-io. Scoped to this self-contained read handler. rollback_custom_skill and update_skill also touch blocking IO but interleave it with awaits (security scan / cache refresh) and do a read-modify-write, so offloading them needs the asyncio.Lock serialization treatment (cf. #3552) and is left as a separate fix unit. * test: trim dead skills history setup * fix(skills): use the user-scoped storage accessor in the offloaded history read The merge with main left the offloaded reader calling get_or_new_skill_storage, which is not defined in this module (ruff F821), so lint failed and the handler would raise NameError at runtime. Use _get_user_skill_storage(config) — the same accessor every other handler in this router uses. Also update the regression test for the current route signature: the handler is now admin-only and takes a Request, so the test supplies request.state.user (mirroring tests/blocking_io/test_channel_runtime_config_store.py) and seeds the history through the same user-scoped accessor. --------- Co-authored-by: ly-wang19 <ly-wang19@users.noreply.github.com> Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-07-24 22:33:16 +08:00
黄云龙	126fc9ea81	fix(subagents): clamp subagent limit consistently with MIN_SUBAGENT_LIMIT (#4081 ) * fix(subagents): align prompt and middleware subagent limit; allow min of 1 SubagentLimitMiddleware clamped max_concurrent to [2, 4] internally, but agent.py and client.py fed the raw config value into the system prompt, so a user-configured 1 (or 5) produced a prompt that disagreed with the enforced middleware limit. Lower MIN_SUBAGENT_LIMIT to 1 and clamp the raw config value with _clamp_subagent_limit() at both the agent factory and the embedded client so the prompt and middleware see the same value. * fix: remove unused imports MAX_CONCURRENT_SUBAGENT_CALLS, MIN_CONCURRENT_SUBAGENT_CALLS, clamp_subagent_concurrency * fix: harmonize clamp range [1,4] across middleware, config, and prompt path; fix lint - Changed MIN_CONCURRENT_SUBAGENT_CALLS from 2 to 1 so prompt.py's clamp_subagent_concurrency and the middleware's _clamp_subagent_limit both clamp to [1,4] — eliminating the divergence where the prompt told the model 'max 2 task calls' but the middleware enforced 1. - Applied _clamp_subagent_limit at build_middlewares (agent.py:360) so all 3 construction sites (agent.py:360, agent.py:450, client.py:259) consistently clamp the config-resolved limit. - Derived MIN_SUBAGENT_LIMIT / MAX_SUBAGENT_LIMIT from MIN_CONCURRENT_SUBAGENT_CALLS / MAX_CONCURRENT_SUBAGENT_CALLS so the two module-level definitions stay in sync. - Added TestConfigParity.test_prompt_path_and_middleware_clamp_agree regression test. - Fixed lint. * fix(lint): add missing imports for MIN_CONCURRENT_SUBAGENT_CALLS and MAX_CONCURRENT_SUBAGENT_CALLS * docs+test: update AGENTS.md clamp range to 1-4; add prompt/middleware parity regression test - backend/AGENTS.md still documented the old [2,4] clamp in two places; updated to [1,4] to match MIN_CONCURRENT_SUBAGENT_CALLS = 1. - Added test_apply_prompt_template_single_subagent_limit_matches_middleware: renders the real system prompt with max_concurrent_subagents=1 and asserts the advertised HARD LIMITS value equals SubagentLimitMiddleware's enforced max_concurrent — the end-to-end check that would have caught the [1,4] vs [2,4] prompt-path divergence flagged in review. * refactor: simplify per review — restore clamp delegation, drop redundant call-site clamps Per willem-bd's review, reduce the PR to the one behavioral change plus docs/tests: - _clamp_subagent_limit delegates to clamp_subagent_concurrency again instead of inlining a byte-identical copy; with a single source of truth the TestConfigParity sync-check class is unnecessary — dropped. - Revert the call-site clamps in agent.py (build_middlewares, _make_lead_agent) and client.py (_ensure_agent) to main: both downstream consumers (SubagentLimitMiddleware.__init__ and the prompt path) already clamp internally, and the cross-module private import of _clamp_subagent_limit goes away with them. - Keep MIN_CONCURRENT_SUBAGENT_CALLS = 1 (the fix), the [1, 4] docstring updates, the AGENTS.md range corrections, and the end-to-end prompt/middleware parity test for single-subagent mode (docstring reworded: on main a configured 1 was bumped to 2 by both paths — there was no divergence to fix, just a silently raised floor). * test: fix stale comment referencing reverted agent.py/client.py call-site clamps --------- Co-authored-by: nankingjing <nankingjing@users.noreply.github.com> Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-07-24 21:56:11 +08:00
Daoyuan Li	ca3e510b7d	fix(scheduler): close duplicate dispatch race (#4105 ) Enforce one queued or running scheduled-task run per task with a partial unique index. The migration resolves legacy duplicates before creating the index, and losing inserts use the existing conflict or skip outcomes.	2026-07-24 21:41:09 +08:00
Daoyuan Li	159b774944	fix(skills): handle non-string frontmatter keys (#4167 ) Normalize YAML frontmatter keys in the shared parser so validation and review report malformed fields instead of failing while sorting mixed key types.	2026-07-24 21:25:53 +08:00
H Haidong	c7538cfb35	fix(runs): terminate orphaned streams after lease recovery (#4420 ) * fix(runs): terminate orphaned streams after lease recovery * fix(runs): include recovered ids in callback warnings * fix(runs): harden orphan recovery lifecycle	2026-07-24 19:34:20 +08:00
ShitK	a4ede80deb	fix(runtime): reject unsupported run options and stream modes (#4430 ) * fix(runtime): reject unsupported run options * fix(runtime): align SDK run compatibility * fix(frontend): avoid unsupported events stream mode --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-07-24 19:24:24 +08:00
Ryker_Feng	cd9432bcc1	feat(tools): support GIF images in view_image (#4438 ) Add GIF to the view_image allowlist: map the .gif extension to image/gif and detect the GIF87a/GIF89a magic bytes so the existing extension/content cross-check accepts GIFs instead of rejecting them as an unsupported format. Covered by a new success test.	2026-07-24 13:12:43 +08:00
MiaoRuidx	80c06414f8	fix: make orphan reconciliation lease-aware (#4434 ) 让启动/孤儿 run 恢复在最终写入前通过 claim_for_takeover 原子重查 lease，避免 owner 在扫描后续约成功仍被误标为 error。补充扫描后续约的回归测试，并把 reconciliation 写失败测试迁移到 takeover claim 路径。 Co-authored-by: MiaoRuidx <12540796+MiaoRuidx@users.noreply.github.com> Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-07-24 09:48:48 +08:00
Huixin615	fbc1463809	fix(gateway): preserve regenerate state in branched threads (#4358 ) * fix(gateway): preserve regenerate state in branched threads * test(gateway): isolate branch regenerate regression config * fix(gateway): preserve branching for legacy histories * fix(gateway): harden branch regenerate lineage * docs(gateway): clarify branch checkpoint behavior --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-07-24 08:57:48 +08:00
Willem Jiang	959f052750	fix(ci): lock the ruff version from the backend Makefile (#4435 )	2026-07-24 08:46:55 +08:00
Aari	5f0108f56c	fix(runtime): stop subgraph stream frames impersonating root frames (#4407 ) * fix(runtime): stop subgraph stream frames impersonating root frames The web frontend always requested stream_subgraphs, and since delegated subagent graphs inherit the parent checkpoint namespace (#4215), their values snapshots and token chunks ride the parent stream. The worker's _unpack_stream_item dropped the namespace and published every subgraph frame under a bare event name, so a subagent's values snapshot replaced the whole thread view in SDK clients (#4399), its token chunks flooded the parent message stream, and a subagent's LLM error fallback could be mistaken for the parent run's. Publish subgraph frames under namespace-qualified SSE event names (mode\|ns1\|ns2, LangGraph Platform style) and keep root-only consumers (file-tool chunk batcher, subagent event persistence, error-fallback detection) on root frames only. Drop streamSubgraphs from the frontend submit paths: subtask progress arrives via root-namespace task_* custom events, so the flag only exposed the leak. * test(runtime): add production-shaped subgraph stream regression tests Address review: the namespace tests validated the publishing helpers with hand-fed namespaces, while the #4399 regression lived in the integration between LangGraph's delegation routing and the worker's stream loop. Add TestWorkerSubgraphStreamIntegration: a real parent graph delegates through the real SubagentExecutor and streams through run_agent into a real MemoryStreamBridge, locking both stream_subgraphs modes -- delegated frames arrive namespaced (never bare), a delegated error fallback cannot mark the parent run as errored, and without the flag delegated frames stay out while task_* custom events remain.	2026-07-23 23:32:06 +08:00
Huixin615	4a2ecd430e	fix(streaming): expose custom events to astream_events (#4403 ) * fix(streaming): expose custom events to astream_events * test(streaming): validate real custom event emitters --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-07-23 22:56:12 +08:00
hataa	7857fa0cce	feat(authz): enforce tool authorization at assembly and runtime (#4370 ) * feat(authz): enforce tool authorization at assembly and runtime * fix(middleware): guard deferred tool setup lookup (#4370) --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-07-23 22:51:35 +08:00
MiaoRuidx	f1632cc351	fix(run): add run event stream contract (#4342 ) * docs: document run event stream contract * fix(run): address event stream review feedback --------- Co-authored-by: MiaoRuidx <12540796+MiaoRuidx@users.noreply.github.com> Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-07-23 21:33:57 +08:00
Vanzeren	62dd8d2b67	bench(checkpoint): add channel mode benchmark (#4395 ) * bench(checkpoint): add channel mode benchmark * bench(checkpoint): harden benchmark reporting	2026-07-23 17:54:29 +08:00
Aari	b7933d18e4	fix(safety): backfill empty content-filter responses so they don't poison the thread (#4394 ) An empty assistant message from a provider safety filter (content_filter with no content, no tool calls) was persisted into thread history and replayed to strict OpenAI-compatible providers, which reject it with HTTP 400 ("message ... with role 'assistant' must not be empty") — breaking every later turn until a new chat is started. SafetyFinishReasonMiddleware only handled the tool-call case (#3028) and TerminalResponseMiddleware only the post-tool case (#4027), so a plain empty content-filter response fell through both. Extend the safety middleware to backfill a user-facing explanation when a safety-terminated message is otherwise blank, so the persisted turn is non-empty (and the user sees why it was blocked). Fixes #4393	2026-07-23 16:59:34 +08:00
Daoyuan Li	04659cc8dd	fix(gateway): stop implying 200 webhook deliveries are unrecoverable (#4307 ) PR #4289 corrected the false claim that GitHub auto-retries 5xx webhook deliveries, but its replacement wording overcorrected: it described a mistaken 200 response as dropping the webhook "forever with no way to recover it" / "permanently" - implying manual recovery is impossible, not just unprompted. fancyboi999 flagged this in a CHANGES_REQUESTED review on #4289 (submitted 23:21:23Z, referencing github_webhooks.py:190-197,325-335 and test_github_webhooks.py:548-559) that went unaddressed before the PR was approved and merged roughly 40 minutes later. Verified directly against GitHub's documentation before changing anything: the manual "Redeliver" button and the REST/App redelivery endpoints place no failed-status precondition on the delivery id - any past delivery, success or failure, can be redelivered within GitHub's ~3-day window (https://docs.github.com/en/webhooks/testing-and-troubleshooting-webhooks/redelivering-webhooks, https://docs.github.com/en/rest/repos/webhooks#redeliver-a-delivery-for-a-repository-webhook). The real problem with swallowing a transient failure into 200 is discoverability, not recoverability: the delivery never shows up as failed in Recent Deliveries, and GitHub's own recommended scripted-recovery pattern filters on non-OK status by convention, not because the platform blocks redelivering a success. A 200'd delivery can still be redelivered by hand if an operator happens to look - they just get no signal telling them to, unlike the 503 path, which stays correctly flagged as failed and so is actually found. - github_webhooks.py: reworded the route docstring and the inline fan-out comment to describe the 200-vs-503 difference as discoverability, not raw recoverability, and added the redelivery docs link alongside the existing failed-deliveries link. - test_github_webhooks.py: reworded test_dispatch_failure_returns_503_not_200's docstring the same way. No assertions changed. All 44 tests in test_github_webhooks.py pass, plus test_github_dispatcher.py / test_github_channel.py / test_github_registry.py / test_channels.py (322 total). ruff check and ruff format --check are clean on both touched files.	2026-07-23 14:32:49 +08:00
Aari	70fb91654d	fix(gateway): seed branch run-events so inherited history survives forking (#4385 ) * fix(gateway): seed branch run-events so inherited history survives (#4380) The thread feed (GET /messages, /messages/page) reads the run-event store, but branch creation only wrote checkpoint state - a fresh branch had no message rows, so the parent history vanished from the UI as soon as the branch's first run refreshed the feed. Seed the branch's run_events from the same checkpoint snapshot the branch was created from, mirroring RunJournal's message-event contract (event types, hidden-message rules, original-user-text restoration). Best-effort: a seeding failure degrades to the old behavior and is reported as history_seed_mode=failed. * docs(gateway): correct branch-seed docstring on RunJournal divergences The "consumers cannot tell a seeded row from a journaled one" claim was overstated for AI rows: seeded rows omit run-scoped enrichment (usage / latency_ms / llm_call_index) and stamp caller=lead_agent rather than the message's original caller, neither recoverable from a checkpoint message. Rewrite the docstring to state these divergences explicitly and note they are display-invisible today (no consumer indexes those keys; per-message caller drives no attribution). Also add a code comment marking the hide_from_ui filter as intentionally stricter than the live paths. * fix(gateway): seed dict-shaped checkpoint messages + persist hidden AI/tool rows Two review-driven fixes to build_branch_history_seed_events: 1. Checkpoint messages can arrive as model_dump()-shaped dicts (the branch-matching helpers in threads.py already handle both BaseMessage and dict). The seed only handled BaseMessage, so a dict-backed checkpoint seeded nothing and the branch reported skipped_empty while history existed. Coerce dicts back to BaseMessage via messages_from_dict (faithful: tool_calls / tool_call_id / additional_kwargs survive); unparseable dicts are dropped best-effort. 2. RunJournal.on_llm_end and _persist_tool_result_message persist hide_from_ui AI/tool rows unconditionally (the frontend hides them client-side); the hide check only gates the reconciliation pass. The seed dropped them, so a hidden turn vanished from a forked feed and seeded rows diverged from journaled ones. Match RunJournal and write them, restoring true row-level parity. Adds tests for dict deserialization, the unparseable-dict drop, and the hidden AI/tool persistence contract.	2026-07-23 13:57:32 +08:00
Admire	a38b1daec3	fix(streaming): keep large file generation responsive (#4354 ) * fix(streaming): keep large file generation responsive * fix(streaming): address follow-up review feedback * fix(streaming): address final review feedback	2026-07-23 08:51:14 +08:00
Aari	7b330101d2	fix(tools): exclude injected runtime from list_uploaded_files schema (#4375 ) (#4376 ) Declaring the injected runtime arg as `Annotated[Runtime, InjectedToolArg] \| None` made the top-level annotation a Union, so LangChain no longer treated it as injected. It leaked into the model-facing schema and pydantic raised PydanticInvalidForJsonSchema on the ToolRuntime dataclass the moment the tool was bound to a model. The tool is bound by default for the lead agent, so any default run on an OpenAI-compatible provider failed at tool-bind time. Declare runtime as a bare Runtime first param, matching every other built-in tool (present_files, view_image, task, ...), which LangChain auto-injects and auto-excludes from the schema. Add a schema regression test that binds the tool.	2026-07-23 08:22:15 +08:00
Aari	0d4d0cb17d	feat(agents): database-backed storage for custom agent definitions (#4359 ) * feat(agents): database-backed storage for custom agent definitions Add an agent_storage.backend switch (default file, behaviour-unchanged) with a db backend that stores each custom agent as a row in the shared SQL persistence layer, so a multi-instance deployment sees the same agents on every node (#4331, #4357). Introduces an AgentStore interface routing all read/write surfaces, an agents table + migration 0006, startup validation, and a file->db importer. Follows the thread_meta store / run_events backend-switch / 0003_scheduled_tasks migration patterns; no new dependency. * fix(agents): make db storage path production-ready (review round 1) Addresses review feedback on the db/sync agent-storage path: - sql.py: mirror the async engine's per-connection SQLite PRAGMAs on the sync engine (busy_timeout=30000, synchronous=NORMAL, foreign_keys=ON, WAL) so both engines behave identically against the shared DB; guard the engine cache with a lock (double-checked) so concurrent first-touch cannot build duplicate engines or register the connect listener twice. - routers/agents.py + routers/assistants_compat.py: offload the sync-store reads that ran on the event loop (list/get/check, update's pre-read + legacy guard + refresh, and assistants_compat's four list routes) via asyncio.to_thread — on db+postgres each was a network round trip stalling the loop. Writes were already offloaded. - file.py: translate the create() mkdir(exist_ok=False) race FileExistsError into AgentExistsError (router 409, matching SqlAgentStore's IntegrityError path); correct the _write docstring — per-file atomic replace, two commits sequential not transactional. Tests: sync-engine PRAGMA + engine-cache reuse assertions; file create-race -> AgentExistsError; strict Blockbuster anchor over the read endpoints so a regression back onto the loop fails CI. * fix(agents): address round-2 review on the db store path - update_agent tool: align the docstring/inline comment with FileAgentStore._write. Cross-field write atomicity is db-only; the file backend commits config then soul via two sequential os.replace (a crash between them can leave a fresh config.yaml beside a stale SOUL.md). The dropped partial-write reporting is an intentional tradeoff — the stage-then-replace safety is preserved (test_update_agent_soul_failure_does_not_replace_config still holds). - SqlAgentStore.update(): true upsert. Catch IntegrityError on the insert-on-missing branch, re-fetch and apply, so two concurrent first-time writes (e.g. two setup_agent handshakes) converge instead of surfacing a raw UNIQUE(user_id, name) violation as a 500. Symmetric with create(). - get_agent_store(): document the graph-subprocess config-resolution invariant (the except->file fallback is a genuine no-config path, not a mask for a misconfigured graph process) and pin it with two tests driving the real get_app_config() file resolution: db resolves from an on-disk config.yaml, file fallback when config is unresolvable. * test(agents): cover SqlAgentStore.update() write-race upsert recovery Mandatory-TDD test for the round-2 fix in 0680340a: two concurrent first-time update()s where the loser's insert hits UNIQUE(user_id, name). Deterministically forces the IntegrityError recovery path by making the first _row probe miss the committed winner, and asserts last-writer-wins instead of a surfaced 500.	2026-07-23 08:03:21 +08:00
March7	4dd7cafef1	fix(sandbox): serialize E2B release transitions (#4355 )	2026-07-23 07:42:43 +08:00
Daoyuan Li	44990ff194	fix(mcp): use threading.Lock for OAuth token refresh to avoid cross-thread deadlock (#4240 ) * fix(mcp): use threading.Lock for OAuth token refresh to avoid cross-thread deadlock OAuthTokenManager created one asyncio.Lock per server for the process lifetime. The embedded/TUI sync tool-call path (DeerFlowClient.stream() -> LangGraph's ToolNode._func -> a ThreadPoolExecutor -> make_sync_tool_wrapper's per-call asyncio.run()) invokes get_authorization_header from a fresh event loop on a fresh OS thread for every concurrent tool call. asyncio.Lock binds to whichever loop first contends on it; when a caller on a different loop later releases or wakes a waiter, it does so without call_soon_threadsafe, so the waiting loop's selector is never woken and that caller hangs forever with no exception. A third concurrent caller instead raises a synchronous RuntimeError ("bound to a different event loop"). Either way, two concurrent OAuth-protected tool calls (including the very first cold-start token fetch) can freeze the entire agent turn. Gateway's async path (ToolNode._afunc) is unaffected. Replace the asyncio.Lock with a plain threading.Lock, acquired via asyncio.to_thread so the blocking wait never blocks the event loop, and released synchronously in a finally block. This keeps the single-fetch de-duplication the lock provided while making it safe across however many event loops/threads call into the same server's lock. Adds a regression test that runs three threads, each with its own event loop, calling get_authorization_header concurrently for the same server, and asserts (with a bounded join timeout so a regression fails fast instead of hanging the suite) that none hang or raise, and that only one real token fetch happens. * fix(mcp): make OAuth lock acquisition cancellation-safe get_authorization_header acquired the per-server threading.Lock via a bare `await asyncio.to_thread(lock.acquire)`, with the try/finally that guarantees release only starting after that await returned. Once the executor thread had actually started running lock.acquire(), cancelling the awaiting caller only stopped the caller -- Python cannot interrupt a running OS thread. CancelledError was still delivered to the caller immediately, but the thread kept blocking until the current holder released, then silently acquired the lock with nobody left to call release() for it. The lock stayed locked forever and every later OAuth token refresh for that server blocked permanently at the same line -- the exact cross-thread deadlock this lock was introduced to prevent, reintroduced via a different path under cancellation (e.g. a caller wrapped in asyncio.wait_for/asyncio.timeout, or task-group cancellation). Run the acquisition as an explicit asyncio.create_task, awaited via asyncio.shield() so cancelling the caller no longer cancels the underlying acquisition task. If the caller is cancelled, keep (re-)waiting on the still-shielded acquisition task -- tolerating further cancellation during this cleanup by simply retrying -- until it actually finishes, release the lock immediately, and only then re-raise. This guarantees the lock is released regardless of when or how many times the caller is cancelled: before the acquisition is even scheduled, while queued, or after it has already been silently granted. Adds a regression test that holds the per-server lock, starts a second caller that has to wait for it, cancels that caller while it is genuinely blocked in its executor thread, releases the original holder, and asserts a third caller completes within a bounded asyncio.wait_for and still performs exactly one token fetch. Every potentially-hanging await is bounded so a regression fails the test quickly instead of hanging the suite.	2026-07-22 19:58:43 +08:00
March7	8c78d1f41f	fix(subagents): load user-scoped skills (#4356 )	2026-07-22 14:59:33 +08:00
Daoyuan Li	314f84bc8d	fix(feishu): check response.success() on card/reaction SDK calls (#4234 ) * fix(feishu): check response.success() on card/reaction SDK calls _reply_card, _create_card, _update_card, and _add_reaction call the lark-oapi SDK and only used the response on the happy path, never checking response.success(). lark-oapi signals a business-level failure (invalid/expired card, permission error, etc.) by returning a response with success()=False rather than raising, so these calls looked identical to callers whether Feishu accepted them or not. This file's own _upload_image/_upload_file/_receive_single_file already guard against exactly this by checking response.success() before trusting the response; the card/reaction helpers just didn't follow that established pattern. The gap is most exposed on _update_card: Feishu supports streaming, so a single conversation issues many _update_card patches, each one a chance to silently drop an update. _send_card_message already has a try/except around _update_card that retries (via _send_with_retry) on non-final failures and falls back to a brand-new card on final ones - but that logic was unreachable because _update_card could never raise on a business failure. Adds response.success() checks to all four methods, raising for _reply_card/_create_card/_update_card (mirroring the upload helpers, and making the existing retry/fallback logic in _send_card_message reachable) and logging a warning for _add_reaction (mirroring _receive_single_file, since a failed reaction is fire-and-forget and must not trigger a redundant resend of the whole card). Adds regression coverage in TestFeishuCardSuccessChecks: a business-failure mock response for each of the four methods, plus two tests driving _send_card_message end to end to confirm the retry and fallback-to-new-card paths actually engage now. * fix(feishu): include log_id in card SDK failure errors + cover create_card retry path willem-bd's review on this PR suggested two non-blocking follow-ups: - _reply_card/_create_card/_update_card's RuntimeError on a business-level failure omitted the Feishu log_id, unlike _add_reaction and _receive_single_file in this same file, which already include it in their warning logs. Adding it gives a Feishu support-traceable id once retries exhaust and the error reaches the caller. - _create_card's failure on the no-thread_ts path (the tail of _send_card_message) only had direct unit coverage (test_create_card_raises_on_business_failure_response), unlike _update_card's failure path, which also has an end-to-end test through send() confirming _send_with_retry engages (test_send_retries_after_update_card_business_failure_then_succeeds). Adds the mirrored end-to-end test for the _create_card path.	2026-07-22 14:52:42 +08:00
Daoyuan Li	09d9cf53d2	fix(harness): add timeout to invoke_acp_agent to prevent indefinite hangs (#4238 ) invoke_acp_agent had no timeout anywhere in its call path, and ACPAgentConfig had no timeout field. If the ACP agent subprocess answers initialize/new_session correctly but then hangs inside prompt(), the tool call - and therefore the whole agent turn - blocks indefinitely, with the child process left running. MCP stdio servers already guard against this class of hang via tool_call_timeout; ACP agent invocations had no equivalent. Add ACPAgentConfig.timeout_seconds (default 1800, ge=1), mirroring the shape/default of subagents.timeout_seconds, and wrap the conn.prompt() call in asyncio.wait_for(). On TimeoutError, return a clear error instead of hanging; exiting the spawn_agent_process context block triggers the ACP library's own graceful-then-forceful subprocess cleanup, so the hung process is actually terminated.	2026-07-22 14:47:08 +08:00
lllyfff	01a89f2379	[feat] memory: pluggable MemoryManager interface for backend onboarding (#4326 ) * refactor(memory): pluggable MemoryManager interface for backend onboarding Optimize the MemoryManager interface layer so new backends (mem0/openviking) onboard with less code and the contract stays stable as capabilities are added. A minimal backend now implements only from_config + add + get_context (verified by test_memory_manager_interface.py::_MinimalBackend onboarding via the factory); the factory no longer knows a backend's private hooks. - MemoryManager: ABC -> pydantic BaseModel; three-tier methods (tier-1 add/get_context abstract; tier-2 management defaults; tier-3 optional hooks warm/reload/fact + on_pre_compress/on_turn_start). Dropped 3 self-serving hooks. 6 hasattr probe sites -> direct call + try/except NotImplementedError. - from_config classmethod: factory thins to resolve + inject storage_path + collect host hooks + call from_config; DeerMem-specific hook consumption moved from factory to DeerMem.from_config. - Invariants: @model_validator (mode='tool' requires search via supports_search ClassVar); DeerMemConfig storage_path-is-file check moved here from factory. - Async: aadd/aget_context/asearch default to the sync path (speculative). - Callbacks: MemoryCallbacks + LangfuseMemoryCallbacks; on_memory_llm_call subsumes tracing_callback (same signature/timing/mutation); deleted the tracing_callback field. DeerMem decoupled from langfuse (portability). - noop keeps read-op empty overrides (avoids router 500s on the disable-memory-via-noop path); only delete/export inherit the base raise. Behavior preserved: 661 passed / 13 skipped. Docs: backends/README.md rewritten (three-tier + from_config + callbacks); samples README updated; removed stale private doc paths. Co-Authored-By: Claude <noreply@anthropic.com> * fix(memory): 501 on unsupported read/manage endpoints + accurate warm log Review follow-up on the three-tier MemoryManager refactor. - Read/manage endpoints (GET /memory, /memory/export, /memory/status, DELETE /memory, POST /memory/import) and the /memory/reload fallback now catch NotImplementedError -> 501, matching the fact-CRUD endpoints. The hasattr->try/except migration had skipped these: they were @abstractmethod before (every backend implemented them, so they never raised), so once they became tier-2 default-raise a minimal backend (only add + get_context) hit a raw 500 -- there is no global NotImplementedError handler. get_memory is shared via _get_memory_or_501 (covers /memory, export, status, reload fallback). noop is unchanged: its read-op empty overrides never raise. - warm() base default returns None (tri-state: True=warmed, False=failed, None=nothing to warm) so the Gateway lifespan logs "skipping" for a non-DeerMem backend (e.g. noop) instead of the inaccurate "warmed successfully" it never earned. DeerMem.warm keeps True/False. - Tests: 6 router 501 tests (read/manage + reload fallback) + 2 lifespan warm-log tests (None->skipping, False->warning); conformance/pluggable assert warm() is None. 705 passed / 13 skipped; lint clean. Co-Authored-By: Claude <noreply@anthropic.com> * fix(memory): review follow-ups - search-flag consistency, client reload, backend_config purity Address review feedback on the three-tier MemoryManager refactor: - [Medium] supports_search/search drift: the invariant now requires the supports_search ClassVar flag to MATCH whether search() is actually overridden (type(self).search is not MemoryManager.search), so the flag can't drift from the impl. Catches both directions at instantiation: a backend that overrides search() but forgets supports_search=True (was a misleading tool-mode rejection), and one that sets the flag without overriding (was a runtime NotImplementedError on the first memory_search). noop sets supports_search=True to match its search() override. Conformance adds drift + consistent-backend tests. - [Low] client.reload_memory fallback: wrap the get_memory fallback so a minimal backend (only add + get_context) surfaces a clean NotImplementedError ("implements neither reload_memory nor get_memory") instead of an uncaught propagation -- mirrors the router's 501. Test added. - [Low] backend_config purity: DeerMem.from_config restores backend_config to the pure data the host passed after model_post_init parses the injected hooks into DeerMemConfig (self._config, PrivateAttr); the field stays serializable (no callables/LLM) and matches the README ("host hooks NOT in backend_config"). Test asserts purity + hooks wired. - [Low] CHANGELOG: breaking-change note that mode='tool' + non-search backend now fails fast at startup (was silently empty) so operators recognize it on upgrade. - [Nit] .gitignore: drop the env-specific .tmp-pytest/ entry (--basetemp is local-only, not make test/CI). 709 passed / 13 skipped; lint clean. Co-Authored-By: Claude <noreply@anthropic.com> * docs(changelog): correct memory tool-mode fail-fast note The CHANGELOG entry said mode='tool' + a non-search backend "(e.g. noop)" fails fast at startup, but noop overrides search() (returns []) and sets supports_search=True (required by the consistency invariant), so noop IS search-capable and noop+tool does NOT fail fast. The fail-fast only affects a custom backend that onboards without overriding search(). Reworded to drop the misleading noop example and state both shipping backends implement search(). Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>	2026-07-22 14:40:57 +08:00
Aari	05e4f4f6d8	fix(sandbox): bound E2B output synchronization resources (#4364 ) * fix(sandbox): bound E2B output synchronization resources E2B release-time output sync pulled every changed file back from the remote VM with only a per-file size cap and no aggregate bound, so a pathological outputs tree (thousands of files, or many sub-cap files summing to gigabytes, or a slow VM) could make release download unboundedly on a hot path that runs at every agent turn end. Add three aggregate ceilings on top of the per-file cap — total bytes, file count, and a wall-clock deadline — enforced in the sync loop. When a ceiling is hit the pass stops early, logs what it dropped, and defers the rest to the next release. A truncated pass skips stale-manifest pruning so files it never reached are reconciled next time instead of being forgotten and re-downloaded. Closes #4340 * test(sandbox): pin multi-pass convergence of bounded output sync The four truncation tests each exercise a single capped pass. Add a two-pass test that locks in the invariant the design relies on for correctness: already-synced files are skipped before the budget check, so they never consume the cap and the deferred tail drains over successive releases instead of the leading files being re-downloaded every turn. A refactor that let a skipped file consume the cap would pass the single-pass tests but fail this one.	2026-07-22 14:32:54 +08:00
Lee minjing	e225ad57d7	feat(uploads): lazy-load historical files via list_uploaded_files tool (#4174 ) * feat(uploads): lazy-load historical files via list_uploaded_files tool Replace per-turn injection of all historical upload metadata with on-demand discovery via a new `list_uploaded_files` built-in tool, following the same deferred-discovery pattern used by skills. - Rename <uploaded_files> block to <current_uploads> (current-run files only) - Add list_uploaded_files tool with include_outline: bool\|list[str] - Extract outline helpers to shared deerflow/utils/file_outline.py - Update system prompt to reflect lazy-loading behaviour - Historical file scan removed from UploadsMiddleware.before_agent() Co-Authored-By: Claude <noreply@anthropic.com> * fix(uploads): clear uploaded_files state when no new files in current turn When before_agent() returns None on empty turns, the LastValue uploaded_files field retains the previous turn's filenames. list_uploaded_files then incorrectly excludes those files as "current-run" files, making them invisible until the next upload. Fix: return {"uploaded_files": []} instead of None to explicitly clear state. Add two-turn regression test covering the exact scenario from review feedback. Co-Authored-By: Claude <noreply@anthropic.com> * fix: resolve CI lint errors and stale test assertion from merge - Split long prompt line to fit 240-char limit - Add missing `Any` import in list_uploaded_files_tool - Remove unused `re` import in file_conversion (outline code moved) - Remove unused `os` import in middleware test - Fix test assertion: <uploaded_files> → <current_uploads> after main merge Co-Authored-By: Claude <noreply@anthropic.com> * fix: resolve CI lint errors and stale test assertion from merge - Split long prompt line to fit 240-char limit - Add missing `Any` import in list_uploaded_files_tool - Remove unused `re` import in file_conversion (outline code moved) - Remove unused `os` import in middleware test - Fix test assertion: <uploaded_files> → <current_uploads> after main merge Co-Authored-By: Claude <noreply@anthropic.com> * fix: add current_uploads to input sanitization exempt tags The lazy-loading PR renamed <uploaded_files> to <current_uploads>. The anti-drift guard scans all framework XML blocks and requires each to be either blocked or explicitly exempted. current_uploads wraps trusted server-generated file metadata, not user input, so it belongs in the exempt set. Co-Authored-By: Claude <noreply@anthropic.com> * test: regenerate replay golden after uploaded_files state change before_agent now returns {"uploaded_files": []} instead of None, adding uploaded_files to SSE values events. Regenerated via DEERFLOW_WRITE_GOLDEN=1. Co-Authored-By: Claude <noreply@anthropic.com> * fix: review feedback — memory pipeline, stale tags, state clearing, nits - Match both tags in memory stripping pipeline (uploaded_files\|current_uploads) - Remove stale uploaded_files from _BLOCKED_TAG_NAMES - Clear uploaded_files on all before_agent early-return paths - Fix ponytail: stray word in file_conversion re-export comment - Remove dead total_omitted branch in _format_omitted_summary - ruff format fixes Co-Authored-By: Claude <noreply@anthropic.com> * fix: block current_uploads, sanitize only original user content Per review feedback: instead of exempting <current_uploads> (which allows user forgery), move it to _BLOCKED_TAG_NAMES and change InputSanitizationMiddleware._process_request to scan only the original user content (ORIGINAL_USER_CONTENT_KEY) when available. Server-injected trusted blocks are no longer checked against the blocked-tag denylist. Co-Authored-By: Claude <noreply@anthropic.com> * docs: clarify fallback reason in input sanitization comment Co-Authored-By: Claude <noreply@anthropic.com> * @ fix: third-round review feedback — state visibility, sanitization, regex, nits - list_uploaded_files_tool: logger.warning instead of silent try/except on runtime.state read failure (High) - input_sanitization_middleware: _extract_text_from_content skips empty text blocks to match message_content_to_text behaviour; rfind fallback path logs warning for observability (Medium) - memory pipeline regexes: backreference (?P<tag>)(?P=tag) in message_processing.py and prompt.py (Low) - file_conversion.py: re-export moved to top of file (Low) - Tests: middleware→tool state bridge test; integrated forged-tag + multimodal sanitization tests PR #4174 — Follow-up issues: #4212, #4213, #4214 Co-Authored-By: Claude <noreply@anthropic.com> @ * @ fix: 4th-round review — denylist, sanitization, scandir, nits - Add "uploaded_files" back to _BLOCKED_TAG_NAMES (old tag still processed by deermem; user forgery must be escaped) (consistency) - Fix inaccurate rfind-fallback comment: UploadsMiddleware keeps string as string, fallback is unreachable for strings (doc fix) - Distinguish "empty string key" (upload without text) from "non-string key" (caller forgery) so empty-text uploads never escape the server block (edge) - Merge dual os.scandir(uploads_dir) calls into one list re-use (minor) - Add comment on .md sibling skip known limitation: user-uploaded .md files whose stem collides with a converted doc are hidden (boundary, no code change) Co-Authored-By: Claude <noreply@anthropic.com> @ * @ fix: tighten rfind-failure fallback — distinguish server blocks from user blocks When _extract_text_from_content and message_content_to_text disagree on multimodal list content and rfind fails, use content[0] (server-injected <current_uploads> block) vs content[1:] (user blocks) to sanitize only user blocks. Raw strings and non-standard dict blocks that _extract_text_from_content misses are now also sanitized. Non-distinguishable paths (< 2 text blocks, non-list content) still degrade to full sanitization (safe — server block may be escaped but user forgery never leaks). All fallback paths log via logger.warning. Decision 18 / willem-bd 4th-round comment #3 Co-Authored-By: Claude <noreply@anthropic.com> @ * @ fix: correct comments referencing text_blocks → content in rfind fallback Co-Authored-By: Claude <noreply@anthropic.com> @ * fix: 5th-round review — dead code, subagent gating, integration test, perf, consistency - Delete unreachable ORIGINAL_USER_CONTENT_KEY guard in rfind fallback branch (original_user_content guaranteed non-empty str at that point) - Remove list_uploaded_files from BUILTIN_TOOLS; add include_upload_tool param to get_available_tools(), default True; task_tool.py passes False so subagents no longer receive a tool whose state exclusion is broken - Add integration test exercising real create_agent graph (not mocked runtime.state) to verify LangGraph propagates before_agent state writes into ToolRuntime.state during same-turn tool calls - Cache DirEntry.stat() st_size in candidates tuple to avoid second per-file syscall in the rendering loop - Make the upload-tag pre-check case-insensitive (content_str.lower()) to match _UPLOAD_BLOCK_RE re.IGNORECASE PR #4174 — willem-bd 5th-round review items #1-#5 Co-Authored-By: Claude <noreply@anthropic.com> * fix(channels): pass files metadata through _human_input_message() for IM uploads _human_input_message() was not passing additional_kwargs.files to the downstream message. UploadsMiddleware read no files, wrote uploaded_files=[], and list_uploaded_files reported same-run IM attachments as historical files (fancyboi999 repro). Fix: add files parameter to _human_input_message(), call site passes files=uploaded. Regression test locks the contract. Co-Authored-By: Claude <noreply@anthropic.com> * fix(channels): remove legacy <uploaded_files> manual prepend to fix double-injection regression Commit 8d86dbf6 added files= pass-through to UploadsMiddleware but left the manual _format_uploaded_files_block() prepend in place. Every IM attachment reached the model twice — once via the legacy <uploaded_files> block and once via <current_uploads>. This commit removes the manual prepend and the now-dead _format_uploaded_files_block() function. UploadsMiddleware is the sole upload-context producer for both IM and web paths. Reported-by: fancyboi999 (PR review) Co-Authored-By: Claude <noreply@anthropic.com> * docs: update #4212 issue body to reflect completed fixes and narrowed remaining scope * chore: remove temporary scratch file * fix(middleware): neutralize user-derived values inside <current_uploads> block Upload-derived filenames, paths, outline titles, and preview text are interpolated verbatim inside the trusted <current_uploads> wrapper, which InputSanitizationMiddleware exempts from sanitization. A crafted filename or document heading containing blocked authority tags would bypass the guardrail and enter model context as trusted framework data. Fix: call neutralize_untrusted_tags() on all four user-derived values inside _format_file_entry(), preserving the outer <current_uploads> wrapper untouched. Reported-by: fancyboi999 (P1 security review) Co-Authored-By: Claude <noreply@anthropic.com> * fix(middleware): neutralize extension labels in omitted-file summary Files exceeding the 10-item context cap bypass _format_file_entry(). Their extensions, derived from user-controlled filenames via _extension_label(), were interpolated verbatim into the trusted <current_uploads> wrapper — another path for blocked authority tags to escape the guardrail. Fix: neutralize extension values inside _extension_label(), the single extraction point for all extension labels. Reported-by: fancyboi999 (P1 security review) Co-Authored-By: Claude <noreply@anthropic.com> * fix(tools): neutralize user-derived values in list_uploaded_files tool result Apply neutralize_untrusted_tags() to every model-visible user-derived value returned by list_uploaded_files: filename, virtual path, extension, outline titles, outline preview lines, and omitted-file extension summary. This closes the last remaining injection bypass in the upload lazy-loading path - the <current_uploads> block and its omitted summary were already neutralized (previous commits), but the list_uploaded_files tool produced a second exit for the same attacker-controlled metadata that ToolResultSanitizationMiddleware did not cover. Co-Authored-By: Claude <noreply@anthropic.com> * fix(tests): add missing include_upload_tool=False to task_tool mock assertions PR #4174 added include_upload_tool parameter to get_available_tools(). task_tool.py correctly passes include_upload_tool=False for subagents but 5 existing tests' assert_called_once_with expectations were not updated, causing CI failures. Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>	2026-07-22 14:02:56 +08:00
Daoyuan Li	40c4ec32f4	fix(blocking-io): trace self/cls attribute chains and local aliases in the call graph (#4200 ) * fix(blocking-io): trace self/cls attribute chains and local aliases in the call graph _record_call_ref only recorded a call-graph edge for bare-name calls and literal self./cls. single-hop attribute calls (self.flush()). Any other receiver shape fell through the "." not in call_name fallback and was silently dropped from the graph -- including a deeper self./cls. attribute chain (self.store.flush()), a local variable holding a self./cls. attribute (store = self.store; store.flush()), or a parameter used directly as a receiver. A real blocking call reachable from async code only through one of those shapes never surfaced as a finding, the opposite (and more dangerous) failure mode from the duplicate-helper-name over-report this detector already documents. Trace those shapes back to a self./cls. attribute or a parameter, within the same function only, and resolve them through the same bare-method-name fallback already used for receivers that cannot be resolved to a name at all -- no new false-positive risk beyond what that existing fallback already accepts. * fix(blocking-io): narrow alias tracking to fix three scope-creep bugs The alias/receiver tracking this detector added reused dotted_name(), which intentionally unwraps ast.Call/ast.Subscript for blocking-call pattern matching elsewhere in this module. Reusing it for alias extraction let a Call or Subscript result inherit its base's alias-worthiness, so factory().flush(), client = factory(); client.flush(), and client = clients[0]; client.flush() were all incorrectly treated as calls on a traced receiver. Add _simple_receiver_name(), a restricted Name/Attribute-only extractor, and use it wherever a receiver/alias is extracted instead of dotted_name(). Alias state also only ever grew: _record_local_receiver_alias_targets never removed a name once traced, so a later reassignment to a non-traceable value (client = NonBlockingClient()) left the name aliased forever, still exposing unrelated same-named methods. Reassigning now resolves traceability from scratch and kills the name when the new value isn't traceable. Separately, if/else branches had no isolation: with no visit_If override, body and orelse shared one mutable alias set, so an alias added in one leaked into the other and the result depended on which branch was textually first. Add a visit_If override that snapshots aliases before the branch, resets between body and orelse, and unions their exit states afterwards -- a conservative, order-independent may-alias join. Scoped to ast.If only; ast.Try/ast.Match keep the previous unisolated traversal (different, more complex control-flow semantics, out of scope here). Finally, _visit_function pushed the new function's context before visiting decorator_list/args/returns, but those expressions run at definition time in the enclosing scope, not the function body. A default value referencing an outer name that happens to match one of the function's own parameter names (receiver = Store(); async def route(receiver=receiver.flush())) was misattributed to route itself. Visit decorators, parameter defaults/annotations, the return annotation, and PEP 695 type-parameter bounds before pushing the new function's context so they resolve against the enclosing scope. Real-scanner output against the actual backend tree is unchanged (41/41 findings, byte-identical JSON) -- these were latent false-positive/negative risks in shapes the current codebase doesn't happen to contain, not active miscounts. * fix(blocking-io): fix three more alias-tracking and definition-time bugs _record_local_receiver_alias_targets ran before the assignment's own value was visited, so an assignment's RHS was analyzed against the alias state after the target had already been updated/killed for this same statement. Python evaluates the RHS before binding the target: with `client = self.store` followed by `client = client.flush()`, the second statement's target update killed `client`'s alias before its own RHS (`client.flush()`) was visited, so that call silently disappeared from the graph. visit_Assign and visit_AnnAssign now visit the RHS first and only update the target's alias afterward, matching Python's own evaluate-then-bind order. _simple_receiver_name still returned the trailing attribute name whenever its recursive parent lookup came back unsupported (a Call or Subscript), instead of refusing the whole chain -- so `factory().client` and `clients[0].client` both collapsed to plain "client", which, when "client" was also a traced parameter or local alias, incorrectly linked `factory().client.flush()` to an unrelated same-file `Store.flush`. Return None instead of falling back to `node.attr`, so an unsupported node anywhere in the chain makes the whole receiver unresolved rather than a truncated suffix of it. Finally, _visit_function's enclosing-scope walk of decorators, defaults, annotations, and type_params recursed into every subexpression uniformly, including ones that don't actually execute at definition time: a lambda's body, a bare generator expression's element/later-for clauses, annotations postponed by `from __future__ import annotations`, and PEP 695 type-parameter bounds (always evaluated lazily, in their own hidden function, only if something like T.__bound__ is ever accessed). Add visit_Lambda/visit_GeneratorExp overrides that stop at exactly the eager subset (a lambda's own parameter defaults; a generator's outermost iterable), skip parameter/ return annotations entirely once a `postponed_annotations` flag is set by the future import, and drop the type_params walk instead of moving it to the enclosing scope. Real-scanner output against the actual backend tree is unchanged (41/41 findings, byte-identical JSON) -- these were latent risks in shapes the current codebase doesn't happen to contain, not active miscounts. * fix(blocking-io): preserve eager traversal for immediately invoked lambdas and consumed generators visit_Lambda/visit_GeneratorExp (added last round to stop treating merely created lambda/generator objects as executing at definition time) were unconditional, so they also suppressed bodies that genuinely execute right away: an immediately invoked lambda ((lambda: ...)()) and a generator expression passed directly to an eager-consuming builtin (list/set/tuple/ frozenset/dict/sorted). visit_Call now marks a Lambda used as its own func, or a GeneratorExp passed as the sole argument to one of those builtins, by node identity before generic_visit runs. visit_Lambda/visit_GeneratorExp check that marker and, on a match, visit the node fully instead of applying the lazy walk. A lambda/generator that is merely created, stored, passed as a callback, or invoked later through a variable is unaffected and stays lazy. * fix(blocking-io): scope lambda/generator laziness to definition-time expressions only visit_Lambda/visit_GeneratorExp were unconditional overrides, so they suppressed lambda bodies and generator elements everywhere the visitor reached one, not only inside another function's definition-time expressions (decorators, parameter defaults/annotations, return annotation) where that suppression is actually needed. In ordinary function-body code this caused real false negatives: a lambda stored in a local and called through that name (callback = lambda: os.listdir("."); callback()), a generator reduced by sum/any/all/min/max, a bare lambda/generator that is merely created, and a generator wrapped in another lazy iterator like map(...) all went unscanned, even though none of them are definition-time expressions at all. The previous fix for this (an id()-keyed marker set covering exactly two eager shapes -- an immediately invoked lambda, and a generator passed directly to a fixed list of eager-consuming builtins) narrowed the suppression back down, but only for those two shapes, and the underlying eager-consumer builtin set itself excluded true reducers (sum/any/all/ min/max) that consume their generator argument just as eagerly as list/ set/etc. Both are instances of the same problem: enumerating every shape in which a lambda or generator happens to be invoked/consumed piecemeal inside an AST visitor, which is unbounded in the general case. Replace both mechanisms with a single boolean, _in_definition_time_expression, set only while _visit_function walks another function's own decorators/defaults/annotations/return annotation. visit_Lambda/visit_GeneratorExp apply their lazy (defaults-only/outermost-iterable-only) walk only while it is set; everywhere else they fall through to a full generic_visit, scanning lambda bodies and generator elements unconditionally -- the same conservative, over-report-rather-than-infer stance this file already takes for reachability elsewhere. This removes EAGER_ITERABLE_CONSUMER_NAMES and the two identity-marker sets entirely rather than growing them further. The one shape this newly gives up on -- an immediately invoked lambda or eagerly consumed generator used as another function's decorator/default/annotation value -- is now an explicit, narrow, documented limitation (see backend/AGENTS.md): definition-time expressions never scan a nested lambda body or generator element, full stop, regardless of whether it happens to be invoked right there. Targeted suite (test_detect_blocking_io_static.py + test_scan_changed_blocking_io.py + test_detector_repo_root.py + blocking_io/test_gate_smoke.py): 65/66 pass, the one failure a pre-existing Windows path-separator comparison unrelated to this file. Full backend suite: identical 64 pre-existing failures on both the pre-fix and post-fix commit, confirmed by diffing the two failure lists directly -- zero regressions. Real scanner against the actual backend tree: 41/41, byte-identical JSON before and after -- these were latent risks in shapes the current codebase doesn't happen to contain, not active miscounts.	2026-07-22 13:55:40 +08:00
Ryker_Feng	20debf9cc7	feat(agents): per-agent model and generation settings (#4347 ) * feat(agents): per-agent model and generation settings Let each custom agent choose its own model and sampling settings (temperature, max_tokens) plus thinking / reasoning_effort defaults, so agents sharing a model profile are no longer stuck with one shared temperature and output length (#4336). AgentConfig gains optional model_settings / thinking_enabled / reasoning_effort (None = inherit). create_chat_model applies per-caller model_overrides on top of the profile before the thinking/Codex transforms; the lead agent resolves each knob with precedence request > agent config > profile/default. The /api/agents create/update routes persist the fields and reject an unknown model. The default lead agent path is unchanged (no agent config -> overrides None). The agent chat composer also stops force-overriding an agent's configured default model with models[0]. * fix(agents): tri-state thinking control and default-model capability gating The model-settings dialog seeded the thinking switch to false, so opening it to tweak temperature and saving silently disabled thinking (the runtime default is on) with no way back to inherit. It also hid the thinking / reasoning controls whenever the agent inherited the global default model, since `__default__` never resolved through `models.find`. Give thinking an explicit Inherit / On / Off tri-state so an untouched save is a no-op, and resolve `__default__` to the effective default (models[0]) for the capability check. Logic lives in the tested helpers module.	2026-07-22 13:44:55 +08:00
Andrew Chen	ae510cb2e8	fix(sandbox): make an empty old_str a no-op in str_replace on any file (#4256 ) str_replace guards the replacement with `if old_str not in content`, which cannot reject an empty old_str -- `"" in content` is always true. So an empty old_str reached `str.replace("", new_str)`, which inserts new_str at every character boundary, and the tool rewrote the file while still returning "OK": old_str='', new_str='# H\n' -> OK, file silently prepended old_str='', new_str='X', replace_all -> OK, 'XdXeXfX XmXaXiXnX(X)X:X\nX...' The empty-file branch above it already handles this case (`if not content: if not old_str: return "OK"`), and the existing test states the intent directly: "An empty old_str is a no-op edit and remains a benign OK". That contract just never held once the file had content. The tool is registered by default (config.example.yaml) and its schema declares old_str as a plain string with no minLength, so a model can emit "" legitimately; read-before-write only compares a hash and lets it past. Check old_str first so the no-op holds whatever the file contains. The empty-file case folds into the same not-found branch, which keeps its message and behaviour. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-07-22 09:20:54 +08:00
Daoyuan Li	495e90832c	fix(sandbox): scope e2b grep() glob filter to its directory prefix (#4168 ) E2BSandbox.grep()'s glob handling reduced a directory-scoped pattern like "src/.js" down to just ".js" before passing it to `grep --include=`, dropping the directory-scoping prefix entirely. GNU grep's `--include` matches by basename only, at any depth, so the search silently broadened to every matching-extension file in the sandbox tree instead of just the directory the caller asked for. Keep the basename portion as a coarse `--include=` pre-filter (a superset of the true match set) and post-filter grep's raw hits through path_matches(), the same helper glob() already uses to enforce directory scoping correctly, so grep and glob agree on what a directory-scoped pattern means.	2026-07-22 09:11:45 +08:00
March7	ce4a6d4c3d	fix(backend): remove transient image context after model calls (#4267 ) * fix(backend): discard transient image context * fix(backend): protect client image context ids * docs(backend): clarify image checkpoint lifecycle	2026-07-22 08:41:50 +08:00
Vanzeren	42baed8c8c	feat(checkpoint): dual-mode checkpoint storage with LangGraph DeltaChannel (#4292 ) * feat(checkpoint): dual-mode checkpoint storage with LangGraph DeltaChannel Add a restart-required database.checkpoint_channel_mode ("full" default, "delta") that stores the messages channel via LangGraph 1.2 DeltaChannel, cutting checkpoint storage from O(n^2) to O(n) for append-only history. Existing full checkpoints seed delta state transparently; no data migration. - config: mode schema + freeze-on-first-use with CheckpointModeReconfigurationError; mode marker persisted in checkpoint metadata; unsafe delta->full downgrade rejected fail-closed with CheckpointModeMismatchError (run-level error, failed state read) - state: delta message state schema; CheckpointStateAccessor centralizes materialized reads for all consumers (threads API, branches, regeneration, compaction, state updates, memory, goal workers) - runtime: raw writers (run durations, interrupted title, thread goal) parent their checkpoints to the checkpoint they derive from, preserving delta ancestry; rollback forks the pre-run lineage through a state mutation graph with Overwrite restores; InMemorySaver delta-history override delegates to the base walk (fixes dropped first write after migration, also present upstream) - tests: conformance suite over {memory, sqlite, postgres} covering migration replay, stable message IDs, storage shape and writer preservation; conftest fixture isolates the frozen mode between tests; stale config fakes refreshed - ci: backend unit tests gain a postgres service * fix(checkpoint): close materialization gaps in goal flow, guard public factory - Route goal-continuation message reads through CheckpointStateAccessor: raw channel_values reads see the delta sentinel in delta mode, which disabled goal continuation (stand_down=no_durable_end_of_turn) after durable assistant turns. Raw tuples remain for tuple-only metadata (checkpoint id, pending_writes). - Reject checkpoint_channel_mode='delta' + checkpointer in create_deerflow_agent at construction: factory-built persisted graphs bypass mode-marker injection and the fail-closed gate, reproducing silent mixed-mode state loss. Delta without persistence stays allowed. - Import the postgres saver lazily (pytest.importorskip in the fixture) so the documented default install collects the suite; add a CI job running pytest --collect-only on uv sync --group dev without extras. - Fix test_checkpointer fallback test to patch get_app_config at its use site (provider module), making it deterministic when a local config.yaml selects a persistent backend. * fix(gateway): preserve extension-owned channels in state mutations, bump config version - build_state_mutation_graph / build_checkpoint_state_mutation_accessor accept an explicit state_schema; branch and POST /state now compile the mutation graph from the thread's effective schema (graph_state_schema on the assistant graph). The base-ThreadState fallback silently discarded channels contributed by custom AgentMiddleware.state_schema on branch (data loss) and returned a false-success 200 on POST /state. - POST /state validates values keys against the mutation graph's channels and rejects unknown fields with 422 instead of ignoring them; reducer detection covers extension channels (BinaryOperatorAggregate or DeltaChannel) so Overwrite replace semantics work for middleware reducers in both modes. - Endpoint regression: custom AgentMiddleware.state_schema value survives branch, updates through POST /state, and an unknown field receives 422. - config_version 26 -> 27 for the new database.checkpoint_channel_mode (example, Helm chart values + README, support-bundle fixture), so existing installs get the outdated-config warning and make config-upgrade merges the field; covered by a test driving the real example file and the real config-upgrade script. * fix(gateway): resolve assistant schema via one boundary, copy branch reducer values with Overwrite GET /threads/{id}/state now resolves the thread's assistant_id through a single reusable boundary (thread metadata -> assistant_id -> effective graph), so channels contributed by a custom AgentMiddleware.state_schema are materialized instead of dropped by the default lead schema. POST /state uses the same boundary instead of resolving the schema ad hoc. Branch writes wrap every copied reducer channel in Overwrite (derived from the effective mutation graph: BinaryOperatorAggregate + DeltaChannel), not just messages, so already-aggregated values are never re-merged. Regression tests use a real AgentMiddleware.state_schema with a non-identity reducer in both full and delta modes: GET /state returns the extension value, POST /state replaces it, branch preserves it byte-for-byte; the unknown-field 422 is a separate assertion. * refactor(checkpoint): collapse read-path round-trips and ship dual-mode parity tests Address review round 4 on PR #4292: - Push ahistory/history limit through Pregel into checkpointer.alist (SQL LIMIT) instead of materializing all rows and breaking in Python - Fold the read-side mode-compat gate onto the returned snapshot's metadata; only writes keep the pre-write tuple fetch (fail-closed) - Cache factory-built accessor graphs per (assistant_id, mode) with factory-identity revalidation; state reads no longer build a lead agent per request - get_thread: one snapshot fetch + one raw pending_writes fetch on the resolved checkpoint (post-checkpoint __error__ writes never surface in snapshot.tasks; verified empirically) - DeerFlowClient.get_thread: single checkpointer.list walk collects pending_writes per checkpoint instead of N get_tuple calls - InMemorySaver delta-history patch: stand-down when the upstream override disappears, try/except guard, validated-version warning, guard tests - make_lead_agent mode precedence: first freeze is owned by app_config (client-supplied configurable key ignored); once frozen, injected key/app_config must match or fail closed - Rollback: lock in non-message channel restoration via fork inheritance with a dedicated reducer-channel test - Add tests/test_threads_checkpoint_mode.py and tests/test_gateway_checkpoint_mode.py referenced by AGENTS.md and the PR validation section: lifecycle parity (memory + sqlite), per-step blob-count storage guard, gateway endpoint parity Counted-saver tests pin checkpoint round-trips for aget/ahistory so these regressions cannot silently return. * fix(checkpoint): precise mode-mismatch HTTP mapping, gate E2E, and accessor resilience - threads router: map CheckpointModeMismatchError to 409 (with cause and thread id) and CheckpointModeReconfigurationError to 503 across all state endpoints instead of swallowing both into a generic 500 - gate coverage: seed a real delta checkpoint into AsyncSqliteSaver and assert aget/aupdate/ahistory fail closed; assert 409 at the HTTP boundary through the real route stack - rollback: compile the restore mutation graph with the thread's effective state schema per the build_state_mutation_graph contract - inheritance contract locks: rollback and manual compaction preserve middleware-contributed channels via checkpoint fork cloning - services: revalidate the accessor-graph cache against app_config identity so config.yaml hot-reloads never serve a stale compiled graph - services: degrade full-mode state reads to raw checkpointer reads when the agent factory is unavailable (delta gate still applies; delta mode has no fallback) - deps: override websockets==16.0 (langgraph-sdk 0.4.2's <16 pin silently downgraded 16.0 -> 15.0.1; pin is not grounded in any API incompatibility) and bump the langchain lower bound to what the lockfile actually resolves * fix(checkpoint): include anchor checkpoint in degraded history walk + cover get_thread - _RawCheckpointReadAccessor.ahistory: alist(before=...) is exclusive while pregel's get_state_history treats config.checkpoint_id as the inclusive start; fetch the anchor explicitly so both read paths paginate identically - extend the degraded-path gateway test: GET /thread returns raw values, and POST /history with before starts at the anchor checkpoint * fix(gateway): preserve degraded checkpoint timestamps * fix(gateway): harden degraded checkpoint access * fix(gateway): resolve assistants for checkpoint reads --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-07-22 08:33:29 +08:00
Ryker_Feng	8dafb667dd	fix(tui): derive /help text from the command registry (#4327 ) The /help string was hardcoded and had drifted from BUILTIN_COMMANDS, omitting six real commands (help, switch, resume, uploads, artifacts, details) and ordering the rest differently from the picker. Generate the command line from the registry so /help can never fall out of sync again, and add parity tests that guard against future drift.	2026-07-22 07:53:50 +08:00
RongfuShuiping	4bf028d048	feat(memory): add incremental agent-scoped Markdown fact storage (#4279 ) * feat(memory): add Markdown fact storage repository * docs(memory): explain storage rewrite for beginners * docs(memory): fix plan markdown formatting * refactor(memory): separate global summaries from agent facts * fix(memory): make Markdown fact updates incremental and safe * Update STORAGE_REWRITE_CHANGES.md * Delete docs/plans/STORAGE_REWRITE_PLAN.md * Delete docs/plans/STORAGE_REWRITE_CHANGES.md * fix(memory): address Markdown storage review feedback * fix(memory): complete review follow-ups * fix(memory): resolve storage review findings * feat(memory): add proactive Markdown migration CLI * fix(memory): harden Markdown storage concurrency * fix(memory): harden markdown storage migration * fix(memory): close migration review gaps --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com> Co-authored-by: qin-chenghan <qinchenghan@Huawei.com>	2026-07-22 07:46:03 +08:00
Aari	3c0a45ad77	fix(skills): inject Langfuse metadata into the standalone skill scan (#4321 )	2026-07-21 23:41:07 +08:00

1 2 3 4 5 ...

1057 Commits