deer-flow

mirror of https://github.com/bytedance/deer-flow.git synced 2026-07-27 08:28:00 +00:00

Author	SHA1	Message	Date
March-77	b22f85c686	fix(sandbox): reconcile E2B sandboxes safely (#4443 ) * fix(sandbox): reconcile E2B sandboxes safely * fix(sandbox): clear failed E2B adoption intent	2026-07-27 14:10:24 +08:00
阿泽	1baa8ad696	feat(clarification): structured form fields for human-input cards (#4400 Phase 1) (#4406 ) * feat(clarification): structured form fields for human-input cards Add a request-side v2 `form` mode to the ask_clarification protocol so business flows (e.g. expense reimbursement) can collect several values in one card instead of sequential free-text questions: - `ask_clarification` gains a restricted `fields` parameter (text / textarea / number / select / multi_select / checkbox / date) - ClarificationMiddleware validates and normalizes fields explicitly (whitelisted types, unknown -> text, select-likes without options -> text, duplicate/invalid entries dropped, all-invalid falls back to the legacy modes) since the middleware short-circuits before tool execution; the plain-text fallback lists fields for IM channels - Form payloads carry `version: 2` so older frontends degrade to the text fallback; replies stay on the v1 response protocol — the card submits a readable summary as `response_kind: "text"`, so journal persistence and answered-card recovery are unchanged - Frontend renders typed field controls with required-field validation and compact multi-select chips Part of #4400 (scope narrowed per maintainer feedback: request-side only, no new response kinds, no top-level multi_choice). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(clarification): harden form protocol per review feedback Address the five review points on #4406: - Reject field names colliding with JS Object.prototype members on both sides; frontend reads form values via own-property access only, so `constructor`/`toString`-style names can no longer leak inherited members into required validation or the submitted summary - Close open requests answered through the legacy text fallback: a visible plain human reply (no response metadata) now marks every previously-opened request as answered, so upgrading to a v2-aware frontend cannot leave the composer locked on an already-answered card - Give checkbox fields deterministic boolean semantics: values are seeded to an explicit false ("no" in the summary) and `required` means must-agree/consent; documented in the tool schema - Make middleware field validation atomic: structurally broken entries (bad/duplicate/reserved names, over-cap field/option counts or text lengths) degrade the whole form instead of silently dropping fields; options are trimmed/deduped with blanks removed so the backend never emits payloads the frontend parser rejects - Associate form labels/controls (htmlFor/id), aria-required, aria-invalid, and error descriptions for accessibility Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * refactor(clarification): type the fields item schema via TypedDict Replace `fields: list[dict[str, Any]]` with `list[ClarificationFormField]` (a TypedDict with `name` required and the type whitelist as a Literal) so the provider-facing tool schema documents the item shape instead of an opaque object relying on the docstring. Runtime validation is unchanged and stays in ClarificationMiddleware, which intercepts the call before tool execution. Addresses the non-blocking review suggestion on #4406. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(frontend): drop unsupported aria-invalid from multi-select group jsx-a11y: role=group does not support aria-invalid; the error linkage stays via aria-describedby. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(clarification): coerce numeric required flags and normalize fields once - `_normalize_bool` now coerces 1/0 (some providers serialize booleans as integers), so `required: 1` no longer silently flips to optional - `_handle_clarification` normalizes `fields` once and passes the result to both the text fallback and the payload builder Addresses the non-blocking review nits on #4406. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(clarification): harden form protocol per contract review round 2 Backend: - Guard unhashable JSON in the intercept path: `type: []`/`{}` degrades the field to text and `clarification_type: []` coerces to str instead of raising TypeError (which, with return_direct, ended the turn with an error and no card or fallback) - Add a total budget over the serialized normalized fields (16KB UTF-8 bytes): per-item caps alone admitted forms whose IM text fallback exceeded channel delivery limits (Slack 40k chars, Feishu ~30KB card), silently truncating trailing fields; a boundary test proves any accepted form's fallback stays deliverable Frontend: - Submission value now appends a JSON block keyed by stable field names (readable summary alone is delimiter-ambiguous), with a collision regression test - Parser boundary tightened to match backend constraints: empty option values (Radix SelectItem crash), duplicate option ids/values, duplicate field names, and the form<->version-2 binding are rejected - Keep the error node mounted while any field is still invalid so aria-describedby never points at a removed element (happy-dom interaction test) - Required semantics are now accessible: native checkbox control (no HTML required attribute — it would intercept the custom submit path), visually-hidden localized "required" markers next to the aria-hidden asterisks - Legacy-fallback closure narrowed to the latest unanswered request: nothing guarantees a single outstanding clarification across runs, and closing all would silently swallow older decisions; an older request left open becomes the active card again Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(frontend): keep clarification selects controlled --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>	2026-07-27 14:05:31 +08:00
Vanzeren	e01173d8b2	bench(checkpoint): production-shaped full/delta benchmark with configurable snapshot frequency (#4467 ) * feat(checkpoint): production-shaped full/delta benchmark with configurable snapshot frequency - Group benchmark scripts into per-family folders (checkpoint/, sandbox/) - Extract shared benchmark infrastructure into checkpoint_bench_common.py - Add checkpoint_delta_snapshot_frequency config (default 1000, process-frozen); freeze it in make_lead_agent and DeerFlowClient; key the state-schema adaptation cache by resolved frequency - New bench_production.py: per-case child processes run N ainvoke turns through the real lead-agent graph (scripted deterministic model, real AsyncSqliteSaver), then measure GET /state + POST /history through the real Gateway route stack in one event loop (httpx ASGITransport), cold/warm accessor-cache split, cross-mode digest gates - New summarize_production.py: delta/full ratios plus decision metrics (snapshot_write_spike, cache_effect_ms, checkpoint_write_share, auto-discovered history per-limit ratios) * fix(checkpoint): address production benchmark review	2026-07-27 11:47:49 +08:00
March-77	2e5c8da257	fix(sandbox): bypass proxies for local AIO traffic (#4444 ) * fix(sandbox): bypass proxies for local AIO traffic * fix(sandbox): classify public IPv6 proxy targets	2026-07-27 07:47:39 +08:00
Huixin615	090e80c1dd	fix(runtime): fail-stop runs when lease ownership cannot be confirmed (#4431 ) * fix(runtime): fail-stop runs after lease expiry * test(runtime): cover late successful lease renewal --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-07-27 07:25:34 +08:00
Huixin615	1cd5dea336	fix(streaming): signal replay history gaps (#4426 ) * fix(streaming): signal replay history gaps * fix(streaming): guard initial Redis replay window * fix(frontend): align inactive gap recovery --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-07-27 07:13:06 +08:00
Aari	244ce7739f	fix(runtime): linearize delta-mode checkpoint resume (#4460 ) * fix(runtime): linearize delta-mode checkpoint resume Resuming a run from an older checkpoint forks the lineage, and in delta mode that fork's state cannot be materialized correctly: the delta history walk collects every pending_writes entry stored on each on-path ancestor, but a shared parent also carries the writes of the sibling child that was abandoned. Those writes replay into the fork, so the run starts from a message list that still contains the answer it was meant to replace — regenerating in a branched thread surfaced this as the superseded assistant message reappearing beside the new one after a reload. All three saver implementations are affected, so write-to-child ownership is a gap in the upstream delta contract rather than one saver's slip. Rather than reimplement that walk, express the fork as what it means: materialize the requested checkpoint's state, write it as an Overwrite on the current head (which has no siblings), and run linearly. The abandoned turn stays in history as the rewritten head's ancestry. This runs after the rollback point is captured, so cancel-with-rollback still restores the real pre-run head, and fails closed — an unreadable resume checkpoint raises instead of falling back to the corrupt fork. Full mode keeps forking: its checkpoints carry complete channel_values and need no replay. * fix(runtime): restore complete delta resume state * fix(runtime): linearize delta rollback restoration * Apply suggestions from code review Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * fix(runtime): serialize delta resume preparation --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-07-26 21:59:19 +08:00
DanielWalnut	bb9f67aaf1	fix(runtime): close cancelled replacement admission (#4472 )	2026-07-26 21:57:39 +08:00
Vanzeren	1c7531242c	feat(runtime): record terminal artifact delivery receipts (slice 1 of #4272 ) (#4365 ) * feat(runtime): record terminal artifact delivery receipts (#4272) * fix(runtime): persist delivery receipts across recovery * test(runtime): cover delivery receipt invariants * fix(runtime): preserve terminal status on receipt outages	2026-07-26 21:45:47 +08:00
lllyfff	8145d66a33	feat(memory): memory message processing (#4447 ) * feat(memory): signals-based update pipeline + always-on watermark/trivial filter Refactor the DeerMem memory update pipeline (message_processing -> queue -> updater) around a signals frozenset seam, replacing the (filtered, correction_detected, reinforcement_detected) 3-tuple with (filtered, signals: frozenset[str]) end to end. message_processing: - Externalize signal-detection patterns to YAML (message_patterns/.yaml). - Extend signals from correction/reinforcement to a 6-class set (correction/reinforcement/preference/identity/goal/decision); detect_signals returns a frozenset aligned with the fact category enum. - Pure-acknowledgment turns ("ok"/"好的"/...) are always filtered out before enqueue (whole-message fullmatch), saving an extraction LLM call. queue (core/queue.py): - In-memory list + debounce timer, with flush_sync (graceful-shutdown drain that joins an in-flight worker under a hard timeout) and queue_max_depth backpressure (signal-bearing updates always admitted; QueueFull otherwise). - Same-key updates coalesce with a signal union; per-batch success/fail summary. updater (core/updater.py): - head500+tail500 message truncation (replaces the 1000-char head chop). - Always-on per-thread watermark: feed only messages added since the last extraction. The watermark is in-memory and is not advanced on failure, so a failed/lost update is re-fed on the next conversation turn. - [MANUAL] prompt marker for user-authored facts (source.type="manual"). - Post-invoke extraction_callback (host-injected) emitting facts_extracted / facts_accepted / rejected_low_confidence; the host default logs metrics and flags >60% rejection. Confidence filtering remains in _apply_updates (the existing fact_confidence_threshold check); there is no separate write gate. Consolidation stays opt-in (lossy). The ABC add/add_nowait signature is unchanged, so the summarization flush hook and host are unaffected. Tests: add test_message_processing_signals, test_updater_truncation, test_updater_watermark; update queue/updater/consolidation/staleness/pluggable tests for the signals seam. Co-Authored-By: Claude <noreply@anthropic.com> fix(memory): harden update pipeline per PR review - Catch QueueFull in DeerMem.add/add_nowait so backpressure degrades to 'update skipped' instead of propagating into after_agent / summarization_hook and breaking the agent run (peer middlewares self-guard; MemoryMiddleware was the lone exception). Emergency (add_nowait) always admits under backpressure -- its data cannot be re-fed next turn. - Rewrite the watermark from index-based to content/identity-based (_message_identity + _feed_after_watermark) so it stays correct when summarization removes the conversation front -- an index watermark pointed at the wrong message and silently skipped un-extracted tail turns. The emergency flush bypasses the watermark (bypass_watermark on ConversationContext, threaded through update_memory) and coexists with (does not replace) a pending normal update, so a flush cannot drop a pending update's un-extracted tail. - Populate facts_accepted / rejected_low_confidence inside _apply_updates at the real confidence-filter site (passed_threshold) instead of re-deriving the threshold in _finalize_update -- eliminates metric drift. - Emit extraction metrics in a finally with an 'attempted' flag so exception failures (parse error, apply_changes raise after retry) are observable, not only the happy path. - Re-detect signals on the post-watermark feed for the extraction hint so it no longer references turns the LLM cannot see; admission-time signals still drive backpressure. - Move the post-batch reschedule inside the queue lock to close a non-atomic self._timer race with a concurrent add. Co-Authored-By: Claude <noreply@anthropic.com> * fix(memory): address follow-up review nits (LRU, metric name, docstring) - Bound the in-memory watermark cache with a configurable LRU (watermark_max_keys, default 4096, 0=unbounded). A dropped key re-extracts one batch on that thread's next turn (the documented restart behavior), so eviction is safe and preserves the content-identity watermark's front-removal guarantee. Adds _watermark_get/_watermark_set helpers and a bounded-LRU regression test. - Rename the extraction metric facts_accepted -> facts_passed_confidence so the name matches what the >60% rejection-rate warning assumes (a confidence-gate signal, not a persisted-fact count); drop the stale "historical semantics" justification. Brand-new callback, one consumer. - Fix the stale test_message_processing_signals module docstring: the signals seam is already swapped to frozenset, and a stale stage-numbering prefix is removed. Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>	2026-07-26 21:16:36 +08:00
Ryker_Feng	68c0ffdac8	feat(frontend): pin recent chats (#4442 ) * feat(frontend): pin recent chats * fix(threads): address pin-chat review feedback - Stop bumping updated_at on metadata-only PATCH (pin/unpin) via a new update_metadata(touch=False) path so unpinning no longer jumps a chat to the top of the updated_at-sorted recent list. - Narrow patchThreadMetadata to a ThreadMetadataPatchResponse matching the Gateway's actual response (no values/context). - Namespace the pinned metadata key as deerflow_pinned for consistency with deerflow_sidecar / deerflow_branch. - Cover touch/touch=False behavior in repo + router tests; document the e2e mock's updated_at preservation now mirrors production. * style(frontend): format thread utils test * fix(threads): make pinned ordering server-side * test(frontend): keep infinite-scroll fixture order stable * test(frontend): stabilize lark reconnect e2e * docs: clarify thread pin metadata contract	2026-07-26 20:47:58 +08:00
Zhengcy05	2f60bee388	fix: surface length-capped model responses (#4309 ) * fix: surface length-capped model responses * fix: avoid the influence of the mid-turn * fix: correcting semantic annotations * fix: add ModelLengthTerminationDetector to compatible providers * fix:delete redundancy code * fix:supplementing log information improves observability * fix: align the document and complete the assertions. * fix: unit test * fix: revert AGENTS.md * fix: unit test * fix: add annotation and skip AIMessage has empty content	2026-07-26 14:43:08 +08:00
Aari	d1aeea2c3e	fix(checkpoint): unwrap Overwrite first writes into empty channels (#4383 ) * fix(gateway): stop persisting Overwrite wrappers into empty reducer channels on branch Thread branching (and POST /state on a never-written channel) wraps copied reducer values in Overwrite. Upstream BinaryOperatorAggregate.update seeds an empty (MISSING) channel with values[0] verbatim without unwrapping, so Union-typed channels (sandbox/goal/todos/promoted) stored the wrapper literally and the next consumer crashed with TypeError: 'Overwrite' object is not subscriptable (#4380). Patch the channel to unwrap the first write (mirroring DeltaChannel semantics), and stop copying thread-scoped channels (sandbox, thread_data) into branches: the parent's sandbox_id would bind the branch to the parent's workspace and release lifecycle. * refactor(checkpoint): drop private _get_overwrite import for a local Overwrite check Importing langgraph's underscored _get_overwrite at module top level meant an upstream refactor that drops it - plausibly the same release that fixes the bug - would fail this module's import and crash startup before the probe can stand the patch down. Replace it with a local helper on the public Overwrite type, and fix two test docstring nits. * refactor(checkpoint): write patch flags via their constants to avoid drift Both saver patches read their "already patched" idempotence flag through a module constant (_PATCH_FLAG / _BINOP_PATCH_FLAG) but wrote it as a hard-coded attribute literal, so renaming the constant would silently break the guard and double-apply the patch. Write via the same constant (setattr), dropping the now-unneeded attr-defined ignores. --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-07-26 10:44:50 +08:00
Daoyuan Li	5d073991c2	fix(sandbox): widen boxlite/aio_sandbox tenant hash and verify identity on reclaim (#4171 ) * fix(sandbox): prevent truncated tenant ID reuse * fix(sandbox): handle late same-tenant box registration	2026-07-26 09:47:57 +08:00
Yufeng He	6e6c078595	fix(sandbox): unwrap Overwrite-wrapped sandbox state in after_agent (#4381 ) * fix(sandbox): unwrap Overwrite-wrapped sandbox state in after_agent Fork-restored checkpoints can deliver the sandbox channel still wrapped in langgraph.types.Overwrite: the rollback restore applies replace-style writes through a state-mutation graph in delta checkpoint mode, and after_agent/aafter_agent then crash subscripting the wrapper ("TypeError: 'Overwrite' object is not subscriptable") on the next sandbox tool run in the forked conversation. Unwrap before reading the sandbox id, and pin both hooks against an Overwrite-wrapped state. Refs #4380 (bug 1 of 2; the history-loss half is a separate display path) Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> * fix(sandbox): don't release fork-restored parent sandboxes The Overwrite-wrapped value replays the parent thread's sandbox state, so releasing it from the forked run would evict the parent's warm sandbox. _unwrap_sandbox now reports the wrapped form, and both after_agent hooks skip the release for it while keeping the normal path unchanged. Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> --------- Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-07-26 09:02:40 +08:00
Ryker_Feng	7aa314b4c1	feat: add Lark CLI integration (#3971 ) * feat: add lark cli integration * fix: polish lark integration actions * feat: support lark incremental permissions * fix: detect lark authorization completion * fix: harden lark integration install * feat: expand lark auth scopes and reuse host auth in sandbox Default lark auth to least-privilege (recommend=false, base sign-in only) and expose the full set of lark-cli --domain business domains as native --domain grants instead of a 4-domain read-only mapping. Resolve the skill pack from the latest larksuite/cli GitHub release at install time with content-hash integrity, and surface version/runtime drift in status. Share the per-user lark-cli config/data profile between the Gateway Settings auth flow and agent conversations by mounting the integration dirs into the AIO sandbox and injecting the matching env for lark-cli commands, with an allowlisted extra_mounts path in the provisioner/K8s backend and traversal guards on integration paths. * style: fix lint issues from ruff and prettier Sort imports in the provisioner PVC test and re-wrap two long i18n description strings to satisfy backend ruff and frontend prettier CI. * fix(lark): address managed integration review feedback * fix(frontend): stabilize integrations settings e2e * test(sandbox): isolate remote backend legacy visibility check * test: fix backend unit failures after merge * Harden Lark integration review fixes * Format Lark integration E2E test * fix(lark): harden sandbox credential exposure and status disclosure Address willem_bd's security review on PR #3971: - Mount the per-user lark-cli config dir (long-lived appSecret) read-only into the AIO sandbox; only the refreshable-token data dir stays writable. - Redact host filesystem paths (install_path, cli.path) from GET /lark/status and the config/auth complete responses for non-admin callers, fail-closed on any auth error. - Document the npm postinstall trade-off (--ignore-scripts is not viable because @larksuite/cli fetches its platform binary in postinstall). - Document the sandbox credential trust boundary in AGENTS.md and README, pointing at the sidecar-broker follow-up (#4338). --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-07-26 08:09:17 +08:00
Aari	68797c5759	fix(gateway): scope branch history seed run ids per inherited turn (#4459 ) Branch creation seeds the new thread's run-event feed from its checkpoint so inherited history survives the first run (#4380). Every seeded row carried one shared run id, but run_id is a turn identity to the feed's consumers, not a provenance tag: regenerating the inherited answer resolves that row's run id as the superseded source, and GET /messages/page then drops every row carrying it. One shared id for the whole seed therefore deleted the complete inherited history on a branch's first regenerate, leaving only the regenerated turn. Group seeded rows into one synthetic run per inherited turn (branch-seed-{thread_id}-{n}), a new turn opening at each persisted human message — the same boundary a real run has, including the allowlisted hidden ask_clarification reply, which resumes as its own run. Supersession is then confined to the turn actually regenerated. Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-07-26 08:00:48 +08:00
Aari	37c343fe30	fix(summarization): summarize with the run model, fall back on summary-provider failure (#4361 ) * fix(summarization): own the run model for compaction; bound failure With summarization.model_name: null the summary model resolved to config.models[0] while the executing model is selected per run; when they differ and models[0]'s provider is broken (expired key, quota, outage) compaction silently failed every triggered turn and context grew unbounded until the main provider 400s the run (#3103's shape), even though the run's own model was healthy. Model ownership is now sourced from the builders, not re-derived at runtime: - The lead, subagent, and manual /compact builders each pass the resolved run model into create_summarization_middleware(run_model_name=...). The middleware no longer reads runtime.context / get_config(), which do not carry a custom agent's or a subagent's resolved model, so a custom-agent lead run and a distinct-model subagent now summarize with their own model, not models[0] / the parent's. Runtime re-resolution and the per-name model cache are removed. - model_name: null summarizes with the run's own model; an explicitly configured summary model generates and falls back to the run model on failure. The fallback is built lazily after the primary fails and its construction is guarded, so a broken fallback cannot skip a healthy primary or escape the automatic failure boundary. Failure is bounded and side-effect-safe: - An empty or whitespace-only response is treated as a generation failure, not a valid summary, so compaction never removes all history for an empty replacement. - compact_state/acompact_state take raise_on_failure independent of force: the manual /compact path always surfaces a generation failure (even force=false) and routes it to the existing ContextCompactionFailed path (HTTP 500 -> frontend error toast) instead of an unconsumed response reason. The automatic path leaves compaction state unchanged. - before_summarization hooks fire only after a replacement summary exists. SummarizationConfig.model_name, config.example.yaml, and docs/summarization.md document the final lead/subagent/manual ownership rules. Part of RFC #4346 (section A). Evaluating fraction/triggers against the run model's profile (profile ownership) is a separate follow-up. * fix(summarization): manual /compact model ownership + fail-open construct/parse Manual /compact carried only agent_name, so it derived the run model from the custom-agent model or config.models[0] and missed the request-selected model the run path uses (request -> custom-agent -> default). Carry model_name through ThreadCompactRequest and the frontend compact call, resolve with the same precedence, and move the custom-agent config read off the event loop (asyncio .to_thread) with user_id so the strict blocking-IO gate is not bypassed by the broad except. Make one summary attempt own its full lifecycle so the fail-open boundary covers construction and response parsing, not just invocation: build each candidate model lazily and guarded (a raising constructor falls through to the healthy run model instead of breaking agent construction), build the model_name:null primary from the run model rather than config.models[0], and run response text extraction inside the invocation try so a failing .text accessor falls back instead of escaping compaction. Adds factory-level constructor-failure, response-extraction-failure (sync/async), and route-path model-ownership tests.	2026-07-26 07:39:39 +08:00
MiaoRuidx	735f67a5b2	fix: guard pending run startup cancellation (#4450 ) * fix: guard pending run startup cancellation * fix(run): address startup review feedback * fix(run): narrow start_run store contract --------- Co-authored-by: MiaoRuidx <12540796+MiaoRuidx@users.noreply.github.com> Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-07-25 23:50:21 +08:00
Huixin615	8af760fc30	fix(runtime): make orphan reconciliation lease-aware (#4427 )	2026-07-25 23:26:17 +08:00
Vanzeren	3c8b82c594	fix(runtime): serialize checkpoint writes with active runs (#4437 ) * fix(runtime): serialize checkpoint writes with active runs * fix(runtime): address checkpoint reservation reviews * fix(runtime): address reservation race reviews * fix(runtime): refine reservation conflict semantics	2026-07-25 23:18:34 +08:00
VectorPeak	07d8b98864	fix(mcp): ignore malformed path-like text (#4456 ) Co-authored-by: chatgpt-codex-connector[bot] <199175422+chatgpt-codex-connector[bot]@users.noreply.github.com>	2026-07-25 21:43:33 +08:00
Vanzeren	8c19a2eb36	perf(checkpoint): linearize message write merging (#4421 ) * perf(checkpoint): linearize message write merging * test(checkpoint): address message reducer review	2026-07-25 21:19:24 +08:00
luo jiyin	3b77a7401b	fix(sandbox): enforce E2B replica capacity limits (#4391 ) * fix(sandbox): enforce E2B replica capacity limits (in-process) Add SandboxCapacityExceededError with diagnostic fields. Add overflow_policy (wait/reject/burst), acquire_timeout, and burst_limit config options. Implement atomic capacity reservation with a four-slot model: reserved / active / warm / transitioning. Transitioning slots close the window where active-to-warm or warm-to-active transitions appear to have zero occupied slots, which would let concurrent acquires exceed the configured replica ceiling. Re-route release, reclaim, and evict through transitioning counters. Add shutdown guard: reject waiters, kill VMs created during shutdown. Add 14 tests: policy enforcement, release+acquire race, warm-reclaim race, shutdown-waiter interaction, shutdown-during-create, and concurrent different-thread capacity assertion. Related: #4339 * fix: harden e2b sandbox capacity lifecycle * fix: retain e2b capacity during uncertain eviction * fix: serialize e2b tombstone eviction * fix: retain capacity after uncertain e2b cleanup * fix: track e2b remote operations during shutdown * fix(sandbox): validate E2B capacity config * fix(sandbox): classify capacity errors * fix(sandbox): harden E2B capacity lifecycle * test(sandbox): cover E2B review findings * docs(changelog): note E2B capacity behavior * docs(readme): explain E2B overflow handling * docs(backend): record E2B lifecycle rules * docs(sandbox): clarify destructive E2B reset * fix(sandbox): close E2B capacity race gaps --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-07-25 10:54:14 +08:00
ShitK	0f0955bf7b	fix(client): preserve ToolMessage artifacts (#4422 ) Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-07-25 09:47:58 +08:00
黄云龙	126fc9ea81	fix(subagents): clamp subagent limit consistently with MIN_SUBAGENT_LIMIT (#4081 ) * fix(subagents): align prompt and middleware subagent limit; allow min of 1 SubagentLimitMiddleware clamped max_concurrent to [2, 4] internally, but agent.py and client.py fed the raw config value into the system prompt, so a user-configured 1 (or 5) produced a prompt that disagreed with the enforced middleware limit. Lower MIN_SUBAGENT_LIMIT to 1 and clamp the raw config value with _clamp_subagent_limit() at both the agent factory and the embedded client so the prompt and middleware see the same value. * fix: remove unused imports MAX_CONCURRENT_SUBAGENT_CALLS, MIN_CONCURRENT_SUBAGENT_CALLS, clamp_subagent_concurrency * fix: harmonize clamp range [1,4] across middleware, config, and prompt path; fix lint - Changed MIN_CONCURRENT_SUBAGENT_CALLS from 2 to 1 so prompt.py's clamp_subagent_concurrency and the middleware's _clamp_subagent_limit both clamp to [1,4] — eliminating the divergence where the prompt told the model 'max 2 task calls' but the middleware enforced 1. - Applied _clamp_subagent_limit at build_middlewares (agent.py:360) so all 3 construction sites (agent.py:360, agent.py:450, client.py:259) consistently clamp the config-resolved limit. - Derived MIN_SUBAGENT_LIMIT / MAX_SUBAGENT_LIMIT from MIN_CONCURRENT_SUBAGENT_CALLS / MAX_CONCURRENT_SUBAGENT_CALLS so the two module-level definitions stay in sync. - Added TestConfigParity.test_prompt_path_and_middleware_clamp_agree regression test. - Fixed lint. * fix(lint): add missing imports for MIN_CONCURRENT_SUBAGENT_CALLS and MAX_CONCURRENT_SUBAGENT_CALLS * docs+test: update AGENTS.md clamp range to 1-4; add prompt/middleware parity regression test - backend/AGENTS.md still documented the old [2,4] clamp in two places; updated to [1,4] to match MIN_CONCURRENT_SUBAGENT_CALLS = 1. - Added test_apply_prompt_template_single_subagent_limit_matches_middleware: renders the real system prompt with max_concurrent_subagents=1 and asserts the advertised HARD LIMITS value equals SubagentLimitMiddleware's enforced max_concurrent — the end-to-end check that would have caught the [1,4] vs [2,4] prompt-path divergence flagged in review. * refactor: simplify per review — restore clamp delegation, drop redundant call-site clamps Per willem-bd's review, reduce the PR to the one behavioral change plus docs/tests: - _clamp_subagent_limit delegates to clamp_subagent_concurrency again instead of inlining a byte-identical copy; with a single source of truth the TestConfigParity sync-check class is unnecessary — dropped. - Revert the call-site clamps in agent.py (build_middlewares, _make_lead_agent) and client.py (_ensure_agent) to main: both downstream consumers (SubagentLimitMiddleware.__init__ and the prompt path) already clamp internally, and the cross-module private import of _clamp_subagent_limit goes away with them. - Keep MIN_CONCURRENT_SUBAGENT_CALLS = 1 (the fix), the [1, 4] docstring updates, the AGENTS.md range corrections, and the end-to-end prompt/middleware parity test for single-subagent mode (docstring reworded: on main a configured 1 was bumped to 2 by both paths — there was no divergence to fix, just a silently raised floor). * test: fix stale comment referencing reverted agent.py/client.py call-site clamps --------- Co-authored-by: nankingjing <nankingjing@users.noreply.github.com> Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-07-24 21:56:11 +08:00
Daoyuan Li	ca3e510b7d	fix(scheduler): close duplicate dispatch race (#4105 ) Enforce one queued or running scheduled-task run per task with a partial unique index. The migration resolves legacy duplicates before creating the index, and losing inserts use the existing conflict or skip outcomes.	2026-07-24 21:41:09 +08:00
Daoyuan Li	159b774944	fix(skills): handle non-string frontmatter keys (#4167 ) Normalize YAML frontmatter keys in the shared parser so validation and review report malformed fields instead of failing while sorting mixed key types.	2026-07-24 21:25:53 +08:00
H Haidong	c7538cfb35	fix(runs): terminate orphaned streams after lease recovery (#4420 ) * fix(runs): terminate orphaned streams after lease recovery * fix(runs): include recovered ids in callback warnings * fix(runs): harden orphan recovery lifecycle	2026-07-24 19:34:20 +08:00
ShitK	a4ede80deb	fix(runtime): reject unsupported run options and stream modes (#4430 ) * fix(runtime): reject unsupported run options * fix(runtime): align SDK run compatibility * fix(frontend): avoid unsupported events stream mode --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-07-24 19:24:24 +08:00
Ryker_Feng	cd9432bcc1	feat(tools): support GIF images in view_image (#4438 ) Add GIF to the view_image allowlist: map the .gif extension to image/gif and detect the GIF87a/GIF89a magic bytes so the existing extension/content cross-check accepts GIFs instead of rejecting them as an unsupported format. Covered by a new success test.	2026-07-24 13:12:43 +08:00
MiaoRuidx	80c06414f8	fix: make orphan reconciliation lease-aware (#4434 ) 让启动/孤儿 run 恢复在最终写入前通过 claim_for_takeover 原子重查 lease，避免 owner 在扫描后续约成功仍被误标为 error。补充扫描后续约的回归测试，并把 reconciliation 写失败测试迁移到 takeover claim 路径。 Co-authored-by: MiaoRuidx <12540796+MiaoRuidx@users.noreply.github.com> Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-07-24 09:48:48 +08:00
Aari	5f0108f56c	fix(runtime): stop subgraph stream frames impersonating root frames (#4407 ) * fix(runtime): stop subgraph stream frames impersonating root frames The web frontend always requested stream_subgraphs, and since delegated subagent graphs inherit the parent checkpoint namespace (#4215), their values snapshots and token chunks ride the parent stream. The worker's _unpack_stream_item dropped the namespace and published every subgraph frame under a bare event name, so a subagent's values snapshot replaced the whole thread view in SDK clients (#4399), its token chunks flooded the parent message stream, and a subagent's LLM error fallback could be mistaken for the parent run's. Publish subgraph frames under namespace-qualified SSE event names (mode\|ns1\|ns2, LangGraph Platform style) and keep root-only consumers (file-tool chunk batcher, subagent event persistence, error-fallback detection) on root frames only. Drop streamSubgraphs from the frontend submit paths: subtask progress arrives via root-namespace task_* custom events, so the flag only exposed the leak. * test(runtime): add production-shaped subgraph stream regression tests Address review: the namespace tests validated the publishing helpers with hand-fed namespaces, while the #4399 regression lived in the integration between LangGraph's delegation routing and the worker's stream loop. Add TestWorkerSubgraphStreamIntegration: a real parent graph delegates through the real SubagentExecutor and streams through run_agent into a real MemoryStreamBridge, locking both stream_subgraphs modes -- delegated frames arrive namespaced (never bare), a delegated error fallback cannot mark the parent run as errored, and without the flag delegated frames stay out while task_* custom events remain.	2026-07-23 23:32:06 +08:00
Huixin615	4a2ecd430e	fix(streaming): expose custom events to astream_events (#4403 ) * fix(streaming): expose custom events to astream_events * test(streaming): validate real custom event emitters --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-07-23 22:56:12 +08:00
hataa	7857fa0cce	feat(authz): enforce tool authorization at assembly and runtime (#4370 ) * feat(authz): enforce tool authorization at assembly and runtime * fix(middleware): guard deferred tool setup lookup (#4370) --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-07-23 22:51:35 +08:00
MiaoRuidx	f1632cc351	fix(run): add run event stream contract (#4342 ) * docs: document run event stream contract * fix(run): address event stream review feedback --------- Co-authored-by: MiaoRuidx <12540796+MiaoRuidx@users.noreply.github.com> Co-authored-by: Willem Jiang <willem.jiang@gmail.com>	2026-07-23 21:33:57 +08:00
Aari	b7933d18e4	fix(safety): backfill empty content-filter responses so they don't poison the thread (#4394 ) An empty assistant message from a provider safety filter (content_filter with no content, no tool calls) was persisted into thread history and replayed to strict OpenAI-compatible providers, which reject it with HTTP 400 ("message ... with role 'assistant' must not be empty") — breaking every later turn until a new chat is started. SafetyFinishReasonMiddleware only handled the tool-call case (#3028) and TerminalResponseMiddleware only the post-tool case (#4027), so a plain empty content-filter response fell through both. Extend the safety middleware to backfill a user-facing explanation when a safety-terminated message is otherwise blank, so the persisted turn is non-empty (and the user sees why it was blocked). Fixes #4393	2026-07-23 16:59:34 +08:00
Aari	70fb91654d	fix(gateway): seed branch run-events so inherited history survives forking (#4385 ) * fix(gateway): seed branch run-events so inherited history survives (#4380) The thread feed (GET /messages, /messages/page) reads the run-event store, but branch creation only wrote checkpoint state - a fresh branch had no message rows, so the parent history vanished from the UI as soon as the branch's first run refreshed the feed. Seed the branch's run_events from the same checkpoint snapshot the branch was created from, mirroring RunJournal's message-event contract (event types, hidden-message rules, original-user-text restoration). Best-effort: a seeding failure degrades to the old behavior and is reported as history_seed_mode=failed. * docs(gateway): correct branch-seed docstring on RunJournal divergences The "consumers cannot tell a seeded row from a journaled one" claim was overstated for AI rows: seeded rows omit run-scoped enrichment (usage / latency_ms / llm_call_index) and stamp caller=lead_agent rather than the message's original caller, neither recoverable from a checkpoint message. Rewrite the docstring to state these divergences explicitly and note they are display-invisible today (no consumer indexes those keys; per-message caller drives no attribution). Also add a code comment marking the hide_from_ui filter as intentionally stricter than the live paths. * fix(gateway): seed dict-shaped checkpoint messages + persist hidden AI/tool rows Two review-driven fixes to build_branch_history_seed_events: 1. Checkpoint messages can arrive as model_dump()-shaped dicts (the branch-matching helpers in threads.py already handle both BaseMessage and dict). The seed only handled BaseMessage, so a dict-backed checkpoint seeded nothing and the branch reported skipped_empty while history existed. Coerce dicts back to BaseMessage via messages_from_dict (faithful: tool_calls / tool_call_id / additional_kwargs survive); unparseable dicts are dropped best-effort. 2. RunJournal.on_llm_end and _persist_tool_result_message persist hide_from_ui AI/tool rows unconditionally (the frontend hides them client-side); the hide check only gates the reconciliation pass. The seed dropped them, so a hidden turn vanished from a forked feed and seeded rows diverged from journaled ones. Match RunJournal and write them, restoring true row-level parity. Adds tests for dict deserialization, the unparseable-dict drop, and the hidden AI/tool persistence contract.	2026-07-23 13:57:32 +08:00
Admire	a38b1daec3	fix(streaming): keep large file generation responsive (#4354 ) * fix(streaming): keep large file generation responsive * fix(streaming): address follow-up review feedback * fix(streaming): address final review feedback	2026-07-23 08:51:14 +08:00
Aari	7b330101d2	fix(tools): exclude injected runtime from list_uploaded_files schema (#4375 ) (#4376 ) Declaring the injected runtime arg as `Annotated[Runtime, InjectedToolArg] \| None` made the top-level annotation a Union, so LangChain no longer treated it as injected. It leaked into the model-facing schema and pydantic raised PydanticInvalidForJsonSchema on the ToolRuntime dataclass the moment the tool was bound to a model. The tool is bound by default for the lead agent, so any default run on an OpenAI-compatible provider failed at tool-bind time. Declare runtime as a bare Runtime first param, matching every other built-in tool (present_files, view_image, task, ...), which LangChain auto-injects and auto-excludes from the schema. Add a schema regression test that binds the tool.	2026-07-23 08:22:15 +08:00
Aari	0d4d0cb17d	feat(agents): database-backed storage for custom agent definitions (#4359 ) * feat(agents): database-backed storage for custom agent definitions Add an agent_storage.backend switch (default file, behaviour-unchanged) with a db backend that stores each custom agent as a row in the shared SQL persistence layer, so a multi-instance deployment sees the same agents on every node (#4331, #4357). Introduces an AgentStore interface routing all read/write surfaces, an agents table + migration 0006, startup validation, and a file->db importer. Follows the thread_meta store / run_events backend-switch / 0003_scheduled_tasks migration patterns; no new dependency. * fix(agents): make db storage path production-ready (review round 1) Addresses review feedback on the db/sync agent-storage path: - sql.py: mirror the async engine's per-connection SQLite PRAGMAs on the sync engine (busy_timeout=30000, synchronous=NORMAL, foreign_keys=ON, WAL) so both engines behave identically against the shared DB; guard the engine cache with a lock (double-checked) so concurrent first-touch cannot build duplicate engines or register the connect listener twice. - routers/agents.py + routers/assistants_compat.py: offload the sync-store reads that ran on the event loop (list/get/check, update's pre-read + legacy guard + refresh, and assistants_compat's four list routes) via asyncio.to_thread — on db+postgres each was a network round trip stalling the loop. Writes were already offloaded. - file.py: translate the create() mkdir(exist_ok=False) race FileExistsError into AgentExistsError (router 409, matching SqlAgentStore's IntegrityError path); correct the _write docstring — per-file atomic replace, two commits sequential not transactional. Tests: sync-engine PRAGMA + engine-cache reuse assertions; file create-race -> AgentExistsError; strict Blockbuster anchor over the read endpoints so a regression back onto the loop fails CI. * fix(agents): address round-2 review on the db store path - update_agent tool: align the docstring/inline comment with FileAgentStore._write. Cross-field write atomicity is db-only; the file backend commits config then soul via two sequential os.replace (a crash between them can leave a fresh config.yaml beside a stale SOUL.md). The dropped partial-write reporting is an intentional tradeoff — the stage-then-replace safety is preserved (test_update_agent_soul_failure_does_not_replace_config still holds). - SqlAgentStore.update(): true upsert. Catch IntegrityError on the insert-on-missing branch, re-fetch and apply, so two concurrent first-time writes (e.g. two setup_agent handshakes) converge instead of surfacing a raw UNIQUE(user_id, name) violation as a 500. Symmetric with create(). - get_agent_store(): document the graph-subprocess config-resolution invariant (the except->file fallback is a genuine no-config path, not a mask for a misconfigured graph process) and pin it with two tests driving the real get_app_config() file resolution: db resolves from an on-disk config.yaml, file fallback when config is unresolvable. * test(agents): cover SqlAgentStore.update() write-race upsert recovery Mandatory-TDD test for the round-2 fix in 0680340a: two concurrent first-time update()s where the loser's insert hits UNIQUE(user_id, name). Deterministically forces the IntegrityError recovery path by making the first _row probe miss the committed winner, and asserts last-writer-wins instead of a surfaced 500.	2026-07-23 08:03:21 +08:00
March7	4dd7cafef1	fix(sandbox): serialize E2B release transitions (#4355 )	2026-07-23 07:42:43 +08:00
Daoyuan Li	44990ff194	fix(mcp): use threading.Lock for OAuth token refresh to avoid cross-thread deadlock (#4240 ) * fix(mcp): use threading.Lock for OAuth token refresh to avoid cross-thread deadlock OAuthTokenManager created one asyncio.Lock per server for the process lifetime. The embedded/TUI sync tool-call path (DeerFlowClient.stream() -> LangGraph's ToolNode._func -> a ThreadPoolExecutor -> make_sync_tool_wrapper's per-call asyncio.run()) invokes get_authorization_header from a fresh event loop on a fresh OS thread for every concurrent tool call. asyncio.Lock binds to whichever loop first contends on it; when a caller on a different loop later releases or wakes a waiter, it does so without call_soon_threadsafe, so the waiting loop's selector is never woken and that caller hangs forever with no exception. A third concurrent caller instead raises a synchronous RuntimeError ("bound to a different event loop"). Either way, two concurrent OAuth-protected tool calls (including the very first cold-start token fetch) can freeze the entire agent turn. Gateway's async path (ToolNode._afunc) is unaffected. Replace the asyncio.Lock with a plain threading.Lock, acquired via asyncio.to_thread so the blocking wait never blocks the event loop, and released synchronously in a finally block. This keeps the single-fetch de-duplication the lock provided while making it safe across however many event loops/threads call into the same server's lock. Adds a regression test that runs three threads, each with its own event loop, calling get_authorization_header concurrently for the same server, and asserts (with a bounded join timeout so a regression fails fast instead of hanging the suite) that none hang or raise, and that only one real token fetch happens. * fix(mcp): make OAuth lock acquisition cancellation-safe get_authorization_header acquired the per-server threading.Lock via a bare `await asyncio.to_thread(lock.acquire)`, with the try/finally that guarantees release only starting after that await returned. Once the executor thread had actually started running lock.acquire(), cancelling the awaiting caller only stopped the caller -- Python cannot interrupt a running OS thread. CancelledError was still delivered to the caller immediately, but the thread kept blocking until the current holder released, then silently acquired the lock with nobody left to call release() for it. The lock stayed locked forever and every later OAuth token refresh for that server blocked permanently at the same line -- the exact cross-thread deadlock this lock was introduced to prevent, reintroduced via a different path under cancellation (e.g. a caller wrapped in asyncio.wait_for/asyncio.timeout, or task-group cancellation). Run the acquisition as an explicit asyncio.create_task, awaited via asyncio.shield() so cancelling the caller no longer cancels the underlying acquisition task. If the caller is cancelled, keep (re-)waiting on the still-shielded acquisition task -- tolerating further cancellation during this cleanup by simply retrying -- until it actually finishes, release the lock immediately, and only then re-raise. This guarantees the lock is released regardless of when or how many times the caller is cancelled: before the acquisition is even scheduled, while queued, or after it has already been silently granted. Adds a regression test that holds the per-server lock, starts a second caller that has to wait for it, cancels that caller while it is genuinely blocked in its executor thread, releases the original holder, and asserts a third caller completes within a bounded asyncio.wait_for and still performs exactly one token fetch. Every potentially-hanging await is bounded so a regression fails the test quickly instead of hanging the suite.	2026-07-22 19:58:43 +08:00
March7	8c78d1f41f	fix(subagents): load user-scoped skills (#4356 )	2026-07-22 14:59:33 +08:00
Daoyuan Li	09d9cf53d2	fix(harness): add timeout to invoke_acp_agent to prevent indefinite hangs (#4238 ) invoke_acp_agent had no timeout anywhere in its call path, and ACPAgentConfig had no timeout field. If the ACP agent subprocess answers initialize/new_session correctly but then hangs inside prompt(), the tool call - and therefore the whole agent turn - blocks indefinitely, with the child process left running. MCP stdio servers already guard against this class of hang via tool_call_timeout; ACP agent invocations had no equivalent. Add ACPAgentConfig.timeout_seconds (default 1800, ge=1), mirroring the shape/default of subagents.timeout_seconds, and wrap the conn.prompt() call in asyncio.wait_for(). On TimeoutError, return a clear error instead of hanging; exiting the spawn_agent_process context block triggers the ACP library's own graceful-then-forceful subprocess cleanup, so the hung process is actually terminated.	2026-07-22 14:47:08 +08:00
lllyfff	01a89f2379	[feat] memory: pluggable MemoryManager interface for backend onboarding (#4326 ) * refactor(memory): pluggable MemoryManager interface for backend onboarding Optimize the MemoryManager interface layer so new backends (mem0/openviking) onboard with less code and the contract stays stable as capabilities are added. A minimal backend now implements only from_config + add + get_context (verified by test_memory_manager_interface.py::_MinimalBackend onboarding via the factory); the factory no longer knows a backend's private hooks. - MemoryManager: ABC -> pydantic BaseModel; three-tier methods (tier-1 add/get_context abstract; tier-2 management defaults; tier-3 optional hooks warm/reload/fact + on_pre_compress/on_turn_start). Dropped 3 self-serving hooks. 6 hasattr probe sites -> direct call + try/except NotImplementedError. - from_config classmethod: factory thins to resolve + inject storage_path + collect host hooks + call from_config; DeerMem-specific hook consumption moved from factory to DeerMem.from_config. - Invariants: @model_validator (mode='tool' requires search via supports_search ClassVar); DeerMemConfig storage_path-is-file check moved here from factory. - Async: aadd/aget_context/asearch default to the sync path (speculative). - Callbacks: MemoryCallbacks + LangfuseMemoryCallbacks; on_memory_llm_call subsumes tracing_callback (same signature/timing/mutation); deleted the tracing_callback field. DeerMem decoupled from langfuse (portability). - noop keeps read-op empty overrides (avoids router 500s on the disable-memory-via-noop path); only delete/export inherit the base raise. Behavior preserved: 661 passed / 13 skipped. Docs: backends/README.md rewritten (three-tier + from_config + callbacks); samples README updated; removed stale private doc paths. Co-Authored-By: Claude <noreply@anthropic.com> * fix(memory): 501 on unsupported read/manage endpoints + accurate warm log Review follow-up on the three-tier MemoryManager refactor. - Read/manage endpoints (GET /memory, /memory/export, /memory/status, DELETE /memory, POST /memory/import) and the /memory/reload fallback now catch NotImplementedError -> 501, matching the fact-CRUD endpoints. The hasattr->try/except migration had skipped these: they were @abstractmethod before (every backend implemented them, so they never raised), so once they became tier-2 default-raise a minimal backend (only add + get_context) hit a raw 500 -- there is no global NotImplementedError handler. get_memory is shared via _get_memory_or_501 (covers /memory, export, status, reload fallback). noop is unchanged: its read-op empty overrides never raise. - warm() base default returns None (tri-state: True=warmed, False=failed, None=nothing to warm) so the Gateway lifespan logs "skipping" for a non-DeerMem backend (e.g. noop) instead of the inaccurate "warmed successfully" it never earned. DeerMem.warm keeps True/False. - Tests: 6 router 501 tests (read/manage + reload fallback) + 2 lifespan warm-log tests (None->skipping, False->warning); conformance/pluggable assert warm() is None. 705 passed / 13 skipped; lint clean. Co-Authored-By: Claude <noreply@anthropic.com> * fix(memory): review follow-ups - search-flag consistency, client reload, backend_config purity Address review feedback on the three-tier MemoryManager refactor: - [Medium] supports_search/search drift: the invariant now requires the supports_search ClassVar flag to MATCH whether search() is actually overridden (type(self).search is not MemoryManager.search), so the flag can't drift from the impl. Catches both directions at instantiation: a backend that overrides search() but forgets supports_search=True (was a misleading tool-mode rejection), and one that sets the flag without overriding (was a runtime NotImplementedError on the first memory_search). noop sets supports_search=True to match its search() override. Conformance adds drift + consistent-backend tests. - [Low] client.reload_memory fallback: wrap the get_memory fallback so a minimal backend (only add + get_context) surfaces a clean NotImplementedError ("implements neither reload_memory nor get_memory") instead of an uncaught propagation -- mirrors the router's 501. Test added. - [Low] backend_config purity: DeerMem.from_config restores backend_config to the pure data the host passed after model_post_init parses the injected hooks into DeerMemConfig (self._config, PrivateAttr); the field stays serializable (no callables/LLM) and matches the README ("host hooks NOT in backend_config"). Test asserts purity + hooks wired. - [Low] CHANGELOG: breaking-change note that mode='tool' + non-search backend now fails fast at startup (was silently empty) so operators recognize it on upgrade. - [Nit] .gitignore: drop the env-specific .tmp-pytest/ entry (--basetemp is local-only, not make test/CI). 709 passed / 13 skipped; lint clean. Co-Authored-By: Claude <noreply@anthropic.com> * docs(changelog): correct memory tool-mode fail-fast note The CHANGELOG entry said mode='tool' + a non-search backend "(e.g. noop)" fails fast at startup, but noop overrides search() (returns []) and sets supports_search=True (required by the consistency invariant), so noop IS search-capable and noop+tool does NOT fail fast. The fail-fast only affects a custom backend that onboards without overriding search(). Reworded to drop the misleading noop example and state both shipping backends implement search(). Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>	2026-07-22 14:40:57 +08:00
Aari	05e4f4f6d8	fix(sandbox): bound E2B output synchronization resources (#4364 ) * fix(sandbox): bound E2B output synchronization resources E2B release-time output sync pulled every changed file back from the remote VM with only a per-file size cap and no aggregate bound, so a pathological outputs tree (thousands of files, or many sub-cap files summing to gigabytes, or a slow VM) could make release download unboundedly on a hot path that runs at every agent turn end. Add three aggregate ceilings on top of the per-file cap — total bytes, file count, and a wall-clock deadline — enforced in the sync loop. When a ceiling is hit the pass stops early, logs what it dropped, and defers the rest to the next release. A truncated pass skips stale-manifest pruning so files it never reached are reconciled next time instead of being forgotten and re-downloaded. Closes #4340 * test(sandbox): pin multi-pass convergence of bounded output sync The four truncation tests each exercise a single capped pass. Add a two-pass test that locks in the invariant the design relies on for correctness: already-synced files are skipped before the budget check, so they never consume the cap and the deferred tail drains over successive releases instead of the leading files being re-downloaded every turn. A refactor that let a skipped file consume the cap would pass the single-pass tests but fail this one.	2026-07-22 14:32:54 +08:00
Lee minjing	e225ad57d7	feat(uploads): lazy-load historical files via list_uploaded_files tool (#4174 ) * feat(uploads): lazy-load historical files via list_uploaded_files tool Replace per-turn injection of all historical upload metadata with on-demand discovery via a new `list_uploaded_files` built-in tool, following the same deferred-discovery pattern used by skills. - Rename <uploaded_files> block to <current_uploads> (current-run files only) - Add list_uploaded_files tool with include_outline: bool\|list[str] - Extract outline helpers to shared deerflow/utils/file_outline.py - Update system prompt to reflect lazy-loading behaviour - Historical file scan removed from UploadsMiddleware.before_agent() Co-Authored-By: Claude <noreply@anthropic.com> * fix(uploads): clear uploaded_files state when no new files in current turn When before_agent() returns None on empty turns, the LastValue uploaded_files field retains the previous turn's filenames. list_uploaded_files then incorrectly excludes those files as "current-run" files, making them invisible until the next upload. Fix: return {"uploaded_files": []} instead of None to explicitly clear state. Add two-turn regression test covering the exact scenario from review feedback. Co-Authored-By: Claude <noreply@anthropic.com> * fix: resolve CI lint errors and stale test assertion from merge - Split long prompt line to fit 240-char limit - Add missing `Any` import in list_uploaded_files_tool - Remove unused `re` import in file_conversion (outline code moved) - Remove unused `os` import in middleware test - Fix test assertion: <uploaded_files> → <current_uploads> after main merge Co-Authored-By: Claude <noreply@anthropic.com> * fix: resolve CI lint errors and stale test assertion from merge - Split long prompt line to fit 240-char limit - Add missing `Any` import in list_uploaded_files_tool - Remove unused `re` import in file_conversion (outline code moved) - Remove unused `os` import in middleware test - Fix test assertion: <uploaded_files> → <current_uploads> after main merge Co-Authored-By: Claude <noreply@anthropic.com> * fix: add current_uploads to input sanitization exempt tags The lazy-loading PR renamed <uploaded_files> to <current_uploads>. The anti-drift guard scans all framework XML blocks and requires each to be either blocked or explicitly exempted. current_uploads wraps trusted server-generated file metadata, not user input, so it belongs in the exempt set. Co-Authored-By: Claude <noreply@anthropic.com> * test: regenerate replay golden after uploaded_files state change before_agent now returns {"uploaded_files": []} instead of None, adding uploaded_files to SSE values events. Regenerated via DEERFLOW_WRITE_GOLDEN=1. Co-Authored-By: Claude <noreply@anthropic.com> * fix: review feedback — memory pipeline, stale tags, state clearing, nits - Match both tags in memory stripping pipeline (uploaded_files\|current_uploads) - Remove stale uploaded_files from _BLOCKED_TAG_NAMES - Clear uploaded_files on all before_agent early-return paths - Fix ponytail: stray word in file_conversion re-export comment - Remove dead total_omitted branch in _format_omitted_summary - ruff format fixes Co-Authored-By: Claude <noreply@anthropic.com> * fix: block current_uploads, sanitize only original user content Per review feedback: instead of exempting <current_uploads> (which allows user forgery), move it to _BLOCKED_TAG_NAMES and change InputSanitizationMiddleware._process_request to scan only the original user content (ORIGINAL_USER_CONTENT_KEY) when available. Server-injected trusted blocks are no longer checked against the blocked-tag denylist. Co-Authored-By: Claude <noreply@anthropic.com> * docs: clarify fallback reason in input sanitization comment Co-Authored-By: Claude <noreply@anthropic.com> * @ fix: third-round review feedback — state visibility, sanitization, regex, nits - list_uploaded_files_tool: logger.warning instead of silent try/except on runtime.state read failure (High) - input_sanitization_middleware: _extract_text_from_content skips empty text blocks to match message_content_to_text behaviour; rfind fallback path logs warning for observability (Medium) - memory pipeline regexes: backreference (?P<tag>)(?P=tag) in message_processing.py and prompt.py (Low) - file_conversion.py: re-export moved to top of file (Low) - Tests: middleware→tool state bridge test; integrated forged-tag + multimodal sanitization tests PR #4174 — Follow-up issues: #4212, #4213, #4214 Co-Authored-By: Claude <noreply@anthropic.com> @ * @ fix: 4th-round review — denylist, sanitization, scandir, nits - Add "uploaded_files" back to _BLOCKED_TAG_NAMES (old tag still processed by deermem; user forgery must be escaped) (consistency) - Fix inaccurate rfind-fallback comment: UploadsMiddleware keeps string as string, fallback is unreachable for strings (doc fix) - Distinguish "empty string key" (upload without text) from "non-string key" (caller forgery) so empty-text uploads never escape the server block (edge) - Merge dual os.scandir(uploads_dir) calls into one list re-use (minor) - Add comment on .md sibling skip known limitation: user-uploaded .md files whose stem collides with a converted doc are hidden (boundary, no code change) Co-Authored-By: Claude <noreply@anthropic.com> @ * @ fix: tighten rfind-failure fallback — distinguish server blocks from user blocks When _extract_text_from_content and message_content_to_text disagree on multimodal list content and rfind fails, use content[0] (server-injected <current_uploads> block) vs content[1:] (user blocks) to sanitize only user blocks. Raw strings and non-standard dict blocks that _extract_text_from_content misses are now also sanitized. Non-distinguishable paths (< 2 text blocks, non-list content) still degrade to full sanitization (safe — server block may be escaped but user forgery never leaks). All fallback paths log via logger.warning. Decision 18 / willem-bd 4th-round comment #3 Co-Authored-By: Claude <noreply@anthropic.com> @ * @ fix: correct comments referencing text_blocks → content in rfind fallback Co-Authored-By: Claude <noreply@anthropic.com> @ * fix: 5th-round review — dead code, subagent gating, integration test, perf, consistency - Delete unreachable ORIGINAL_USER_CONTENT_KEY guard in rfind fallback branch (original_user_content guaranteed non-empty str at that point) - Remove list_uploaded_files from BUILTIN_TOOLS; add include_upload_tool param to get_available_tools(), default True; task_tool.py passes False so subagents no longer receive a tool whose state exclusion is broken - Add integration test exercising real create_agent graph (not mocked runtime.state) to verify LangGraph propagates before_agent state writes into ToolRuntime.state during same-turn tool calls - Cache DirEntry.stat() st_size in candidates tuple to avoid second per-file syscall in the rendering loop - Make the upload-tag pre-check case-insensitive (content_str.lower()) to match _UPLOAD_BLOCK_RE re.IGNORECASE PR #4174 — willem-bd 5th-round review items #1-#5 Co-Authored-By: Claude <noreply@anthropic.com> * fix(channels): pass files metadata through _human_input_message() for IM uploads _human_input_message() was not passing additional_kwargs.files to the downstream message. UploadsMiddleware read no files, wrote uploaded_files=[], and list_uploaded_files reported same-run IM attachments as historical files (fancyboi999 repro). Fix: add files parameter to _human_input_message(), call site passes files=uploaded. Regression test locks the contract. Co-Authored-By: Claude <noreply@anthropic.com> * fix(channels): remove legacy <uploaded_files> manual prepend to fix double-injection regression Commit 8d86dbf6 added files= pass-through to UploadsMiddleware but left the manual _format_uploaded_files_block() prepend in place. Every IM attachment reached the model twice — once via the legacy <uploaded_files> block and once via <current_uploads>. This commit removes the manual prepend and the now-dead _format_uploaded_files_block() function. UploadsMiddleware is the sole upload-context producer for both IM and web paths. Reported-by: fancyboi999 (PR review) Co-Authored-By: Claude <noreply@anthropic.com> * docs: update #4212 issue body to reflect completed fixes and narrowed remaining scope * chore: remove temporary scratch file * fix(middleware): neutralize user-derived values inside <current_uploads> block Upload-derived filenames, paths, outline titles, and preview text are interpolated verbatim inside the trusted <current_uploads> wrapper, which InputSanitizationMiddleware exempts from sanitization. A crafted filename or document heading containing blocked authority tags would bypass the guardrail and enter model context as trusted framework data. Fix: call neutralize_untrusted_tags() on all four user-derived values inside _format_file_entry(), preserving the outer <current_uploads> wrapper untouched. Reported-by: fancyboi999 (P1 security review) Co-Authored-By: Claude <noreply@anthropic.com> * fix(middleware): neutralize extension labels in omitted-file summary Files exceeding the 10-item context cap bypass _format_file_entry(). Their extensions, derived from user-controlled filenames via _extension_label(), were interpolated verbatim into the trusted <current_uploads> wrapper — another path for blocked authority tags to escape the guardrail. Fix: neutralize extension values inside _extension_label(), the single extraction point for all extension labels. Reported-by: fancyboi999 (P1 security review) Co-Authored-By: Claude <noreply@anthropic.com> * fix(tools): neutralize user-derived values in list_uploaded_files tool result Apply neutralize_untrusted_tags() to every model-visible user-derived value returned by list_uploaded_files: filename, virtual path, extension, outline titles, outline preview lines, and omitted-file extension summary. This closes the last remaining injection bypass in the upload lazy-loading path - the <current_uploads> block and its omitted summary were already neutralized (previous commits), but the list_uploaded_files tool produced a second exit for the same attacker-controlled metadata that ToolResultSanitizationMiddleware did not cover. Co-Authored-By: Claude <noreply@anthropic.com> * fix(tests): add missing include_upload_tool=False to task_tool mock assertions PR #4174 added include_upload_tool parameter to get_available_tools(). task_tool.py correctly passes include_upload_tool=False for subagents but 5 existing tests' assert_called_once_with expectations were not updated, causing CI failures. Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>	2026-07-22 14:02:56 +08:00
Ryker_Feng	20debf9cc7	feat(agents): per-agent model and generation settings (#4347 ) * feat(agents): per-agent model and generation settings Let each custom agent choose its own model and sampling settings (temperature, max_tokens) plus thinking / reasoning_effort defaults, so agents sharing a model profile are no longer stuck with one shared temperature and output length (#4336). AgentConfig gains optional model_settings / thinking_enabled / reasoning_effort (None = inherit). create_chat_model applies per-caller model_overrides on top of the profile before the thinking/Codex transforms; the lead agent resolves each knob with precedence request > agent config > profile/default. The /api/agents create/update routes persist the fields and reject an unknown model. The default lead agent path is unchanged (no agent config -> overrides None). The agent chat composer also stops force-overriding an agent's configured default model with models[0]. * fix(agents): tri-state thinking control and default-model capability gating The model-settings dialog seeded the thinking switch to false, so opening it to tweak temperature and saving silently disabled thinking (the runtime default is on) with no way back to inherit. It also hid the thinking / reasoning controls whenever the agent inherited the global default model, since `__default__` never resolved through `models.find`. Give thinking an explicit Inherit / On / Off tri-state so an untouched save is a no-op, and resolve `__default__` to the effective default (models[0]) for the capability check. Logic lives in the tested helpers module.	2026-07-22 13:44:55 +08:00
Andrew Chen	ae510cb2e8	fix(sandbox): make an empty old_str a no-op in str_replace on any file (#4256 ) str_replace guards the replacement with `if old_str not in content`, which cannot reject an empty old_str -- `"" in content` is always true. So an empty old_str reached `str.replace("", new_str)`, which inserts new_str at every character boundary, and the tool rewrote the file while still returning "OK": old_str='', new_str='# H\n' -> OK, file silently prepended old_str='', new_str='X', replace_all -> OK, 'XdXeXfX XmXaXiXnX(X)X:X\nX...' The empty-file branch above it already handles this case (`if not content: if not old_str: return "OK"`), and the existing test states the intent directly: "An empty old_str is a no-op edit and remains a benign OK". That contract just never held once the file had content. The tool is registered by default (config.example.yaml) and its schema declares old_str as a plain string with no minLength, so a model can emit "" legitimately; read-before-write only compares a hash and lets it past. Check old_str first so the no-op holds whatever the file contains. The empty-file case folds into the same not-found branch, which keeps its message and behaviour. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-07-22 09:20:54 +08:00

1 2 3 4 5 ...

632 Commits