fix(backend): stream DeerFlowClient AI text as token deltas (#1969) (#1974)

* fix(backend): stream DeerFlowClient AI text as token deltas (#1969)

DeerFlowClient.stream() subscribed to LangGraph stream_mode=["values",
"custom"] which only delivers full-state snapshots at graph-node
boundaries, so AI replies were dumped as a single messages-tuple event
per node instead of streaming token-by-token. `client.stream("hello")`
looked identical to `client.chat("hello")` — the bug reported in #1969.

Subscribe to "messages" mode as well, forward AIMessageChunk deltas as
messages-tuple events with delta semantics (consumers accumulate by id),
and dedup the values-snapshot path so it does not re-synthesize AI
text that was already streamed. Introduce a per-id usage_metadata
counter so the final AIMessage in the values snapshot and the final
"messages" chunk — which carry the same cumulative usage — are not
double-counted.

chat() now accumulates per-id deltas and returns the last message's
full accumulated text. Non-streaming mock sources (single event per id)
are a degenerate case of the same logic, keeping existing callers and
tests backward compatible.

Verified end-to-end against a real LLM: a 15-number count emits 35
messages-tuple events with BPE subword boundaries clearly visible
("eleven" -> "ele" / "ven", "twelve" -> "tw" / "elve"), 476ms across
the window, end-event usage matches the values-snapshot usage exactly
(not doubled). tests/test_client_live.py::TestLiveStreaming passes.

New unit tests:
- test_messages_mode_emits_token_deltas: 3 AIMessageChunks produce 3
  delta events with correct content/id/usage, values-snapshot does not
  duplicate, usage counted once.
- test_chat_accumulates_streamed_deltas: chat() rebuilds full text
  from deltas.
- test_messages_mode_tool_message: ToolMessage delivered via messages
  mode is not duplicated by the values-snapshot synthesis path.

The stream() docstring now documents why this client does not reuse
Gateway's run_agent() / StreamBridge pipeline (sync vs async, raw
LangChain objects vs serialized dicts, single caller vs HTTP fan-out).

Fixes #1969

* refactor(backend): simplify DeerFlowClient streaming helpers (#1969)

Post-review cleanup for the token-level streaming fix. No behavior
change for correct inputs; one efficiency regression fixed.

Fix: chat() O(n²) accumulator
-----------------------------
`chat()` accumulated per-id text via `buffers[id] = buffers.get(id,"") + delta`,
which is O(n) per concat → O(n²) total over a streamed response. At
~2 KB cumulative text this becomes user-visible; at 50 KB / 5000 chunks
it costs roughly 100-300 ms of pure copying. Switched to
`dict[str, list[str]]` + `"".join()` once at return.

Cleanup
-------
- Extract `_serialize_tool_calls`, `_ai_text_event`, `_ai_tool_calls_event`,
  and `_tool_message_event` static helpers. The messages-mode and
  values-mode branches previously repeated four inline dict literals each;
  they now call the same builders.
- `StreamEvent.type` is now typed as `Literal["values", "messages-tuple",
  "custom", "end"]` via a `StreamEventType` alias. Makes the closed set
  explicit and catches typos at type-check time.
- Direct attribute access on `AIMessage`/`AIMessageChunk`: `.usage_metadata`,
  `.tool_calls`, `.id` all have default values on the base class, so the
  `getattr(..., None)` fallbacks were dead code. Removed from the hot
  path.
- `_account_usage` parameter type loosened to `Any` so that LangChain's
  `UsageMetadata` TypedDict is accepted under strict type checking.
- Trimmed narrating comments on `seen_ids` / `streamed_ids` / the
  values-synthesis skip block; kept the non-obvious ones that document
  the cross-mode dedup invariant.

Net diff: -15 lines. All 132 unit tests + harness boundary test still
pass; ruff check and ruff format pass.

* docs(backend): add STREAMING.md design note (#1969)

Dedicated design document for the token-level streaming architecture,
prompted by the bug investigation in #1969.

Contents:
- Why two parallel streaming paths exist (Gateway HTTP/async vs
  DeerFlowClient sync/in-process) and why they cannot be merged.
- LangGraph's three-layer mode naming (Graph "messages" vs Platform
  SDK "messages-tuple" vs HTTP SSE) and why a shared string constant
  would be harmful.
- Gateway path: run_agent + StreamBridge + sse_consumer with a
  sequence diagram.
- DeerFlowClient path: sync generator + direct yield, delta semantics,
  chat() accumulator.
- Why the three id sets (seen_ids / streamed_ids / counted_usage_ids)
  each carry an independent invariant and cannot be collapsed.
- End-to-end sequence for a real conversation turn.
- Lessons from #1969: why mock-based tests missed the bug, why
  BPE subword boundaries in live output are the strongest
  correctness signal, and the regression test that locks it in.
- Source code location index.

Also:
- Link from backend/CLAUDE.md Embedded Client section.
- Link from backend/docs/README.md under Feature Documentation.

* test(backend): add refactor regression guards for stream() (#1969)

Three new tests in TestStream that lock the contract introduced by
PR #1974 so any future refactor (sync->async migration, sharing a
core with Gateway's run_agent, dedup strategy change) cannot
silently change behavior.

- test_dedup_requires_messages_before_values_invariant: canary that
  documents the order-dependence of cross-mode dedup. streamed_ids
  is populated only by the messages branch, so values-before-messages
  for the same id produces duplicate AI text events. Real LangGraph
  never inverts this order, but a refactor that does (or that makes
  dedup idempotent) must update this test deliberately.

- test_messages_mode_golden_event_sequence: locks the *exact* event
  sequence (4 events: 2 messages-tuple deltas, 1 values snapshot, 1
  end) for a canonical streaming turn. List equality gives a clear
  diff on any drift in order, type, or payload shape.

- test_chat_accumulates_in_linear_time: perf canary for the O(n^2)
  fix in commit 1f11ba10. 10,000 single-char chunks must accumulate
  in under 1s; the threshold is wide enough to pass on slow CI but
  tight enough to fail if buffer = buffer + delta is restored.

All three tests pass alongside the existing 12 TestStream tests
(15/15). ruff check + ruff format clean.

* docs(backend): clarify stream() docstring on JSON serialization (#1969)

Replace the misleading "raw LangChain objects (AIMessage,
usage_metadata as dataclasses), not dicts" claim in the
"Why not reuse Gateway's run_agent?" section. The implementation
already yields plain Python dicts (StreamEvent.data is dict, and
usage_metadata is a TypedDict), so the original wording suggested
a richer return type than the API actually delivers.

The corrected wording focuses on what is actually true and
relevant: this client skips the JSON/SSE serialization layer that
Gateway adds for HTTP wire transmission, and yields stream event
payloads directly as Python data structures.

Addresses Copilot review feedback on PR #1974.

* test(backend): document none-id messages dedup limitation (#1969)

Add test_none_id_chunks_produce_duplicates_known_limitation to
TestStream that explicitly documents and asserts the current
behavior when an LLM provider emits AIMessageChunk with id=None
(vLLM, certain custom backends).

The cross-mode dedup machinery cannot record a None id in
streamed_ids (guarded by ``if msg_id:``), so the values snapshot's
reassembled AIMessage with a real id falls through and synthesizes
a duplicate AI text event. The test asserts len == 2 and locks
this as a known limitation rather than silently letting future
contributors hit it without context.

Why this is documented rather than fixed:
* Falling back to ``metadata.get("id")`` does not help — LangGraph's
  messages-mode metadata never carries the message id.
* Synthesizing ``f"_synth_{id(msg_chunk)}"`` only helps if the
  values snapshot uses the same fallback, which it does not.
* A real fix requires provider cooperation (always emit chunk ids)
  or content-based dedup (false-positive risk), neither of which
  belongs in this PR.

If a real fix lands, replace this test with a positive assertion
that dedup works for None-id chunks.

Addresses Copilot review feedback on PR #1974 (client.py:515).

* fix(frontend): UI polish - fix CSS typo, dark mode border, and hardcoded colors (#1942)

- Fix `font-norma` typo to `font-normal` in message-list subtask count
- Fix dark mode `--border` using reddish hue (22.216) instead of neutral
- Replace hardcoded `rgb(184,184,192)` in hero with `text-muted-foreground`
- Replace hardcoded `bg-[#a3a1a1]` in streaming indicator with `bg-muted-foreground`
- Add missing `font-sans` to welcome description `<pre>` for consistency
- Make case-study-section padding responsive (`px-4 md:px-20`)

Closes #1940

* docs: clarify deployment sizing guidance (#1963)

* fix(frontend): prevent stale 'new' thread ID from triggering 422 history requests (#1960)

After history.replaceState updates the URL from /chats/new to
/chats/{UUID}, Next.js useParams does not update because replaceState
bypasses the router. The useEffect in useThreadChat would then set
threadIdFromPath ('new') as the threadId, causing the LangGraph SDK
to call POST /threads/new/history which returns HTTP 422 (Invalid
thread ID: must be a UUID).

This fix adds a guard to skip the threadId update when
threadIdFromPath is the literal string 'new', preserving the
already-correct UUID that was set when the thread was created.

* fix(frontend): avoid using route new as thread id (#1967)

Co-authored-by: luoxiao6645 <luoxiao6645@gmail.com>

* Fix(subagent): Event loop conflict in SubagentExecutor.execute() (#1965)

* Fix event loop conflict in SubagentExecutor.execute()

When SubagentExecutor.execute() is called from within an already-running
event loop (e.g., when the parent agent uses async/await), calling
asyncio.run() creates a new event loop that conflicts with asyncio
primitives (like httpx.AsyncClient) that were created in and bound to
the parent loop.

This fix detects if we're already in a running event loop, and if so,
runs the subagent in a separate thread with its own isolated event loop
to avoid conflicts.

Fixes: sub-task cards not appearing in Ultra mode when using async parent agents

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(subagent): harden isolated event loop execution

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor(backend): remove dead getattr in _tool_message_event

---------

Co-authored-by: greatmengqi <chenmengqi.0376@bytedance.com>
Co-authored-by: Xinmin Zeng <135568692+fancyboi999@users.noreply.github.com>
Co-authored-by: 13ernkastel <LennonCMJ@live.com>
Co-authored-by: siwuai <458372151@qq.com>
Co-authored-by: 肖 <168966994+luoxiao6645@users.noreply.github.com>
Co-authored-by: luoxiao6645 <luoxiao6645@gmail.com>
Co-authored-by: Saber <11769524+hawkli-1994@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
This commit is contained in:
greatmengqi 2026-04-10 18:16:38 +08:00 committed by GitHub
parent 654354c624
commit b1aabe88b8
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
5 changed files with 917 additions and 56 deletions

View File

@ -395,14 +395,16 @@ Both can be modified at runtime via Gateway API endpoints or `DeerFlowClient` me
**Architecture**: Imports the same `deerflow` modules that LangGraph Server and Gateway API use. Shares the same config files and data directories. No FastAPI dependency.
**Agent Conversation** (replaces LangGraph Server):
- `chat(message, thread_id)` — synchronous, returns final text
- `stream(message, thread_id)` — yields `StreamEvent` aligned with LangGraph SSE protocol:
- `"values"` — full state snapshot (title, messages, artifacts)
- `"messages-tuple"` — per-message update (AI text, tool calls, tool results)
- `"end"` — stream finished
- `chat(message, thread_id)` — synchronous, accumulates streaming deltas per message-id and returns the final AI text
- `stream(message, thread_id)` — subscribes to LangGraph `stream_mode=["values", "messages", "custom"]` and yields `StreamEvent`:
- `"values"` — full state snapshot (title, messages, artifacts); AI text already delivered via `messages` mode is **not** re-synthesized here to avoid duplicate deliveries
- `"messages-tuple"` — per-chunk update: for AI text this is a **delta** (concat per `id` to rebuild the full message); tool calls and tool results are emitted once each
- `"custom"` — forwarded from `StreamWriter`
- `"end"` — stream finished (carries cumulative `usage` counted once per message id)
- Agent created lazily via `create_agent()` + `_build_middlewares()`, same as `make_lead_agent`
- Supports `checkpointer` parameter for state persistence across turns
- `reset_agent()` forces agent recreation (e.g. after memory or skill changes)
- See [docs/STREAMING.md](docs/STREAMING.md) for the full design: why Gateway and DeerFlowClient are parallel paths, LangGraph's `stream_mode` semantics, the per-id dedup invariants, and regression testing strategy
**Gateway Equivalent Methods** (replaces Gateway API):

View File

@ -15,6 +15,7 @@ This directory contains detailed documentation for the DeerFlow backend.
| Document | Description |
|----------|-------------|
| [STREAMING.md](STREAMING.md) | Token-level streaming design: Gateway vs DeerFlowClient paths, `stream_mode` semantics, per-id dedup |
| [FILE_UPLOAD.md](FILE_UPLOAD.md) | File upload functionality |
| [PATH_EXAMPLES.md](PATH_EXAMPLES.md) | Path types and usage examples |
| [summarization.md](summarization.md) | Context summarization feature |
@ -47,6 +48,7 @@ docs/
├── PATH_EXAMPLES.md # Path usage examples
├── summarization.md # Summarization feature
├── plan_mode_usage.md # Plan mode feature
├── STREAMING.md # Token-level streaming design
├── AUTO_TITLE_GENERATION.md # Title generation
├── TITLE_GENERATION_IMPLEMENTATION.md # Title implementation details
└── TODO.md # Roadmap and issues

351
backend/docs/STREAMING.md Normal file
View File

@ -0,0 +1,351 @@
# DeerFlow 流式输出设计
本文档解释 DeerFlow 是如何把 LangGraph agent 的事件流端到端送到两类消费者HTTP 客户端、嵌入式 Python 调用方)的:两条路径为什么**必须**并存、它们各自的契约是什么、以及设计里那些 non-obvious 的不变式。
---
## TL;DR
- DeerFlow 有**两条并行**的流式路径:**Gateway 路径**async / HTTP SSE / JSON 序列化)服务浏览器和 IM 渠道;**DeerFlowClient 路径**sync / in-process / 原生 LangChain 对象)服务 Jupyter、脚本、测试。它们**无法合并**——消费者模型不同。
- 两条路径都从 `create_agent()` 工厂出发,核心都是订阅 LangGraph 的 `stream_mode=["values", "messages", "custom"]``values` 是节点级 state 快照,`messages` 是 LLM token 级 delta`custom` 是显式 `StreamWriter` 事件。**这三种模式不是详细程度的梯度,是三个独立的事件源**,要 token 流就必须显式订阅 `messages`
- 嵌入式 client 为每个 `stream()` 调用维护三个 `set[str]``seen_ids` / `streamed_ids` / `counted_usage_ids`。三者看起来相似但管理**三个独立的不变式**,不能合并。
---
## 为什么有两条流式路径
两条路径服务的消费者模型根本不同:
| 维度 | Gateway 路径 | DeerFlowClient 路径 |
|---|---|---|
| 入口 | FastAPI `/runs/stream` endpoint | `DeerFlowClient.stream(message)` |
| 触发层 | `runtime/runs/worker.py::run_agent` | `packages/harness/deerflow/client.py::DeerFlowClient.stream` |
| 执行模型 | `async def` + `agent.astream()` | sync generator + `agent.stream()` |
| 事件传输 | `StreamBridge`asyncio Queue+ `sse_consumer` | 直接 `yield` |
| 序列化 | `serialize(chunk)` → 纯 JSON dict匹配 LangGraph Platform wire 格式 | `StreamEvent.data`,携带原生 LangChain 对象 |
| 消费者 | 前端 `useStream` React hook、飞书/Slack/Telegram channel、LangGraph SDK 客户端 | Jupyter notebook、集成测试、内部 Python 脚本 |
| 生命周期管理 | `RunManager`run_id 跟踪、disconnect 语义、multitask 策略、heartbeat | 无;函数返回即结束 |
| 断连恢复 | `Last-Event-ID` SSE 重连 | 无需要 |
**两条路径的存在是 DRY 的刻意妥协**Gateway 的全部基础设施async + Queue + JSON + RunManager**都是为了跨网络边界把事件送给 HTTP 消费者**。当生产者agent和消费者Python 调用栈)在同一个进程时,这整套东西都是纯开销。
### 为什么不能让 DeerFlowClient 复用 Gateway
曾经考虑过三种复用方案,都被否决:
1. **让 `client.stream()` 变成 `async def client.astream()`**
breaking change。用户用不上的 `async for` / `asyncio.run()` 要硬塞进 Jupyter notebook 和同步脚本。DeerFlowClient 的一大卖点("把 agent 当普通函数调用")直接消失。
2. **在 `client.stream()` 内部起一个独立事件循环线程,用 `StreamBridge` 在 sync/async 之间做桥接**
引入线程池、队列、信号量。为了"消除重复",把**复杂度**代替代码行数引进来。是典型的"wrong abstraction"——开销高于复用收益。
3. **让 `run_agent` 自己兼容 sync mode**
给 Gateway 加一条用不到的死分支,污染 worker.py 的焦点。
所以两条路径的事件处理逻辑会**相似但不共享**。这是刻意设计,不是疏忽。
---
## LangGraph `stream_mode` 三层语义
LangGraph 的 `agent.stream(stream_mode=[...])` 是**多路复用**接口:一次订阅多个 mode每个 mode 是一个独立的事件源。三种核心 mode
```mermaid
flowchart LR
classDef values fill:#B8C5D1,stroke:#5A6B7A,color:#2C3E50
classDef messages fill:#C9B8A8,stroke:#7A6B5A,color:#2C3E50
classDef custom fill:#B5C4B1,stroke:#5A7A5A,color:#2C3E50
subgraph LG["LangGraph agent graph"]
direction TB
Node1["node: LLM call"]
Node2["node: tool call"]
Node3["node: reducer"]
end
LG -->|"每个节点完成后"| V["values: 完整 state 快照"]
Node1 -->|"LLM 每产生一个 token"| M["messages: (AIMessageChunk, meta)"]
Node1 -->|"StreamWriter.write()"| C["custom: 任意 dict"]
class V values
class M messages
class C custom
```
| Mode | 发射时机 | Payload | 粒度 |
|---|---|---|---|
| `values` | 每个 graph 节点完成后 | 完整 state dicttitle、messages、artifacts| 节点级 |
| `messages` | LLM 每次 yield 一个 chunktool 节点完成时 | `(AIMessageChunk \| ToolMessage, metadata_dict)` | token 级 |
| `custom` | 用户代码显式调用 `StreamWriter.write()` | 任意 dict | 应用定义 |
### 两套命名的由来
同一件事在**三个协议层**有三个名字:
```
Application HTTP / SSE LangGraph Graph
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ frontend │ │ LangGraph │ │ agent.astream│
│ useStream │──"messages- │ Platform SDK │──"messages"──│ graph.astream│
│ Feishu IM │ tuple"──────│ HTTP wire │ │ │
└──────────────┘ └──────────────┘ └──────────────┘
```
- **Graph 层**`agent.stream` / `agent.astream`LangGraph Python 直接 APImode 叫 **`"messages"`**。
- **Platform SDK 层**`langgraph-sdk` HTTP client跨进程 HTTP 契约mode 叫 **`"messages-tuple"`**。
- **Gateway worker** 显式做翻译:`if m == "messages-tuple": lg_modes.append("messages")``runtime/runs/worker.py:117-121`)。
**后果**`DeerFlowClient.stream()` 直接调 `agent.stream()`Graph 层),所以必须传 `"messages"``app/channels/manager.py` 通过 `langgraph-sdk` 走 HTTP SDK所以传 `"messages-tuple"`。**这两个字符串不能互相替代**,也不能抽成"一个共享常量"——它们是不同协议层的 type alias共享只会让某一层说不是它母语的话。
---
## Gateway 路径async + HTTP SSE
```mermaid
sequenceDiagram
participant Client as HTTP Client
participant API as FastAPI<br/>thread_runs.py
participant Svc as services.py<br/>start_run
participant Worker as worker.py<br/>run_agent (async)
participant Bridge as StreamBridge<br/>(asyncio.Queue)
participant Agent as LangGraph<br/>agent.astream
participant SSE as sse_consumer
Client->>API: POST /runs/stream
API->>Svc: start_run(body)
Svc->>Bridge: create bridge
Svc->>Worker: asyncio.create_task(run_agent(...))
Svc-->>API: StreamingResponse(sse_consumer)
API-->>Client: event-stream opens
par worker (producer)
Worker->>Agent: astream(stream_mode=lg_modes)
loop 每个 chunk
Agent-->>Worker: (mode, chunk)
Worker->>Bridge: publish(run_id, event, serialize(chunk))
end
Worker->>Bridge: publish_end(run_id)
and sse_consumer (consumer)
SSE->>Bridge: subscribe(run_id)
loop 每个 event
Bridge-->>SSE: StreamEvent
SSE-->>Client: "event: <name>\ndata: <json>\n\n"
end
end
```
关键组件:
- `runtime/runs/worker.py::run_agent` — 在 `asyncio.Task` 里跑 `agent.astream()`,把每个 chunk 通过 `serialize(chunk, mode=mode)` 转成 JSON`bridge.publish()`
- `runtime/stream_bridge` — 抽象 Queue。`publish/subscribe` 解耦生产者和消费者,支持 `Last-Event-ID` 重连、心跳、多订阅者 fan-out。
- `app/gateway/services.py::sse_consumer` — 从 bridge 订阅,格式化为 SSE wire 帧。
- `runtime/serialization.py::serialize` — mode-aware 序列化;`messages` mode 下 `serialize_messages_tuple``(chunk, metadata)` 转成 `[chunk.model_dump(), metadata]`
**`StreamBridge` 的存在价值**:当生产者(`run_agent` 任务和消费者HTTP 连接)在不同的 asyncio task 里运行时,需要一个可以跨 task 传递事件的中介。Queue 同时还承担断连重连的 buffer 和多订阅者的 fan-out。
---
## DeerFlowClient 路径sync + in-process
```mermaid
sequenceDiagram
participant User as Python caller
participant Client as DeerFlowClient.stream
participant Agent as LangGraph<br/>agent.stream (sync)
User->>Client: for event in client.stream("hi"):
Client->>Agent: stream(stream_mode=["values","messages","custom"])
loop 每个 chunk
Agent-->>Client: (mode, chunk)
Client->>Client: 分发 mode<br/>构建 StreamEvent
Client-->>User: yield StreamEvent
end
Client-->>User: yield StreamEvent(type="end")
```
对比之下sync 路径的每个环节都是显著更少的移动部件:
- 没有 `RunManager` —— 一次 `stream()` 调用对应一次生命周期,无需 run_id。
- 没有 `StreamBridge` —— 直接 `yield`,生产和消费在同一个 Python 调用栈,不需要跨 task 中介。
- 没有 JSON 序列化 —— `StreamEvent.data` 直接装原生 LangChain 对象(`AIMessage.content``usage_metadata``UsageMetadata` TypedDict。Jupyter 用户拿到的是真正的类型,不是匿名 dict。
- 没有 asyncio —— 调用者可以直接 `for event in ...`,不必写 `async for`
---
## 消费语义delta vs cumulative
LangGraph `messages` mode 给出的是 **delta**:每个 `AIMessageChunk.content` 只包含这一次新 yield 的 token**不是**从头的累计文本。
这个语义和 LangChain 的 `fs2 Stream` 风格一致:**上游发增量,下游负责累加**。Gateway 路径里前端 `useStream` React hook 自己维护累加器DeerFlowClient 路径里 `chat()` 方法替调用者做累加。
### `DeerFlowClient.chat()` 的 O(n) 累加器
```python
chunks: dict[str, list[str]] = {}
last_id: str = ""
for event in self.stream(message, thread_id=thread_id, **kwargs):
if event.type == "messages-tuple" and event.data.get("type") == "ai":
msg_id = event.data.get("id") or ""
delta = event.data.get("content", "")
if delta:
chunks.setdefault(msg_id, []).append(delta)
last_id = msg_id
return "".join(chunks.get(last_id, ()))
```
**为什么不是 `buffers[id] = buffers.get(id,"") + delta`**CPython 的字符串 in-place concat 优化仅在 refcount=1 且 LHS 是 local name 时生效;这里字符串存在 dict 里被 reassign优化失效每次都是 O(n) 拷贝 → 总体 O(n²)。实测 50 KB / 5000 chunk 的回复要 100-300ms 纯拷贝开销。用 `list` + `"".join()` 是 O(n)。
---
## 三个 id set 为什么不能合并
`DeerFlowClient.stream()` 在一次调用生命周期内维护三个 `set[str]`
```python
seen_ids: set[str] = set() # values 路径内部 dedup
streamed_ids: set[str] = set() # messages → values 跨模式 dedup
counted_usage_ids: set[str] = set() # usage_metadata 幂等计数
```
乍看像是"三份几乎一样的东西",实际每个管**不同的不变式**。
| Set | 负责的不变式 | 被谁填充 | 被谁查询 |
|---|---|---|---|
| `seen_ids` | 连续两个 `values` 快照里同一条 message 只生成一个 `messages-tuple` 事件 | values 分支每处理一条消息就加入 | values 分支处理下一条消息前检查 |
| `streamed_ids` | 如果一条消息已经通过 `messages` 模式 token 级流过values 快照到达时**不要**再合成一次完整 `messages-tuple` | messages 分支每发一个 AI/tool 事件就加入 | values 分支看到消息时检查 |
| `counted_usage_ids` | 同一个 `usage_metadata` 在 messages 末尾 chunk 和 values 快照的 final AIMessage 里各带一份,**累计总量只算一次** | `_account_usage()` 每次接受 usage 就加入 | `_account_usage()` 每次调用时检查 |
### 为什么不能只用一个 set
关键观察:**同一个 message id 在这三个 set 里的加入时机不同**。
```mermaid
sequenceDiagram
participant M as messages mode
participant V as values mode
participant SS as streamed_ids
participant SU as counted_usage_ids
participant SE as seen_ids
Note over M: 第一个 AI text chunk 到达
M->>SS: add(msg_id)
Note over M: 最后一个 chunk 带 usage
M->>SU: add(msg_id)
Note over V: snapshot 到达,包含同一条 AI message
V->>SE: add(msg_id)
V->>SS: 查询 → 已存在,跳过文本合成
V->>SU: 查询 → 已存在,不重复计数
```
- `seen_ids` **永远在 values 快照到达时**加入,所以它是 "values 已处理" 的标记。一条只出现在 messages 流里的消息(罕见但可能),`seen_ids` 里永远没有它。
- `streamed_ids` **在 messages 流的第一个有效事件时**加入。一条只通过 values 快照到达的非 AI 消息HumanMessage、被 truncate 的 tool 消息),`streamed_ids` 里永远没有它。
- `counted_usage_ids` **只在看到非空 `usage_metadata` 时**加入。一条完全没有 usage 的消息tool message、错误消息永远不会进去。
**集合包含关系**`counted_usage_ids ⊆ (streamed_ids seen_ids)` 大致成立,但**不是严格子集**,因为一条消息可以在 messages 模式流完 text 但**在最后那个带 usage 的 chunk 之前**就被 values snapshot 赶上——此时它已经在 `streamed_ids` 里,但还不在 `counted_usage_ids` 里。把它们合并成一个 dict-of-flags 会让这个微妙的时序依赖**从类型系统里消失**,变成注释里的一句话。三个独立的 set 把不变式显式化了:每个 set 名对应一个可以口头回答的问题。
---
## 端到端:一次真实对话的事件时序
假设调用 `client.stream("Count from 1 to 15")`LLM 给出 "one\ntwo\n...\nfifteen"88 字符tokenizer 把它拆成 ~35 个 BPE chunk。下面是事件到达序列的精简版
```mermaid
sequenceDiagram
participant U as User
participant C as DeerFlowClient
participant A as LangGraph<br/>agent.stream
U->>C: stream("Count ... 15")
C->>A: stream(mode=["values","messages","custom"])
A-->>C: ("values", {messages: [HumanMessage]})
C-->>U: StreamEvent(type="values", ...)
Note over A,C: LLM 开始 yield token
loop 35 次,约 476ms
A-->>C: ("messages", (AIMessageChunk(content="ele"), meta))
C->>C: streamed_ids.add(ai-1)
C-->>U: StreamEvent(type="messages-tuple",<br/>data={type:ai, content:"ele", id:ai-1})
end
Note over A: LLM finish_reason=stop最后一个 chunk 带 usage
A-->>C: ("messages", (AIMessageChunk(content="", usage_metadata={...}), meta))
C->>C: counted_usage_ids.add(ai-1)<br/>(无文本,不 yield)
A-->>C: ("values", {messages: [..., AIMessage(complete)]})
C->>C: ai-1 in streamed_ids → 跳过合成
C->>C: 捕获 usage (已在 counted_usage_idsno-op)
C-->>U: StreamEvent(type="values", ...)
C-->>U: StreamEvent(type="end", data={usage:{...}})
```
关键观察:
1. 用户看到 **35 个 messages-tuple 事件**,跨越约 476ms每个事件带一个 token delta 和同一个 `id=ai-1`
2. 最后一个 `values` 快照里的 `AIMessage` **不会**再触发一个完整的 `messages-tuple` 事件——因为 `ai-1 in streamed_ids` 跳过了合成。
3. `end` 事件里的 `usage` 正好等于那一份 cumulative usage**不是它的两倍**——`counted_usage_ids` 在 messages 末尾 chunk 上已经吸收了values 分支的重复访问是 no-op。
4. 消费者拿到的 `content` 是**增量**"ele" 只包含 3 个字符,不是 "one\ntwo\n...ele"。想要完整文本要按 `id` 累加,`chat()` 已经帮你做了。
---
## 为什么这个设计容易出 bug以及测试策略
本文档的直接起因是 bytedance/deer-flow#1969`DeerFlowClient.stream()` 原本只订阅 `["values", "custom"]`**漏了 `"messages"`**。结果 `client.stream("hello")` 等价于一次性返回,视觉上和 `chat()` 没区别。
这类 bug 有三个结构性原因:
1. **多协议层命名**`messages` / `messages-tuple` / HTTP SSE `messages` 是同一概念的三个名字。在其中一层出错不会在另外两层报错。
2. **多消费者模型**Gateway 和 DeerFlowClient 是两套独立实现,**没有单一的"订阅哪些 mode"的 single source of truth**。前者订阅对了不代表后者也订阅对了。
3. **mock 测试绕开了真实路径**:老测试用 `agent.stream.return_value = iter([dict_chunk, ...])` 喂 values 形状的 dict 模拟 state 快照。这样构造的输入**永远不会进入 `messages` mode 分支**,所以即使 `stream_mode` 里少一个元素CI 依然全绿。
### 防御手段
真正的防线是**显式断言 "messages" mode 被订阅 + 用真实 chunk shape mock**
```python
# tests/test_client.py::test_messages_mode_emits_token_deltas
agent.stream.return_value = iter([
("messages", (AIMessageChunk(content="Hel", id="ai-1"), {})),
("messages", (AIMessageChunk(content="lo ", id="ai-1"), {})),
("messages", (AIMessageChunk(content="world!", id="ai-1"), {})),
("values", {"messages": [HumanMessage(...), AIMessage(content="Hello world!", id="ai-1")]}),
])
# ...
assert [e.data["content"] for e in ai_text_events] == ["Hel", "lo ", "world!"]
assert len(ai_text_events) == 3 # values snapshot must NOT re-synthesize
assert "messages" in agent.stream.call_args.kwargs["stream_mode"]
```
**为什么这比"抽一个共享常量"更有效**:共享常量只能保证"用它的人写对字符串",但新增消费者的人可能根本不知道常量在哪。行为断言强制任何改动都要穿过**实际执行路径**,改回 `["values", "custom"]` 会立刻让 `assert "messages" in ...` 失败。
### 活体信号BPE 子词边界
回归的最终验证是让真实 LLM 数 1-15然后看是否能在输出里看到 tokenizer 的子词切分:
```
[5.460s] 'ele' / 'ven' eleven 被拆成两个 token
[5.508s] 'tw' / 'elve' twelve 拆两个
[5.568s] 'th' / 'irteen' thirteen 拆两个
[5.623s] 'four'/ 'teen' fourteen 拆两个
[5.677s] 'f' / 'if' / 'teen' fifteen 拆三个
```
子词切分是 tokenizer 的外部事实,**无法伪造**。能看到它就说明数据流**逐 chunk** 地穿过了整条管道,没有被任何中间层缓冲成整段。这种"活体信号"在流式系统里是比单元测试更高置信度的证据。
---
## 相关源码定位
| 关心什么 | 看这里 |
|---|---|
| DeerFlowClient 嵌入式流 | `packages/harness/deerflow/client.py::DeerFlowClient.stream` |
| `chat()` 的 delta 累加器 | `packages/harness/deerflow/client.py::DeerFlowClient.chat` |
| Gateway async 流 | `packages/harness/deerflow/runtime/runs/worker.py::run_agent` |
| HTTP SSE 帧输出 | `app/gateway/services.py::sse_consumer` / `format_sse` |
| 序列化到 wire 格式 | `packages/harness/deerflow/runtime/serialization.py` |
| LangGraph mode 命名翻译 | `packages/harness/deerflow/runtime/runs/worker.py:117-121` |
| 飞书渠道的增量卡片更新 | `app/channels/manager.py::_handle_streaming_chat` |
| Channels 自带的 delta/cumulative 防御性累加 | `app/channels/manager.py::_merge_stream_text` |
| Frontend useStream 支持的 mode 集合 | `frontend/src/core/api/stream-mode.ts` |
| 核心回归测试 | `backend/tests/test_client.py::TestStream::test_messages_mode_emits_token_deltas` |

View File

@ -25,7 +25,7 @@ import uuid
from collections.abc import Generator, Sequence
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any
from typing import Any, Literal
from langchain.agents import create_agent
from langchain.agents.middleware import AgentMiddleware
@ -55,6 +55,9 @@ from deerflow.uploads.manager import (
logger = logging.getLogger(__name__)
StreamEventType = Literal["values", "messages-tuple", "custom", "end"]
@dataclass
class StreamEvent:
"""A single event from the streaming agent response.
@ -69,7 +72,7 @@ class StreamEvent:
data: Event payload. Contents vary by type.
"""
type: str
type: StreamEventType
data: dict[str, Any] = field(default_factory=dict)
@ -254,13 +257,53 @@ class DeerFlowClient:
return get_available_tools(model_name=model_name, subagent_enabled=subagent_enabled)
@staticmethod
def _serialize_tool_calls(tool_calls) -> list[dict]:
"""Reshape LangChain tool_calls into the wire format used in events."""
return [{"name": tc["name"], "args": tc["args"], "id": tc.get("id")} for tc in tool_calls]
@staticmethod
def _ai_text_event(msg_id: str | None, text: str, usage: dict | None) -> "StreamEvent":
"""Build a ``messages-tuple`` AI text event, attaching usage when present."""
data: dict[str, Any] = {"type": "ai", "content": text, "id": msg_id}
if usage:
data["usage_metadata"] = usage
return StreamEvent(type="messages-tuple", data=data)
@staticmethod
def _ai_tool_calls_event(msg_id: str | None, tool_calls) -> "StreamEvent":
"""Build a ``messages-tuple`` AI tool-calls event."""
return StreamEvent(
type="messages-tuple",
data={
"type": "ai",
"content": "",
"id": msg_id,
"tool_calls": DeerFlowClient._serialize_tool_calls(tool_calls),
},
)
@staticmethod
def _tool_message_event(msg: ToolMessage) -> "StreamEvent":
"""Build a ``messages-tuple`` tool-result event from a ToolMessage."""
return StreamEvent(
type="messages-tuple",
data={
"type": "tool",
"content": DeerFlowClient._extract_text(msg.content),
"name": msg.name,
"tool_call_id": msg.tool_call_id,
"id": msg.id,
},
)
@staticmethod
def _serialize_message(msg) -> dict:
"""Serialize a LangChain message to a plain dict for values events."""
if isinstance(msg, AIMessage):
d: dict[str, Any] = {"type": "ai", "content": msg.content, "id": getattr(msg, "id", None)}
if msg.tool_calls:
d["tool_calls"] = [{"name": tc["name"], "args": tc["args"], "id": tc.get("id")} for tc in msg.tool_calls]
d["tool_calls"] = DeerFlowClient._serialize_tool_calls(msg.tool_calls)
if getattr(msg, "usage_metadata", None):
d["usage_metadata"] = msg.usage_metadata
return d
@ -438,6 +481,53 @@ class DeerFlowClient:
consumers can switch between HTTP streaming and embedded mode
without changing their event-handling logic.
Token-level streaming
~~~~~~~~~~~~~~~~~~~~~
This method subscribes to LangGraph's ``messages`` stream mode, so
``messages-tuple`` events for AI text are emitted as **deltas** as
the model generates tokens, not as one cumulative dump at node
completion. Each delta carries a stable ``id`` consumers that
want the full text must accumulate ``content`` per ``id``.
``chat()`` already does this for you.
Tool calls and tool results are still emitted once per logical
message. ``values`` events continue to carry full state snapshots
after each graph node finishes; AI text already delivered via the
``messages`` stream is **not** re-synthesized from the snapshot to
avoid duplicate deliveries.
Why not reuse Gateway's ``run_agent``?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Gateway (``runtime/runs/worker.py``) has a complete streaming
pipeline: ``run_agent`` ``StreamBridge`` ``sse_consumer``. It
looks like this client duplicates that work, but the two paths
serve different audiences and **cannot** share execution:
* ``run_agent`` is ``async def`` and uses ``agent.astream()``;
this method is a sync generator using ``agent.stream()`` so
callers can write ``for event in client.stream(...)`` without
touching asyncio. Bridging the two would require spinning up
an event loop + thread per call.
* Gateway events are JSON-serialized by ``serialize()`` for SSE
wire transmission. This client yields in-process stream event
payloads directly as Python data structures (``StreamEvent``
with ``data`` as a plain ``dict``), without the extra
JSON/SSE serialization layer used for HTTP delivery.
* ``StreamBridge`` is an asyncio-queue decoupling producers from
consumers across an HTTP boundary (``Last-Event-ID`` replay,
heartbeats, multi-subscriber fan-out). A single in-process
caller with a direct iterator needs none of that.
So ``DeerFlowClient.stream()`` is a parallel, sync, in-process
consumer of the same ``create_agent()`` factory not a wrapper
around Gateway. The two paths **should** stay in sync on which
LangGraph stream modes they subscribe to; that invariant is
enforced by ``tests/test_client.py::test_messages_mode_emits_token_deltas``
rather than by a shared constant, because the three layers
(Graph, Platform SDK, HTTP) each use their own naming
(``messages`` vs ``messages-tuple``) and cannot literally share
a string.
Args:
message: User message text.
thread_id: Thread ID for conversation context. Auto-generated if None.
@ -448,8 +538,8 @@ class DeerFlowClient:
StreamEvent with one of:
- type="values" data={"title": str|None, "messages": [...], "artifacts": [...]}
- type="custom" data={...}
- type="messages-tuple" data={"type": "ai", "content": str, "id": str}
- type="messages-tuple" data={"type": "ai", "content": str, "id": str, "usage_metadata": {...}}
- type="messages-tuple" data={"type": "ai", "content": <delta>, "id": str}
- type="messages-tuple" data={"type": "ai", "content": <delta>, "id": str, "usage_metadata": {...}}
- type="messages-tuple" data={"type": "ai", "content": "", "id": str, "tool_calls": [...]}
- type="messages-tuple" data={"type": "tool", "content": str, "name": str, "tool_call_id": str, "id": str}
- type="end" data={"usage": {"input_tokens": int, "output_tokens": int, "total_tokens": int}}
@ -466,13 +556,47 @@ class DeerFlowClient:
context["agent_name"] = self._agent_name
seen_ids: set[str] = set()
# Cross-mode handoff: ids already streamed via LangGraph ``messages``
# mode so the ``values`` path skips re-synthesis of the same message.
streamed_ids: set[str] = set()
# The same message id carries identical cumulative ``usage_metadata``
# in both the final ``messages`` chunk and the values snapshot —
# count it only on whichever arrives first.
counted_usage_ids: set[str] = set()
cumulative_usage: dict[str, int] = {"input_tokens": 0, "output_tokens": 0, "total_tokens": 0}
def _account_usage(msg_id: str | None, usage: Any) -> dict | None:
"""Add *usage* to cumulative totals if this id has not been counted.
``usage`` is a ``langchain_core.messages.UsageMetadata`` TypedDict
or ``None``; typed as ``Any`` because TypedDicts are not
structurally assignable to plain ``dict`` under strict type
checking. Returns the normalized usage dict (for attaching
to an event) when we accepted it, otherwise ``None``.
"""
if not usage:
return None
if msg_id and msg_id in counted_usage_ids:
return None
if msg_id:
counted_usage_ids.add(msg_id)
input_tokens = usage.get("input_tokens", 0) or 0
output_tokens = usage.get("output_tokens", 0) or 0
total_tokens = usage.get("total_tokens", 0) or 0
cumulative_usage["input_tokens"] += input_tokens
cumulative_usage["output_tokens"] += output_tokens
cumulative_usage["total_tokens"] += total_tokens
return {
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"total_tokens": total_tokens,
}
for item in self._agent.stream(
state,
config=config,
context=context,
stream_mode=["values", "custom"],
stream_mode=["values", "messages", "custom"],
):
if isinstance(item, tuple) and len(item) == 2:
mode, chunk = item
@ -484,6 +608,36 @@ class DeerFlowClient:
yield StreamEvent(type="custom", data=chunk)
continue
if mode == "messages":
# LangGraph ``messages`` mode emits ``(message_chunk, metadata)``.
if isinstance(chunk, tuple) and len(chunk) == 2:
msg_chunk, _metadata = chunk
else:
msg_chunk = chunk
msg_id = getattr(msg_chunk, "id", None)
if isinstance(msg_chunk, AIMessage):
text = self._extract_text(msg_chunk.content)
counted_usage = _account_usage(msg_id, msg_chunk.usage_metadata)
if text:
if msg_id:
streamed_ids.add(msg_id)
yield self._ai_text_event(msg_id, text, counted_usage)
if msg_chunk.tool_calls:
if msg_id:
streamed_ids.add(msg_id)
yield self._ai_tool_calls_event(msg_id, msg_chunk.tool_calls)
elif isinstance(msg_chunk, ToolMessage):
if msg_id:
streamed_ids.add(msg_id)
yield self._tool_message_event(msg_chunk)
continue
# mode == "values"
messages = chunk.get("messages", [])
for msg in messages:
@ -493,47 +647,25 @@ class DeerFlowClient:
if msg_id:
seen_ids.add(msg_id)
# Already streamed via ``messages`` mode; only (defensively)
# capture usage here and skip re-synthesizing the event.
if msg_id and msg_id in streamed_ids:
if isinstance(msg, AIMessage):
_account_usage(msg_id, getattr(msg, "usage_metadata", None))
continue
if isinstance(msg, AIMessage):
# Track token usage from AI messages
usage = getattr(msg, "usage_metadata", None)
if usage:
cumulative_usage["input_tokens"] += usage.get("input_tokens", 0) or 0
cumulative_usage["output_tokens"] += usage.get("output_tokens", 0) or 0
cumulative_usage["total_tokens"] += usage.get("total_tokens", 0) or 0
counted_usage = _account_usage(msg_id, msg.usage_metadata)
if msg.tool_calls:
yield StreamEvent(
type="messages-tuple",
data={
"type": "ai",
"content": "",
"id": msg_id,
"tool_calls": [{"name": tc["name"], "args": tc["args"], "id": tc.get("id")} for tc in msg.tool_calls],
},
)
yield self._ai_tool_calls_event(msg_id, msg.tool_calls)
text = self._extract_text(msg.content)
if text:
event_data: dict[str, Any] = {"type": "ai", "content": text, "id": msg_id}
if usage:
event_data["usage_metadata"] = {
"input_tokens": usage.get("input_tokens", 0) or 0,
"output_tokens": usage.get("output_tokens", 0) or 0,
"total_tokens": usage.get("total_tokens", 0) or 0,
}
yield StreamEvent(type="messages-tuple", data=event_data)
yield self._ai_text_event(msg_id, text, counted_usage)
elif isinstance(msg, ToolMessage):
yield StreamEvent(
type="messages-tuple",
data={
"type": "tool",
"content": self._extract_text(msg.content),
"name": getattr(msg, "name", None),
"tool_call_id": getattr(msg, "tool_call_id", None),
"id": msg_id,
},
)
yield self._tool_message_event(msg)
# Emit a values event for each state snapshot
yield StreamEvent(
@ -550,10 +682,12 @@ class DeerFlowClient:
def chat(self, message: str, *, thread_id: str | None = None, **kwargs) -> str:
"""Send a message and return the final text response.
Convenience wrapper around :meth:`stream` that returns only the
**last** AI text from ``messages-tuple`` events. If the agent emits
multiple text segments in one turn, intermediate segments are
discarded. Use :meth:`stream` directly to capture all events.
Convenience wrapper around :meth:`stream` that accumulates delta
``messages-tuple`` events per ``id`` and returns the text of the
**last** AI message to complete. Intermediate AI messages (e.g.
planner drafts) are discarded only the final id's accumulated
text is returned. Use :meth:`stream` directly if you need every
delta as it arrives.
Args:
message: User message text.
@ -561,15 +695,21 @@ class DeerFlowClient:
**kwargs: Override client defaults (same as stream()).
Returns:
The last AI message text, or empty string if no response.
The accumulated text of the last AI message, or empty string
if no AI text was produced.
"""
last_text = ""
# Per-id delta lists joined once at the end — avoids the O(n²) cost
# of repeated ``str + str`` on a growing buffer for long responses.
chunks: dict[str, list[str]] = {}
last_id: str = ""
for event in self.stream(message, thread_id=thread_id, **kwargs):
if event.type == "messages-tuple" and event.data.get("type") == "ai":
content = event.data.get("content", "")
if content:
last_text = content
return last_text
msg_id = event.data.get("id") or ""
delta = event.data.get("content", "")
if delta:
chunks.setdefault(msg_id, []).append(delta)
last_id = msg_id
return "".join(chunks.get(last_id, ()))
# ------------------------------------------------------------------
# Public API — configuration queries

View File

@ -10,7 +10,7 @@ from pathlib import Path
from unittest.mock import MagicMock, patch
import pytest
from langchain_core.messages import AIMessage, HumanMessage, SystemMessage, ToolMessage # noqa: F401
from langchain_core.messages import AIMessage, AIMessageChunk, HumanMessage, SystemMessage, ToolMessage # noqa: F401
from app.gateway.routers.mcp import McpConfigResponse
from app.gateway.routers.memory import MemoryConfigResponse, MemoryStatusResponse
@ -225,7 +225,9 @@ class TestStream:
agent.stream.assert_called_once()
call_kwargs = agent.stream.call_args.kwargs
assert call_kwargs["stream_mode"] == ["values", "custom"]
# ``messages`` enables token-level streaming of AI text deltas;
# see DeerFlowClient.stream() docstring and GitHub issue #1969.
assert call_kwargs["stream_mode"] == ["values", "messages", "custom"]
assert events[0].type == "custom"
assert events[0].data == {"type": "task_started", "task_id": "task-1"}
@ -351,6 +353,123 @@ class TestStream:
# Should not raise; end event proves it completed
assert events[-1].type == "end"
def test_messages_mode_emits_token_deltas(self, client):
"""stream() forwards LangGraph ``messages`` mode chunks as delta events.
Regression for bytedance/deer-flow#1969 — before the fix the client
only subscribed to ``values`` mode, so LLM output was delivered as
a single cumulative dump after each graph node finished instead of
token-by-token deltas as the model generated them.
"""
# Three AI chunks sharing the same id, followed by a terminal
# values snapshot with the fully assembled message — this matches
# the shape LangGraph emits when ``stream_mode`` includes both
# ``messages`` and ``values``.
assembled = AIMessage(content="Hel lo world!", id="ai-1", usage_metadata={"input_tokens": 3, "output_tokens": 4, "total_tokens": 7})
agent = MagicMock()
agent.stream.return_value = iter(
[
("messages", (AIMessageChunk(content="Hel", id="ai-1"), {})),
("messages", (AIMessageChunk(content=" lo ", id="ai-1"), {})),
(
"messages",
(
AIMessageChunk(
content="world!",
id="ai-1",
usage_metadata={"input_tokens": 3, "output_tokens": 4, "total_tokens": 7},
),
{},
),
),
("values", {"messages": [HumanMessage(content="hi", id="h-1"), assembled]}),
]
)
with (
patch.object(client, "_ensure_agent"),
patch.object(client, "_agent", agent),
):
events = list(client.stream("hi", thread_id="t-stream"))
# Three delta messages-tuple events, all with the same id, each
# carrying only its own delta (not cumulative).
ai_text_events = [e for e in events if e.type == "messages-tuple" and e.data.get("type") == "ai" and e.data.get("content")]
assert [e.data["content"] for e in ai_text_events] == ["Hel", " lo ", "world!"]
assert all(e.data["id"] == "ai-1" for e in ai_text_events)
# The values snapshot MUST NOT re-synthesize an AI text event for
# the already-streamed id (otherwise consumers see duplicated text).
assert len(ai_text_events) == 3
# Usage metadata attached only to the chunk that actually carried
# it, and counted into cumulative usage exactly once (the values
# snapshot's duplicate usage on the assembled AIMessage must not
# be double-counted).
events_with_usage = [e for e in ai_text_events if "usage_metadata" in e.data]
assert len(events_with_usage) == 1
assert events_with_usage[0].data["usage_metadata"] == {"input_tokens": 3, "output_tokens": 4, "total_tokens": 7}
end_event = events[-1]
assert end_event.type == "end"
assert end_event.data["usage"] == {"input_tokens": 3, "output_tokens": 4, "total_tokens": 7}
# The values snapshot itself is still emitted.
assert any(e.type == "values" for e in events)
# stream_mode includes ``messages`` — the whole point of this fix.
call_kwargs = agent.stream.call_args.kwargs
assert "messages" in call_kwargs["stream_mode"]
def test_chat_accumulates_streamed_deltas(self, client):
"""chat() concatenates per-id deltas from messages mode."""
agent = MagicMock()
agent.stream.return_value = iter(
[
("messages", (AIMessageChunk(content="Hel", id="ai-1"), {})),
("messages", (AIMessageChunk(content="lo ", id="ai-1"), {})),
("messages", (AIMessageChunk(content="world!", id="ai-1"), {})),
("values", {"messages": [HumanMessage(content="hi", id="h-1"), AIMessage(content="Hello world!", id="ai-1")]}),
]
)
with (
patch.object(client, "_ensure_agent"),
patch.object(client, "_agent", agent),
):
result = client.chat("hi", thread_id="t-chat-stream")
assert result == "Hello world!"
def test_messages_mode_tool_message(self, client):
"""stream() forwards ToolMessage chunks from messages mode."""
agent = MagicMock()
agent.stream.return_value = iter(
[
(
"messages",
(
ToolMessage(content="file.txt", id="tm-1", tool_call_id="tc-1", name="bash"),
{},
),
),
("values", {"messages": [HumanMessage(content="ls", id="h-1"), ToolMessage(content="file.txt", id="tm-1", tool_call_id="tc-1", name="bash")]}),
]
)
with (
patch.object(client, "_ensure_agent"),
patch.object(client, "_agent", agent),
):
events = list(client.stream("ls", thread_id="t-tool-stream"))
tool_events = [e for e in events if e.type == "messages-tuple" and e.data.get("type") == "tool"]
# The tool result must be delivered exactly once (from messages
# mode), not duplicated by the values-snapshot synthesis path.
assert len(tool_events) == 1
assert tool_events[0].data["content"] == "file.txt"
assert tool_events[0].data["name"] == "bash"
assert tool_events[0].data["tool_call_id"] == "tc-1"
def test_list_content_blocks(self, client):
"""stream() handles AIMessage with list-of-blocks content."""
ai = AIMessage(
@ -373,6 +492,253 @@ class TestStream:
assert len(msg_events) == 1
assert msg_events[0].data["content"] == "result"
# ------------------------------------------------------------------
# Refactor regression guards (PR #1974 follow-up safety)
#
# The three tests below are not bug-fix tests — they exist to lock
# the *exact* contract of stream() so a future refactor (e.g. moving
# to ``agent.astream()``, sharing a core with Gateway's run_agent,
# changing the dedup strategy) cannot silently change behavior.
# ------------------------------------------------------------------
def test_dedup_requires_messages_before_values_invariant(self, client):
"""Canary: locks the order-dependence of cross-mode dedup.
``streamed_ids`` is populated only by the ``messages`` branch.
If a ``values`` snapshot arrives BEFORE its corresponding
``messages`` chunks for the same id, the values path falls
through and synthesizes its own AI text event, then the
messages chunk emits another delta consumers see the same
id twice.
Under normal LangGraph operation this never happens (messages
chunks are emitted during LLM streaming, the values snapshot
after the node completes), so the implicit invariant is safe
in production. This test exists as a tripwire for refactors
that switch to ``agent.astream()`` or share a core with
Gateway: if the ordering ever changes, this test fails and
forces the refactor to either (a) preserve the ordering or
(b) deliberately re-baseline to a stronger order-independent
dedup contract and document the new contract here.
"""
agent = MagicMock()
agent.stream.return_value = iter(
[
# values arrives FIRST — streamed_ids still empty.
("values", {"messages": [HumanMessage(content="hi", id="h-1"), AIMessage(content="Hello", id="ai-1")]}),
# messages chunk for the same id arrives SECOND.
("messages", (AIMessageChunk(content="Hello", id="ai-1"), {})),
]
)
with (
patch.object(client, "_ensure_agent"),
patch.object(client, "_agent", agent),
):
events = list(client.stream("hi", thread_id="t-order-canary"))
ai_text_events = [e for e in events if e.type == "messages-tuple" and e.data.get("type") == "ai" and e.data.get("content")]
# Current behavior: 2 events (values synthesis + messages delta).
# If a refactor makes dedup order-independent, this becomes 1 —
# update the assertion AND the docstring above to record the
# new contract, do not silently fix this number.
assert len(ai_text_events) == 2
assert all(e.data["id"] == "ai-1" for e in ai_text_events)
assert [e.data["content"] for e in ai_text_events] == ["Hello", "Hello"]
def test_messages_mode_golden_event_sequence(self, client):
"""Locks the **exact** event sequence for a canonical streaming turn.
This is a strong regression guard: any future refactor that
changes the order, type, or shape of emitted events fails this
test with a clear list-equality diff, forcing either a
preserved sequence or a deliberate re-baseline.
Input shape:
messages chunk 1 text "Hel", no usage
messages chunk 2 text "lo", with cumulative usage
values snapshot assembled AIMessage with same usage
Locked behavior:
* Two messages-tuple AI text events (one per chunk), each
carrying ONLY its own delta not cumulative.
* ``usage_metadata`` attached only to the chunk that
delivered it (not the first chunk).
* The values event is still emitted, but its embedded
``messages`` list is the *serialized* form no
synthesized messages-tuple events for the already-
streamed id.
* ``end`` event carries cumulative usage counted exactly
once across both modes.
"""
# Inline the usage literal at construction sites so Pyright can
# narrow ``dict[str, int]`` to ``UsageMetadata`` (TypedDict
# narrowing only works on literals, not on bound variables).
# The local ``usage`` is reused only for assertion comparisons
# below, where structural dict equality is sufficient.
usage = {"input_tokens": 3, "output_tokens": 2, "total_tokens": 5}
agent = MagicMock()
agent.stream.return_value = iter(
[
("messages", (AIMessageChunk(content="Hel", id="ai-1"), {})),
("messages", (AIMessageChunk(content="lo", id="ai-1", usage_metadata={"input_tokens": 3, "output_tokens": 2, "total_tokens": 5}), {})),
(
"values",
{
"messages": [
HumanMessage(content="hi", id="h-1"),
AIMessage(content="Hello", id="ai-1", usage_metadata={"input_tokens": 3, "output_tokens": 2, "total_tokens": 5}),
]
},
),
]
)
with (
patch.object(client, "_ensure_agent"),
patch.object(client, "_agent", agent),
):
events = list(client.stream("hi", thread_id="t-golden"))
actual = [(e.type, e.data) for e in events]
expected = [
("messages-tuple", {"type": "ai", "content": "Hel", "id": "ai-1"}),
("messages-tuple", {"type": "ai", "content": "lo", "id": "ai-1", "usage_metadata": usage}),
(
"values",
{
"title": None,
"messages": [
{"type": "human", "content": "hi", "id": "h-1"},
{"type": "ai", "content": "Hello", "id": "ai-1", "usage_metadata": usage},
],
"artifacts": [],
},
),
("end", {"usage": usage}),
]
assert actual == expected
def test_chat_accumulates_in_linear_time(self, client):
"""``chat()`` must use a non-quadratic accumulation strategy.
PR #1974 commit 2 replaced ``buffer = buffer + delta`` with
``list[str].append`` + ``"".join`` to fix an O() regression
introduced in commit 1. This test guards against a future
refactor accidentally restoring the quadratic path.
Threshold rationale (10,000 single-char chunks, 1 second):
* Current O(n) implementation: ~50-200 ms total, including
all mock + event yield overhead.
* O() regression at n=10,000: chat accumulation alone
becomes ~500 ms-2 s (50 M character copies), reliably
over the bound on any reasonable CI.
If this test ever flakes on slow CI, do NOT raise the threshold
blindly first confirm the implementation still uses
``"".join``, then consider whether the test should move to a
benchmark suite that excludes mock overhead.
"""
import time
n = 10_000
chunks: list = [("messages", (AIMessageChunk(content="x", id="ai-1"), {})) for _ in range(n)]
chunks.append(
(
"values",
{
"messages": [
HumanMessage(content="go", id="h-1"),
AIMessage(content="x" * n, id="ai-1"),
]
},
)
)
agent = MagicMock()
agent.stream.return_value = iter(chunks)
with (
patch.object(client, "_ensure_agent"),
patch.object(client, "_agent", agent),
):
start = time.monotonic()
result = client.chat("go", thread_id="t-perf")
elapsed = time.monotonic() - start
assert result == "x" * n
assert elapsed < 1.0, f"chat() took {elapsed:.3f}s for {n} chunks — possible O(n^2) regression (see PR #1974 commit 2 for the original fix)"
def test_none_id_chunks_produce_duplicates_known_limitation(self, client):
"""Documents a known dedup limitation: ``messages`` chunks with ``id=None``.
Some LLM providers (vLLM, certain custom backends) emit
``AIMessageChunk`` instances without an ``id``. In that case
the cross-mode dedup machinery cannot record the chunk in
``streamed_ids`` (the implementation guards on ``if msg_id``
before adding), and a subsequent ``values`` snapshot whose
reassembled ``AIMessage`` carries a real id will fall through
the dedup check and synthesize a second AI text event for the
same logical message consumers see duplicated text.
Why this is documented rather than fixed
----------------------------------------
Falling back to ``metadata.get("id")`` does **not** help:
LangGraph's messages-mode metadata never carries the message
id (it carries ``langgraph_node`` / ``langgraph_step`` /
``checkpoint_ns`` / ``tags`` etc.). Synthesizing a fallback
like ``f"_synth_{id(msg_chunk)}"`` only helps if the values
snapshot uses the same fallback, which it does not. A real
fix requires either provider cooperation (always emit chunk
ids out of scope for this PR) or content-based dedup (risks
false positives for two distinct short messages with identical
text).
This test makes the limitation **explicit and discoverable**
so a future contributor debugging "duplicate text in vLLM
streaming" finds the answer immediately. If a real fix lands,
replace this test with a positive assertion that dedup works
for the None-id case.
See PR #1974 Copilot review comment on ``client.py:515``.
"""
agent = MagicMock()
agent.stream.return_value = iter(
[
# Realistic shape: chunk has no id (provider didn't set one),
# values snapshot's reassembled AIMessage has a fresh id
# assigned somewhere downstream (langgraph or middleware).
("messages", (AIMessageChunk(content="Hello", id=None), {})),
(
"values",
{
"messages": [
HumanMessage(content="hi", id="h-1"),
AIMessage(content="Hello", id="ai-1"),
]
},
),
]
)
with (
patch.object(client, "_ensure_agent"),
patch.object(client, "_agent", agent),
):
events = list(client.stream("hi", thread_id="t-none-id-limitation"))
ai_text_events = [e for e in events if e.type == "messages-tuple" and e.data.get("type") == "ai" and e.data.get("content")]
# KNOWN LIMITATION: 2 events for the same logical message.
# 1) from messages chunk (id=None, NOT added to streamed_ids
# because of ``if msg_id:`` guard at client.py line ~522)
# 2) from values-snapshot synthesis (ai-1 not in streamed_ids,
# so the skip-branch at line ~549 doesn't trigger)
# If this becomes 1, someone fixed the limitation — update this
# test to a positive assertion and document the fix.
assert len(ai_text_events) == 2
assert ai_text_events[0].data["id"] is None
assert ai_text_events[1].data["id"] == "ai-1"
assert all(e.data["content"] == "Hello" for e in ai_text_events)
class TestChat:
def test_returns_last_message(self, client):