mirror of
https://github.com/bytedance/deer-flow.git
synced 2026-05-23 17:13:44 +00:00
6 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
c810e9f809
|
fix(harness)!: hydrate runs from RunStore and persist interrupted status (#2932)
* fix(harness): hydrate run history from RunStore and persist cancellation status fix: - Make RunManager.get() async and hydrate from RunStore when in-memory record is missing - Merge store rows into list_by_thread() with in-memory precedence for active runs - Persist interrupted status to RunStore in cancel() and create_or_reject(interrupt|rollback) - Extract _persist_status() to reuse the best-effort store update pattern - Await run_mgr.get() in all gateway endpoints - Return 409 with distinct message for store-only runs not active on current worker Closes #2812, Closes #2813 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(harness): consistent sort and guarded hydration in RunManager fix: - list_by_thread() now sorts by created_at desc (newest first) even when no RunStore is configured, matching the store-backed code path - guard _record_from_store() call sites in get() and list_by_thread() with best-effort error handling so a single malformed store row cannot turn read paths into 500s test: - update test_list_by_thread assertion to expect newest-first order - seed MemoryRunStore via public put() API instead of writing to _runs * fix(harness): guard store-only runs from streaming and fix get() TOCTOU Add RunRecord.store_only flag set by _record_from_store so callers can distinguish hydrated history from live in-memory runs. join_run and stream_existing_run (action=None) now return 409 instead of hanging forever on an empty MemoryStreamBridge channel. Re-check _runs under lock after the store await in RunManager.get() so a concurrent create() that lands between the two checks returns the authoritative in-memory record rather than a stale store-hydrated copy. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * fix(harness): reorder bridge fetch in join_run and make list_by_thread limit explicit Move get_stream_bridge() after the store_only guard in join_run so a missing bridge cannot produce 503 for historical runs before the 409 guard fires. Add limit parameter to RunManager.list_by_thread (default 100, matching the store's page size) and pass it explicitly to the store call. Update docstring to document the limit instead of claiming all runs are returned. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * fix(harness): cap list_by_thread result to limit after merge Apply [:limit] to all return paths in list_by_thread so the method consistently returns at most limit records regardless of how many in-memory runs exist, making the limit parameter a true upper bound on the response size rather than just a store-query hint. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * fix `list_by_thread` docstring Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * fix(runtime): add update_model_name to RunStore to prevent SQL integrity errors RunManager.update_model_name() was calling _persist_to_store() which uses RunStore.put(), but RunRepository.put() is insert-only. This caused integrity errors when updating model_name for existing runs in SQL-backed stores. fix: - Add abstract update_model_name method to RunStore base class - Implement update_model_name in MemoryRunStore - Implement update_model_name in RunRepository with proper normalization - Add _persist_model_name helper in RunManager - Update RunManager.update_model_name to use the new method test: - Add tests for update_model_name functionality - Add integration tests for RunManager with SQL-backed store Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(runtime): handle NULL status/on_disconnect in _record_from_store `dict.get(key, default)` only uses the default when the key is absent, so a SQL row with an explicit NULL status would pass `None` to `RunStatus(None)` and raise, breaking hydration for otherwise valid rows. Switch to `row.get(...) or fallback` so both missing and NULL values get a safe default. Add tests for get() and list_by_thread() with a NULL status row to prevent regression. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> * fix(runs): address PR review feedback on store consistency changes - Fix list_by_thread limit semantics: pass store_limit = max(0, limit - len(memory_records)) to store so newer store records are not crowded out by in-memory records - Remove dead code: cancelled guard after raise is always True, simplify to if wait and record.task - Document _record_from_store NULL fallback policy (status→pending, on_disconnect→cancel) in docstring Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> |
||
|
|
39f901d3a5
|
fix(runs): restore historical runs from persistent store after gateway restart (#2989)
* fix(runs): restore historical runs from persistent store after gateway restart
RunManager.list_by_thread() and get() only queried the in-memory _runs
dict, returning empty results after a restart even when PostgreSQL had
the records. Add store fallback to both read paths and a new async
aget() for the API endpoint, keeping sync get() for internal callers
that need live task/abort_event state.
Fixes #2984
* Apply suggestions from code review
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* fix(runs): scope run store fallback reads by user id
Agent-Logs-Url: https://github.com/bytedance/deer-flow/sessions/e73daada-1215-4bc1-ab7d-7117826c5013
Co-authored-by: WillemJiang <219644+WillemJiang@users.noreply.github.com>
* test(runs): clarify ordering expectation and mock store filters
Agent-Logs-Url: https://github.com/bytedance/deer-flow/sessions/e73daada-1215-4bc1-ab7d-7117826c5013
Co-authored-by: WillemJiang <219644+WillemJiang@users.noreply.github.com>
* test(runs): make user filter fallback assertions explicit
Agent-Logs-Url: https://github.com/bytedance/deer-flow/sessions/e73daada-1215-4bc1-ab7d-7117826c5013
Co-authored-by: WillemJiang <219644+WillemJiang@users.noreply.github.com>
* test(runs): verify user-isolated fallback behavior with memory store
Agent-Logs-Url: https://github.com/bytedance/deer-flow/sessions/e73daada-1215-4bc1-ab7d-7117826c5013
Co-authored-by: WillemJiang <219644+WillemJiang@users.noreply.github.com>
* update the code with feedback from issue-2984
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
|
||
|
|
de253e4a0a
|
feat(run): Propagates model_name from the gateway request through the runtime and persistence stack to the SQLite database. (#2775)
* feat(run): propagate model_name from gateway request context to persistence layer Pass model_name through the full run creation pipeline — from RunCreateRequest.context in the gateway, through RunManager, to the RunStore interface and SQL persistence. This enables client-specified model selection to be recorded per-run in the database. * feat(run): add model allowlist validation and effective model name capture - Validate model_name against allowlist in gateway services.py using get_app_config().get_model_config() - Truncate model_name to 128 chars to match DB column constraint - In worker.py, capture effective model name from agent.metadata after agent creation and persist if resolved differently than requested * feat(run): add defense-in-depth model_name normalization and round-trip persistence tests - Add _normalize_model_name() to RunRepository for whitespace stripping and 128-char truncation before DB writes. - Add round-trip unit tests for model_name creation and default None in test_run_manager.py. * fix(run): coerce non-string model_name values before strip/truncate in _normalize_model_name * fix(gateway): add runtime type guard for model_name coercion in gateway services Add isinstance check and str() coercion before calling .strip() to prevent AttributeError when non-string types (int, None, etc.) flow through the gateway. Paired with SQL integration test for end-to-end model_name persistence across gateway → langgraph → persistence layer. * fix(run): drop Alembic migration for model_name (no-op) and expose public update method on RunManager - Drop a1b2c3d4e5f6 migration: model_name already exists in RunRow schema and is auto-created via Base.metadata.create_all() at startup - Add update_model_name() public method to RunManager to replace the private _persist_to_store call in worker.py, preserving internal locking/persistence |
||
|
|
db5ad86381 |
feat: enhance chat history loading with new hooks and UI components (#2338)
* Refactor API fetch calls to use a unified fetch function; enhance chat history loading with new hooks and UI components - Replaced `fetchWithAuth` with a generic `fetch` function across various API modules for consistency. - Updated `useThreadStream` and `useThreadHistory` hooks to manage chat history loading, including loading states and pagination. - Introduced `LoadMoreHistoryIndicator` component for better user experience when loading more chat history. - Enhanced message handling in `MessageList` to accommodate new loading states and history management. - Added support for run messages in the thread context, improving the overall message handling logic. - Updated translations for loading indicators in English and Chinese. * Fix test assertions for run ordering in RunManager tests - Updated assertions in `test_list_by_thread` to reflect correct ordering of runs. - Modified `test_list_by_thread_is_stable_when_timestamps_tie` to ensure stable ordering when timestamps are tied. |
||
|
|
aae59a8ba8
|
fix: surface configured sandbox mounts to agents (#1638)
* fix: surface configured sandbox mounts to agents * fix: address PR review feedback --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com> |
||
|
|
34e835bc33
|
feat(gateway): implement LangGraph Platform API in Gateway, replace langgraph-cli (#1403)
* feat(gateway): implement LangGraph Platform API in Gateway, replace langgraph-cli
Implement all core LangGraph Platform API endpoints in the Gateway,
allowing it to fully replace the langgraph-cli dev server for local
development. This eliminates a heavyweight dependency and simplifies
the development stack.
Changes:
- Add runs lifecycle endpoints (create, stream, wait, cancel, join)
- Add threads CRUD and search endpoints
- Add assistants compatibility endpoints (search, get, graph, schemas)
- Add StreamBridge (in-memory pub/sub for SSE) and async provider
- Add RunManager with atomic create_or_reject (eliminates TOCTOU race)
- Add worker with interrupt/rollback cancel actions and runtime context injection
- Route /api/langgraph/* to Gateway in nginx config
- Skip langgraph-cli startup by default (SKIP_LANGGRAPH_SERVER=0 to restore)
- Add unit tests for RunManager, SSE format, and StreamBridge
* fix: drain bridge queue on client disconnect to prevent backpressure
When on_disconnect=continue, keep consuming events from the bridge
without yielding, so the worker is not blocked by a full queue.
Only on_disconnect=cancel breaks out immediately.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* fix: remove pytest import
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* fix: Fix default stream_mode to ["values", "messages-tuple"]
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* fix: Remove unused if_exists field from ThreadCreateRequest
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* fix: address review comments on gateway LangGraph API
- Mount runs.py router in app.py (missing include_router)
- Normalize interrupt_before/after "*" to node list before run_agent()
- Use entry.id for SSE event ID instead of counter
- Drain bridge queue on disconnect when on_disconnect=continue
- Reuse serialization helper in wait_run() for consistent wire format
- Reject unsupported multitask_strategy with 400
- Remove SKIP_LANGGRAPH_SERVER fallback, always use Gateway
* feat: extract app.state access into deps.py
Encapsulate read/write operations for singleton objects (RunManager,
StreamBridge, checkpointer) held in app.state into a shared utility,
reducing repeated access patterns across router modules.
* feat: extract deerflow.runtime.serialization module with tests
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: replace duplicated serialization with deerflow.runtime.serialization
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: extract app/gateway/services.py with run lifecycle logic
Create a service layer that centralizes SSE formatting, input/config
normalization, and run lifecycle management. Router modules will delegate
to these functions instead of using private cross-imported helpers.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: wire routers to use services layer, remove cross-module private imports
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* style: apply ruff formatting to refactored files
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(runtime): support LangGraph dev server and add compat route
- Enable official LangGraph dev server for local development workflow
- Decouple runtime components from agents package for better separation
- Provide gateway-backed fallback route when dev server is skipped
- Simplify lifecycle management using context manager in gateway
* feat(runtime): add Store providers with auto-backend selection
- Add async_provider.py and provider.py under deerflow/runtime/store/
- Support memory, sqlite, postgres backends matching checkpointer config
- Integrate into FastAPI lifespan via AsyncExitStack in deps.py
- Replace hardcoded InMemoryStore with config-driven factory
* refactor(gateway): migrate thread management from checkpointer to Store and resolve multiple endpoint failures
- Add Store-backed CRUD helpers (_store_get, _store_put, _store_upsert)
- Replace checkpoint-scanning search with two-phase strategy:
phase 1 reads Store (O(threads)), phase 2 backfills from checkpointer
for legacy/LangGraph Server threads with lazy migration
- Extend Store record schema with values field for title persistence
- Sync thread title from checkpoint to Store after run completion
- Fix /threads/{id}/runs/{run_id}/stream 405 by accepting both
GET and POST methods; POST handles interrupt/rollback actions
- Fix /threads/{id}/state 500 by separating read_config and
write_config, adding checkpoint_ns to configurable, and
shallow-copying checkpoint/metadata before mutation
- Sync title to Store on state update for immediate search reflection
- Move _upsert_thread_in_store into services.py, remove duplicate logic
- Add _sync_thread_title_after_run: await run task, read final
checkpoint title, write back to Store record
- Spawn title sync as background task from start_run when Store exists
* refactor(runtime): deduplicate store and checkpointer provider logic
Extract _ensure_sqlite_parent_dir() helper into checkpointer/provider.py
and use it in all three places that previously inlined the same mkdir logic.
Consolidate duplicate error constants in store/async_provider.py by importing
from store/provider.py instead of redefining them.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor(runtime): move SQLite helpers to runtime/store, checkpointer imports from store
_resolve_sqlite_conn_str and _ensure_sqlite_parent_dir now live in
runtime/store/provider.py. agents/checkpointer/provider and
agents/checkpointer/async_provider import from there, reversing the
previous dependency direction (store → checkpointer becomes
checkpointer → store).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor(runtime): extract SQLite helpers into runtime/store/_sqlite_utils.py
Move resolve_sqlite_conn_str and ensure_sqlite_parent_dir out of
checkpointer/provider.py into a dedicated _sqlite_utils module.
Functions are now public (no underscore prefix), making cross-module
imports semantically correct. All four provider files import from
the single shared location.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(gateway): use adelete_thread to fully remove thread checkpoints on delete
AsyncSqliteSaver has no adelete method — the previous hasattr check
always evaluated to False, silently leaving all checkpoint rows in the
database. Switch to adelete_thread(thread_id) which deletes every
checkpoint and pending-write row for the thread across all namespaces
(including sub-graph checkpoints).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(gateway): remove dead bridge_cm/ckpt_cm code and fix StrEnum lint
app.py had unreachable code after the async-with lifespan refactor:
bridge_cm and ckpt_cm were referenced but never defined (F821), and
the channel service startup/shutdown was outside the langgraph_runtime
block so it never ran. Move channel service lifecycle inside the
async-with block where it belongs.
Replace str+Enum inheritance in RunStatus and DisconnectMode with
StrEnum as suggested by UP042.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* style: format with ruff
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: JeffJiang <for-eleven@hotmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
|