mirror of https://github.com/bytedance/deer-flow.git synced 2026-05-03 15:28:21 +00:00

History

rayhpeng 9d0a42c1fb refactor(runtime): restructure runs module with new execution architecture

Major refactoring of deerflow/runtime/:
- runs/callbacks/ - new callback system (builder, events, title, tokens)
- runs/internal/ - execution internals (executor, supervisor, stream_logic, registry)
- runs/internal/execution/ - execution artifacts and events handling
- runs/facade.py - high-level run facade
- runs/observer.py - run observation protocol
- runs/types.py - type definitions
- runs/store/ - simplified store interfaces (create, delete, query, event)

Refactor stream_bridge/:
- Replace old providers with contract.py and exceptions.py
- Remove async_provider.py, base.py, memory.py

Add documentation:
- README.md and README_zh.md for runtime module

Remove deprecated:
- manager.py moved to internal/
- worker.py, schemas.py
- user_context.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-04-22 11:28:01 +08:00

runs

refactor(runtime): restructure runs module with new execution architecture

2026-04-22 11:28:01 +08:00

stream_bridge

refactor(runtime): restructure runs module with new execution architecture

2026-04-22 11:28:01 +08:00

__init__.py

refactor(runtime): restructure runs module with new execution architecture

2026-04-22 11:28:01 +08:00

actor_context.py

refactor(runtime): restructure runs module with new execution architecture

2026-04-22 11:28:01 +08:00

converters.py

refactor(runtime): restructure runs module with new execution architecture

2026-04-22 11:28:01 +08:00

README_zh.md

refactor(runtime): restructure runs module with new execution architecture

2026-04-22 11:28:01 +08:00

README.md

refactor(runtime): restructure runs module with new execution architecture

2026-04-22 11:28:01 +08:00

serialization.py

refactor(runtime): restructure runs module with new execution architecture

2026-04-22 11:28:01 +08:00

README.md

deerflow.runtime Design Overview

This document describes the current implementation of backend/packages/harness/deerflow/runtime, including its overall design, boundary model, the collaboration between runs and stream_bridge, how it interacts with external infrastructure and the app layer, and how actor_context is dynamically injected to provide user isolation.

1. Overall Role

deerflow.runtime is the runtime kernel layer of DeerFlow.

It sits below agents / tools / middlewares and above app / gateway / infra. Its purpose is to define runtime semantics and boundary contracts, without directly owning web endpoints, ORM models, or concrete infrastructure implementations.

Its public surface is re-exported from __init__.py and currently exposes four main capability areas:

runs
- Run domain types, execution facade, lifecycle observers, and store protocols
stream_bridge
- Stream event bridge contract and public stream types
actor_context
- Request/task-scoped actor context and user-isolation bridge
serialization
- Runtime serialization helpers for LangChain / LangGraph data and outward-facing events

Structurally, the current package looks like:

runtime
  ├─ runs
  │   ├─ facade / types / observer / store
  │   ├─ internal/*
  │   └─ callbacks/*
  ├─ stream_bridge
  │   ├─ contract
  │   └─ exceptions
  ├─ actor_context
  └─ serialization / converters

2. Overall Design and Constraint Model

2.1 Design Goal

The core goal of runtime is to decouple runtime control-plane semantics from infrastructure implementations.

It only cares about:

What a run is and how run state changes over time
What lifecycle events and stream events are produced during execution
Which capabilities must be injected from the outside, such as checkpointer, event store, stream bridge, and durable stores
Who the current actor is, and how lower layers can use that for isolation

It deliberately does not care about:

Whether events are stored in memory, Redis, or another transport
How run / thread / feedback data is persisted
HTTP / SSE / FastAPI details
How the auth plugin resolves the request user

2.2 Boundary Rules

The current package has a fairly clear boundary model:

runs owns execution orchestration, not ORM or SQL writes
stream_bridge defines stream semantics, not app-level bridge construction
actor_context defines runtime context, not auth-plugin behavior
Durable data enters only through boundary protocols:
- RunCreateStore
- RunQueryStore
- RunDeleteStore
- RunEventStore
Lifecycle side effects enter only through RunObserver
User isolation is not implemented ad hoc in each module; it is propagated through actor context

In one sentence:

runtime defines semantics and contracts; app.infra provides implementations.

3. runs Subsystem Design

3.1 Purpose

runtime/runs is the run orchestration domain. It is responsible for:

Defining run domain objects and status transitions
Organizing create / stream / wait / join / cancel / delete behavior
Maintaining the in-process runtime control plane
Emitting stream events and lifecycle events during execution
Collecting trace, token, title, and message data through callbacks

3.2 Core Objects

See runs/types.py.

The most important types are:

RunSpec
- Built by the app-side input layer
- The real execution input
RunRecord
- The runtime record managed by RunRegistry
RunStatus
- pending, starting, running, success, error, interrupted, timeout
RunScope
- Distinguishes stateful vs stateless execution and temporary thread behavior

3.3 Current Constraints

The current implementation explicitly limits some parts of the problem space:

multitask_strategy currently supports only reject and interrupt on the main path
enqueue, after_seconds, and batch execution are not on the current primary path
RunRegistry is an in-process state source, not a durable source of truth
External queries may use durable stores, but the live control plane still centers on the in-memory registry

3.4 Facade and Internal Components

RunsFacade in runs/facade.py provides the unified API:

create_background
create_and_stream
create_and_wait
join_stream
join_wait
cancel
get_run
list_runs
delete_run

Internally it composes:

RunRegistry
ExecutionPlanner
RunSupervisor
RunStreamService
RunWaitService
RunCreateStore / RunQueryStore / RunDeleteStore
RunObserver

So RunsFacade is the public entry point, while execution and state transitions are distributed across smaller components.

4. stream_bridge Design and Implementation

4.1 Why stream_bridge Is a Separate Abstraction

StreamBridge is defined in stream_bridge/contract.py.

It exists because run execution needs an event channel that is:

Subscribable
Replayable
Terminal-state aware
Resume-capable

That behavior must not be hard-coupled to HTTP SSE, in-memory queues, or Redis-specific details.

So:

harness defines stream semantics
the app layer owns backend selection and implementation

4.2 Contract Contents

The abstract StreamBridge currently exposes:

publish(run_id, event, data)
publish_end(run_id)
publish_terminal(run_id, kind, data)
subscribe(run_id, last_event_id, heartbeat_interval)
cleanup(run_id, delay=0)
cancel(run_id)
mark_awaiting_input(run_id)
start()
close()

Public types include:

StreamEvent
StreamStatus
ResumeResult
HEARTBEAT_SENTINEL
END_SENTINEL
CANCELLED_SENTINEL

4.3 Semantic Boundary

The contract explicitly distinguishes:

end / cancel / error
- Real business-level terminal events for a run
close()
- Bridge-level shutdown
- Not equivalent to run cancellation

4.4 Current Implementation Style

The concrete implementation currently used is the app-layer MemoryStreamBridge.

Its design is effectively “one in-memory event log per run”:

_RunStream stores the event list, offset mapping, status, subscriber count, and awaiting-input state
publish() generates increasing event IDs and appends to the per-run log
subscribe() supports replay, heartbeat, resume, and terminal exit
cleanup_loop() handles:
- old streams
- active streams with no publish activity
- orphan terminal streams
- TTL expiration
mark_awaiting_input() extends timeout behavior for HITL flows

The Redis implementation is still only a placeholder in RedisStreamBridge.

4.5 Call Chain

The stream bridge participates in the execution chain like this:

RunsFacade
  -> RunStreamService
  -> StreamBridge
  -> app route converts events to SSE

More concretely:

_RunExecution._start() publishes metadata
_RunExecution._stream() converts agent astream() output into bridge events
_RunExecution._finish_success() / _finish_failed() / _finish_aborted() publish terminal events
RunWaitService waits by subscribing for values, error, or terminal events
The app route layer converts those events into outward-facing SSE

4.6 Future Extensions

Likely future directions include:

A real Redis bridge for cross-process / multi-instance streaming
Stronger Last-Event-ID gap recovery behavior
Richer HITL state handling
Cross-node run coordination and more explicit dead-letter strategies

5. External Communication and Store Read/Write Boundaries

5.1 Two Main Outward Boundaries

runtime does not send HTTP requests directly and does not write ORM models directly, but it communicates outward through two main boundaries:

StreamBridge
- For outward-facing stream events
store / observer
- For durable data and lifecycle side effects

5.2 Store Boundary Protocols

Under runs/store, the harness layer defines:

RunCreateStore
RunQueryStore
RunDeleteStore
RunEventStore

These are not harness-internal persistence implementations. They are app-facing contracts declared by the runtime.

5.3 How the app Layer Supplies Store Implementations

The app layer currently provides:

The shared pattern is:

harness depends only on protocols
the app layer owns session lifecycle, commit behavior, access control, and backend choice
durable data eventually lands in store.repositories.* or JSONL files

5.4 How Run Lifecycle Data Leaves the Runtime

The single-run executor _RunExecution does not write to the database directly.

It exports data through three paths:

bridge events
- Streamed outward to subscribers
callback -> RunEventStore
- Execution trace / message / tool / custom events are persisted in batches
lifecycle event -> RunObserver
- Run started, completed, failed, cancelled, and thread-status updates are emitted for app observers

5.5 `RunEventStore` Backends

The app-side factory app/infra/run_events/factory.py currently selects:

run_events.backend == "db"
- AppRunEventStore
run_events.backend == "jsonl"
- JsonlRunEventStore

So the runtime does not care whether events end up in a database or in files. It only requires the event-store protocol.

6. Run Lifecycle Data, Callbacks, Write-Back, and Query Flow

6.1 Main Single-Run Flow

The main _RunExecution.run() flow is:

_start()
_prepare()
_stream()
_finish_after_stream()
finally
- _emit_final_thread_status()
- callbacks.flush()
- bridge.cleanup(run_id)

6.2 What the Start Phase Records

_start():

sets run status to running
emits RUN_STARTED
extracts the first human message and emits HUMAN_MESSAGE
captures the pre-run checkpoint ID
publishes a metadata stream event

6.3 What the Callbacks Collect

Callbacks live under runs/callbacks.

The main ones are:

RunEventCallback
- Records run_start, run_end, llm_request, llm_response, tool_start, tool_end, tool_result, custom_event, and more
- Flushes batches into RunEventStore
RunTokenCallback
- Aggregates token usage, LLM call counts, lead/subagent/middleware token split, message counts, first human message, and last AI message
RunTitleCallback
- Extracts thread title from title middleware output or custom events

6.4 How completion_data Is Produced

RunTokenCallback.completion_data() yields RunCompletionData, including:

total_input_tokens
total_output_tokens
total_tokens
llm_call_count
lead_agent_tokens
subagent_tokens
middleware_tokens
message_count
last_ai_message
first_human_message

The executor includes this data in lifecycle payloads on success, failure, and cancellation.

6.5 How the app Layer Writes Lifecycle Results Back

The executor emits RunLifecycleEvent objects through RunEventEmitter.

The app-layer StorageRunObserver then persists durable state:

RUN_STARTED
- Marks the run as running
RUN_COMPLETED
- Writes completion data
- Syncs thread title if present
RUN_FAILED
- Writes error and completion data
RUN_CANCELLED
- Writes interrupted state and completion data
THREAD_STATUS_UPDATED
- Syncs thread status

6.6 Query Paths

RunsFacade.get_run() and list_runs() have two paths:

If a RunQueryStore is injected, durable state is used first
Otherwise, the facade falls back to RunRegistry

So:

the in-memory registry is the control plane
the durable store is the preferred query surface

7. How actor_context Is Dynamically Injected for User Isolation

7.1 Design Goal

actor_context is defined in actor_context.py.

Its purpose is to let the runtime and lower-level infrastructure modules depend on a stable notion of “who the current actor is” without importing the auth plugin, FastAPI request objects, or a specific user model.

7.2 Current Implementation

The current implementation is a request/task-scoped context built on top of ContextVar:

ActorContext
- Currently carries only user_id
_current_actor
- A ContextVar[ActorContext | None]
bind_actor_context(actor)
- Binds the current actor
reset_actor_context(token)
- Restores the previous context
get_actor_context()
- Returns the current actor
get_effective_user_id()
- Returns the current user ID or DEFAULT_USER_ID
resolve_user_id(value=AUTO | explicit | None)
- Resolves repository/storage-facing user IDs consistently

7.3 How the app Layer Injects It Dynamically

Dynamic injection currently happens at the app/auth boundary.

For HTTP request flows:

app.plugins.auth.security.middleware
- Builds ActorContext(user_id=...) from the authenticated request user
- Binds and resets runtime actor context around request handling
app.plugins.auth.security.actor_context
- Provides bind_request_actor_context(request) and bind_user_actor_context(user_id)
- Allows routes and non-HTTP entry points to bind runtime actor context explicitly

For non-HTTP / external channel flows:

Those entry points also wrap execution with bind_user_actor_context(user_id) before they enter runtime-facing code. This matters because:

the runtime does not need to distinguish HTTP from Feishu or other channels
any entry point that can resolve a user ID can inject the same isolation semantics
the same runtime/store/path/memory code can stay protocol-agnostic

So the runtime itself does not know what a request is, and it does not know the auth plugin’s user model. It only knows whether an ActorContext is currently bound in the ContextVar.

7.4 Propagation Semantics After Injection

In practice, “dynamic injection” here does not mean manually threading user_id through every function signature. The app boundary binds the actor into a ContextVar, and runtime-facing code reads it only where isolation is actually needed.

The current semantics are:

an entry boundary calls bind_actor_context(...)
the async call chain created inside that context sees the same actor view
the boundary restores the previous value with reset_actor_context(token) when the request/task exits

That gives two practical outcomes:

most runtime interfaces do not need to carry user_id as an explicit parameter through every layer
boundaries that do need durable isolation or path isolation can still read explicitly via resolve_user_id() or get_effective_user_id()

7.5 How User Isolation Actually Works

User isolation is implemented through “dynamic injection + boundary-specific reads”.

The main paths are:

path / uploads / sandbox / memory
- Use get_effective_user_id() to derive per-user directories and resource scopes
app storage adapters
- Use resolve_user_id(AUTO) in RunStoreAdapter, ThreadMetaStorage, and related boundaries
run event store
- AppRunEventStore reads get_actor_context() and decides whether the current actor may see a thread

So user isolation is not centralized in a single middleware and then forgotten. Instead:

the app boundary dynamically binds the actor into runtime context
runtime and lower layers read that context when they need isolation input
each boundary applies the user ID according to its own responsibility

7.6 Why This Approach Works Well

The current design has several practical strengths:

The runtime does not depend on a specific auth implementation
HTTP and non-HTTP entry points can reuse the same isolation mechanism
The same user ID propagates naturally into paths, memory, store access, and event visibility
Where stronger enforcement is needed, AUTO + resolve_user_id() can require a bound actor context

7.7 Future Extensions

ActorContext already contains explicit future-extension hints. The current pattern can be extended without changing the architecture:

tenant_id
- For multi-tenant isolation
subject_id
- For a more stable identity key
scopes
- For finer-grained authorization
auth_source
- To track the source channel or auth mechanism

The recommended extension model is to preserve the current shape:

The app/auth boundary binds a richer ActorContext
The runtime depends only on abstract context fields, never on request/user objects
Lower layers read only the fields they actually need
Store / path / sandbox / stream / memory boundaries can gradually become tenant-aware or scope-aware

More concretely, stronger isolation can be added incrementally at the boundaries:

store boundaries
- add tenant_id filtering in RunStoreAdapter, ThreadMetaStorage, and feedback/event stores
path and sandbox boundaries
- shard directories by tenant_id/user_id instead of user_id alone
event-visibility boundaries
- layer scopes or subject_id checks into run-event and thread queries
external-channel boundaries
- populate auth_source so API, channel, and internal-job traffic can be distinguished

That keeps the runtime dependent on the abstract “current actor context” concept, not on FastAPI request objects or a specific auth implementation.

8. Interaction with the app Layer

8.1 How the app Layer Wires the Runtime

The app composition root for runs is app/gateway/services/runs/facade_factory.py.

It assembles:

RunRegistry
ExecutionPlanner
RunSupervisor
RunStreamService
RunWaitService
RunsRuntime
- bridge
- checkpointer
- store
- event_store
- agent_factory_resolver
StorageRunObserver
AppRunCreateStore
AppRunQueryStore
AppRunDeleteStore

8.2 How app.state Provides Infrastructure

In app/gateway/registrar.py:

init_persistence() creates:
- persistence
- checkpointer
- run_store
- thread_meta_storage
- run_event_store
init_runtime() creates:
- stream_bridge

Those objects are then attached to app.state for dependency injection and facade construction.

8.3 The app Boundary for `stream_bridge`

Concrete stream bridge construction now belongs entirely to the app layer:

harness exports only the StreamBridge contract
app.infra.stream_bridge.build_stream_bridge constructs the actual implementation

That is a very explicit boundary:

harness defines runtime semantics and interfaces
app selects and constructs infrastructure

9. Summary

The most accurate one-line summary of deerflow.runtime today is:

It is a runtime kernel built around run orchestration, a stream bridge as the streaming boundary, actor context as the dynamic isolation bridge, and store / observer protocols as the durable and side-effect boundaries.

More concretely:

runs owns orchestration and lifecycle progression
stream_bridge owns stream semantics
actor_context owns runtime-scoped user context and isolation bridging
serialization / converters own outward event and message formatting
the app layer owns real persistence, stream infrastructure, and auth-driven context injection

The main strengths of this structure are:

Runtime semantics are decoupled from infrastructure implementations
Request identity is decoupled from runtime logic
HTTP, CLI, and channel-worker entry points can reuse the same runtime boundaries
The system can grow toward multi-tenancy, cross-process stream bridges, and richer durable backends without changing the core model

The current limitations are also clear:

RunRegistry is still an in-process control plane
The Redis bridge is not implemented yet
Some multitask strategies and batch capabilities are still outside the main path
ActorContext currently carries only user_id, not richer fields such as tenant, scopes, or auth source

So the best way to understand the current code is not as a final platform, but as a runtime kernel with clear semantics and extension boundaries.

README.md Unescape Escape

deerflow.runtime Design Overview

1. Overall Role

2. Overall Design and Constraint Model

2.1 Design Goal

2.2 Boundary Rules

3. runs Subsystem Design

3.1 Purpose

3.2 Core Objects

3.3 Current Constraints

3.4 Facade and Internal Components

4. stream_bridge Design and Implementation

4.1 Why stream_bridge Is a Separate Abstraction

4.2 Contract Contents

4.3 Semantic Boundary

4.4 Current Implementation Style

4.5 Call Chain

4.6 Future Extensions

5. External Communication and Store Read/Write Boundaries

5.1 Two Main Outward Boundaries

5.2 Store Boundary Protocols

5.3 How the app Layer Supplies Store Implementations

5.4 How Run Lifecycle Data Leaves the Runtime

5.5 RunEventStore Backends

6. Run Lifecycle Data, Callbacks, Write-Back, and Query Flow

6.1 Main Single-Run Flow

6.2 What the Start Phase Records

6.3 What the Callbacks Collect

6.4 How completion_data Is Produced

6.5 How the app Layer Writes Lifecycle Results Back

6.6 Query Paths

7. How actor_context Is Dynamically Injected for User Isolation

7.1 Design Goal

7.2 Current Implementation

7.3 How the app Layer Injects It Dynamically

7.4 Propagation Semantics After Injection

7.5 How User Isolation Actually Works

7.6 Why This Approach Works Well

7.7 Future Extensions

8. Interaction with the app Layer

8.1 How the app Layer Wires the Runtime

8.2 How app.state Provides Infrastructure

8.3 The app Boundary for stream_bridge

9. Summary

README.md

5.5 `RunEventStore` Backends

8.3 The app Boundary for `stream_bridge`