Add to packages/storage/: - README.md and README_zh.md - comprehensive documentation - store/persistence/async_provider.py - async persistence provider Update repositories: - contracts/thread_meta.py - add new contract method - db/thread_meta.py - implement new method - factory.py - update factory logic Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
19 KiB
deerflow-storage Design Overview
This document explains the current responsibilities of backend/packages/storage, its overall design, how database integration works, how persistence models are defined, what database access interfaces it exposes, and how the app layer consumes it through infra.
1. Package Role
deerflow-storage is DeerFlow's unified persistence foundation package. Its purpose is to pull database integration and business data persistence out of the app layer and provide them as a reusable storage layer.
At the moment, it mainly provides two kinds of capabilities:
- A checkpointer for the LangGraph runtime.
- ORM models, repository contracts, and database implementations for DeerFlow application data.
This package does not expose HTTP endpoints directly, does not depend on FastAPI routes directly, and does not own business orchestration. It acts as a storage kernel.
2. Overall Layering
The current code is roughly split into the following layers:
config
└─ Reads configuration, resolves environment variables, and determines database parameters
persistence
└─ Creates AsyncEngine / SessionFactory / LangGraph checkpointer
repositories/contracts
└─ Defines domain objects and repository protocols (Pydantic + Protocol)
repositories/models
└─ Defines SQLAlchemy ORM table models
repositories/db
└─ Implements database access on top of AsyncSession
app.infra.storage
└─ Adapts storage repositories into app-facing interfaces
gateway / runtime
└─ Uses infra through dependency injection, facades, observers, and event stores
The core idea is:
- The
storagepackage only decides how data is stored and what is stored. app.infratranslates low-level repositories into application-facing semantics.gatewayandruntimedepend only on interfaces exposed byinfra, not on ORM models or SQL directly.
3. How Database Integration Works
3.1 Configuration Entry
Database configuration is defined in store/config/storage_config.py, while the outer application configuration is loaded by store/config/app_config.py.
The configuration flow has several notable traits:
- It loads from
backend/config.yamlor the repository rootconfig.yamlby default. - It supports overriding the path via
DEER_FLOW_CONFIG_PATH. - It supports
$ENV_VARsyntax in config values. - Timezone configuration also affects how timestamp fields are handled in the storage layer.
3.2 Persistence Entry Point
The unified entry point is create_persistence() in store/persistence/factory.py.
It performs three main tasks:
- Builds a SQLAlchemy URL from
StorageConfig. - Selects the SQLite / MySQL / PostgreSQL builder based on the configured driver.
- Returns
AppPersistence, which contains:checkpointerenginesession_factorysetupaclose
So the application startup does not just get a bare database connection. It gets a full runtime persistence bundle.
3.3 Driver Integration Pattern
Driver implementations live in:
store/persistence/drivers/sqlite.pystore/persistence/drivers/mysql.pystore/persistence/drivers/postgres.py
All three follow the same pattern:
- Create an
AsyncEngine - Create an
async_sessionmaker - Create the LangGraph async checkpointer for that backend
- In
setup(), initialize the checkpointer first, then runMappedBase.metadata.create_all - In
aclose(), close the engine and checkpointer in order
This means the current initialization strategy is:
- Checkpointer tables and business tables are initialized together at runtime startup.
- Business tables currently rely on
SQLAlchemy create_all(). - There is no separate migration orchestration path inside this package as the main workflow.
3.4 Current SQLite Behavior
SQLite uses StorageConfig.sqlite_storage_path to generate the database file path, which defaults to .deer-flow/data/deerflow.db.
For SQLite, the model primary key type falls back to Integer PRIMARY KEY, because SQLite auto-increment behavior works more reliably with that than with BIGINT.
4. How Persistence Models Are Defined
4.1 Base Model Conventions
Base definitions are in store/persistence/base_model.py.
That file standardizes several things:
MappedBaseas the declarative base for all ORM models.DataClassBaseto support native dataclass-style models.Baseto includecreated_timeandupdated_time.id_keyas the unified primary key definition.UniversalTextas a cross-dialect long-text type.TimeZoneas a timezone-aware datetime type wrapper.
As a result, new models in this package usually follow this pattern:
- Inherit from
Baseif they needcreated_timeandupdated_time. - Inherit from
DataClassBaseif they only need dataclass-style mapping withoutupdated_time. - Use
id: Mapped[id_key]for the primary key.
4.2 Current Business Models
Models are under store/repositories/models:
Run- Table:
runs - Stores run metadata, status, token statistics, message summaries, and error details.
- Table:
ThreadMeta- Table:
thread_meta - Stores thread-level metadata, status, title, and ownership data.
- Table:
RunEvent- Table:
run_events - Stores events and messages emitted by a run.
- Uses a
(thread_id, seq)unique constraint to maintain per-thread ordering.
- Table:
Feedback- Table:
feedback - Stores feedback records associated with runs.
- Table:
4.3 Model Field Design Traits
There are a few common conventions in the current models:
- Business identifiers are string fields such as
run_id,thread_id, andfeedback_id; the auto-incrementidis only an internal primary key. - Structured extension data is usually stored in a JSON
metadatacolumn, while the ORM attribute name is oftenmeta. - Long text fields use
UniversalText. - Timestamp fields go through
TimeZoneto keep timezone handling consistent.
RunEvent.content has one additional rule:
- If
contentis adict, it is serialized to JSON before being written. - A
content_is_dict=Truemarker is added intometadata. - On reads, the value is deserialized again based on that marker.
This lets run_events support both plain text messages and structured event payloads.
5. How Database Access Interfaces Are Defined
5.1 contracts: Repository Contract Layer
Contracts are defined in store/repositories/contracts.
This layer does two things:
- Uses Pydantic models to define input and output objects such as
RunCreate,Run,ThreadMetaCreate, andThreadMeta. - Uses
Protocolto define repository interfaces such asRunRepositoryProtocolandThreadMetaRepositoryProtocol.
That means upper layers depend on contracts and protocols rather than on a specific SQLAlchemy implementation.
5.2 db: Database Implementation Layer
Implementations live in store/repositories/db.
Each repository implementation follows the same pattern:
- The constructor receives an
AsyncSession - It uses
select / update / deletefor database operations - It converts ORM models into the Pydantic objects defined by the contracts layer
For example:
DbRunRepositoryhandles CRUD and completion-stat updates for therunstable.DbThreadMetaRepositoryhandles thread metadata retrieval, updates, and search.DbRunEventRepositoryhandles batched event append, message pagination, and deletion by thread or run.DbFeedbackRepositoryhandles feedback creation and retrieval.
5.3 factory: Repository Construction Entry
store/repositories/factory.py provides unified factory functions:
build_run_repository(session)build_thread_meta_repository(session)build_feedback_repository(session)build_run_event_repository(session)
So upper layers only need an AsyncSession and do not need to depend directly on concrete repository class names.
6. What the Package Exposes
If you look only at the storage package itself, it exposes two categories of interfaces.
6.1 Runtime Persistence Entry
Exported from store/persistence/__init__.py:
create_persistence()AppPersistence- The ORM base classes and shared persistence types
This is the entry point used by the application to initialize database access and the checkpointer.
6.2 Repository Contracts and Builders
Exported from store/repositories/__init__.py:
- Contract-layer input and output models
- Repository
Protocols - Repository builder factory functions
This is how the application integrates business persistence by repository contract.
In other words, the storage package does not provide an HTTP SDK to the app layer. It provides:
- Initialization capabilities
- A session factory
- Repository protocols and repository builders
7. How the app Layer Uses It Through infra
The app layer does not operate on store.repositories.db.* directly. It goes through app.infra.storage.
Relevant code lives in:
backend/app/infra/storage/runs.pybackend/app/infra/storage/thread_meta.pybackend/app/infra/storage/run_events.py
7.1 Why the infra Layer Exists
The app layer does not want raw repository interfaces. It needs persistence services aligned with application semantics:
- Automatic session lifecycle management
- Automatic commit / rollback behavior
- Actor / user visibility checks
- Conversion from lower-level Pydantic models into app-facing dict structures
- Alignment with the expectations of facades, observers, and routers
7.2 Run Integration
RunStoreAdapter wraps build_run_repository(session) with a session_factory and exposes:
getlist_by_threadcreateupdate_statusset_errorupdate_run_completiondelete
Important details:
- Each call creates its own
AsyncSession. - Read and write flows manage transactions separately.
- Visibility filtering is applied through
actor_contextanduser_id. - Run creation serializes metadata and kwargs before persisting them.
7.3 Thread Integration
Thread metadata is split into two layers:
ThreadMetaStoreAdapter- A session-managed wrapper around the repository.
ThreadMetaStorage- A higher-level app-facing interface.
ThreadMetaStorage adds application-oriented methods such as:
ensure_threadensure_thread_runningsync_thread_titlesync_thread_assistant_idsync_thread_statussync_thread_metadatasearch_threads
So the app layer typically depends on ThreadMetaStorage, not directly on the low-level repository protocol.
7.4 RunEvent Integration
AppRunEventStore is a runtime-oriented event storage adapter. It is not just a CRUD wrapper. It is shaped around the runtime event-store protocol:
put_batchlist_messageslist_eventslist_messages_by_runcount_messagesdelete_by_threaddelete_by_run
It also performs thread visibility checks. If the current actor has a user_id, it first loads the thread owner and then decides whether the actor can read or write events for that thread.
8. How storage Is Wired at app Startup
Application startup wiring happens in backend/app/gateway/registrar.py.
The init_persistence() flow is:
- Call
create_persistence() - Run
app_persistence.setup() - Use
session_factoryto build:RunStoreAdapterThreadMetaStoreAdapterFeedbackStoreAdapterAppRunEventStore
- Build
ThreadMetaStorageon top of that - Inject all of them into
app.state
So from the application's point of view, storage is not wired as a single global repository object. Instead:
- Lower layers share a single
session_factory - Upper layers create sessions per call through adapters
- The final objects are attached to
FastAPI app.statefor routers and services
9. How gateway and service Layers Use These Capabilities
9.1 Dependency Injection
backend/app/gateway/dependencies/repositories.py reads the following objects from request.app.state:
run_storethread_meta_repothread_meta_storagefeedback_repo
These are then exposed as FastAPI dependencies to route handlers.
9.2 Usage in Thread Routes
In backend/app/gateway/routers/langgraph/threads.py:
- Thread creation calls
ThreadMetaStorage.ensure_thread() - Thread search calls
ThreadMetaStorage.search_threads() - Thread deletion calls
ThreadMetaStorage.delete_thread()
So the thread API does not touch ORM tables directly. It goes through the infra layer.
9.3 Usage in the Runs Facade
backend/app/gateway/services/runs/facade_factory.py injects storage-related objects into RunsFacade:
run_read_reporun_write_reporun_delete_repothread_meta_storagerun_event_store
These are then consumed by app-layer components such as:
AppRunCreateStoreAppRunQueryStoreAppRunDeleteStoreStorageRunObserver
9.4 How Run Lifecycle State Is Written Back
StorageRunObserver is a key integration path.
It listens to runtime lifecycle events and writes their results back into persistence:
RUN_STARTED-> updates run status torunningRUN_COMPLETED-> updates completion stats and, when needed, syncs the thread titleRUN_FAILED-> updates error state and error detailsRUN_CANCELLED-> updates run status tointerruptedTHREAD_STATUS_UPDATED-> syncs thread status
This means the storage package itself does not listen to runtime events directly, but app.infra.storage already plugs it into the runtime observer system.
10. How It Communicates with External Systems
The phrase "external communication" splits into two cases here.
10.1 Communication with Databases
The storage package communicates with SQLite / MySQL / PostgreSQL through SQLAlchemy async engines.
The main entry points are:
create_async_engineAsyncSession- Repository-level
select / update / delete
The LangGraph checkpointer also communicates with the database through backend-specific async savers, but that logic is centralized in persistence/drivers.
10.2 Communication with External Application Interfaces
The storage package does not expose external APIs by itself. External communication is handled by the app layer.
The typical path is:
- An HTTP request enters a FastAPI route
- The route gets an
infraadapter through dependency injection infracalls astoragerepository- The route converts the result into an API response
There is also a runtime-event path:
- The runtime emits a run lifecycle event or a run event
- An observer or event store calls
infra infracallsstorage- The data is persisted into the database
So more precisely, storage does not communicate outward on its own. It acts as the database boundary inside the application and is consumed by both the HTTP layer and the runtime layer.
11. Current Design Philosophy
The current code reflects a fairly clear design philosophy:
- Unify checkpointer storage and application data storage under one entry point.
- Use repository contracts to isolate upper layers from ORM details.
- Use an
infraadapter layer to isolate app semantics from storage semantics. - Prefer async SQLAlchemy to fit the modern async application stack.
- Keep database dialect differences contained in shared base types and driver builders.
- Keep actor / user visibility rules in app infra rather than hard-coding them into ORM models.
That means this is not meant to be a full business data layer. It is a composable low-level persistence package.
12. Scope and Boundaries
What the current storage package is responsible for:
- Database connection parameters and initialization
- LangGraph checkpointer integration
- ORM base model conventions
- Core DeerFlow persistence models
- Repository contracts and database implementations
What it is not responsible for:
- FastAPI route protocols
- Authentication and authorization
- Business workflow orchestration
- Actor context binding
- SSE / stream-bridge network communication
- Higher-level facade semantics
Those responsibilities live in app.gateway, app.plugins.auth, deerflow.runtime, and app.infra.
13. Example End-to-End Call Chains
For "create a run and then update its state when execution finishes", the chain looks like this:
HTTP POST /api/threads/{thread_id}/runs
-> gateway router
-> RunsFacade
-> AppRunCreateStore
-> RunStoreAdapter
-> build_run_repository(session)
-> DbRunRepository
-> runs table
execution completes
-> runtime emits lifecycle event
-> StorageRunObserver
-> RunStoreAdapter / ThreadMetaStorage
-> DbRunRepository / DbThreadMetaRepository
-> runs / thread_meta tables
For "query the messages of a run", the chain looks like this:
HTTP GET /api/threads/{thread_id}/runs/{run_id}/messages
-> gateway router
-> get run_event_store from app.state
-> AppRunEventStore
-> build_run_event_repository(session)
-> DbRunEventRepository
-> run_events table
14. Summary
In one sentence, the role of backend/packages/storage in the current system is:
It is DeerFlow's database and persistence foundation, unifying database integration, ORM models, repository contracts, database implementations, and LangGraph checkpointer integration; the app layer then turns those low-level capabilities into thread, run, event, and feedback semantics through infra, and exposes them through HTTP routes and the runtime event system.