SHIYAO ZHANG ddfc988bef
feat(uploads): add pymupdf4llm PDF converter with auto-fallback and async offload (#1727)
* feat(uploads): add pymupdf4llm PDF converter with auto-fallback and async offload

- Introduce pymupdf4llm as an optional PDF converter with better heading
  detection and table preservation than MarkItDown
- Auto mode: prefer pymupdf4llm when installed; fall back to MarkItDown
  when output is suspiciously sparse (image-based / scanned PDFs)
- Sparsity check uses chars-per-page (< 50 chars/page) rather than an
  absolute threshold, correctly handling both short and long documents
- Large files (> 1 MB) are offloaded to asyncio.to_thread() to avoid
  blocking the event loop (related: #1569)
- Add UploadsConfig with pdf_converter field (auto/pymupdf4llm/markitdown)
- Add pymupdf4llm as optional dependency: pip install deerflow-harness[pymupdf]
- Add 14 unit tests covering sparsity heuristic, routing logic, and async path

* fix(uploads): address Copilot review comments on PDF converter

- Fix docstring: MIN_CHARS_PYMUPDF -> _MIN_CHARS_PER_PAGE (typo)
- Fix file handle leak: wrap pymupdf.open in try/finally to ensure doc.close()
- Fix silent fallback gap: _convert_pdf_with_pymupdf4llm now catches all
  conversion exceptions (not just ImportError), so encrypted/corrupt PDFs
  fall back to MarkItDown instead of propagating
- Tighten type: pdf_converter field changed from str to Literal[auto|pymupdf4llm|markitdown]
- Normalize config value: _get_pdf_converter() strips and lowercases the raw
  config string, warns and falls back to 'auto' on unknown values
2026-04-03 21:59:45 +08:00

46 lines
1.1 KiB
TOML

[project]
name = "deerflow-harness"
version = "0.1.0"
description = "DeerFlow agent harness framework"
requires-python = ">=3.12"
dependencies = [
"agent-client-protocol>=0.4.0",
"agent-sandbox>=0.0.19",
"dotenv>=0.9.9",
"httpx>=0.28.0",
"kubernetes>=30.0.0",
"langchain>=1.2.3",
"langchain-anthropic>=1.3.4",
"langchain-deepseek>=1.0.1",
"langchain-mcp-adapters>=0.1.0",
"langchain-openai>=1.1.7",
"langfuse>=3.4.1",
"langgraph>=1.0.6,<1.0.10",
"langgraph-api>=0.7.0,<0.8.0",
"langgraph-cli>=0.4.14",
"langgraph-runtime-inmem>=0.22.1",
"markdownify>=1.2.2",
"markitdown[all,xlsx]>=0.0.1a2",
"pydantic>=2.12.5",
"pyyaml>=6.0.3",
"readabilipy>=0.3.0",
"tavily-python>=0.7.17",
"firecrawl-py>=1.15.0",
"tiktoken>=0.8.0",
"ddgs>=9.10.0",
"duckdb>=1.4.4",
"langchain-google-genai>=4.2.1",
"langgraph-checkpoint-sqlite>=3.0.3",
"langgraph-sdk>=0.1.51",
]
[project.optional-dependencies]
pymupdf = ["pymupdf4llm>=0.0.17"]
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[tool.hatch.build.targets.wheel]
packages = ["deerflow"]