3 Commits

Author SHA1 Message Date
SHIYAO ZHANG
163121d327
fix(uploads): handle split-bold headings and ** ** artefacts in extract_outline (#1838)
* feat(uploads): guide agent to use grep/glob/read_file for uploaded documents

Add workflow guidance to the <uploaded_files> context block so the agent
knows to use grep and glob (added in #1784) alongside read_file when
working with uploaded documents, rather than falling back to web search.

This is the final piece of the three-PR PDF agentic search pipeline:
- PR1 (#1727): pymupdf4llm converter produces structured Markdown with headings
- PR2 (#1738): document outline injected into agent context with line numbers
- PR3 (this):  agent guided to use outline + grep + read_file workflow

* feat(uploads): add file-first priority and fallback guidance to uploaded_files context

* fix(uploads): handle split-bold headings and ** ** artefacts in extract_outline

- Add _clean_bold_title() to merge adjacent bold spans (** **) produced
  by pymupdf4llm when bold text crosses span boundaries
- Add _SPLIT_BOLD_HEADING_RE (Style 3) to recognise **<num>** **<title>**
  headings common in academic papers; excludes pure-number table headers
  and rows with more than 4 bold blocks
- When outline is empty, read first 5 non-empty lines of the .md as a
  content preview and surface a grep hint in the agent context
- Update _format_file_entry to render the preview + grep hint instead of
  silently omitting the outline section
- Add 3 new extract_outline tests and 2 new middleware tests (65 total)

* fix(uploads): address Copilot review comments on extract_outline regex

- Replace ASCII [A-Za-z] guard with negative lookahead to support non-ASCII
  titles (e.g. **1** **概述**); pure-numeric/punctuation blocks still excluded
- Replace .+ with [^*]+ and cap repetition at {0,2} (four blocks total) to
  keep _SPLIT_BOLD_HEADING_RE linear and avoid ReDoS on malformed input
- Remove now-redundant len(blocks) <= 4 code-level check (enforced by regex)
- Log debug message with exc_info when preview extraction fails
2026-04-04 14:25:08 +08:00
SHIYAO ZHANG
ddfc988bef
feat(uploads): add pymupdf4llm PDF converter with auto-fallback and async offload (#1727)
* feat(uploads): add pymupdf4llm PDF converter with auto-fallback and async offload

- Introduce pymupdf4llm as an optional PDF converter with better heading
  detection and table preservation than MarkItDown
- Auto mode: prefer pymupdf4llm when installed; fall back to MarkItDown
  when output is suspiciously sparse (image-based / scanned PDFs)
- Sparsity check uses chars-per-page (< 50 chars/page) rather than an
  absolute threshold, correctly handling both short and long documents
- Large files (> 1 MB) are offloaded to asyncio.to_thread() to avoid
  blocking the event loop (related: #1569)
- Add UploadsConfig with pdf_converter field (auto/pymupdf4llm/markitdown)
- Add pymupdf4llm as optional dependency: pip install deerflow-harness[pymupdf]
- Add 14 unit tests covering sparsity heuristic, routing logic, and async path

* fix(uploads): address Copilot review comments on PDF converter

- Fix docstring: MIN_CHARS_PYMUPDF -> _MIN_CHARS_PER_PAGE (typo)
- Fix file handle leak: wrap pymupdf.open in try/finally to ensure doc.close()
- Fix silent fallback gap: _convert_pdf_with_pymupdf4llm now catches all
  conversion exceptions (not just ImportError), so encrypted/corrupt PDFs
  fall back to MarkItDown instead of propagating
- Tighten type: pdf_converter field changed from str to Literal[auto|pymupdf4llm|markitdown]
- Normalize config value: _get_pdf_converter() strips and lowercases the raw
  config string, warns and falls back to 'auto' on unknown values
2026-04-03 21:59:45 +08:00
SHIYAO ZHANG
5ff230eafd
feat(uploads): inject document outline into agent context for converted files (#1738)
* feat(uploads): inject document outline into agent context for converted files

Extract headings from converted .md files and inject them into the
<uploaded_files> context block so the agent can navigate large documents
by line number before reading.

- Add `extract_outline()` to `file_conversion.py`: recognises standard
  Markdown headings (#/##/###) and SEC-style bold structural headings
  (**ITEM N. BUSINESS**, **PART II**); caps at 50 entries; excludes
  cover-page boilerplate (WASHINGTON DC, CURRENT REPORT, SIGNATURES)
- Add `_extract_outline_for_file()` helper in `uploads_middleware.py`:
  looks for a sibling `.md` file produced by the conversion pipeline
- Update `UploadsMiddleware._create_files_message()` to render the outline
  under each file entry with `L{line}: {title}` format and a `read_file`
  prompt for range-based reading
- Tests: 10 new tests for `extract_outline()`, 4 new tests for outline
  injection in `UploadsMiddleware`; existing test updated for new `outline`
  field in `uploaded_files` state

Partially addresses #1647 (agent ignores uploaded files).

* fix(uploads): stream outline file reads and strip inline bold from heading titles

- Switch extract_outline() from read_text().splitlines() to open()+line iteration
  so large converted documents are not loaded into memory on every agent turn;
  exits as soon as MAX_OUTLINE_ENTRIES is reached (Copilot suggestion)
- Strip **...** wrapper from standard Markdown heading titles before appending
  to outline so agent context stays clean (e.g. "## **Overview**" → "Overview")
  (Copilot suggestion)
- Remove unused pathlib.Path import and fix import sort order in test_file_conversion.py
  to satisfy ruff CI lint

* fix(uploads): show truncation hint when outline exceeds MAX_OUTLINE_ENTRIES

When extract_outline() hits the cap it now appends a sentinel entry
{"truncated": True} instead of silently dropping the rest of the headings.
UploadsMiddleware reads the sentinel and renders a hint line:

  ... (showing first 50 headings; use `read_file` to explore further)

Without this the agent had no way to know the outline was incomplete and
would treat the first 50 headings as the full document structure.

* fix(uploads): fall back to configurable.thread_id when runtime.context lacks thread_id

runtime.context does not always carry thread_id (depends on LangGraph
invocation path). ThreadDataMiddleware already falls back to
get_config().configurable.thread_id — apply the same pattern so
UploadsMiddleware can resolve the uploads directory and attach outlines
in all invocation paths.

* style: apply ruff format

---------

Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
2026-04-03 20:52:47 +08:00