He Wang e9deb6c2f2
perf(harness): push thread metadata filters into SQL (#2865)
* perf(harness): push thread metadata filters into SQL

Replace Python-side metadata filtering (5x overfetch + in-memory match)
with database-side json_extract predicates so LIMIT/OFFSET pagination
is exact regardless of match density.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

* fix(harness): add dialect-aware JsonMatch compiler for type-safe metadata SQL filters

Replace SQLAlchemy JSON index/comparator APIs with a custom JsonMatch
ColumnElement that compiles to json_type/json_extract on SQLite and
jsonb_typeof/->>/-> on PostgreSQL. Tighten key validation regex to
single-segment identifiers, handle None/bool/numeric value types with
json_type-based discrimination, and strengthen test coverage for edge
cases and discriminability.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

* fix(harness): address Copilot review comments on JSON metadata filters

- Use json_typeof instead of jsonb_typeof in PostgreSQL compiler; the
  metadata_json column is JSON not JSONB so jsonb_typeof would error at
  runtime on any PostgreSQL backend
- Align _is_safe_json_key with json_match's _KEY_CHARSET_RE so keys
  containing hyphens or leading digits are not silently skipped
- Add thread_id as secondary ORDER BY in search() to make pagination
  deterministic when updated_at values collide; remove asyncio.sleep
  from the pagination regression test

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* fix(harness): address remaining review comments on metadata SQL filters

- Remove _is_safe_json_key() and reuse json_match ValueError to avoid
  validator drift (Copilot #3217603895, #3217411616)
- Raise ValueError when all metadata keys are rejected so callers never
  get silent unfiltered results (WillemJiang)
- Fix integer precision: split int/float branches, bind int as Integer()
  with INTEGER/BIGINT CAST instead of float() coercion (Copilot #3217603972)
- Fix jsonb_typeof -> json_typeof on JSON column (Copilot #3217411579)
- Replace manual _cleanup() calls with async yield fixture so teardown
  always runs (Copilot #3217604019)
- Remove asyncio.sleep(0.01) pagination ordering; use thread_id secondary
  sort instead (Copilot #3217411636)
- Add type annotations to _bind/_build_clause/_compile_* and remove EOL
  comments from _Dialect fields (coding.mdc)
- Expand test coverage: boolean/null/mixed-type/large-int precision,
  partial unsafe-key skip with caplog assertion

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(harness): address third-round Copilot review comments on JsonMatch

- Reject unsupported value types (list, dict, ...) in JsonMatch.__init__
  with TypeError so inherit_cache=True never receives an unhashable value
  and callers get an explicit error instead of silent str() coercion
  (Copilot #3217933201)
- Upgrade int bindparam from Integer() to BigInteger() to align with
  BIGINT CAST and avoid overflow on large integers (Copilot #3217933252)
- Catch TypeError alongside ValueError in search() so non-string metadata
  keys are warned and skipped rather than raising unexpectedly
  (Copilot #3217933300)
- Add three tests: json_match rejects unsupported value types, search()
  warns and raises on non-string key, search() warns and raises on
  unsupported value type

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(harness): address fourth-round Copilot review comments on JsonMatch

- Add CASE WHEN guard for PostgreSQL integer matching: json_typeof returns
  'number' for both ints and floats; wrap CAST in CASE with regex guard
  '^-?[0-9]+$' so float rows never trigger CAST error (Copilot #3218413860)
- Validate isinstance(key, str) before regex match in JsonMatch.__init__
  so non-string keys raise ValueError consistently instead of TypeError
  from re.match (Copilot #3218413900)
- Include exception message in metadata filter skip warning so callers
  can distinguish invalid key from unsupported value type (Copilot #3218413924)
- Update tests: assert CASE WHEN guard in PG int compilation, cover
  non-string key ValueError in test_json_match_rejects_unsafe_key

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(harness): align ThreadMetaStore.search() signature with sql.py implementation

Use `dict[str, Any]` for `metadata` and `list[dict[str, Any]]` as return
type in base class and MemoryThreadMetaStore to resolve an LSP signature
mismatch; also correct a test docstring that cited the wrong exception type.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(harness): surface InvalidMetadataFilterError as HTTP 400 in search endpoint

Replace bare ValueError with a domain-specific InvalidMetadataFilterError
(subclass of ValueError) so the Gateway handler can catch it and return
HTTP 400 instead of letting it bubble up as a 500.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

* fix(harness): sanitize metadata keys in log output to prevent log injection

Use ascii() instead of %r to escape control characters in client-supplied
metadata keys before logging, preventing multiline/forged log entries.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* fix(harness): validate metadata filters at API boundary and dedupe key/value rules

- Add Pydantic ``field_validator`` on ``ThreadSearchRequest.metadata`` so
  unsafe keys / unsupported value types are rejected with HTTP 422 from
  both SQL and memory backends (closes Copilot review 3218830849).
- Export ``validate_metadata_filter_key`` / ``validate_metadata_filter_value``
  (and ``ALLOWED_FILTER_VALUE_TYPES``) from ``json_compat`` and have
  ``JsonMatch.__init__`` reuse them — the Gateway-side validator and the
  SQL-side ``JsonMatch`` constructor now share one admission rule and
  cannot drift.
- Format ``InvalidMetadataFilterError`` rejected-keys list as a
  comma-separated plain string instead of a Python list repr so the
  surfaced HTTP 400 detail is readable (closes Copilot review 3218830899).
- Update router tests to cover both 422 boundary paths plus the 400
  defense-in-depth path when a backend still raises the error.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(harness): harden JsonMatch compile-time key validation against __init__ bypass

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

* fix: address review feedback on metadata filter SQL push-down

- Add signed 64-bit range check to validate_metadata_filter_value; give
  out-of-range ints a distinct TypeError message.

- Replace assert guards in _compile_sqlite/_compile_pg with explicit
  if/raise so they survive python -O optimisation.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4 <noreply@anthropic.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-12 23:21:22 +08:00
..