mirror of
https://github.com/bytedance/deer-flow.git
synced 2026-04-25 11:18:22 +00:00
* test(skills): add trigger eval set for systematic-literature-review skill 20 eval queries (10 should-trigger, 10 should-not-trigger) for use with skill-creator's run_eval.py. Includes real-world SLR queries contributed by @VANDRANKI (issue #1862 author) and edge cases for routing disambiguation with academic-paper-review. * test(skills): add grader expectations for SLR skill evaluation 5 eval cases with 39 expectations covering: - Standard SLR flow (APA/BibTeX/IEEE format selection) - Keyword extraction and search behavior - Subagent dispatch for metadata extraction - Report structure (themes, convergences, gaps, per-paper annotations) - Negative case: single-paper routing to academic-paper-review - Edge case: implicit SLR without explicit keywords * refactor(skills): shorten SLR description for better trigger rate Reduce description from 833 to 344 chars. Key changes: - Lead with "systematic literature review" as primary trigger phrase - Strengthen single-paper exclusion: "Not for single-paper tasks" - Remove verbose example patterns that didn't improve routing Tested with run_eval.py (10 runs/query): - False positive "best paper on RL": 67% → 20% (improved) - True positive explicit SLR query: ~30% (unchanged) Low recall is a routing-layer limitation, not a description issue — see PR description for full analysis. * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>