From sci-brain
Adds arXiv IDs or DOIs to a knowledge base by fetching metadata, downloading PDFs (with SciHub fallback), rendering to markdown, regenerating INDEX.md, and appending to references.bib.
How this skill is triggered — by the user, by Claude, or both
Slash command
/sci-brain:download-refThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- A discussion / draft surfaces a paper not yet in the project KB, and you want it indexed for future search.
references.bib.Do NOT use:
The renderer uses pymupdf4llm for highest-fidelity output (preserves figures). Fallbacks (markitdown → pdftotext) are text-only — figures silently missing. Verify before fetching:
python3 -c "import pymupdf4llm; print('ok', pymupdf4llm.__version__)"
If that errors, install for the same python3 the helpers will use:
# macOS / Homebrew Python
/opt/homebrew/bin/python3 -m pip install --user --break-system-packages pymupdf4llm
# Linux / system Python
python3 -m pip install --user pymupdf4llm
The Sci-Hub fallback (Step 4b) drives a real browser to clear the mirrors' DDoS-Guard challenge, so it needs Playwright with a Chromium. Only required if you expect to hit paywalled DOIs:
python3 -m pip install --user playwright && python3 -m playwright install chromium
1806.08734, 2006.10739) — strip the vN suffix.10.1103/PhysRevLett.130.036401) — lowercase preferred; renderer normalizes.download-ref writes:
$KB/.raw/{arxiv,doi}/<id>.{json,pdf}$KB/.figures/{arxiv__<id>,doi__<safe>}/...$KB/<id>_<slug>.md (rendered paper, one per ref)$KB/INDEX.md (regenerated each run)$KB/references.bibdownload-ref never touches:
$KB/NOTES.md — owned by survey / researchstyle / humans (sub-themes, open problems, bottlenecks).The canonical bib is $KB/references.bib — it lives inside the KB, beside INDEX.md and NOTES.md. (Older notes may say $(dirname $KB)/ref.bib; that project-root path is retired.)
If the caller passes --kb <abs-path>, use that. Otherwise:
KB=$(python3 skills/download-ref/helpers/resolve_kb.py)
if [ -z "$KB" ]; then
# resolve_kb printed "unresolvable from ..." to stderr and exited 2.
# Ask the user via AskUserQuestion where the KB should live.
exit 1
fi
For advisor flows (/incarnate, /brainstorm-ideas with a selected advisor), resolve the advisor KB instead: KB=$(python3 skills/download-ref/helpers/resolve_kb.py --advisor <slug>). This honors $SCIBRAIN_KB_DIRNAME the same way the project-KB form does.
for id in 1806.08734 2006.10739; do
[ -f "$KB/.raw/arxiv/$id.json" ] && echo "$id present" || echo "$id missing"
done
for doi in 10.1103/PhysRevLett.130.036401; do
safe=$(echo "$doi" | tr '/' '-')
[ -f "$KB/.raw/doi/$safe.json" ] && echo "$doi present" || echo "$doi missing"
done
Helpers are idempotent — this check is for human-readable status, not gating.
3a. Direct input (single-shot mode):
TMP=/tmp/download-ref-manifest.json
cat > "$TMP" <<'EOF'
{"arxiv": ["1806.08734", "2006.10739"], "doi": []}
EOF
3b. From an existing references.bib (bulk mode, --from-bib):
TMP=/tmp/download-ref-manifest.json
python3 skills/download-ref/helpers/bibtex_to_manifest.py "$KB/references.bib" > "$TMP"
When in bulk mode, optionally ask the user:
"I see 59 refs in the manifest. Render all, topic-filtered, or specific IDs?"
- (a) All — proceed with the full manifest
- (b) Topic-filtered — name a heading from
NOTES.md(skill greps for cite keys under it)- (c) Specific IDs — paste arXiv IDs / DOIs
For (b) and (c), edit $TMP accordingly before continuing.
python3 skills/download-ref/helpers/fetch_metadata.py \
--kb "$KB" \
--manifest "$TMP" \
--download-arxiv-pdfs
Populates $KB/.raw/{arxiv,doi}/<id>.{json,pdf} idempotently. PDFs are downloaded sequentially with 2s sleep between requests to avoid arXiv rate limits. Each PDF is verified for a %%EOF trailer; truncated downloads are discarded and retried. For DOIs whose publisher gates the PDF (APS / Nature / IOP / AAAS / ACS), the helper falls back to the arXiv preprint via externalIds.ArXiv when present. If even that fails, you'll see a miss line — go to Step 4b.
Tip: Set SEMANTIC_SCHOLAR_API_KEY in your environment to raise the Semantic Scholar rate limit from ~1 req/s to 100 req/s. Get a free key at https://www.semanticscholar.org/product/api#api-key-form.
If Step 4 reports miss for any DOI (no open-access PDF and no arXiv preprint),
run the browser-based Sci-Hub helper. Pass the missed DOIs:
python3 skills/download-ref/helpers/scihub_download.py --kb "$KB" \
--doi 10.1111/j.1467-9280.2006.01693.x \
--doi 10.3102/0034654316689306
It tries each mirror in helpers/scihub_domains.toml (in order) until one
serves the PDF, solving the mirrors' DDoS-Guard JavaScript challenge with a
headless browser, and saves to $KB/.raw/doi/<safe>.pdf (<safe> = DOI with
/ → -) — the same place Step 4 writes, so Step 5 (render) picks it up. It
prints one OK / MISS / SKIP line per DOI.
MISS, the domain list is likely
stale: web-search "working sci-hub mirror domains " and edit
helpers/scihub_domains.toml (see its header), then re-run.--headed.Skip this step if all PDFs were fetched in Step 4.
python3 skills/download-ref/helpers/render.py --kb "$KB"
Add --only-missing to skip papers that already have a rendered .md file (>500 bytes). This is much faster when adding a few papers to a large KB:
python3 skills/download-ref/helpers/render.py --kb "$KB" --only-missing
No manifest needed — renderer auto-discovers .raw/{arxiv,doi}/*.json. Renders new entries; overwrites existing.
PDF backend priority:
pymupdf4llm — markdown + extracted images into $KB/.figures/.markitdown — text-only fallback.pdftotext -layout — last-resort fallback..raw/ and .figures/ should stay out of git. Append to .gitignore if missing.
In single-shot mode (Step 3a), ask the user to confirm each new cite key. In bulk mode (Step 3b), the keys come from references.bib directly — skip this step.
python3 skills/download-ref/helpers/append_bibtex.py propose \
--kb "$KB" --id 1806.08734 --type arxiv
Output JSON has proposed_key (form lastname_year_firstkeyword), title, authors, year, bibtex_with_proposed_key. Show the user via AskUserQuestion:
Once confirmed:
python3 skills/download-ref/helpers/append_bibtex.py append \
--kb "$KB" --id 1806.08734 --type arxiv \
--key rahaman_2018_spectral \
--bib "$KB/references.bib"
The helper rewrites the BibTeX cite key, refuses duplicates, appends with one blank-line separator.
python3 skills/download-ref/helpers/index.py \
--kb "$KB" \
--title "<project-or-advisor-slug> — references" \
--source-note "Reading list and full-text harness."
Replace <project-or-advisor-slug> with this KB's name. Once chosen, keep --title and --source-note byte-identical across runs — INDEX.md is regenerated wholesale every time; drift causes noisy diffs.
# New md files appear at top level
ls -t "$KB"/*.md | head
# Frontmatter present
for f in "$KB"/*.md; do
case "$(basename "$f")" in INDEX.md|NOTES.md) continue ;; esac
head -1 "$f" | grep -q '^---$' || echo "MISSING FRONTMATTER: $f"
done
# Raw blobs gitignored
KB_NAME=$(basename "$KB")
git -C "$(dirname "$KB")" check-ignore "$KB_NAME/.raw/" 2>/dev/null \
|| echo "WARN: $KB_NAME/.raw/ not gitignored"
# INDEX picked up the new ids
for id in 1806.08734 2006.10739; do
grep -q "$id" "$KB/INDEX.md" || echo "WARN: $id missing from INDEX.md"
done
Tell the user: new cite key(s), rendered file path(s), full_text yes/no per ref.
After the done checklist passes, offer the pipeline's final stage:
"Papers downloaded and rendered. Write the review?"
- (a) Write a review — invokes
survey-writerto produce a technology assessment from the rendered KB. This is the final stage of thesurvey→download-ref→survey-writerpipeline.- (b) Done — stop here.
/survey (upstream): writes/extends $KB/NOTES.md, appends to $KB/references.bib, regenerates $KB/INDEX.md, then hands off to /download-ref to fetch PDFs and render full text. The survey's transition checkpoint offers this directly./survey-writer (downstream): consumes the rendered KB (full-text .md files + $KB/references.bib) to produce a structured technology assessment report./survey / /researchstyle: write their own .raw/ JSON via batched fetches and call append_bibtex.py directly (skipping the per-ref confirmation in Step 6). They invoke index.py at the end of their run./brainstorm-ideas end-of-session: surfaces candidate IDs/DOIs from the conversation; for the user's selections, invokes /download-ref in single-shot mode./incarnate: invokes /download-ref (or /researchstyle) targeting the advisor KB resolved by python3 skills/download-ref/helpers/resolve_kb.py --advisor <slug>.| Mistake | Fix |
|---|---|
Passing a relative --kb | Always absolute. Helpers don't cd; figures depend on absolute paths. |
Forgetting --download-arxiv-pdfs in Step 4 | Without it full_text: no and Step 5 has nothing to render. |
Using arXiv:XXXX with prefix or vN suffix | Strip both — manifest takes bare ids: 1806.08734. |
Editing the rendered .md and losing it on re-render | Renderer overwrites without warning. Edit .raw/ source or renderer logic. |
| Cite-key collision with different content | Helper skips silently — investigate, re-run propose with a different key. |
Drifting --title / --source-note between runs | INDEX.md regenerates wholesale; first-run values are canonical. Copy verbatim from existing INDEX.md. |
.raw/{arxiv,doi}/<id>.json exists for every requested id.raw/{arxiv,doi}/<id>.pdf exists where the source allows (else recorded as miss)<id>_<slug>.md per ref at $KB/ root, with frontmatter$KB/INDEX.md regenerated, lists each new entry$KB/references.bib has the new cite key (no duplicate)full_text yes/no per refnpx claudepluginhub quantumbfs/sci-brain --plugin sci-brainIndexes paper collections (Zotero library, PDF folder, or Google Scholar profile) into a structured knowledge base under a project directory.
Syncs .bib references to Zotero library and generates Obsidian literature notes with cross-cutting concept extraction. Use after /search-lit or to bulk-register references.
Manages Paperpile reference library and resolves citations to PDFs via the paperpile CLI. Supports add, search, fetch, label, edit, trash, and auth operations.