From pdf-to-text
Extract clean text and markdown from any PDF. Fixes broken Unicode mappings that make Chrome's copy-paste produce gibberish. Returns plain text, basic markdown, or structured markdown with TOC, headings, page markers, and token counts. Works locally via WASM — your PDFs never leave your machine.
How this skill is triggered — by the user, by Claude, or both
Slash command
/pdf-to-text:extract-pdfThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
```bash
_PLUGIN_DIR="${CLAUDE_PLUGIN_ROOT:-$(cd "$(dirname "$0")/.." 2>/dev/null && pwd || echo "$HOME/.config/pdf-to-text")}"
_UPD=$("$_PLUGIN_DIR/bin/update-check" 2>/dev/null || true)
[ -n "$_UPD" ] && echo "$_UPD" || true
If UPGRADE_AVAILABLE <old> <new> is output: tell the user a new version is available and ask if they want to upgrade. If yes, run $_PLUGIN_DIR/hooks/install-engine.sh. If JUST_UPGRADED <old> <new>: tell the user "PDF to Text engine updated to v{new}!" and continue.
A local PDF-to-markdown extraction engine with a 7-level
fallback cascade that recovers text from PDFs whose embedded Unicode
mappings are broken or missing. It outputs three markdown formats:
plain, basic (one # Page N per page), and structured
(YAML frontmatter + TOC + font-detected headings + fenced code blocks).
Use this skill when you need clean text from a PDF, not when you need the visual layout. For tables, figures, and images, look elsewhere.
Use glyph-api when:
.pdf and wants the contentcat/pdftotext produced garbage on a
particular PDFDo NOT use glyph-api for:
camelot, tabula-py, pdfplumber).tesseract or a cloud OCR service first, then feed the resulting
text-layered PDF to glyph-api.Four invocation paths, ordered by agent-friendliness.
This plugin registers MCP tools automatically. Use them directly:
extract_pdf — pass url or path, get plain/basic/structured markdown + statsrender_markdown — fetch and parse any .md URL, get sections + token countlist_recent — see recently extracted PDFs from the local cacheUse the extract_pdf tool with path: "/path/to/document.pdf" and format: "structured"
If you're driving a browser (Claude in Chrome, Playwright, Puppeteer, CDP):
// Navigate to any .pdf URL. The extension's DNR rule intercepts and
// redirects to the viewer. The viewer runs the WASM extractor and
// publishes the result to a page-world global.
await navigate("https://example.com/document.pdf");
// Wait for extraction to finish.
await waitFor(() => window.__glyph?.status === "ready");
// Read the three markdown formats (all pre-computed).
const plain = window.__glyph.markdown.plain;
const basic = window.__glyph.markdown.basic;
const structured = window.__glyph.markdown.structured;
The viewer also fires a glyph:status CustomEvent and sets
document.body.dataset.glyphStatus to loading / ready / error, so
agents that prefer selector-based waits can use:
await waitForSelector('body[data-glyph-status="ready"]');
Iframe-embedding pages receive a postMessage with {type: "glyph:status", status, markdown} once ready.
Requirements: the Glyph Chrome extension must be installed (unpacked
or from the Chrome Web Store). The extension ID is stable for a given
store listing; print it from any viewer tab via chrome.runtime.id in
the DevTools console.
cd /Users/jordan/Code/glyph-api
./target/release/glyph-api path/to/document.pdf
Outputs plain text to stdout. One form-feed (\f) between pages. No
frontmatter, no headings, no markdown syntax — just the extracted text.
To build the binary from source (~5s):
cd /Users/jordan/Code/glyph-api
cargo build --release --bin glyph-api
import initWasm, { extract_chars_with_positions } from "./pkg/glyph_api.js";
import { readFileSync } from "node:fs";
await initWasm({ module_or_path: readFileSync("./pkg/glyph_api_bg.wasm") });
const pdfBytes = readFileSync("document.pdf");
const json = extract_chars_with_positions(pdfBytes);
const parsed = JSON.parse(json);
// parsed.pages[i].text — per-page extracted text
// parsed.pages[i].chars — per-char position metadata (for spatial queries)
// parsed.total_chars, total_resolved, total_unresolved
To also get the three markdown formats, import the markdown module:
import {
toPlain,
toBasicMarkdown,
toStructuredMarkdown,
} from "./extension/dist/markdown.js";
const structured = toStructuredMarkdown(parsed, {
srcUrl: "https://example.com/document.pdf",
});
<page 1 text>
<page 2 text>
...
No markdown. Blank line between pages. Use for embeddings and keyword search.
# Page 1
<escaped page 1 text>
---
# Page 2
<escaped page 2 text>
Page headers, minimal escaping. Use when you want page-boundary awareness without heading detection.
---
source: https://example.com/document.pdf
pages: 9
chars: 21146
tokens: ~5.3k
headings: 13
extracted_at: 2026-04-11T17:00:00.000Z
extractor: glyph-api
---
## Table of Contents
- Bitcoin: A Peer-to-Peer Electronic Cash System _(p.1)_
- 1. Introduction _(p.1)_
- 2. Transactions _(p.2)_
...
<!-- page 1 -->
## Bitcoin: A Peer-to-Peer Electronic Cash System
Satoshi Nakamoto
[email protected]
www.bitcoin.org
## Abstract
A purely peer-to-peer version of electronic cash...
## 1. Introduction
Commerce on the Internet has come to rely...
Notable features:
<!-- page N --> markers — invisible when rendered, queryable for
per-page chunking in RAG pipelines^\d+\.\s+[A-Z])#include), Python (def), math formulas
(∑ ⋅ ≤), and diagram labels — keeps GitHub's markdown renderer from
choking on themdense_20p_70l_r2.pdf
contains the same sentence repeated 1400 times by design. That's the
input, not an extraction bug.fi ligature bug — papers using ligatures (fi, fl,
ffi, ffl) may emit raw glyph codes like 002gures instead of
figures. Future work: extend the L3/L4 cascade to recognize ligature
glyph names.Tj operator produce Form10 4 0 2025 instead of Form 1040 (2025). Future work: detect tight-spacing runs and apply word grouping.110TH CONGRESS set with positive
letter-tracking extracts as 110 TH C ONGRESS. Future work: detect
heading-level letter-spacing and collapse.\``textfor legibility but won't render via MathJax. Future work:$$...$$` wrapping
for paragraphs where symbol density is high enough.If you're unsure whether glyph-api improved on Chrome's native extraction
for a given PDF, look at the frontmatter's headings: count and the
structured output's TOC — both should be non-empty for any real
document. For the famous "Attention Is All You Need" paper (arxiv
1706.03762), the structured output has:
docmap --type code --lang c on the
appendixIf the output has headings: 0 and no TOC, the PDF likely has unusual
content structure — try feeding it back and reporting the file shape.
If you have docmap (v0.4.0+) installed, you can verify the output's
structural integrity:
docmap extracted.structured.md # section tree
docmap extracted.structured.md --type code --lang c # find C blocks
docmap extracted.structured.md --type math # find math blocks
docmap extracted.structured.md --json | jq '.documents[0].sections'
This gives independent confirmation that the extracted markdown is semantically navigable — agents can jump to sections, count constructs, and reason about document structure.
npx claudepluginhub jordancoin/pdf-to-text --plugin pdf-to-textCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.