Multi-format document parsing tools for PDF, DOCX, HTML, and Markdown with support for LlamaParse, Unstructured.io, PyPDF2, PDFPlumber, and python-docx. Use when parsing documents, extracting text from PDFs, processing Word documents, converting HTML to text, extracting tables from documents, building RAG pipelines, chunking documents, or when user mentions document parsing, PDF extraction, DOCX processing, table extraction, OCR, LlamaParse, Unstructured.io, or document ingestion.
Limited to specific tools
Additional assets for this skill
This skill is limited to using the following tools:
examples/parse-legal-document.pyexamples/parse-research-paper.pyscripts/parse-docx.pyscripts/parse-html.pyscripts/parse-pdf.pyscripts/setup-llamaparse.shscripts/setup-unstructured.shtemplates/multi-format-parser.pytemplates/table-extraction.pyPurpose: Autonomously parse and extract content from multiple document formats (PDF, DOCX, HTML, Markdown) using industry-standard libraries and AI-powered parsing tools.
Activation Triggers:
Key Resources:
scripts/setup-llamaparse.sh - Install and configure LlamaParse (AI-powered parsing)scripts/setup-unstructured.sh - Install Unstructured.io libraryscripts/parse-pdf.py - Functional PDF parser with multiple backend optionsscripts/parse-docx.py - DOCX document parserscripts/parse-html.py - HTML to structured text parsertemplates/multi-format-parser.py - Universal document parser templatetemplates/table-extraction.py - Specialized table extraction templateexamples/parse-research-paper.py - Research paper parsing with citationsexamples/parse-legal-document.py - Legal document parsing with sectionsBest For:
Pros:
Cons:
Documentation: https://docs.cloud.llamaindex.ai/llamaparse
Setup:
./scripts/setup-llamaparse.sh
Usage Pattern:
from llama_parse import LlamaParse
parser = LlamaParse(
api_key="llx-...",
result_type="markdown", # or "text"
language="en",
verbose=True
)
documents = parser.load_data("document.pdf")
for doc in documents:
print(doc.text)
Best For:
Pros:
Cons:
Documentation: https://unstructured-io.github.io/unstructured/
Setup:
./scripts/setup-unstructured.sh
Usage Pattern:
from unstructured.partition.auto import partition
elements = partition("document.pdf")
for element in elements:
print(f"{element.category}: {element.text}")
Best For:
Pros:
Cons:
Documentation: https://github.com/py-pdf/pypdf2
Setup:
pip install pypdf2
Usage Pattern:
from PyPDF2 import PdfReader
reader = PdfReader("document.pdf")
for page in reader.pages:
print(page.extract_text())
Best For:
Pros:
Cons:
Documentation: https://github.com/jsvine/pdfplumber
Setup:
pip install pdfplumber
Usage Pattern:
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
tables = page.extract_tables()
text = page.extract_text()
Best For:
Pros:
Cons:
Documentation: https://github.com/python-openxml/python-docx
Setup:
pip install python-docx
Usage Pattern:
from docx import Document
doc = Document("document.docx")
for para in doc.paragraphs:
print(para.text)
for table in doc.tables:
for row in table.rows:
print([cell.text for cell in row.cells])
| Use Case | Recommended Parser | Alternative |
|---|---|---|
| Simple PDF text extraction | PyPDF2 | Unstructured |
| Complex PDFs with tables | LlamaParse | PDFPlumber |
| Scanned documents (OCR) | LlamaParse | Unstructured + Tesseract |
| Word documents (.docx) | python-docx | Unstructured |
| HTML to text | parse-html.py | Unstructured |
| Multi-format batch processing | Unstructured | Multi-format-parser |
| Table extraction | PDFPlumber | LlamaParse |
| Research papers | LlamaParse | Unstructured |
| Legal documents | LlamaParse | PDFPlumber |
| Production RAG pipeline | Unstructured | LlamaParse |
scripts/parse-pdf.py)Command-line PDF parser supporting multiple backends:
# Using PyPDF2 (default)
python scripts/parse-pdf.py document.pdf
# Using PDFPlumber (better for tables)
python scripts/parse-pdf.py document.pdf --backend pdfplumber
# Using LlamaParse (AI-powered)
python scripts/parse-pdf.py document.pdf --backend llamaparse --api-key llx-...
# Output to file
python scripts/parse-pdf.py document.pdf --output output.txt
# Extract tables as JSON
python scripts/parse-pdf.py document.pdf --backend pdfplumber --tables-only --output tables.json
Features:
scripts/parse-docx.py)Word document parser with structure preservation:
# Basic extraction
python scripts/parse-docx.py document.docx
# Extract with structure
python scripts/parse-docx.py document.docx --preserve-structure
# Extract tables only
python scripts/parse-docx.py document.docx --tables-only
# Output as JSON
python scripts/parse-docx.py document.docx --output output.json --format json
Features:
scripts/parse-html.py)HTML to clean text converter:
# Basic HTML parsing
python scripts/parse-html.py document.html
# From URL
python scripts/parse-html.py https://example.com/article
# Preserve links
python scripts/parse-html.py document.html --preserve-links
# Extract specific selector
python scripts/parse-html.py document.html --selector "article.content"
Features:
templates/multi-format-parser.py)Universal parser handling multiple formats with automatic format detection:
from multi_format_parser import MultiFormatParser
parser = MultiFormatParser(
llamaparse_api_key="llx-...", # Optional
use_ocr=True,
chunk_size=1000
)
# Automatic format detection
result = parser.parse_file("document.pdf")
print(result.text)
print(result.metadata)
print(result.tables)
# Batch processing
results = parser.parse_directory("./documents/")
for filename, result in results.items():
print(f"{filename}: {len(result.text)} characters")
Supports:
templates/table-extraction.py)Specialized table extraction with multiple strategies:
from table_extraction import TableExtractor
extractor = TableExtractor(
prefer_llamaparse=True,
fallback_to_pdfplumber=True
)
# Extract all tables from document
tables = extractor.extract_tables("financial_report.pdf")
for i, table in enumerate(tables):
print(f"Table {i + 1}:")
print(table.to_markdown()) # or .to_csv(), .to_json()
print(f"Confidence: {table.confidence}")
Features:
examples/parse-research-paper.py)Complete example for parsing academic papers:
# Extracts title, abstract, sections, citations, tables, figures
python examples/parse-research-paper.py paper.pdf --output paper.json
Extracts:
examples/parse-legal-document.py)Specialized parser for legal documents:
# Extracts clauses, sections, definitions, parties
python examples/parse-legal-document.py contract.pdf --output contract.json
Extracts:
from multi_format_parser import MultiFormatParser
parser = MultiFormatParser(chunk_size=512, chunk_overlap=50)
result = parser.parse_file("document.pdf")
# Chunks ready for embedding
for chunk in result.chunks:
print(f"Chunk {chunk.id}: {chunk.text[:100]}...")
print(f"Metadata: {chunk.metadata}")
# Send to embedding model
import glob
from multi_format_parser import MultiFormatParser
parser = MultiFormatParser()
# Process all documents in directory
for filepath in glob.glob("./documents/**/*", recursive=True):
try:
result = parser.parse_file(filepath)
# Store in vector database
store_embeddings(result.chunks)
print(f"✓ Processed {filepath}")
except Exception as e:
print(f"✗ Failed {filepath}: {e}")
Parser Selection:
Performance:
Accuracy:
RAG Optimization:
PyPDF2 returns garbled text:
Unstructured installation fails:
sudo apt-get install poppler-utils tesseract-ocrbrew install poppler tesseractLlamaParse API errors:
Table extraction misses columns:
DOCX parsing fails:
Core:
pip install pypdf2 pdfplumber python-docx beautifulsoup4 lxml markdown
Optional (Unstructured):
pip install unstructured[local-inference]
sudo apt-get install poppler-utils tesseract-ocr # Linux
brew install poppler tesseract # macOS
Optional (LlamaParse):
pip install llama-parse
# Requires API key from https://cloud.llamaindex.ai
Supported Formats: PDF, DOCX, HTML, Markdown, TXT Parsers: LlamaParse, Unstructured.io, PyPDF2, PDFPlumber, python-docx Version: 1.0.0