From zyte-web-data
Extracts structured data from locally saved HTML files, with optional schema-guided extraction. Useful for analyzing downloaded pages by specifying file path and optional output/schema.
How this skill is triggered — by the user, by Claude, or both
Slash command
/zyte-web-data:scrape-analyze-pageThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are extracting structured data from a page. Given saved HTML, identify all available fields and extract their values.
You are extracting structured data from a page. Given saved HTML, identify all available fields and extract their values.
Read ${CLAUDE_SKILL_DIR}/../scrape/references/python-environments.md.
This is the user prompt: $ARGUMENTS. You need to extract the following information from it:
product1.html. This is what you need to analyze. Don't proceed if it's not provided.product1.json. When provided, this is where you will save the structured analysis.Only process this one page. Do not read or compare with other pages' analysis files.
Clean the HTML and extract metadata, saving outputs to the work directory. Use {page_id}.{html_variant} as the filename base to avoid collisions:
mkdir -p .scrape/.work/analysis
uv run ${CLAUDE_SKILL_DIR}/scripts/clean_html.py PAGE.html -l1 -o .scrape/.work/analysis/{page_id}.{html_variant}.cleaned.html
uv run ${CLAUDE_SKILL_DIR}/scripts/extract_metadata.py PAGE.html -u PAGE_URL -o .scrape/.work/analysis/{page_id}.{html_variant}.metadata.json
Read only the cleaned HTML (never the original) and the metadata JSON. The metadata may be empty {} if the page has no structured data.
IMPORTANT: Never read the original HTML file (PAGE.html), even partially, even via tools such as Bash. Only use the cleaned HTML output from step 1 as your HTML source.
Use both the cleaned HTML and the metadata as data sources. Metadata (especially JSON-LD) often has cleaner, more complete values than what's visible in the HTML — e.g., structured price/priceCurrency vs rendered "$29.99", aggregateRating with review count, brand as a structured object. Some fields may only exist in metadata (e.g., sku, gtin, @type).
Examine both sources and extract all meaningful data fields. For each field, determine:
Three modes depending on arguments:
examples for formatting. Also extract additional fields not in the schema — they may reveal data the user didn't know about.For fields with large values (long text, HTML content, nested structures):
If output_path is provided, save complete extraction to output_path, otherwise skip this step.
If the user has asked for a specific format and structure, use that.
Otherwise write a JSON file with the following structure:
{
"fields": {
"name": {"type": "str", "value": "Widget X"},
"price": {"type": "str", "value": "$29.99"},
"description": {"type": "str", "value": "Full long description..."}
}
}
Return a concise summary, with field names, types and values. Truncate large values, if the full output was saved into a file. Example:
These fields were discovered:
name (str): "Widget X"
price (str): "$29.99"
description (str): "Premium widget with advanced..." (2340 chars)
rating (float): 4.5
The full report is saved to $output_path
npx claudepluginhub zytedata/claude-skills --plugin zyte-web-dataAnalyzes an HTML page to produce detailed field extraction instructions for code generation, covering CSS selectors, JSON-LD, microdata, and OpenGraph sources.
Extract structured data from documents that resist standard parsing, such as redacted records, scanned forms, inconsistent tables, and OCR artifacts. Use this skill when a journalist needs to transform messy PDFs or images into structured JSON with full provenance tracking. Triggers on requests involving FOIA documents, court records, financial disclosures, government forms, leaked documents, or any document described as "hard to parse," "scanned," "redacted," or "inconsistent."
Extracts typed JSON from one or more web pages using a JSON Schema with fastCRW's scrape-based extraction pipeline. Use for structured data like prices, stock status, or job listings.