From zyte-web-data
Generates web-poet page object code from per-page extraction analyses, synthesizing multiple analysis files into a single domain-wide page object class. Automates code generation for web scraping projects.
How this skill is triggered — by the user, by Claude, or both
Slash command
/zyte-web-data:scrape-codegen-generate [work-path] [output-path] [spec-path][work-path] [output-path] [spec-path]This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are generating web-poet page object code. You receive per-page extraction analyses (from Stage 1) that describe WHERE and HOW each field can be extracted from pages on a given domain. Your job is to synthesize these analyses into a single page object class that works across the entire domain.
You are generating web-poet page object code. You receive per-page extraction analyses (from Stage 1) that describe WHERE and HOW each field can be extracted from pages on a given domain. Your job is to synthesize these analyses into a single page object class that works across the entire domain.
The raw argument string is $ARGUMENTS. Split it into 3 whitespace-separated positional arguments:
.scrape/.work/spec.scrape/spec/page_object.py.scrape/spec/spec.jsonPlus, taken from the surrounding prompt text (not from the argument string):
@field methods for those fields. When not set, generate all fields found in the analyses.Read web-poet.md and docs-access.md from ${CLAUDE_SKILL_DIR}/../scrape/references/.
Read the schema from {spec_path} — use the properties object inside schema.
Read all Stage 1 analysis files from {work_path}/codegen-analyze/:
{work_path}/codegen-analyze/detail-1.json
{work_path}/codegen-analyze/detail-2.json
...
For each field in the schema, review all per-page analyses together:
No content filtering — ever. Even if the user's prompt asks to filter, exclude, or limit results by value, do NOT implement that logic in the page object. Page objects extract; spiders filter. Mention in your summary that filtering belongs at the spider level.
Generate a complete, self-contained Python module following the web-poet reference. The code must:
None when a field is not present (never empty string or []).extruct for JSON-LD/microdata, price_parser for prices, jmespath for JSON queries.Structure:
from web_poet import WebPage, field
# ... other imports as needed
class PageObject(WebPage[dict]):
# shared helpers as @cached_property if multiple fields need them
@field
def field_name(self) -> type | None:
# extraction logic
...
Save the generated code to {output_path}.
Return a summary of what was generated:
Generated page object with N fields:
name: CSS h1.product-title::text
price: JSON-LD offers.price, fallback to CSS span.price
description: CSS div.product-description (text join)
rating: JSON-LD aggregateRating.ratingValue
image_url: CSS img.product-image::attr(src) + urljoin
Include notes on any fields where consensus was difficult or where extraction may be fragile.
npx claudepluginhub zytedata/claude-skills --plugin zyte-web-dataGenerates web-poet page objects from extraction specs produced by /scrape-spec, including item classes, page objects, and test fixtures.
Extracts structured data from websites like product listings, tables, search results, or profiles, generating an executable Playwright script and JSON/CSV output.
Builds production-ready web scrapers for any site using Bright Data infrastructure. Guides site analysis, API selection, selector extraction, pagination, and implementation.