From zyte-web-data
Generates web-poet page objects from extraction specs produced by /scrape-spec, including item classes, page objects, and test fixtures.
How this skill is triggered — by the user, by Claude, or both
Slash command
/zyte-web-data:scrape-codegen [spec-path] [project-dir] [fields][spec-path] [project-dir] [fields]This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are generating a web-poet page object from an extraction spec. The spec contains
You are generating a web-poet page object from an extraction spec. The spec contains a schema, saved HTML pages, and expected values. It may describe any data type — product details, navigation links, article content, etc. Codegen doesn't need to know the data type; it generates a PO that extracts according to the schema.
The spec was produced by /scrape-spec and the project by /scrape-ensure-project.
Read python-environments.md and docs-access.md from ${CLAUDE_SKILL_DIR}/../scrape/references.
The raw argument string is $ARGUMENTS. Split it into up to 3 whitespace-separated positional arguments:
.scrape/books-toscrapeRead {spec_path}/spec.json to get:
schema.properties — the field definitionshtml_variant — which HTML to use (raw or rendered)url — the starting URL (used for domain name)data_type — what's being extracted (used for class naming); always singular (e.g. product, book)Derive names from data_type using these conventions (never pluralize):
ClassName = PascalCase + Page → product → ProductPageItemClass = PascalCase + Item → product → ProductItemmodule_name = snake_case of data_type → product → productIf fields is provided, filter schema.properties to only include those fields.
List page directories in {spec_path}/pages/ that have corresponding values in
{spec_path}/values/. Read expected values from each.
Derive site_name from the spec_path (parent directory name, e.g. books-toscrape from .scrape/books-toscrape/products).
Detect the project name from {project_dir}.
Check {project_name}/items.py for an existing item class matching data_type.
If none exists, write one based on the schema (all fields optional, | None = None).
Add a page object stub:
/scrape-add-page-object {project_dir}/{project_name}/pages/{module_name}.py \
{ClassName} {domain} web_poet.WebPage {project_name}.items.{ItemClass}
Use web_poet.BrowserPage if html_variant is rendered.
Find the fixture class path from the project structure (e.g.,
{project_name}.pages.{module_name}.{ClassName}).
uv run ${CLAUDE_SKILL_DIR}/scripts/convert_fixtures.py \
{spec_path} {project_dir} {fixture_class_path}
mkdir -p .scrape/.work/{site_name}/codegen-analyze
Launch one Agent per page with values, all in a single message for parallel
execution. Each agent runs /scrape-codegen-analyze with all 4 arguments:
/scrape-codegen-analyze {spec_path}/pages/{page_id}/{html_variant}.html .scrape/.work/{site_name} {spec_path}/spec.json {spec_path}/values/{page_id}.json
Skip pages whose HTML file doesn't exist.
After all analysis agents complete, launch a single Agent running
/scrape-codegen-generate with all 3 arguments:
/scrape-codegen-generate .scrape/.work/{site_name} {project_dir}/{project_name}/pages/{module_name}.py {spec_path}/spec.json
cd {project_dir} && uv run pytest fixtures/ -x -v
Report results. If tests fail, read errors and consider re-generating failed fields.
Generated page object at {project_dir}/{project_name}/pages/{module_name}.py:
Class: {ClassName} (N fields)
Fixtures: N test cases
Tests: N/N passing
Follow the web-poet reference at ${CLAUDE_SKILL_DIR}/../scrape/references/web-poet.md, plus:
None for missing data — never empty string, False, or []None before attribute accessException — only specific exceptionsextruct — the metadata format matches extract_metadata.py output from earlier
stages, so the same access patterns work in the page objectBrowserPage as the base classnpx claudepluginhub zytedata/claude-skills --plugin zyte-web-dataGenerates web-poet page object code from per-page extraction analyses, synthesizing multiple analysis files into a single domain-wide page object class. Automates code generation for web scraping projects.
Extracts structured data from websites like product listings, tables, search results, or profiles, generating an executable Playwright script and JSON/CSV output.
Builds production-ready web scrapers for any site using Bright Data infrastructure. Guides site analysis, API selection, selector extraction, pagination, and implementation.