Skill

scrape-codegen

Generates web-poet page objects from extraction specs produced by /scrape-spec, including item classes, page objects, and test fixtures.

Python

automation

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/zyte-web-data:scrape-codegen [spec-path] [project-dir] [fields]

User invocable

Model invocable

Inline context

Default effort

Argument hint[spec-path] [project-dir] [fields]

Tool Access

This skill is limited to the following tools:

SkillAgentBashReadWrite

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are generating a web-poet page object from an extraction spec. The spec contains

Supporting Files

scripts/convert_fixtures.py

SKILL.md

126 lines · ~1.2k tokens

Stats

LanguagePython

Stars18

Forks3

MaintenanceExcellent

Last CommitJun 24, 2026

Actions

View Source View Plugin View on GitHub View README

Input

The raw argument string is $ARGUMENTS. Split it into up to 3 whitespace-separated positional arguments:

spec_path: path to spec folder, e.g. .scrape/books-toscrape
project_dir: path to the Scrapy project
fields: optional, comma-separated field names to generate (empty = all fields)

Process

Step 1: Read the spec

Read {spec_path}/spec.json to get:

schema.properties — the field definitions
html_variant — which HTML to use (raw or rendered)
url — the starting URL (used for domain name)
data_type — what's being extracted (used for class naming); always singular (e.g. product, book)

Derive names from data_type using these conventions (never pluralize):

ClassName = PascalCase + Page → product → ProductPage
ItemClass = PascalCase + Item → product → ProductItem
module_name = snake_case of data_type → product → product

If fields is provided, filter schema.properties to only include those fields.

List page directories in {spec_path}/pages/ that have corresponding values in {spec_path}/values/. Read expected values from each.

Derive site_name from the spec_path (parent directory name, e.g. books-toscrape from .scrape/books-toscrape/products). Detect the project name from {project_dir}.

Step 2: Add item and page object stub

Check {project_name}/items.py for an existing item class matching data_type. If none exists, write one based on the schema (all fields optional, | None = None).

Add a page object stub:

/scrape-add-page-object {project_dir}/{project_name}/pages/{module_name}.py \
    {ClassName} {domain} web_poet.WebPage {project_name}.items.{ItemClass}

Use web_poet.BrowserPage if html_variant is rendered.

Step 3: Convert fixtures

Find the fixture class path from the project structure (e.g., {project_name}.pages.{module_name}.{ClassName}).

uv run ${CLAUDE_SKILL_DIR}/scripts/convert_fixtures.py \
    {spec_path} {project_dir} {fixture_class_path}

Step 4: Analyze pages (parallel)

mkdir -p .scrape/.work/{site_name}/codegen-analyze

Launch one Agent per page with values, all in a single message for parallel execution. Each agent runs /scrape-codegen-analyze with all 4 arguments:

/scrape-codegen-analyze {spec_path}/pages/{page_id}/{html_variant}.html .scrape/.work/{site_name} {spec_path}/spec.json {spec_path}/values/{page_id}.json

Skip pages whose HTML file doesn't exist.

Step 5: Generate page object code

After all analysis agents complete, launch a single Agent running /scrape-codegen-generate with all 3 arguments:

/scrape-codegen-generate .scrape/.work/{site_name} {project_dir}/{project_name}/pages/{module_name}.py {spec_path}/spec.json

Step 6: Test

cd {project_dir} && uv run pytest fixtures/ -x -v

Report results. If tests fail, read errors and consider re-generating failed fields.

Step 7: Report

Generated page object at {project_dir}/{project_name}/pages/{module_name}.py:
  Class: {ClassName} (N fields)
  Fixtures: N test cases
  Tests: N/N passing

Codegen rules

Follow the web-poet reference at ${CLAUDE_SKILL_DIR}/../scrape/references/web-poet.md, plus:

Keep code simple and domain-general — not overfitted to example pages
Return None for missing data — never empty string, False, or []
Use guard clauses, check for None before attribute access
Don't add docstrings to field methods
Don't catch generic Exception — only specific exceptions
Prefer deterministic output — avoid sets (use list + dedup if needed)
If analysis shows a field comes from structured data (JSON-LD, microdata), use extruct — the metadata format matches extract_metadata.py output from earlier stages, so the same access patterns work in the page object
If a browser response is needed, use BrowserPage as the base class

scrape-codegen

Popularity

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

scrape-codegen

Popularity

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

Input

Process

Step 1: Read the spec

Step 2: Add item and page object stub

Step 3: Convert fixtures

Step 4: Analyze pages (parallel)

Step 5: Generate page object code

Step 6: Test

Step 7: Report

Codegen rules

Similar Skills

Input

Process

Step 1: Read the spec

Step 2: Add item and page object stub

Step 3: Convert fixtures

Step 4: Analyze pages (parallel)

Step 5: Generate page object code

Step 6: Test

Step 7: Report

Codegen rules

Similar Skills