Production-proven Playwright web scraping patterns with selector-first approach and robust error handling. Use when users need to build web scrapers, extract data from websites, automate browser interactions, or ask about Playwright selectors, text extraction (innerText vs textContent), regex patterns for HTML, fallback hierarchies, or scraping best practices.
Inherits all available tools
Additional assets for this skill
This skill inherits all available tools. When active, it can use any tool Claude has access to.
Production-proven web scraping patterns using Playwright with selector-first approach and robust error handling.
Always prefer semantic locators over CSS selectors:
// ✅ BEST: Semantic locators (accessible, maintainable)
await page.getByRole('button', { name: 'Submit' })
await page.getByText('Welcome')
await page.getByLabel('Email')
// ⚠️ ACCEPTABLE: Text patterns for dynamic content
await page.locator('text=/\\$\\d+\\.\\d{2}/')
// ❌ AVOID: Brittle CSS selectors
await page.locator('.btn-primary')
await page.locator('#submit-button')
Critical difference between textContent and innerText:
// ❌ WRONG: Returns ALL text including hidden elements, scripts, iframes
const pageText = await page.textContent("body");
// ✅ CORRECT: Returns only VISIBLE text (what users see)
const pageText = await page.innerText("body");
Use case for each:
innerText("body") - Extract visible content for regex matchingtextContent(selector) - Get text from specific elementsHandle newlines and whitespace in HTML:
// ❌ FAILS: [^$]* doesn't match across newlines
const match = pageText.match(/ADULT[^$]*(\$\d+\.\d{2})/);
// ✅ WORKS: [\s\S]{0,10} matches any character including newlines
const match = pageText.match(/ADULT[\s\S]{0,10}(\$\d+\.\d{2})/);
Common patterns:
// Price extraction
/\$(\d+\.\d{2})/
// Date/time
/(\d{1,2}\s+[A-Za-z]{3}\s+\d{4},\s+\d{1,2}:\d{2}[ap]m)/i
// Screen number
/Screen\s+(\d+)/i
Implement 4-tier fallback for robustness:
async function extractField(page: Page, fieldName: string): Promise<string | null> {
// Tier 1: Primary semantic selector
try {
const value = await page.getByLabel(fieldName).textContent();
if (value) return value.trim();
} catch {}
// Tier 2: Alternative selectors
try {
const value = await page.locator(`[aria-label="${fieldName}"]`).textContent();
if (value) return value.trim();
} catch {}
// Tier 3: Text pattern matching
const pageText = await page.innerText("body");
const pattern = new RegExp(`${fieldName}[\\s\\S]{0,20}([A-Z0-9].+)`, 'i');
const match = pageText.match(pattern);
if (match?.[1]) return match[1].trim();
// Tier 4: Return null (caller handles missing data)
return null;
}
// ✅ GOOD: Try-catch with specific actions
try {
await page.goto(url, { waitUntil: 'domcontentloaded' });
} catch (error) {
throw new Error(`Failed to navigate to ${url}: ${error.message}`);
}
// ✅ GOOD: Timeout with clear error
try {
await page.waitForSelector('text="Loading complete"', { timeout: 5000 });
} catch {
// Continue anyway - loading indicator is optional
}
// ❌ WRONG: Grabs first matching image (could be from carousel/ads)
const poster = await page.locator('img[src*="movies"]').first();
// ✅ CORRECT: Target specific hero/header image
const poster = await page.locator('img[src*="movies/headers"]').first();
// ✅ BETTER: Use semantic structure
const poster = await page.locator('header img, [role="banner"] img').first();
Each scraper method should have a single responsibility:
// ✅ GOOD: Each method scrapes ONE resource type
interface ScraperClient {
scrapeMovies(): Promise<{ movies: Movie[] }>;
scrapeSession(sessionId: string): Promise<SessionData>;
scrapePricing(sessionId: string): Promise<PricingData>;
}
// ❌ BAD: Session method returns movie data (violates SRP)
interface ScraperClient {
scrapeSession(sessionId: string): Promise<{
session: SessionData;
movieTitle: string; // ❌ Cross-concern
moviePoster: string; // ❌ Cross-concern
}>;
}
Composition over mixing concerns:
// ✅ Compose data from multiple focused scrapes
const movies = await client.scrapeMovies();
const movie = movies.find(m => m.sessionTimes.includes(sessionId));
const session = await client.scrapeSession(sessionId);
const pricing = await client.scrapePricing(sessionId);
// Build composite response
const ticket = {
movieTitle: movie.title, // From movies scrape
moviePoster: movie.thumbnail, // From movies scrape
sessionDateTime: session.dateTime, // From session scrape
pricing: pricing, // From pricing scrape
};
When building a scraper, follow this sequence:
bun add playwrightdomcontentloaded or networkidle)innerText("body") for visible page textselectorsUsed)import { chromium, type Browser, type Page } from 'playwright';
async function createBrowser(): Promise<Browser> {
return await chromium.launch({
headless: true, // Set false for debugging
});
}
async function createPage(browser: Browser): Promise<Page> {
const page = await browser.newPage({
viewport: { width: 1280, height: 720 },
userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...',
});
return page;
}
export async function createScraperClient() {
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
return {
async scrapeData(url: string) {
await page.goto(url, { waitUntil: 'domcontentloaded' });
const pageText = await page.innerText("body");
const selectorsUsed: Record<string, string> = {};
// Extract fields with fallbacks
let field1 = null;
try {
field1 = await page.getByRole('heading').textContent();
selectorsUsed.field1 = "getByRole";
} catch {
const match = pageText.match(/Title:\s*(.+)/i);
if (match) {
field1 = match[1];
selectorsUsed.field1 = "regex";
}
}
return { field1, selectorsUsed };
},
async close() {
await browser.close();
},
};
}
#!/usr/bin/env bun
import { createScraperClient } from './scraper-client.ts';
async function main() {
const args = process.argv.slice(2);
const url = args[0];
if (!url) {
console.error('Usage: bun run cli.ts <url>');
process.exit(1);
}
const client = await createScraperClient();
try {
const result = await client.scrapeData(url);
console.log(JSON.stringify(result, null, 2));
} catch (error) {
console.error(`Scraping failed: ${error.message}`);
process.exit(1);
} finally {
await client.close();
}
}
main();
Use the Chrome DevTools MCP server to inspect actual page structure:
// In your conversation with Claude:
// "Use Chrome DevTools to inspect the pricing page"
// Claude will use: take_snapshot, evaluate_script, etc.
Always track which selectors worked:
const selectorsUsed: Record<string, string> = {};
// After each extraction
selectorsUsed.fieldName = "getByRole" | "regex" | "fallback-1";
// Return in response for debugging
return { data, selectorsUsed };
// Take screenshot at key points
await page.screenshot({ path: 'debug-step-1.png' });
// Highlight element before extraction
await page.locator(selector).highlight();
// DON'T assume data attributes exist
await page.locator('[data-price]'); // Might not exist!
// DON'T use implementation-specific classes
await page.locator('.MuiButton-root-xyz'); // Will break when CSS changes
// DON'T use textContent for regex extraction
const text = await page.textContent("body"); // Includes hidden iframes!
// DON'T assume data exists
const price = await page.locator('.price').textContent(); // Might throw!
// DO use optional chaining and null returns
const price = await page.locator('.price').textContent().catch(() => null);
Before deploying a scraper: