Web scraping templates, scripts, and patterns for documentation and content collection using Playwright, BeautifulSoup, and Scrapy. Includes rate limiting, error handling, and extraction patterns. Use when scraping documentation, collecting web content, extracting structured data, building RAG knowledge bases, harvesting articles, crawling websites, or when user mentions web scraping, documentation collection, content extraction, Playwright scraping, BeautifulSoup parsing, or Scrapy spiders.
Limited to specific tools
Additional assets for this skill
This skill is limited to using the following tools:
examples/scrape-blog-posts.pyexamples/scrape-github-docs.pyscripts/scrape-articles.pyscripts/scrape-documentation.pyscripts/setup-playwright-scraper.shtemplates/playwright-scraper-template.pytemplates/rate-limited-scraper.pyThis skill provides production-ready web scraping tools, templates, and patterns for documentation collection and content extraction. It includes functional scripts with rate limiting, error handling, and multiple scraping frameworks (Playwright, BeautifulSoup, Scrapy).
Decision Matrix:
| Use Case | Tool | Reason |
|---|---|---|
| JavaScript-heavy sites | Playwright | Full browser rendering |
| Static HTML parsing | BeautifulSoup | Lightweight, fast parsing |
| Large-scale crawling | Scrapy | Built-in concurrency, pipelines |
| Authentication required | Playwright | Cookie/session handling |
| Simple data extraction | BeautifulSoup | Minimal dependencies |
| Complex crawling rules | Scrapy | Spider middleware, rules |
Install required dependencies:
# Setup Playwright scraper (includes browser installation)
bash ./skills/web-scraping-tools/scripts/setup-playwright-scraper.sh
# Install for BeautifulSoup
pip install beautifulsoup4 requests lxml
# Install for Scrapy
pip install scrapy
What the setup script does:
Scrape Documentation Sites:
# Scrape technical documentation with automatic rate limiting
python ./skills/web-scraping-tools/scripts/scrape-documentation.py \
--url "https://docs.example.com" \
--output-dir "./scraped-docs" \
--max-depth 3 \
--rate-limit 2
# Parameters:
# --url: Starting URL to scrape
# --output-dir: Where to save scraped content
# --max-depth: How many levels deep to crawl
# --rate-limit: Seconds to wait between requests
Features:
Scrape Articles and Blog Posts:
# Extract article content with readability parsing
python ./skills/web-scraping-tools/scripts/scrape-articles.py \
--urls urls.txt \
--output-dir "./articles" \
--format markdown
# Input file format (urls.txt):
# https://blog.example.com/post-1
# https://blog.example.com/post-2
Features:
Playwright Scraper Template:
# Copy template for customization
cp ./skills/web-scraping-tools/templates/playwright-scraper-template.py my-scraper.py
# Edit for your needs:
# - Update selectors for target site
# - Customize extraction logic
# - Add specific data transformations
# - Configure authentication if needed
Template Features:
Rate-Limited Scraper Template:
# Copy rate-limited scraper for respectful crawling
cp ./skills/web-scraping-tools/templates/rate-limited-scraper.py my-crawler.py
Template Features:
Example 1: Scrape GitHub Documentation
# Run GitHub docs scraper
python ./skills/web-scraping-tools/examples/scrape-github-docs.py \
--repo "microsoft/playwright" \
--output "./github-docs"
What it does:
Example 2: Scrape Blog Posts
# Run blog post scraper
python ./skills/web-scraping-tools/examples/scrape-blog-posts.py \
--blog-url "https://blog.example.com" \
--category "tutorials" \
--max-posts 50
What it does:
Always Implement:
Never Do:
Respect Website Resources:
# Good: Rate limited with retries
await asyncio.sleep(2) # Wait between requests
if response.status == 429: # Too Many Requests
await asyncio.sleep(60) # Back off
retry()
# Bad: Rapid fire requests
for url in urls: # No delays
scrape(url) # Hammers server
CSS Selectors (BeautifulSoup/Playwright):
# Article content
content = soup.select_one('article.main-content')
# All links in navigation
nav_links = soup.select('nav a[href]')
# Metadata
title = soup.select_one('meta[property="og:title"]')['content']
description = soup.select_one('meta[name="description"]')['content']
XPath Selectors (Scrapy/Playwright):
# Article with specific class
article = page.locator('xpath=//article[@class="post"]')
# All headings
headings = response.xpath('//h1 | //h2 | //h3/text()').getall()
Pagination Handling:
# Follow "Next" button
while next_button := page.locator('a.next'):
await next_button.click()
await page.wait_for_load_state('networkidle')
extract_data(page)
Markdown (for RAG/Documentation):
output = {
'title': title,
'url': url,
'content': markdown_content,
'metadata': {
'author': author,
'date': date,
'tags': tags
}
}
JSON (for structured data):
{
"url": "https://example.com/page",
"title": "Page Title",
"content": "Full text content...",
"links": ["url1", "url2"],
"images": ["img1.jpg", "img2.png"],
"scraped_at": "2025-10-31T21:43:00Z"
}
SQLite (for large datasets):
# Store in database for efficient querying
conn = sqlite3.connect('scraped_data.db')
cursor.execute('''
INSERT INTO pages (url, title, content, scraped_at)
VALUES (?, ?, ?, ?)
''', (url, title, content, datetime.now()))
Python Dependencies:
playwright>=1.40.0
beautifulsoup4>=4.12.0
requests>=2.31.0
lxml>=4.9.0
scrapy>=2.11.0
aiohttp>=3.9.0
markdownify>=0.11.0
System Requirements:
Installation:
pip install playwright beautifulsoup4 requests lxml scrapy aiohttp markdownify
playwright install chromium
Use Case: Build a RAG knowledge base from Python documentation
python ./skills/web-scraping-tools/examples/scrape-github-docs.py \
--repo "python/cpython" \
--docs-path "Doc" \
--output "./python-docs" \
--format markdown
Result:
Use Case: Collect programming tutorials for training data
python ./skills/web-scraping-tools/examples/scrape-blog-posts.py \
--blog-url "https://realpython.com" \
--category "tutorials" \
--max-posts 100 \
--output "./tutorials"
Result:
Use Case: Track new posts on competitor blogs
# Setup scheduled scraper
python ./skills/web-scraping-tools/scripts/scrape-articles.py \
--urls competitor-urls.txt \
--output "./competitor-content" \
--since-date "2025-10-01" \
--notify-new
Result:
Playwright browser installation fails:
# Manual browser installation
playwright install chromium
# Or use system browser
PLAYWRIGHT_BROWSERS_PATH=/usr/bin playwright install
Rate limiting / 429 errors:
--rate-limit 5)--max-concurrent 1)JavaScript content not loading:
page.wait_for_load_state('networkidle')Memory issues with large scrapes:
Blocked by anti-bot protection:
Concurrent Scraping:
# Use asyncio for parallel requests
import asyncio
async def scrape_all(urls):
tasks = [scrape_page(url) for url in urls]
return await asyncio.gather(*tasks)
Caching:
# Cache responses to avoid re-scraping
import diskcache
cache = diskcache.Cache('./scrape-cache')
if url in cache:
return cache[url]
else:
content = scrape(url)
cache[url] = content
return content
Incremental Updates:
# Track what's already scraped
scraped_urls = load_scraped_urls()
new_urls = [url for url in all_urls if url not in scraped_urls]
scrape_batch(new_urls)
Plugin: rag-pipeline Version: 1.0.0 Category: Data Collection Skill Type: Web Scraping & Content Extraction