Expert in web scraping and data extraction with Python tools
Web Scraping You are an expert in web scraping and data extraction using Python tools and frameworks. Core Tools Static Sites Use requests for HTTP requests Use BeautifulSoup for HTML parsing Use lxml for fast XML/HTML processing Dynamic Content Use Selenium for JavaScript-rendered pages Use Playwright for modern web automation Use Puppeteer (via pyppeteer) for headless browsing Large-Scale Extraction Use Scrapy for structured crawling Use jina for AI-powered extraction Use firecrawl for large-scale scraping Complex Workflows Use agentQL for structured queries Use multion for complex automation Best Practices Implement rate limiting and delays Respect robots.txt Use proper user agents Handle errors gracefully Implement retry logic Error Handling Handle network timeouts Deal with blocked requests Manage session cookies Handle pagination properly Ethical Considerations Follow website terms of service Don't overload servers Cache results when possible Be transparent about scraping Data Processing Clean and validate extracted data Handle encoding issues Store data efficiently Implement deduplication
don't have the plugin yet? install it then click "run inline in claude" again.
added explicit decision tree for poison pill branching, detailed procedure inputs/
extract content from websites reliably by cascading through multiple strategies (trafilatura, rotating user agents, Playwright stealth mode, async Playwright) with automatic fallback when one fails. use this skill when you need article text, metadata, or structured data from third-party sites, especially those with anti-bot protections, JavaScript rendering, or rate limiting. includes poison pill detection to catch paywalls, CAPTCHAs, Cloudflare blocks, and login walls before wasting time on unanswored requests.
required:
optional:
external connections:
pip install playwright && playwright install chromium. stealth mode plugin (pip install playwright-stealth) required for sites blocking headless detection.pip install trafilatura): fast, lightweight article extraction. no external API needed.pip install requests): HTTP client for baseline scraping.pip install beautifulsoup4): HTML parsing.pip install yt-dlp): YouTube video/metadata extraction.pip install instaloader): Instagram scraping (requires valid Instagram session or guest mode, see rate limits).pip install youtube-transcript-api): transcript extraction without yt-dlp.environment variables (optional, for anti-bot bypass):
PROXY_LIST: newline-separated proxy URLs (http://ip:port or socks5://ip:port)PLAYWRIGHT_TIMEOUT: milliseconds before Playwright times out (default 30000)validate target URL and check robots.txt
{domain}/robots.txt, parse Disallow rulesrun poison pill detection
{"paywall": false, "rate_limit": true, "cloudflare": false})attempt strategy 1: trafilatura fast extraction
trafilatura.fetch_url(url, timeout=10) then trafilatura.extract(downloaded, include_comments=False)attempt strategy 2: requests with rotating User-Agent
PROXY_LIST env varMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36, Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7), Mozilla/5.0 (X11; Linux x86_64)<article>, <main>, or <div class="content"> tagsattempt strategy 3: Playwright sync with stealth mode
PLAYWRIGHT_TIMEOUT env var (default 30000ms)sync_playwright(), apply stealth_sync() pluginattempt strategy 4: async Playwright for Jupyter (optional fallback)
async with async_playwright() with stealth modehandle special targets (YouTube, Instagram, TikTok)
yt-dlp --dump-json url to extract title, duration, description, video_idinstaloader.Post.from_shortcode() (requires valid session env var INSTA_SESSION or guest mode with rate limiting)TikTokApi (third-party, requires careful rate limiting)youtube_transcript_api.YouTubeTranscriptApi.get_transcript(video_id) if availablefallback: return extraction failure and suggest manual alternatives
--force-robots-bypass to proceed.PLAYWRIGHT_TIMEOUTms: mark as timeout failure and fall back to step 7 or 8. do not retry Playwright (expensive).max_retries times with exponential backoff (1s, 2s, 4s). if all retries fail, report network error and stop.success returns a dictionary with the following schema:
{
"status": "success",
"url": "https://example.com/article",
"strategy_used": "trafilatura|requests|playwright_sync|playwright_async|yt-dlp|instaloader|fallback",
"content": {
"text": "extracted article text...",
"title": "article title",
"author": "author name or null",
"date_published": "2024-01-15 or null",
"word_count": 1250
},
"metadata": {
"source_domain": "example.com",
"http_status": 200,
"charset": "utf-8",
"poisons_detected": {"paywall": false, "rate_limit": false, "cloudflare": false, "captcha": false}
},
"extraction_time_ms": 2345
}
failure returns:
{
"status": "failure",
"url": "https://example.com/article",
"error": "cloudflare block detected after Playwright stealth mode",
"poisons_detected": {"cloudflare": true, "rate_limit": false},
"strategies_attempted": ["trafilatura", "requests", "playwright_sync"],
"recommendations": ["use undetected-chromedriver", "request data via official API"],
"extraction_time_ms": 5600
}
PLAYWRIGHT_TIMEOUT and was aborted. status "timeout" returned within 1s of timeout threshold.