added explicit decision tree for poison pill branching, detailed procedure inputs/

intent

extract content from websites reliably by cascading through multiple strategies (trafilatura, rotating user agents, Playwright stealth mode, async Playwright) with automatic fallback when one fails. use this skill when you need article text, metadata, or structured data from third-party sites, especially those with anti-bot protections, JavaScript rendering, or rate limiting. includes poison pill detection to catch paywalls, CAPTCHAs, Cloudflare blocks, and login walls before wasting time on unanswored requests.

inputs

required:

target URL (string)
extraction goal (article text, metadata, structured data, video, or transcript)

optional:

custom headers dict (overrides rotating user agents)
proxy list (for rate limit bypass, one proxy per request)
timeout (default 10 seconds)
max retries (default 3)

external connections:

Playwright (for JavaScript-heavy sites): install via pip install playwright && playwright install chromium. stealth mode plugin (pip install playwright-stealth) required for sites blocking headless detection.
trafilatura (pip install trafilatura): fast, lightweight article extraction. no external API needed.
requests library (pip install requests): HTTP client for baseline scraping.
BeautifulSoup4 (pip install beautifulsoup4): HTML parsing.
yt-dlp (pip install yt-dlp): YouTube video/metadata extraction.
instaloader (pip install instaloader): Instagram scraping (requires valid Instagram session or guest mode, see rate limits).
youtube-transcript-api (pip install youtube-transcript-api): transcript extraction without yt-dlp.

environment variables (optional, for anti-bot bypass):

PROXY_LIST: newline-separated proxy URLs (http://ip:port or socks5://ip:port)
PLAYWRIGHT_TIMEOUT: milliseconds before Playwright times out (default 30000)

procedure

validate target URL and check robots.txt
- input: target URL
- fetch {domain}/robots.txt, parse Disallow rules
- output: boolean (scrape allowed), scraped path list (paths to avoid)
- if robots.txt forbids the path, skip to step 8 (manual alternatives) unless user explicitly overrides
run poison pill detection
- input: URL
- make HEAD request with generic User-Agent, capture status code and headers
- check for patterns: 403/401 (auth wall), 429/503 (rate limit), "Cloudflare" in headers, "captcha" in body snippet (first 5KB fetch)
- output: dictionary with detected poison pills (e.g., {"paywall": false, "rate_limit": true, "cloudflare": false})
- if critical poison pill detected (e.g., rate limit, Cloudflare), jump to step 6 (Playwright stealth)
attempt strategy 1: trafilatura fast extraction
- input: URL
- call trafilatura.fetch_url(url, timeout=10) then trafilatura.extract(downloaded, include_comments=False)
- output: extracted article text, title, author, date (or null if extraction fails)
- if extraction succeeds and content length > 100 chars, return result to outcome signal (skip remaining strategies)
- if fails or content too short, continue to step 4
attempt strategy 2: requests with rotating User-Agent
- input: URL, optional proxy from PROXY_LIST env var
- make GET request with rotated User-Agent from list: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36, Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7), Mozilla/5.0 (X11; Linux x86_64)
- if proxy available, use it; else direct connection
- output: HTTP response body (HTML)
- parse with BeautifulSoup, extract text from <article>, <main>, or <div class="content"> tags
- if content extracted and > 100 chars, return result
- if fails or empty, continue to step 5
attempt strategy 3: Playwright sync with stealth mode
- input: URL, timeout from PLAYWRIGHT_TIMEOUT env var (default 30000ms)
- launch Playwright with sync_playwright(), apply stealth_sync() plugin
- navigate to URL, wait for network idle (waitForLoadState('networkidle'))
- output: page HTML after JS execution
- parse with BeautifulSoup, extract visible text
- if content > 100 chars, return result
- if fails or Cloudflare challenge remains, continue to step 6
attempt strategy 4: async Playwright for Jupyter (optional fallback)
- input: URL, event loop context
- use async with async_playwright() with stealth mode
- navigate, wait for networkidle
- output: page HTML
- extract text, return if successful
- if all strategies fail, proceed to step 7
handle special targets (YouTube, Instagram, TikTok)
- input: URL domain
- if domain is youtube.com or youtu.be: call yt-dlp --dump-json url to extract title, duration, description, video_id
- if domain is instagram.com: call instaloader.Post.from_shortcode() (requires valid session env var INSTA_SESSION or guest mode with rate limiting)
- if domain is tiktok.com: use TikTokApi (third-party, requires careful rate limiting)
- output: metadata dict (title, date, author, duration, transcript if available)
- for transcripts, call youtube_transcript_api.YouTubeTranscriptApi.get_transcript(video_id) if available
- if extraction succeeds, return result
- if fails, proceed to step 8
fallback: return extraction failure and suggest manual alternatives
- input: none (previous strategies exhausted)
- output: error report with poison pills detected, URL, and recommended next steps
- suggest: (a) inspect site in browser dev tools to find undocumented API endpoint, (b) use Selenium with undetected-chromedriver for more aggressive bypass, (c) request data via official API if available, (d) contact site owner for data access agreement

decision points

if robots.txt forbids scraping and user has not explicitly overridden: stop and return error. user must add override flag --force-robots-bypass to proceed.
if poison pill detected (paywall, CAPTCHA, rate limit, Cloudflare): skip trafilatura and rotating user agent, jump directly to Playwright stealth mode (step 5). if Playwright still blocked, skip to special targets or fallback.
if trafilatura or requests succeeds with content > 100 chars: return immediately. do not continue to Playwright (saves time and resources).
if target is YouTube, Instagram, or TikTok: prioritize dedicated extractors (yt-dlp, instaloader) in step 7 before generic fallback. generic crawling of these sites is inefficient and violates their ToS.
if Playwright times out after PLAYWRIGHT_TIMEOUTms: mark as timeout failure and fall back to step 7 or 8. do not retry Playwright (expensive).
if request raises ConnectionError, Timeout, or SSLError: retry up to max_retries times with exponential backoff (1s, 2s, 4s). if all retries fail, report network error and stop.
if proxy is exhausted (429 response): rotate to next proxy in list. if all proxies exhausted, fall back to direct connection.

output contract

success returns a dictionary with the following schema:

{
  "status": "success",
  "url": "https://example.com/article",
  "strategy_used": "trafilatura|requests|playwright_sync|playwright_async|yt-dlp|instaloader|fallback",
  "content": {
    "text": "extracted article text...",
    "title": "article title",
    "author": "author name or null",
    "date_published": "2024-01-15 or null",
    "word_count": 1250
  },
  "metadata": {
    "source_domain": "example.com",
    "http_status": 200,
    "charset": "utf-8",
    "poisons_detected": {"paywall": false, "rate_limit": false, "cloudflare": false, "captcha": false}
  },
  "extraction_time_ms": 2345
}

failure returns:

{
  "status": "failure",
  "url": "https://example.com/article",
  "error": "cloudflare block detected after Playwright stealth mode",
  "poisons_detected": {"cloudflare": true, "rate_limit": false},
  "strategies_attempted": ["trafilatura", "requests", "playwright_sync"],
  "recommendations": ["use undetected-chromedriver", "request data via official API"],
  "extraction_time_ms": 5600
}

outcome signal

success: skill returns dict with status "success" and text content > 100 chars. user sees extracted article/metadata in terminal or logs.
soft failure: one or more poison pills detected but extraction partially succeeded (e.g., title + author but no body text). status "partial_success" with available fields populated.
hard failure: all strategies exhausted, no content extracted. status "failure" with error message and fallback recommendations. user can inspect error dict to decide next steps (manual extraction, official API, retry with different settings).
timeout: extraction took longer than PLAYWRIGHT_TIMEOUT and was aborted. status "timeout" returned within 1s of timeout threshold.