Fetch one or more public webpages with Scrapling, extract the main content, and convert HTML into Markdown using html2text. Supports static HTTP, concurrent...
---
name: web_markdown_scraper
description: >
Fetch one or more public webpages with Scrapling, extract the main content,
and convert HTML into Markdown using html2text.
Supports static HTTP, concurrent async, stealth anti-bot (Camoufox/Firefox),
and dynamic Playwright Chromium fetching modes with production-grade automatch.
metadata: {"openclaw":{"emoji":"πΈοΈ","requires":{"bins":["python3"]}}}
---
# Web Markdown Scraper
Use this skill when the user wants to:
- Scrape one or more public webpages (static or JavaScript-rendered)
- Convert HTML pages into clean Markdown
- Extract article/body text for summarization, analysis, or indexing
- Bypass anti-bot protections (Cloudflare, Datadome, etc.) via stealth mode
- Scrape many URLs concurrently (async mode)
- Track page elements reliably across website redesigns (automatch)
- Save the extracted results as `.md` files
## Fetcher Mode Selection Guide
| Mode | Fetcher Class | Best For |
|------|--------------|----------|
| `http` (default) | `Fetcher` | Fast static pages, RSS, APIs |
| `async` | `AsyncFetcher` | Batch of 5+ static URLs in parallel |
| `stealth` | `StealthyFetcher` | Anti-bot sites, Cloudflare, fingerprint checks |
| `dynamic` | `PlayWrightFetcher` | Heavy SPAs, React/Vue/Angular apps |
**Decision rule**: Start with `http`. If you get a 403 / CAPTCHA / empty body, switch
to `stealth`. If the content is rendered client-side (empty on first load), use `dynamic`.
Use `async` when scraping many static URLs at once to save time.
## Inputs
### URL sources
- `--url URL` β one target URL (repeat flag for multiple: `--url A --url B`)
- `--url-file FILE` β plain text file with one URL per line
### Fetcher
- `--mode http|async|stealth|dynamic` β fetcher backend (default: `http`)
### Content extraction
- `--selector CSS` β CSS selector for the main content area (omit = full page)
- `--preserve-links` β keep hyperlinks in the Markdown output
- `--output-dir DIR` β save per-page `.md` files and a master `index.json` here
### AutoMatch β production resilience
- `--auto-save` β fingerprint & persist selected elements to the local DB on first run
- `--auto-match` β on subsequent runs, find elements by fingerprint even if the site
layout has changed (do NOT need to update the CSS selector)
### Browser options (stealth / dynamic only)
- `--headless true|false|virtual` β headless mode; `virtual` uses Xvfb (default: `true`)
- `--network-idle` β wait until no network activity for β₯500 ms before capturing
- `--block-images` β block image loading (saves bandwidth and proxy quota)
- `--disable-resources` β drop fonts/images/media/stylesheets for ~25% faster loads
- `--wait-selector CSS` β pause until this element appears in the DOM
- `--wait-selector-state attached|visible|detached|hidden` β element state (default: `attached`)
- `--timeout MS` β global timeout in ms (default: 30 000)
- `--wait MS` β extra idle wait after page load in ms
### StealthyFetcher extras (stealth mode only)
- `--humanize SECONDS` β simulate human-like cursor movement (max duration in seconds)
- `--geoip` β spoof browser timezone, locale, language, and WebRTC IP from proxy geolocation
- `--block-webrtc` β prevent real-IP leaks via WebRTC
- `--disable-ads` β install uBlock Origin in the browser session
- `--proxy URL` β HTTP/SOCKS proxy as a URL string, or JSON:
`'{"server":"host:port","username":"u","password":"p"}'`
### Reliability
- `--retry N` β retry failed requests up to N times with exponential backoff (max 30 s)
## Rules
1. Only process public `http://` or `https://` pages.
2. Never bypass login walls, CAPTCHAs, paywalls, or access controls.
3. Prefer the main article or body content; avoid polluting the output with navigation,
headers, footers, or cookie banners β use `--selector` to target the content area.
4. When `--auto-save` is used, always also pass `--selector` so Scrapling knows which
element fingerprint to record.
5. On subsequent runs for layout-changed pages, use `--auto-match` instead of `--auto-save`.
Do not use both flags at once.
6. Use `--mode async` for batch jobs with 5+ static URLs for parallel execution.
7. Combine `--disable-resources` with `--block-images` in stealth/dynamic mode when
you only need text content β this can cut load times by up to 40%.
8. Always inspect the top-level `ok` field and per-result `ok` fields before using content.
9. If `ok` is `false`, report the exact `error` string β do not invent or guess content.
10. When `--network-idle` is insufficient, use `--wait-selector` for a specific DOM element
to guarantee the content has loaded before capture.
## Command Patterns
### Basic static page
```bash
python3 "{baseDir}/scrape_to_markdown.py" --url "<URL>"
```
### Static page β target specific content area
```bash
python3 "{baseDir}/scrape_to_markdown.py" --url "<URL>" --selector "article.main-content"
```
### Stealth mode β bypass anti-bot protection
```bash
python3 "{baseDir}/scrape_to_markdown.py" --url "<URL>" --mode stealth --network-idle
```
### Stealth + proxy + human fingerprint (maximum stealth)
```bash
python3 "{baseDir}/scrape_to_markdown.py" \
--url "<URL>" \
--mode stealth \
--proxy "http://user:pass@host:port" \
--humanize 2.0 \
--geoip \
--block-webrtc \
--network-idle
```
### Dynamic SPA page (Playwright Chromium)
```bash
python3 "{baseDir}/scrape_to_markdown.py" \
--url "<URL>" \
--mode dynamic \
--wait-selector ".product-list" \
--network-idle \
--disable-resources
```
### Async concurrent batch (multiple URLs)
```bash
python3 "{baseDir}/scrape_to_markdown.py" \
--mode async \
--url "<URL1>" --url "<URL2>" --url "<URL3>"
```
### Batch from file + stealth + save to disk
```bash
python3 "{baseDir}/scrape_to_markdown.py" \
--url-file urls.txt \
--mode stealth \
--disable-resources \
--output-dir outputs
```
### First-run automatch setup (save fingerprint)
```bash
python3 "{baseDir}/scrape_to_markdown.py" \
--url "<URL>" \
--selector ".article-body" \
--auto-save \
--output-dir outputs
```
### Subsequent run after site layout change (adaptive match)
```bash
python3 "{baseDir}/scrape_to_markdown.py" \
--url "<URL>" \
--selector ".article-body" \
--auto-match \
--output-dir outputs
```
### Full production scrape
```bash
python3 "{baseDir}/scrape_to_markdown.py" \
--url "<URL>" \
--mode stealth \
--selector "main article" \
--auto-match \
--preserve-links \
--network-idle \
--disable-resources \
--timeout 60000 \
--retry 3 \
--output-dir outputs
```
## Output Handling
JSON is printed to stdout. Always check `ok` before using content.
**Top-level fields:**
- `ok` β `true` only if every URL succeeded
- `total` / `succeeded` / `failed` β count summary
- `results` β array of per-URL result objects
- `output_index_file` β path to saved `index.json` (if `--output-dir` used)
**Per-URL result fields (when `ok: true`):**
- `url` β the requested URL
- `status` β HTTP status code (e.g. `200`)
- `title` β page `<title>` text
- `markdown` β extracted content as Markdown β **use this as main content**
- `markdown_length` β character count (useful for quality checks)
- `output_markdown_file` β path to saved `.md` file (if `--output-dir` used)
**On failure (`ok: false` in a result):**
- `error` β exact error message; report this verbatim, do not invent content
don't have the plugin yet? install it then click "run inline in claude" again.