Auto Scraping to CSV

Item: Auto Scraping to CSV
Rating: 7.2
Author: Implexa

Scrape any webpage using text-based DOM manipulation and export structured data to CSV. The agent handles complex page nuances — infinite scroll, pagination,...

view source

installs

stars

karma

SkillRank score ↗

7.2/ 10

evaluated by implexa, claude-haiku-4-5 · 2026-05-26

web scraping skill using text-based dom manipulation and agent-driven clarification to extract structured data to csv. handles pagination, infinite scroll, popups, and lazy loading without external llm calls.

structure

8.0

trigger phrases

8.0

procedure

7.0

edge cases

7.0

documentation

7.0

strengths

view original SKILL.md from clawhubclick to expand

---
name: auto-scraping-to-csv
description: |
  Scrape any webpage using text-based DOM manipulation and export structured data to CSV.
  The agent handles complex page nuances — infinite scroll, pagination, popups, lazy loading —
  and asks clarifying questions when data is ambiguous. No external LLM needed.
version: 1.0.0
metadata:
  openclaw:
    requires:
      bins:
        - node
    install:
      - kind: node
        package: playwright
        bins: [playwright]
    homepage: https://github.com/Science-Prof-Robot/autoclick
    emoji: "🧹"
---

# Auto Scraping to CSV — Agent-Driven Web Scraping

Scrape any webpage using text-based DOM manipulation and export structured data to CSV. The agent handles complex page nuances — infinite scroll, pagination, popups, lazy loading — and asks clarifying questions when data is ambiguous. No external LLM required.

## Philosophy: Let the Agent Figure It Out

Traditional scraping requires you to inspect HTML, write CSS selectors, handle edge cases, and debug when the site changes. This skill flips that:

**You say what you want. The agent handles the how.**

```
You: "Scrape product catalog"
Agent: "I see 50 products across 5 pages with infinite scroll. 
        I also see 'Price', 'Sale Price', and 'Member Price' columns.
        Which price field should I extract?"
You: "Sale Price"
Agent: [handles scrolling, pagination, extracts 50 rows] → products.csv
```

The agent will:
1. **Explore** the page structure via text-based DOM
2. **Detect complexity** — scroll, pagination, tabs, filters
3. **Ask questions** when ambiguous (multiple price fields, missing data, format choices)
4. **Handle edge cases** — dismiss popups, wait for lazy loading, retry on errors
5. **Export CSV** — clean, structured, ready to use

## When to Use

- **Product catalogs**: "Scrape all laptops with prices and ratings"
- **News/articles**: "Get latest blog posts with titles, dates, authors"
- **Directory listings**: "Extract all company names, emails, and websites"
- **Table data**: "Get the pricing table from this SaaS page"
- **Real estate**: "Scrape listings with price, beds, baths, square footage"
- **Job boards**: "Get job titles, companies, locations, and salary ranges"
- **Social feeds**: "Extract posts with engagement counts (likes, comments, shares)"
- **Research data**: "Get citation counts, authors, publication dates from this index"

## How It Works

```
Claude (Host Model)
    ↕  HTTP
Bridge Server (Node.js + Playwright)
    ↕  page.evaluate()
Browser (Chromium) ← Page-Agent injected
```

1. **Bridge** launches a local Chromium browser via Playwright
2. **Page-Agent** is injected as an IIFE script from CDN into the target page
3. **Page-Agent** indexes the DOM and generates a simplified text representation:
   ```
   [5]<a >Widget Pro Laptop /></a>
   [12]<div >$1,299.99 /></div>
   [18]<div >4.5 stars (128 reviews) /></div>
   ```
4. **Claude** receives the text state, understands the page structure, decides actions
5. **Agent asks questions** when data is ambiguous or complex
6. **Loop** continues until all data is extracted or user says stop
7. **CSV export** converts structured JSON to CSV

---

## First-Time Setup

### 1. Install Playwright

```bash
npm install -D playwright
npx playwright install chromium
```

### 2. Place the Bridge Script

```bash
# Copy from this skill folder to your agents directory
cp .claude/skills/auto-scraping-to-csv/page-agent-bridge.mjs .claude/agents/
```

### 3. Start the Bridge

```bash
node .claude/agents/page-agent-bridge.mjs
```

Default port: **9876**.

### 4. Verify Health

```bash
curl http://localhost:9876/health
```

---

## Agent-Driven Scraping Workflow

### Step 1: Request What You Want

```
/scrape-to-csv https://example.com/products
  "Extract all products with name, price, rating, and number of reviews"
```

### Step 2: Agent Explores and Asks Questions

The agent will fetch the DOM state and may ask:

```
Agent: "I found 24 products on the current page. The site has infinite scroll 
        and I can see more products loading as I scroll. 
        How many products do you want total? (type 'all' for everything)"
You: "50"

Agent: "I see three price fields:
        - Original Price: $1,299
        - Sale Price: $999
        - Member Price: $899
        Which one should I extract?"
You: "Sale Price"

Agent: "Some products are out of stock and show 'N/A' for price.
        Should I:
        A) Include them with 'N/A'
        B) Skip those rows
        C) Mark as 'Out of Stock' in a separate column"
You: "C"
```

### Step 3: Agent Handles Complexity

The agent automatically:
- Scrolls to load more content
- Clicks pagination buttons
- Waits for lazy-loaded images/data
- Dismisses cookie banners and popups
- Retries if elements are slow to appear
- Stops gracefully if the site blocks scraping

### Step 4: Preview and Confirm

```
Agent: "Extracted 50 products. Here's a preview:

| name | price | rating | reviews | stock_status |
|------|-------|--------|---------|-------------|
| Widget Pro | $999 | 4.5 | 128 | In Stock |
| Gadget Mini | $499 | 4.2 | 89 | In Stock |
| ... | ... | ... | ... | ... |

Save as 'products.csv'? (yes/no/change columns)"
You: "yes"
```

### Step 5: CSV Delivered

```
Agent: "Saved to ./products.csv — 50 rows, 5 columns"
```

---

## Handling Complex Scenarios

### Infinite Scroll

```
You: "Scrape all posts from this feed"
Agent: "The page loads 10 posts at a time via infinite scroll.
        I've scrolled 3 times and found 30 posts so far.
        Should I continue scrolling? (yes/no/scroll N more times)"
You: "scroll 5 more times"
Agent: [scrolls] → "Found 80 total posts. Saved to feed.csv"
```

### Pagination

```
You: "Get all products from this catalog"
Agent: "I see pagination: Page 1 of 12. 
        I can scrape all 12 pages (may take 2-3 minutes) or a subset.
        How many pages?"
You: "all"
Agent: [clicks through 12 pages] → "360 products saved to catalog.csv"
```

### Login Required

```
You: "Scrape my order history"
Agent: "The page redirects to a login screen. 
        I need credentials to proceed.
        Please provide:
        - Username/email
        - Password
        Or type 'skip' to abort."
You: "skip"
Agent: "Aborted. No data extracted."
```

### Multiple Data Formats

```
You: "Get prices from this page"
Agent: "I found prices in multiple formats:
        - '$19.99' (US format)
        - '€17.99' (EU format)
        - '¥2,980' (JP format)
        Which format should I standardize to?"
You: "Keep as-is, add a 'currency' column"
Agent: [extracts with currency detection] → prices.csv
```

### Missing / Inconsistent Data

```
You: "Extract employee data from this directory"
Agent: "I see 100 employee cards. Some are missing:
        - 12 missing phone numbers
        - 8 missing job titles
        - 3 missing photos
        How should I handle missing fields?
        A) Leave blank
        B) Fill with 'N/A'
        C) Skip those rows entirely"
You: "B"
Agent: [extracts 100 rows with 'N/A' for missing fields] → employees.csv
```

---

## Natural Language Commands

### `/scrape-to-csv <url> <description>`

General scraping with CSV export.

```
/scrape-to-csv https://news.ycombinator.com
  "Get top 30 stories with title, URL, points, and comment counts"

/scrape-to-csv https://www.anthropic.com/news
  "Latest blog posts: title, date, category, URL"

/scrape-to-csv https://example.com/realestate
  "Listings: address, price, beds, baths, sqft, listing agent"
```

### `/scrape-table <url> <selector_or_description>`

Extract a specific table.

```
/scrape-table https://example.com/pricing
  "The comparison table with Basic/Pro/Enterprise columns"

/scrape-table https://example.com/sales
  "Q4 2024 revenue breakdown table"
```

### `/scrape-news <url>`

Optimized for news/blog scraping.

```
/scrape-news https://techcrunch.com
  "Latest 20 articles: title, author, date, excerpt, URL"

/scrape-news https://blog.openai.com
  "All posts from 2024: title, date, tags, URL"
```

### `/scrape-products <url>`

Optimized for e-commerce.

```
/scrape-products https://amazon.com/s?k=laptops
  "Laptops: name, brand, price, rating, prime eligible, URL"

/scrape-products https://shopify-store.com/collections/all
  "All products: name, price, compare-at price, availability"
```

---

## Output Format

The agent produces a structured markdown report:

```markdown
## Scraping Report — example.com/products
**Session:** a1b2c3d4 | **Duration:** 2m 14s | **Rows:** 50

### Task
Extract all products with name, price, rating, and number of reviews

### Agent Decisions
- **Pagination**: Detected infinite scroll, scrolled 5 times
- **Price field**: Chose "Sale Price" per user request
- **Missing data**: Filled out-of-stock prices with "N/A" per user request
- **Columns**: name, sale_price, rating, review_count, stock_status

### Sample Data

| name | sale_price | rating | review_count | stock_status |
|------|-----------|--------|-------------|-------------|
| Widget Pro | $999 | 4.5 | 128 | In Stock |
| Gadget Mini | $499 | 4.2 | 89 | In Stock |
| Super Gizmo | $1,299 | 4.8 | 256 | Out of Stock |

### File
`./products.csv` — 50 rows, 5 columns
```

---

## CSV Conversion Options

### Option A — Python (recommended)

```python
import json, csv, re

# Bridge returns: "✅ Executed JavaScript. Result: [{...}, {...}]"
msg = """PASTE_BRIDGE_RESPONSE_HERE"""
match = re.search(r'Result: (\[.*\])', msg)
if match:
    data = json.loads(match.group(1))
    with open('output.csv', 'w', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=data[0].keys())
        writer.writeheader()
        writer.writerows(data)
    print(f"Wrote {len(data)} rows to output.csv")
```

### Option B — Node.js

```javascript
const fs = require('fs');
const data = JSON.parse(fs.readFileSync('data.json', 'utf8'));
const headers = Object.keys(data[0]);
const csv = [
  headers.join(','),
  ...data.map(row => headers.map(h => `"${(row[h]||'').replace(/"/g,'""')}"`).join(','))
].join('\n');
fs.writeFileSync('output.csv', csv);
```

---

## Troubleshooting

### Bridge won't start

```
Error: Cannot find module 'playwright'
```
**Fix:** `npm install -D playwright && npx playwright install chromium`

### Site blocks scraping

**Agent detects:** "The site returned 403 Forbidden. This may be bot protection."
**Options:**
- Try `headless: false` (looks more like a real user)
- Add delays between requests
- Use a different user agent

### Page loads but no data found

**Agent detects:** "The page loaded but I see mostly navigation elements. Content may be behind a login or loaded dynamically."
**Agent asks:** "Should I wait longer, scroll down, or do you have login credentials?"

### Data looks wrong

**Agent detects:** "Prices show as 'NaN' or empty. The site may use JavaScript to render prices."
**Agent asks:** "Should I try executing JavaScript to extract the real values, or skip this field?"

---

## Comparison with Other Tools

| Tool | Setup | Selectors | Complex Pages | Agent Questions | Best For |
|------|-------|-----------|---------------|-----------------|----------|
| **Auto Scraping to CSV** | npm install | None needed | Handles automatically | Yes, clarifies ambiguity | One-off extraction, exploratory scraping |
| BeautifulSoup | pip install | Required | Manual handling | No | Known structure, repeated scraping |
| Scrapy | Project setup | Required | Middleware needed | No | Large-scale crawling, pipelines |
| Playwright E2E | npm install | Required | Manual handling | No | Testing, automation |
| Browser-Use | API key | None | Partial | Limited | Multi-page research |

**Use this skill when:**
- You want to scrape without writing selectors
- The page structure is complex or unknown
- You need the agent to handle edge cases (scroll, popups, pagination)
- You want clarifying questions when data is ambiguous
- You need quick one-off extraction to CSV

---

## Bridge API Reference

### `POST /sessions`
Launch a new browser session.

**Body:**
```json
{ "url": "https://example.com", "headless": false }
```

**Response:** `{ "id": "abc123", "url": "https://example.com" }`

### `GET /sessions/:id/state`
Get text-based DOM state.

**Response:** `{ url, title, header, content, footer }`

### `POST /sessions/:id/act`
Execute an action.

**Body:**
```json
{ "action": "executeJavascript", "params": { "script": "return document.title;" } }
```

### `DELETE /sessions/:id`
Close session.

### `POST /shutdown`
Stop bridge.

---

*Skill: auto-scraping-to-csv v1.0.0 | Bridge: page-agent-bridge.mjs | Powered by Alibaba Page-Agent + Playwright*

related skills

semantically similar in the cross-vendor index

clawhub

74% match

Cxz9909 Agentbrowser

A fast Rust-based headless browser automation CLI with Node.js fallback that enables AI agents to navigate, click, type, and snapshot pages via structured co...

don't have the plugin yet? install it then click "run inline in claude" again.

Auto Scraping to CSV

intent

scrape any webpage without writing selectors. the agent explores the DOM as text, detects pagination/scroll/popups/lazy loading, asks clarifying questions when data is ambiguous (which price field, how many rows, how to handle missing data), then exports clean structured CSV. use this when you need one-off extraction from sites with unknown or complex structure, and you want the agent to figure out the how instead of you debugging HTML and CSS selectors.

inputs

url (required): the webpage to scrape. http or https.
extraction task (required): natural language description of what to extract. example: "get all products with name, price, rating, number of reviews".
bridge server (required): local node.js process running page-agent-bridge.mjs on port 9876. see setup section below.
- env var: PAGE_AGENT_BRIDGE_URL (default: http://localhost:9876)
- if bridge is down, the skill fails immediately with connection error
playwright (required): npm package must be installed and chromium binary cached locally
- setup: npm install -D playwright && npx playwright install chromium
- if missing, bridge startup fails
browser launch options (optional): headless mode (default true), user agent override, viewport size, timeout in ms
- headless false looks more human, slower, useful if site has bot detection
- timeouts default to 30000ms per page action; increase if site is slow
clarification responses (optional): when agent asks questions, user provides strings like "50", "sale price", "C", "all", "skip", etc.

procedure

setup bridge (one-time)
- input: playwright installed, page-agent-bridge.mjs file present in .claude/agents/
- run: node .claude/agents/page-agent-bridge.mjs
- output: bridge listens on http://localhost:9876, logs "✅ bridge ready"
- wait for health check: curl http://localhost:9876/health returns 200
initiate scraping request
- input: url, extraction task description
- call: POST /sessions with url and headless option
- output: session id (e.g. "abc123"), browser tab opens (or headless)
- if bridge is unreachable, fail with "bridge connection refused"
- if url is malformed, bridge returns 400 bad request
fetch initial DOM state
- input: session id from step 2
- call: GET /sessions/:id/state
- output: text-based DOM representation with indexed elements, page title, visible text blocks
- if page takes >30s to load, timeout triggers and agent must decide to retry or abort
- if page shows login screen, state contains redirect url or login form
agent analyzes state and asks clarifying question (if needed)
- input: DOM state, extraction task, page structure detected (pagination, scroll, tabs, missing data)
- agent logic: does the page have multiple valid interpretations (e.g. three price fields, infinite vs paginated scroll, inconsistent data format)?
- output: if ambiguous, agent sends question to user ("which price?", "how many rows total?", "skip out-of-stock or mark N/A?")
- if unambiguous (e.g. single table with clear headers), agent skips to step 5
- if data is behind login, agent asks for credentials or offers abort
user responds to clarification (if asked)
- input: user typed string or selection (e.g. "sale price", "50", "C", "all", "skip")
- agent records response
- output: confirmation that agent understood response
agent executes scraping actions in loop until done
- input: DOM state, clarifications from user
- for each iteration:
  - a. decide action: scroll down, click next button, wait for lazy images, dismiss popup, or extract data
  - b. call POST /sessions/:id/act with action type and params
  - c. output: result of action (element clicked, pixels scrolled, javascript returned value, data extracted)
  - d. fetch updated DOM state with GET /sessions/:id/state
  - e. check if scraping complete (reached end of list, clicked all pagination buttons, user said stop, or 5 consecutive "no new data" states)
  - f. if not done and action failed (timeout, element not found), retry up to 2 times with longer wait, then skip that action
- loop continues until termination condition met
preview extracted data
- input: collected rows as json array of objects
- format as markdown table with 3-5 sample rows
- show column names, data types, row count, missing fields summary
- output: preview text sent to user
user confirms save or requests changes
- input: user says "yes", "no", or "change columns"
- if no: back to step 4 (ask for more clarification)
- if change columns: agent shows available columns, user selects subset or reorders
- if yes: proceed to step 9
convert json to CSV and write file
- input: json array of objects, filename (default: infer from domain or task, e.g. "products.csv", "articles.csv")
- csv format: headers from object keys, quoted values with escaped quotes, one row per line
- output: file written to ./filename.csv
- if file already exists, prompt user to overwrite or rename
- if write fails (permission denied), error message shown
close session and report
- input: session id
- call: DELETE /sessions/:id
- output: browser tab closed, report generated with row count, column count, duration, agent decisions logged
- agent confirms success: "saved to ./products.csv - 50 rows, 5 columns"

decision points

if bridge is down or unreachable: fail immediately with message "bridge server not found at http://localhost:9876. start it with: node .claude/agents/page-agent-bridge.mjs". do not retry.
if page requires login: agent detects redirect to login or login form visible. ask user: "this page requires login. do you have credentials? (provide username/password or type 'skip')". if skip, abort and close session. if credentials provided, agent attempts login via form fill and submit, then waits 3s for page to load.
if page blocks scraping (403, 429, rate limit detected): agent detects http error or "too many requests" message. ask user: "site may have bot protection. try again with longer delays or headless:false? (yes/no/abort)". if yes, re-launch session with headless false and 2s delay between actions. if no or abort, close session.
if data structure is ambiguous (multiple price fields, multiple tables, inconsistent row format): agent asks user which field to extract or how to handle inconsistency. wait for user response before proceeding.
if page uses infinite scroll: agent detects no pagination buttons but content loads on scroll. ask user: "infinite scroll detected. how many rows total? (number/'all'/'stop when no new items load'/cancel)". if "all", keep scrolling until no new items load for 3 consecutive scrolls. if number, scroll until count reached or end of content.
if page uses pagination: agent detects next button or page selector. ask user: "pagination detected, N pages total. scrape all pages? (yes/no/first N pages)". if yes, click next repeatedly until last page. if no or number, stop at user limit.
if data contains missing fields (N/A, blank, null across rows): agent detects inconsistency. ask user: "some rows missing X field. (a) leave blank, (b) fill with 'N/A', (c) skip those rows". apply choice to all rows.
if data format is inconsistent (e.g. prices as "$19.99" and "€17.99" and "¥2,980"): agent detects multiple currencies or formats. ask user: "prices in multiple formats. (a) keep as-is and add currency column, (b) convert all to USD, (c) extract number only". apply choice.
if action times out (wait >30s with no response): agent retries up to 2 times with +10s timeout each. if still timeout after 2 retries, skip that action and log warning, continue with next action. if critical action (page load) times out 3 times, ask user "page slow or unreachable, continue waiting or abort?" if abort, close session.
if extracted data is empty (0 rows found): agent detected page loaded but no matching elements found. ask user: "no data matched the extraction criteria. is the site structure different than expected? (a) scroll down and try again, (b) show me what you found (preview), (c) abort". if (b), show text dump of visible page content.
if file write fails: ask user "failed to write to ./filename.csv (permission denied / disk full). (a) try different filename, (b) abort". if (a), prompt for new filename and retry.
if user requests changes to columns after preview: show list of available columns from extracted data. user selects subset (e.g. "name, price, rating") and agent re-formats csv and writes new file.

output contract

file format: ./output_filename.csv, standard csv per RFC 4180 (comma-delimited, quoted fields, escaped quotes as "", utf-8 encoding)
filename: inferred from domain/task (e.g. "products.csv", "articles.csv", "employees.csv") or user-specified
content: headers from object keys, one row per extracted item, no null values (filled with blank string or "N/A" per user choice)