web-scraper

Item: web-scraper
Rating: 7.8
Author: Implexa

Use DataLens MCP tools to scrape structured data from any website open in Chrome. Triggers when the user wants to extract lists, tables, comments, products,...

view source

installs

stars

karma

SkillRank score ↗

7.8/ 10

evaluated by implexa, claude-haiku-4-5 · 2026-05-26

datalens-web-scraper extracts structured data from webpages via chrome extension and mcp tools. supports table detection, column analysis, nested content expansion, job control, and multiple export formats. requires npm install and chrome setup.

structure

9.0

trigger phrases

8.0

procedure

9.0

edge cases

6.0

documentation

8.0

strengths

view original SKILL.md from clawhubclick to expand

---
name: datalens
description: Use DataLens MCP tools to scrape structured data from any website open in Chrome. Triggers when the user wants to extract lists, tables, comments, products, reviews, or any repeating data from a webpage, or wants to manage active scraping jobs.
---

# DataLens Scraping Skill

## How Tool Calls Work

Every DataLens tool is invoked by running a terminal command. No MCP client configuration is required.

The `datalens-mcp-call` binary handles the MCP stdio handshake and returns the tool result as YAML/JSON to stdout.

```
run_in_terminal: datalens-mcp-call <tool_name> '<args_json>'
```

If `datalens-mcp-call` is not on PATH (e.g. not globally installed), use npx:

```
run_in_terminal: npx datalens-mcp-call <tool_name> '<args_json>'
```

## Prerequisites

1. `datalens-mcp-server` npm package installed: `npm install -g datalens-mcp-server` (or use `npx`).
2. DataLens Chrome extension installed and active in Chrome.
3. Chrome open with the target page loaded (or provide `url` in the tool args — the extension will open it).
4. Node.js ≥ 18 available in the terminal.

---

## How This Works

`datalens-mcp-call` spawns the DataLens MCP proxy as a child process, performs the MCP initialization handshake over stdio, calls the requested tool, and prints the result.

```
AI Agent
  ↓ run_in_terminal
datalens-mcp-call <tool> <args>
  ↓ stdio JSON-RPC
DataLens MCP Proxy (datalens-mcp-proxy)
  ↓ WebSocket (localhost:17373)
Chrome Extension
  ↓
Browser Tab
```

---

## Standard Scraping Workflow

Follow these steps in order. Do not skip steps or call `scrape_start` before `scrape_analyze_columns` completes.

### Step 1 — Detect tables

```bash
datalens-mcp-call scrape_detect_tables '{"url":"https://example.com","prompt":"article list"}'
```

Returns a list of detected table structures with `rootSelector`, `itemSelector`, `documentInfoPath`. Pick the best matching table and copy those three values for subsequent steps.

If the page requires login, ask the user to log in in Chrome first, then re-run this command.

### Step 2 (optional) — Inspect tree for expand buttons

```bash
datalens-mcp-call scrape_get_table_tree '{"rootSelector":"<from step 1>","itemSelector":"<from step 1>","documentInfoPath":"<from step 1>"}'
```

Use when the data has nested replies, collapsed rows, or "load more" buttons. Inspect the `_uid`-annotated tree in the output to identify expand button UIDs.

### Step 2b (optional) — Expand and re-detect

```bash
datalens-mcp-call scrape_click_expand_and_redetect '{"rootSelector":"...","itemSelector":"...","documentInfoPath":"...","expandButtonUids":[{"type":"reply","uids":["uid1","uid2"]}]}'
```

The extension clicks the buttons, waits for new content, then re-detects. Use the updated `rootSelector`/`itemSelector`/`documentInfoPath` from this output in Step 3.

### Step 3 — Analyze columns

```bash
datalens-mcp-call scrape_analyze_columns '{"rootSelector":"...","itemSelector":"...","documentInfoPath":"...","url":"https://example.com","prompt":"article list"}'
```

Calls the backend AI to identify fields, data types, and pagination. Returns a `scraperConfig` and `jobDraft`. Confirm the field list looks correct before proceeding.

### Step 4 — Start scraping

```bash
# Pass the jobDraft object returned by scrape_analyze_columns
datalens-mcp-call scrape_start '{"jobDraft":<paste jobDraft here>,"maxRecords":10}'
```

Returns a `jobId`. Use `maxRecords: 10` for a preview run first.

### Step 5 — Poll for status

```bash
datalens-mcp-call scrape_status '{"jobId":"<jobId>","waitMs":3000}'
```

Re-run until `status` is `COMPLETED`, `FAILED`, or `STOPPED`.

Key status fields:

- `status`: `QUEUED` → `PREPARING` → `RUNNING` → `COMPLETED` / `FAILED` / `STOPPED`
- `scrapedCount`: rows collected so far
- `error`: present only on failure

### Step 6 — Retrieve results

**Save to file (recommended for large results):**

```bash
datalens-mcp-call scrape_export_to_file '{"jobId":"<jobId>","outputDir":"/tmp/datalens","format":"json"}'
```

Returns the saved file path.

**Inline preview (small result sets):**

```bash
datalens-mcp-call scrape_result '{"jobId":"<jobId>","limit":50}'
```

Use the `cursor` field from each response to fetch the next page.

**In-memory export:**

```bash
datalens-mcp-call scrape_export '{"jobId":"<jobId>","format":"csv"}'
```

Returns base64-encoded file content.

---

## Job Control

```bash
datalens-mcp-call scrape_pause  '{"jobId":"<jobId>"}'
datalens-mcp-call scrape_resume '{"jobId":"<jobId>"}'
datalens-mcp-call scrape_stop   '{"jobId":"<jobId>"}'
```

## Browser Tab Management

```bash
datalens-mcp-call browser_list_tabs
datalens-mcp-call browser_open_tab  '{"url":"https://example.com"}'
datalens-mcp-call browser_use_tab   '{"tabId":123}'
datalens-mcp-call browser_close_tab '{"tabId":123}'
```

Tab management is usually not needed — `scrape_detect_tables` with a `url` arg handles tab opening automatically.

---

## Agent Decision Rules

- **Never call `scrape_start` without a `jobDraft` or `scraperConfig`** from a prior `scrape_analyze_columns` response. Fabricating a scraperConfig will produce wrong results.
- **Never skip `scrape_analyze_columns`** and jump straight to `scrape_start`. The analyze step is required to build the config.
- If `scrape_detect_tables` returns an empty list, the page may need login or may be dynamically loaded. Ask the user to open the target URL in Chrome and scroll to load content, then retry.
- If `scrape_status` stays at `QUEUED` for more than 30 seconds, check that the Chrome extension is active and that a tab for the target URL is open.
- Use `maxRecords: 10` for a preview scrape to confirm the config is correct before running a full job.
- Default export format is JSON. Use CSV or XLSX when the user asks for spreadsheet output.

---

## End-to-End Example: Scrape Toutiao Headlines

```bash
# 1. Detect tables on the homepage
datalens-mcp-call scrape_detect_tables '{"url":"https://www.toutiao.com/?is_new_connect=0&is_new_user=0","prompt":"article list"}'

# 2. Analyze columns (fill in selectors from step 1 output)
datalens-mcp-call scrape_analyze_columns '{"rootSelector":"<from step 1>","itemSelector":"<from step 1>","documentInfoPath":"<from step 1>","url":"https://www.toutiao.com/?is_new_connect=0&is_new_user=0","prompt":"article list"}'

# 3. Preview run — first 10 rows (paste the full jobDraft JSON object from step 2)
datalens-mcp-call scrape_start '{"jobDraft":<paste jobDraft>,"maxRecords":10}'

# 4. Poll until status is COMPLETED
datalens-mcp-call scrape_status '{"jobId":"<jobId>","waitMs":3000}'

# 5. Save results to file
datalens-mcp-call scrape_export_to_file '{"jobId":"<jobId>","outputDir":"/tmp/datalens","format":"json"}'
```

Set `DATALENS_TIMEOUT=180000` before running if a tool call takes longer than the default 120 s:

```bash
DATALENS_TIMEOUT=180000 datalens-mcp-call scrape_analyze_columns '...'
```

---

## Debug Tools

These are for troubleshooting only. Do not use in normal scraping workflows.

```
datalens-mcp-call debug_get_logs '{"levels":["error"]}'
datalens-mcp-call debug_clear_logs '{}'
datalens-mcp-call debug_export_logs_to_file '{"outputDir":"/tmp/datalens"}'
```

related skills

semantically similar in the cross-vendor index

clawhub

73% match

Auto Scraping to CSV

Scrape any webpage using text-based DOM manipulation and export structured data to CSV. The agent handles complex page nuances — infinite scroll, pagination,...

don't have the plugin yet? install it then click "run inline in claude" again.

restructured original procedural guide into implexa's six-component format, added explicit decision points for common failure modes and branching logic, documented all external connections and environment variables, clarified input/output contracts with data types and field names, and added outcome signals for each major step.

DataLens Web Scraper

intent

extract structured data (lists, tables, comments, products, reviews, or repeating items) from any website using the DataLens Chrome extension and MCP tools. use this when you need to pull repeating data from a live webpage, expand nested content like replies or load-more buttons, analyze the structure, then run a full or preview scrape with polling and result export. this skill handles the entire workflow from detection through job management.

inputs

external connections:

DataLens Chrome extension: must be installed and active in your Chrome browser. no API key or OAuth needed.
Node.js ≥ 18: required to run datalens-mcp-call or npx.
datalens-mcp-server npm package: install via npm install -g datalens-mcp-server or use npx datalens-mcp-call (no global install required).
Chrome browser: target webpage must be open in a tab, or you can pass a URL and the extension will open it.

parameters per tool:

url (optional): full URL of the target webpage. if omitted, the extension uses the currently active Chrome tab.
prompt (string): natural language description of what you're scraping (e.g. "article list", "product reviews"). used by the AI backend to identify relevant fields.
rootSelector (string): CSS selector for the container holding all repeating items. obtained from scrape_detect_tables output.
itemSelector (string): CSS selector for each individual item within the root. obtained from scrape_detect_tables output.
documentInfoPath (string): JSONPath to the data object inside each item. obtained from scrape_detect_tables output.
jobId (string): unique identifier returned by scrape_start. used to poll status, pause, resume, stop, or export results.
jobDraft (object): complete job configuration returned by scrape_analyze_columns. must not be fabricated.
maxRecords (integer): maximum rows to scrape in one job. use 10 for preview runs, higher for full export.
outputDir (string): filesystem path where exported files will be saved (e.g. "/tmp/datalens").
format (string): output format, one of "json", "csv", "xlsx". default is "json".
waitMs (integer): milliseconds to wait between status polls (e.g. 3000 for 3 seconds).
expandButtonUids (array): list of expand button identifiers found in scrape_get_table_tree output. used to click nested content before re-detecting.
limit (integer): max rows to return in inline preview. use 50 for quick inspection.
cursor (string): pagination token from prior result fetch. omit on first call.

environment variables:

DATALENS_TIMEOUT: optional. set to milliseconds (e.g. 180000 for 3 minutes) if tools take longer than the default 120 seconds. useful for slow pages or large scrapes.

procedure

detect table structure: run datalens-mcp-call scrape_detect_tables '{"url":"<target_url>","prompt":"<what_you_want>"}' to find repeating data on the page. the tool returns a list of detected tables, each with rootSelector, itemSelector, and documentInfoPath. copy the selectors that best match your target data. input: url (optional if tab is open), prompt (required). output: list of table candidates with selectors.
(optional) inspect nested content: if the page has collapsible sections, nested replies, or "load more" buttons, run datalens-mcp-call scrape_get_table_tree '{"rootSelector":"<from_step_1>","itemSelector":"<from_step_1>","documentInfoPath":"<from_step_1>"}' to visualize the DOM tree with _uid-annotated expand buttons. input: the three selectors from step 1. output: annotated tree structure showing which buttons can be clicked.
(optional) expand and re-detect: if you found expand buttons in step 2, run datalens-mcp-call scrape_click_expand_and_redetect '{"rootSelector":"<from_step_1>","itemSelector":"<from_step_1>","documentInfoPath":"<from_step_1>","expandButtonUids":[{"type":"reply","uids":["<uid_1>","<uid_2>"]}]}' to click those buttons, wait for new content to load, and re-detect the updated structure. input: selectors from step 1 plus array of expand button uids. output: updated selectors to use in subsequent steps. skip this step if there are no nested items.
analyze columns and build config: run datalens-mcp-call scrape_analyze_columns '{"rootSelector":"<from_step_1_or_3>","itemSelector":"<from_step_1_or_3>","documentInfoPath":"<from_step_1_or_3>","url":"<target_url>","prompt":"<what_you_want>"}' to call the backend AI, which identifies field names, data types, and pagination strategy. this returns a scraperConfig object and a jobDraft. always confirm the field list matches what you expect before proceeding. input: three selectors (updated if step 3 was used), url, prompt. output: jobDraft (required for step 5) and scraperConfig.
start scrape job: run datalens-mcp-call scrape_start '{"jobDraft":<paste_jobDraft_json>,"maxRecords":<number>}' to launch the scrape. use maxRecords: 10 for a preview run first. do not fabricate or modify the jobDraft. input: jobDraft from step 4, maxRecords. output: jobId (string), used for all subsequent operations.
poll job status: repeatedly run datalens-mcp-call scrape_status '{"jobId":"<jobId>","waitMs":3000}' until status field is COMPLETED, FAILED, or STOPPED. check the scrapedCount field to monitor progress and error field for failure reasons. input: jobId from step 5, waitMs. output: status (enum), scrapedCount (integer), error (string, if failed).
export results: choose one export method per job:
- to file (recommended for large sets): run datalens-mcp-call scrape_export_to_file '{"jobId":"<jobId>","outputDir":"<path>","format":"<json|csv|xlsx>"}'. input: jobId, outputDir (must exist or be creatable), format. output: file path where results were saved.
- inline preview (small sets only): run datalens-mcp-call scrape_result '{"jobId":"<jobId>","limit":50}'. input: jobId, limit. output: array of rows plus optional cursor for pagination. re-run with the cursor value to fetch next page.
- in-memory base64 (for passing to other tools): run datalens-mcp-call scrape_export '{"jobId":"<jobId>","format":"<json|csv>"}'. input: jobId, format. output: base64-encoded file content (string).
(optional) manage running jobs: pause with datalens-mcp-call scrape_pause '{"jobId":"<jobId>"}', resume with datalens-mcp-call scrape_resume '{"jobId":"<jobId>"}', or stop with datalens-mcp-call scrape_stop '{"jobId":"<jobId>"}'. input: jobId. output: updated job status.
(optional) manage browser tabs: list open tabs with datalens-mcp-call browser_list_tabs, open a new tab with datalens-mcp-call browser_open_tab '{"url":"<url>"}', switch to a tab with datalens-mcp-call browser_use_tab '{"tabId":<id>}', or close a tab with datalens-mcp-call browser_close_tab '{"tabId":<id>}'. input: url (for open) or tabId (for use/close). output: tab list or confirmation. note: scrape_detect_tables with a url arg handles tab opening automatically, so this step is usually not needed.

decision points

if scrape_detect_tables returns an empty list: the page may require login, be dynamically loaded (JavaScript-heavy), or have no repeating structure. ask the user to log in in Chrome if needed, scroll down to trigger lazy-loading, then retry the detect command.
if the page has nested/collapsible content: run step 2 (inspect tree) and step 3 (expand and re-detect) before analyzing columns. skip if the page is flat (no nested replies, no load-more buttons).
if scrape_status stays at QUEUED for over 30 seconds: the Chrome extension may be inactive, the target URL tab may be closed, or the browser may be hung. verify the extension icon is visible in Chrome, verify a tab for the target URL is open, then check debug logs with datalens-mcp-call debug_get_logs '{"levels":["error"]}'.
if a tool call times out or takes longer than 120 seconds: set DATALENS_TIMEOUT=180000 (or higher) in your environment before running the command. this is common for large pages or slow internet.
if you want a quick preview before a full scrape: use maxRecords: 10 in scrape_start and scrape_export_to_file or scrape_result to inspect the first few rows. verify the field mapping is correct, then increase maxRecords for the full run.
if the user asks for spreadsheet output: use format: "csv" or format: "xlsx" in scrape_export_to_file. default is JSON.
if you need to interrupt or slow down a scrape: use scrape_pause to temporarily halt, scrape_resume to continue, or scrape_stop to kill the job. paused jobs can be resumed; stopped jobs cannot.
never call scrape_start without a jobDraft from scrape_analyze_columns. fabricating a config will produce incorrect results or errors.
never skip scrape_analyze_columns and jump straight to scrape_start. the analyze step is mandatory to build the config.

output contract

scrape_detect_tables: returns JSON array of table candidates. each candidate has rootSelector (string), itemSelector (string), documentInfoPath (string), and optional metadata. use these three strings in all downstream calls.
scrape_analyze_columns: returns JSON object with scraperConfig (object) and jobDraft (object). the jobDraft is a complete, serializable job descriptor that must be passed verbatim to scrape_start. do not edit or fabricate jobDraft values.
scrape_start: returns JSON object with jobId (string UUID), status (enum: "QUEUED"), and optional createdAt (timestamp). store the jobId for all subsequent operations.
scrape_status: returns JSON object with jobId (string), status (enum: "QUEUED", "PREPARING", "RUNNING", "COMPLETED", "FAILED", "STOPPED"), scrapedCount (integer, rows collected so far), totalCount (integer, estimated or null), progress (float 0.0-1.0, optional), error (string, present only on failure), and completedAt (timestamp, present only when status is terminal).
scrape_export_to_file: returns JSON object with filePath (string, absolute path to saved file), format (string, the format used), recordCount (integer, rows in the export), and fileSize (integer, bytes). the file is created at the specified outputDir with a timestamped name.
scrape_result: returns JSON object with jobId (string), records (array of row objects), count (integer, rows in this response), cursor (string or null, pagination token for next fetch), and hasMore (boolean, true if more rows exist).
scrape_export: returns JSON object with format (string), content (string, base64-encoded file data), recordCount (integer), and encodedSize (integer, bytes after base64 encoding).
all error responses: return JSON object with error (string, human-readable message) and optional code (string, machine-readable error code e.g. "EXTENSION_NOT_ACTIVE", "INVALID_SELECTOR", "TIMEOUT").

outcome signal

scrape_detect_tables succeeds: you receive a list of table candidates with valid selectors. pick the one that visually matches your target data in the webpage.
scrape_analyze_columns succeeds: you see a jobDraft and a list of detected fields (e.g. "title", "author", "date") that match what you expect to scrape. if fields are missing or misnamed, re-run with a more specific prompt or check the page layout.
scrape_start succeeds: you receive a jobId. no error or timeout means the job is queued and the extension is listening.
scrape_status shows COMPLETED: the scrapedCount field matches or exceeds your expected row count (or is close to totalCount). no error field present.
scrape_export_to_file succeeds: the file exists at the returned filePath. you can open it in your editor, spreadsheet app, or text viewer and verify rows are present and formatted correctly.
scrape_result returns data with records array: the array is non-empty and each record has the expected fields from step 4. if hasMore is true and cursor is non-null, call scrape_result again with that cursor to fetch the next page.
scrape_pause, scrape_resume, scrape_stop succeed: the returned status field changes to "PAUSED" (for pause), "RUNNING" (for resume), or "STOPPED" (for stop). no error field.
debug_get_logs returns entries: check for "error" and "warn" level entries to diagnose extension connectivity, selector validity, or timeout issues.
user can see and use the data: final exported file is readable, row count is non-zero, field names and values are intact and sensible, and the user can load it into a spreadsheet, database, or downstream tool without errors.

credits: original skill authored by clawhub. enriched to follow implexa standards with explicit decision points, edge case handling, and detailed output contracts.