Crawl

Crawl any website and save pages as local markdown files. Use when you need to download documentation, knowledge bases, or web content for offline access or analysis. No code required - just provide a URL.

installs

stars

karma

SkillRank score ↗

6.8/ 10

evaluated by implexa, claude-haiku-4-5 · 2026-05-26

crawl extracts and archives website content via tavily api with depth/breadth controls and optional semantic filtering. handles both context-window-aware chunking for agent workflows and full-page archival for offline access.

structure

7.0

trigger phrases

5.0

procedure

8.0

edge cases

5.0

documentation

7.0

strengths

view original SKILL.md from clawhubclick to expand

---
name: crawl
description: "Crawl any website and save pages as local markdown files. Use when you need to download documentation, knowledge bases, or web content for offline access or analysis. No code required - just provide a URL."
---

# Crawl Skill

Crawl websites to extract content from multiple pages. Ideal for documentation, knowledge bases, and site-wide content extraction.

## Prerequisites

**Tavily API Key Required** - Get your key at https://tavily.com

Add to `~/.claude/settings.json`:
```json
{
  "env": {
    "TAVILY_API_KEY": "tvly-your-api-key-here"
  }
}
```

## Quick Start

### Using the Script

```bash
./scripts/crawl.sh '<json>' [output_dir]
```

**Examples:**
```bash
# Basic crawl
./scripts/crawl.sh '{"url": "https://docs.example.com"}'

# Deeper crawl with limits
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2, "limit": 50}'

# Save to files
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2}' ./docs

# Focused crawl with path filters
./scripts/crawl.sh '{"url": "https://example.com", "max_depth": 2, "select_paths": ["/docs/.*", "/api/.*"], "exclude_paths": ["/blog/.*"]}'

# With semantic instructions (for agentic use)
./scripts/crawl.sh '{"url": "https://docs.example.com", "instructions": "Find API documentation", "chunks_per_source": 3}'
```

When `output_dir` is provided, each crawled page is saved as a separate markdown file.

### Basic Crawl

```bash
curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 1,
    "limit": 20
  }'
```

### Focused Crawl with Instructions

```bash
curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find API documentation and code examples",
    "chunks_per_source": 3,
    "select_paths": ["/docs/.*", "/api/.*"]
  }'
```

## API Reference

### Endpoint

```
POST https://api.tavily.com/crawl
```

### Headers

| Header | Value |
|--------|-------|
| `Authorization` | `Bearer <TAVILY_API_KEY>` |
| `Content-Type` | `application/json` |

### Request Body

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `url` | string | Required | Root URL to begin crawling |
| `max_depth` | integer | 1 | Levels deep to crawl (1-5) |
| `max_breadth` | integer | 20 | Links per page |
| `limit` | integer | 50 | Total pages cap |
| `instructions` | string | null | Natural language guidance for focus |
| `chunks_per_source` | integer | 3 | Chunks per page (1-5, requires instructions) |
| `extract_depth` | string | `"basic"` | `basic` or `advanced` |
| `format` | string | `"markdown"` | `markdown` or `text` |
| `select_paths` | array | null | Regex patterns to include |
| `exclude_paths` | array | null | Regex patterns to exclude |
| `allow_external` | boolean | true | Include external domain links |
| `timeout` | float | 150 | Max wait (10-150 seconds) |

### Response Format

```json
{
  "base_url": "https://docs.example.com",
  "results": [
    {
      "url": "https://docs.example.com/page",
      "raw_content": "# Page Title\n\nContent..."
    }
  ],
  "response_time": 12.5
}
```

## Depth vs Performance

| Depth | Typical Pages | Time |
|-------|---------------|------|
| 1 | 10-50 | Seconds |
| 2 | 50-500 | Minutes |
| 3 | 500-5000 | Many minutes |

**Start with `max_depth=1`** and increase only if needed.

## Crawl for Context vs Data Collection

**For agentic use (feeding results into context):** Always use `instructions` + `chunks_per_source`. This returns only relevant chunks instead of full pages, preventing context window explosion.

**For data collection (saving to files):** Omit `chunks_per_source` to get full page content.

## Examples

### For Context: Agentic Research (Recommended)

Use when feeding crawl results into an LLM context:

```bash
curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find API documentation and authentication guides",
    "chunks_per_source": 3
  }'
```

Returns only the most relevant chunks (max 500 chars each) per page - fits in context without overwhelming it.

### For Context: Targeted Technical Docs

```bash
curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com",
    "max_depth": 2,
    "instructions": "Find all documentation about authentication and security",
    "chunks_per_source": 3,
    "select_paths": ["/docs/.*", "/api/.*"]
  }'
```

### For Data Collection: Full Page Archive

Use when saving content to files for later processing:

```bash
curl --request POST \
  --url https://api.tavily.com/crawl \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com/blog",
    "max_depth": 2,
    "max_breadth": 50,
    "select_paths": ["/blog/.*"],
    "exclude_paths": ["/blog/tag/.*", "/blog/category/.*"]
  }'
```

Returns full page content - use the script with `output_dir` to save as markdown files.

## Map API (URL Discovery)

Use `map` instead of `crawl` when you only need URLs, not content:

```bash
curl --request POST \
  --url https://api.tavily.com/map \
  --header "Authorization: Bearer $TAVILY_API_KEY" \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://docs.example.com",
    "max_depth": 2,
    "instructions": "Find all API docs and guides"
  }'
```

Returns URLs only (faster than crawl):

```json
{
  "base_url": "https://docs.example.com",
  "results": [
    "https://docs.example.com/api/auth",
    "https://docs.example.com/guides/quickstart"
  ]
}
```

## Tips

- **Always use `chunks_per_source` for agentic workflows** - prevents context explosion when feeding results to LLMs
- **Omit `chunks_per_source` only for data collection** - when saving full pages to files
- **Start conservative** (`max_depth=1`, `limit=20`) and scale up
- **Use path patterns** to focus on relevant sections
- **Use Map first** to understand site structure before full crawl
- **Always set a `limit`** to prevent runaway crawls

don't have the plugin yet? install it then click "run inline in claude" again.

restructured original into implexa's six components, made implicit decision logic explicit for agentic vs data collection workflows and error handling paths, documented all tavily api parameters and network requirements as inputs, added edge cases like rate limits and timeouts, preserved all original examples and procedure steps, kept lowercase tech-bro tone throughout.

Crawl

Item: Crawl
Rating: 6.8
Author: Implexa

intent

crawl any website and extract content from multiple pages, saving them as markdown files for offline access or agentic analysis. use this when you need to ingest documentation sites, knowledge bases, or broad web content without manual page-by-page downloads. ideal for feeding crawled content into llm context windows or archiving reference material locally.

inputs

tavily api key - required to call the crawl endpoint. get it at https://tavily.com. add to ~/.claude/settings.json:

{
  "env": {
    "TAVILY_API_KEY": "tvly-your-api-key-here"
  }
}

crawl parameters:

url (string, required): root url to begin crawling
max_depth (integer, default 1): levels deep to crawl (1-5 recommended)
max_breadth (integer, default 20): links per page to follow
limit (integer, default 50): total pages cap
instructions (string, optional): natural language focus guidance for agentic use
chunks_per_source (integer, optional): chunks per page (1-5, requires instructions, for context windows)
extract_depth (string, default "basic"): "basic" or "advanced" extraction
format (string, default "markdown"): "markdown" or "text"
select_paths (array, optional): regex patterns to include (e.g., ["/docs/.", "/api/."])
exclude_paths (array, optional): regex patterns to exclude (e.g., ["/blog/tag/.*"])
allow_external (boolean, default true): include external domain links
timeout (float, default 150): max wait in seconds (10-150 range)

network/environment requirements:

active internet connection
tavily api quota available (check at https://tavily.com/account)
target website must be publicly accessible and not blocked

procedure

verify tavily api key is set in ~/.claude/settings.json and valid (test by calling tavily endpoints; if 401 error, key is expired or wrong).
construct request body with required fields:
- input: root url and desired crawl parameters
- output: json object with url, max_depth, limit, and optional instructions/select_paths/exclude_paths
send post request to https://api.tavily.com/crawl with authorization header Bearer $TAVILY_API_KEY.
- input: json payload
- output: http 200 response with results array or http error (401 auth, 429 rate limit, 400 bad params, 504 timeout)
parse response and extract results array containing url and raw_content fields.
- input: json response body
- output: array of objects with {url, raw_content}
(optional) if output_dir is specified, iterate through results and write each page to a separate markdown file named by url slug.
- input: results array, output_dir path
- output: local .md files in output_dir (e.g., ./docs/page-title.md)
return raw_content directly or aggregate into single output based on use case.
- input: parsed results
- output: markdown strings or file paths

decision points

if using for agentic context (feeding into llm):

do: always set instructions parameter and include chunks_per_source (1-3). this returns only relevant content chunks (max 500 chars each) instead of full pages, preventing context window overflow.
else: omit chunks_per_source to get full page raw_content for data collection or archival.

if crawling fails with 401 unauthorized:

check tavily api key validity and expiry at https://tavily.com/account. regenerate if expired. verify key is set correctly in settings.json.

if crawling fails with 429 rate limit:

wait 60 seconds before retrying. reduce limit and max_depth to lower api usage. crawl with smaller batches.

if crawling fails with 504 timeout:

increase timeout parameter (up to 150 max). reduce max_depth or max_breadth to speed up crawl. split into multiple crawls with select_paths to narrow scope.

if target website returns empty results:

check that url is valid, publicly accessible, and not behind authentication/cloudflare captcha. try max_depth=1 first. if using path filters (select_paths/exclude_paths), verify regex patterns match actual url structure (test with map endpoint first).

if output_dir is specified:

do: write each result as a separate markdown file with slug-based filenames and frontmatter (url, crawl_time).
else: return json results as-is for programmatic use.

if site has external links and you want internal-only crawl:

set allow_external: false to stay within root domain.

output contract

when crawling without output_dir (agentic/context use):

return: json object with structure:

{
  "base_url": "https://docs.example.com",
  "results": [
    {
      "url": "https://docs.example.com/page",
      "raw_content": "# Page Title\n\nContent as markdown..."
    }
  ],
  "response_time": 12.5
}

data format: utf-8 markdown strings in raw_content field
edge case: if results array is empty, crawl found no pages matching criteria (check url validity, path filters, or site structure with map endpoint first)

when saving to output_dir (data collection use):

location: {output_dir}/{slug}.md for each page (e.g., ./docs/api-auth.md)

data format: utf-8 markdown with yaml frontmatter:

---
url: https://docs.example.com/api/auth
crawled_at: 2024-01-15T10:30:00Z
---

# Page Content as Markdown

total files: count equals number of pages crawled (up to limit parameter)
edge case: if output_dir does not exist, create it. if file exists, overwrite with latest crawl.

performance expectations:

depth 1: 10-50 pages in seconds
depth 2: 50-500 pages in 1-5 minutes
depth 3: 500-5000 pages in many minutes (avoid unless necessary)

outcome signal

crawl succeeded when:

http 200 response received with non-empty results array
each result has valid url and non-empty raw_content (or chunks field if using instructions)
if output_dir specified, markdown files appear in directory with content intact
response_time field indicates crawl completed (typically < 150 seconds)

crawl partially succeeded when:

results array is non-empty but smaller than expected (likely due to url filtering, rate limits, or site structure). verify path patterns and increase limit if needed.

crawl failed when:

http 401: tavily api key invalid or expired. regenerate and retry.
http 429: rate limited. wait and retry with smaller scope.
http 400: bad request (invalid params, malformed json, unsupported depth). check parameter types and ranges.
http 504: timeout. increase timeout param, reduce depth, or narrow scope with select_paths.
results array empty: no pages matched criteria. check url accessibility, regex patterns, or use map endpoint to understand site structure first.

user sees success when crawled markdown files are readable, contain expected content (titles, paragraphs, code blocks), and match the source website structure.

Crawl

related skills

Crawl

intent

inputs

procedure

decision points

output contract

outcome signal