Crawl any website and save pages as local markdown files. Use when you need to download documentation, knowledge bases, or web content for offline access or analysis. No code required - just provide a URL.
---
name: crawl
description: "Crawl any website and save pages as local markdown files. Use when you need to download documentation, knowledge bases, or web content for offline access or analysis. No code required - just provide a URL."
---
# Crawl Skill
Crawl websites to extract content from multiple pages. Ideal for documentation, knowledge bases, and site-wide content extraction.
## Prerequisites
**Tavily API Key Required** - Get your key at https://tavily.com
Add to `~/.claude/settings.json`:
```json
{
"env": {
"TAVILY_API_KEY": "tvly-your-api-key-here"
}
}
```
## Quick Start
### Using the Script
```bash
./scripts/crawl.sh '<json>' [output_dir]
```
**Examples:**
```bash
# Basic crawl
./scripts/crawl.sh '{"url": "https://docs.example.com"}'
# Deeper crawl with limits
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2, "limit": 50}'
# Save to files
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2}' ./docs
# Focused crawl with path filters
./scripts/crawl.sh '{"url": "https://example.com", "max_depth": 2, "select_paths": ["/docs/.*", "/api/.*"], "exclude_paths": ["/blog/.*"]}'
# With semantic instructions (for agentic use)
./scripts/crawl.sh '{"url": "https://docs.example.com", "instructions": "Find API documentation", "chunks_per_source": 3}'
```
When `output_dir` is provided, each crawled page is saved as a separate markdown file.
### Basic Crawl
```bash
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://docs.example.com",
"max_depth": 1,
"limit": 20
}'
```
### Focused Crawl with Instructions
```bash
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://docs.example.com",
"max_depth": 2,
"instructions": "Find API documentation and code examples",
"chunks_per_source": 3,
"select_paths": ["/docs/.*", "/api/.*"]
}'
```
## API Reference
### Endpoint
```
POST https://api.tavily.com/crawl
```
### Headers
| Header | Value |
|--------|-------|
| `Authorization` | `Bearer <TAVILY_API_KEY>` |
| `Content-Type` | `application/json` |
### Request Body
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `url` | string | Required | Root URL to begin crawling |
| `max_depth` | integer | 1 | Levels deep to crawl (1-5) |
| `max_breadth` | integer | 20 | Links per page |
| `limit` | integer | 50 | Total pages cap |
| `instructions` | string | null | Natural language guidance for focus |
| `chunks_per_source` | integer | 3 | Chunks per page (1-5, requires instructions) |
| `extract_depth` | string | `"basic"` | `basic` or `advanced` |
| `format` | string | `"markdown"` | `markdown` or `text` |
| `select_paths` | array | null | Regex patterns to include |
| `exclude_paths` | array | null | Regex patterns to exclude |
| `allow_external` | boolean | true | Include external domain links |
| `timeout` | float | 150 | Max wait (10-150 seconds) |
### Response Format
```json
{
"base_url": "https://docs.example.com",
"results": [
{
"url": "https://docs.example.com/page",
"raw_content": "# Page Title\n\nContent..."
}
],
"response_time": 12.5
}
```
## Depth vs Performance
| Depth | Typical Pages | Time |
|-------|---------------|------|
| 1 | 10-50 | Seconds |
| 2 | 50-500 | Minutes |
| 3 | 500-5000 | Many minutes |
**Start with `max_depth=1`** and increase only if needed.
## Crawl for Context vs Data Collection
**For agentic use (feeding results into context):** Always use `instructions` + `chunks_per_source`. This returns only relevant chunks instead of full pages, preventing context window explosion.
**For data collection (saving to files):** Omit `chunks_per_source` to get full page content.
## Examples
### For Context: Agentic Research (Recommended)
Use when feeding crawl results into an LLM context:
```bash
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://docs.example.com",
"max_depth": 2,
"instructions": "Find API documentation and authentication guides",
"chunks_per_source": 3
}'
```
Returns only the most relevant chunks (max 500 chars each) per page - fits in context without overwhelming it.
### For Context: Targeted Technical Docs
```bash
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://example.com",
"max_depth": 2,
"instructions": "Find all documentation about authentication and security",
"chunks_per_source": 3,
"select_paths": ["/docs/.*", "/api/.*"]
}'
```
### For Data Collection: Full Page Archive
Use when saving content to files for later processing:
```bash
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://example.com/blog",
"max_depth": 2,
"max_breadth": 50,
"select_paths": ["/blog/.*"],
"exclude_paths": ["/blog/tag/.*", "/blog/category/.*"]
}'
```
Returns full page content - use the script with `output_dir` to save as markdown files.
## Map API (URL Discovery)
Use `map` instead of `crawl` when you only need URLs, not content:
```bash
curl --request POST \
--url https://api.tavily.com/map \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://docs.example.com",
"max_depth": 2,
"instructions": "Find all API docs and guides"
}'
```
Returns URLs only (faster than crawl):
```json
{
"base_url": "https://docs.example.com",
"results": [
"https://docs.example.com/api/auth",
"https://docs.example.com/guides/quickstart"
]
}
```
## Tips
- **Always use `chunks_per_source` for agentic workflows** - prevents context explosion when feeding results to LLMs
- **Omit `chunks_per_source` only for data collection** - when saving full pages to files
- **Start conservative** (`max_depth=1`, `limit=20`) and scale up
- **Use path patterns** to focus on relevant sections
- **Use Map first** to understand site structure before full crawl
- **Always set a `limit`** to prevent runaway crawls
don't have the plugin yet? install it then click "run inline in claude" again.
restructured original into implexa's six components, made implicit decision logic explicit for agentic vs data collection workflows and error handling paths, documented all tavily api parameters and network requirements as inputs, added edge cases like rate limits and timeouts, preserved all original examples and procedure steps, kept lowercase tech-bro tone throughout.
crawl any website and extract content from multiple pages, saving them as markdown files for offline access or agentic analysis. use this when you need to ingest documentation sites, knowledge bases, or broad web content without manual page-by-page downloads. ideal for feeding crawled content into llm context windows or archiving reference material locally.
tavily api key - required to call the crawl endpoint. get it at https://tavily.com. add to ~/.claude/settings.json:
{
"env": {
"TAVILY_API_KEY": "tvly-your-api-key-here"
}
}
crawl parameters:
url (string, required): root url to begin crawlingmax_depth (integer, default 1): levels deep to crawl (1-5 recommended)max_breadth (integer, default 20): links per page to followlimit (integer, default 50): total pages capinstructions (string, optional): natural language focus guidance for agentic usechunks_per_source (integer, optional): chunks per page (1-5, requires instructions, for context windows)extract_depth (string, default "basic"): "basic" or "advanced" extractionformat (string, default "markdown"): "markdown" or "text"select_paths (array, optional): regex patterns to include (e.g., ["/docs/.", "/api/."])exclude_paths (array, optional): regex patterns to exclude (e.g., ["/blog/tag/.*"])allow_external (boolean, default true): include external domain linkstimeout (float, default 150): max wait in seconds (10-150 range)network/environment requirements:
verify tavily api key is set in ~/.claude/settings.json and valid (test by calling tavily endpoints; if 401 error, key is expired or wrong).
construct request body with required fields:
send post request to https://api.tavily.com/crawl with authorization header Bearer $TAVILY_API_KEY.
parse response and extract results array containing url and raw_content fields.
(optional) if output_dir is specified, iterate through results and write each page to a separate markdown file named by url slug.
./docs/page-title.md)return raw_content directly or aggregate into single output based on use case.
if using for agentic context (feeding into llm):
instructions parameter and include chunks_per_source (1-3). this returns only relevant content chunks (max 500 chars each) instead of full pages, preventing context window overflow.chunks_per_source to get full page raw_content for data collection or archival.if crawling fails with 401 unauthorized:
if crawling fails with 429 rate limit:
limit and max_depth to lower api usage. crawl with smaller batches.if crawling fails with 504 timeout:
timeout parameter (up to 150 max). reduce max_depth or max_breadth to speed up crawl. split into multiple crawls with select_paths to narrow scope.if target website returns empty results:
max_depth=1 first. if using path filters (select_paths/exclude_paths), verify regex patterns match actual url structure (test with map endpoint first).if output_dir is specified:
if site has external links and you want internal-only crawl:
allow_external: false to stay within root domain.when crawling without output_dir (agentic/context use):
{
"base_url": "https://docs.example.com",
"results": [
{
"url": "https://docs.example.com/page",
"raw_content": "# Page Title\n\nContent as markdown..."
}
],
"response_time": 12.5
}
when saving to output_dir (data collection use):
{output_dir}/{slug}.md for each page (e.g., ./docs/api-auth.md)---
url: https://docs.example.com/api/auth
crawled_at: 2024-01-15T10:30:00Z
---
# Page Content as Markdown
performance expectations:
crawl succeeded when:
crawl partially succeeded when:
crawl failed when:
user sees success when crawled markdown files are readable, contain expected content (titles, paragraphs, code blocks), and match the source website structure.