Playwright-based web scraping OpenClaw Skill with anti-bot protection. Successfully tested on complex sites like Discuss.com.hk.
---
name: playwright-scraper-skill
description: Playwright-based web scraping OpenClaw Skill with anti-bot protection. Successfully tested on complex sites like Discuss.com.hk.
version: 1.2.0
author: Simon Chan
---
# Playwright Scraper Skill
A Playwright-based web scraping OpenClaw Skill with anti-bot protection. Choose the best approach based on the target website's anti-bot level.
---
## π― Use Case Matrix
| Target Website | Anti-Bot Level | Recommended Method | Script |
|---------------|----------------|-------------------|--------|
| **Regular Sites** | Low | web_fetch tool | N/A (built-in) |
| **Dynamic Sites** | Medium | Playwright Simple | `scripts/playwright-simple.js` |
| **Cloudflare Protected** | High | **Playwright Stealth** β | `scripts/playwright-stealth.js` |
| **YouTube** | Special | deep-scraper | Install separately |
| **Reddit** | Special | reddit-scraper | Install separately |
---
## π¦ Installation
```bash
cd playwright-scraper-skill
npm install
npx playwright install chromium
```
---
## π Quick Start
### 1οΈβ£ Simple Sites (No Anti-Bot)
Use OpenClaw's built-in `web_fetch` tool:
```bash
# Invoke directly in OpenClaw
Hey, fetch me the content from https://example.com
```
---
### 2οΈβ£ Dynamic Sites (Requires JavaScript)
Use **Playwright Simple**:
```bash
node scripts/playwright-simple.js "https://example.com"
```
**Example output:**
```json
{
"url": "https://example.com",
"title": "Example Domain",
"content": "...",
"elapsedSeconds": "3.45"
}
```
---
### 3οΈβ£ Anti-Bot Protected Sites (Cloudflare etc.)
Use **Playwright Stealth**:
```bash
node scripts/playwright-stealth.js "https://m.discuss.com.hk/#hot"
```
**Features:**
- Hide automation markers (`navigator.webdriver = false`)
- Realistic User-Agent (iPhone, Android)
- Random delays to mimic human behavior
- Screenshot and HTML saving support
---
### 4οΈβ£ YouTube Video Transcripts
Use **deep-scraper** (install separately):
```bash
# Install deep-scraper skill
npx clawhub install deep-scraper
# Use it
cd skills/deep-scraper
node assets/youtube_handler.js "https://www.youtube.com/watch?v=VIDEO_ID"
```
---
## π Script Descriptions
### `scripts/playwright-simple.js`
- **Use Case:** Regular dynamic websites
- **Speed:** Fast (3-5 seconds)
- **Anti-Bot:** None
- **Output:** JSON (title, content, URL)
### `scripts/playwright-stealth.js` β
- **Use Case:** Sites with Cloudflare or anti-bot protection
- **Speed:** Medium (5-20 seconds)
- **Anti-Bot:** Medium-High (hides automation, realistic UA)
- **Output:** JSON + Screenshot + HTML file
- **Verified:** 100% success on Discuss.com.hk
---
## π Best Practices
### 1. Try web_fetch First
If the site doesn't have dynamic loading, use OpenClaw's `web_fetch` toolβit's fastest.
### 2. Need JavaScript? Use Playwright Simple
If you need to wait for JavaScript rendering, use `playwright-simple.js`.
### 3. Getting Blocked? Use Stealth
If you encounter 403 or Cloudflare challenges, use `playwright-stealth.js`.
### 4. Special Sites Need Specialized Skills
- YouTube β deep-scraper
- Reddit β reddit-scraper
- Twitter β bird skill
---
## π§ Customization
All scripts support environment variables:
```bash
# Set screenshot path
SCREENSHOT_PATH=/path/to/screenshot.png node scripts/playwright-stealth.js URL
# Set wait time (milliseconds)
WAIT_TIME=10000 node scripts/playwright-simple.js URL
# Enable headful mode (show browser)
HEADLESS=false node scripts/playwright-stealth.js URL
# Save HTML
SAVE_HTML=true node scripts/playwright-stealth.js URL
# Custom User-Agent
USER_AGENT="Mozilla/5.0 ..." node scripts/playwright-stealth.js URL
```
---
## π Performance Comparison
| Method | Speed | Anti-Bot | Success Rate (Discuss.com.hk) |
|--------|-------|----------|-------------------------------|
| web_fetch | β‘ Fastest | β None | 0% |
| Playwright Simple | π Fast | β οΈ Low | 20% |
| **Playwright Stealth** | β±οΈ Medium | β
Medium | **100%** β
|
| Puppeteer Stealth | β±οΈ Medium | β
Medium-High | ~80% |
| Crawlee (deep-scraper) | π’ Slow | β Detected | 0% |
| Chaser (Rust) | β±οΈ Medium | β Detected | 0% |
---
## π‘οΈ Anti-Bot Techniques Summary
Lessons learned from our testing:
### β
Effective Anti-Bot Measures
1. **Hide `navigator.webdriver`** β Essential
2. **Realistic User-Agent** β Use real devices (iPhone, Android)
3. **Mimic Human Behavior** β Random delays, scrolling
4. **Avoid Framework Signatures** β Crawlee, Selenium are easily detected
5. **Use `addInitScript` (Playwright)** β Inject before page load
### β Ineffective Anti-Bot Measures
1. **Only changing User-Agent** β Not enough
2. **Using high-level frameworks (Crawlee)** β More easily detected
3. **Docker isolation** β Doesn't help with Cloudflare
---
## π Troubleshooting
### Issue: 403 Forbidden
**Solution:** Use `playwright-stealth.js`
### Issue: Cloudflare Challenge Page
**Solution:**
1. Increase wait time (10-15 seconds)
2. Try `headless: false` (headful mode sometimes has higher success rate)
3. Consider using proxy IPs
### Issue: Blank Page
**Solution:**
1. Increase `waitForTimeout`
2. Use `waitUntil: 'networkidle'` or `'domcontentloaded'`
3. Check if login is required
---
## π Memory & Experience
### 2026-02-07 Discuss.com.hk Test Conclusions
- β
**Pure Playwright + Stealth** succeeded (5s, 200 OK)
- β Crawlee (deep-scraper) failed (403)
- β Chaser (Rust) failed (Cloudflare)
- β Puppeteer standard failed (403)
**Best Solution:** Pure Playwright + anti-bot techniques (framework-independent)
---
## π§ Future Improvements
- [ ] Add proxy IP rotation
- [ ] Implement cookie management (maintain login state)
- [ ] Add CAPTCHA handling (2captcha / Anti-Captcha)
- [ ] Batch scraping (parallel URLs)
- [ ] Integration with OpenClaw's `browser` tool
---
## π References
- [Playwright Official Docs](https://playwright.dev/)
- [puppeteer-extra-plugin-stealth](https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth)
- [deep-scraper skill](https://clawhub.com/opsun/deep-scraper)
don't have the plugin yet? install it then click "run inline in claude" again.