Multi-platform social media data collection and aggregation for content performance tracking. Use when: (1) collecting engagement metrics (views/likes/commen...
--- name: social-media-data-collector description: | Multi-platform social media data collection and aggregation for content performance tracking. Use when: (1) collecting engagement metrics (views/likes/comments/shares) across multiple platforms, (2) filling bitable/spreadsheet with social media performance data, (3) tracking content distribution results across 10+ platforms, (4) need to scrape platforms without APIs. Covers: Douyin, Weibo, Kuaishou, Bilibili, Toutiao, Xiaohongshu, WeChat Video (视频号), Autohome (汽车之家), Yiche (易车), Baijiahao (百家号), Douyu (斗鱼), Pipixia (皮皮虾), Dongchedi (懂车帝), TikTok, YouTube. NOT for: posting content, account management, or social listening/monitoring. --- # Social Media Data Collector ## Overview Collect engagement metrics from 13+ platforms, aggregate into structured format (飞书多维表格/CSV). Three-tier approach: API first → browser scrape fallback → manual flag. ## Execution Flow 1. **Classify platforms** by data access method (see references/platform-guide.md) 2. **API tier** — call APIs for platforms with programmatic access 3. **Browser tier** — Playwright render + text extraction for remaining 4. **Aggregate** — normalize data, write to target (bitable/CSV) 5. **Cleanup** — remove screenshots, temp files, browser cache ## Platform Tiers | Tier | Platforms | Method | |------|-----------|--------| | API-first | 抖音, 微博, 快手, B站, 今日头条, 小红书 | TikHub API / BlueAI Crawler | | Browser-scrape | 百家号, 汽车之家, 易车, 视频号, 斗鱼, 皮皮虾 | Playwright headless | | API+scrape | 懂车帝 | TikHub (limited) + scrape | ## Model Strategy (Token Optimization) ### Problem Using opus/sonnet for the entire pipeline wastes tokens on mechanical tasks. ### Recommended Model Split | Phase | Model | Why | |-------|-------|-----| | Planning & classification | opus/sonnet | Needs reasoning | | API calls & JSON parsing | haiku/flash | Mechanical, no reasoning needed | | Browser text extraction | Code (no LLM) | Pure Python, no model call | | Data normalization | haiku/flash | Simple mapping | | Report/summary | sonnet | Needs synthesis | ### Implementation - Use `scripts/collect_api.py` for API tier — **zero LLM tokens** (pure code) - Use `scripts/collect_browser.py` for browser tier — **zero LLM tokens** (pure code) - Only invoke LLM for: planning which platforms to hit, handling errors, writing summaries ### Token Budget Estimate (per 13-platform run) - With current approach (all-opus): ~80k tokens - With optimized approach (code scripts + haiku routing): ~5k tokens - **Savings: 94%** ## Key Commands ```bash # Full collection run python3 scripts/collect_api.py --config /tmp/sm-collect/config.json # Browser scrape specific platforms python3 scripts/collect_browser.py --platforms "百家号,汽车之家,视频号" # Write to bitable python3 scripts/write_bitable.py --app-token XXX --table-id YYY --data /tmp/sm-collect/results.json # Cleanup rm -rf /tmp/sm-collect/ /tmp/screenshots/ ``` ## Bitable Field Mapping | 多维表格字段 | 类型 | 说明 | |-------------|------|------| | 播放量 | text | 带"万"后缀的文本 | | 点赞 | number | 纯数字 | | 评论 | number | 纯数字 | | 分享 | number | 纯数字 | | 收藏 | number | 纯数字 | | 互动量合计 | text | 带"万"后缀的文本 | | 数据统计日期 | text | 格式 "2026.5.15" | ⚠️ 注意 `播放量` 和 `互动量合计` 是 text 类型,不是 number!传数字会报 TextFieldConvFail。 ## Cleanup Protocol After each collection run, delete: - `/tmp/sm-collect/` (intermediate JSON) - `/tmp/screenshots/` (browser screenshots) - `/tmp/subagent-out/` (if spawned sub-agents) - Any `.json` temp files in workspace ## Error Handling - API 403/401 → token expired, refresh and retry once - Browser timeout → increase to 25s, retry with `wait_until="domcontentloaded"` - Platform redirects → check URL is correct (易车 hao vs sv domain!) - Empty data → flag for manual check, don't guess ## Platform-Specific Notes See `references/platform-guide.md` for detailed per-platform experience including: - Authentication requirements - URL patterns and gotchas - Data extraction selectors - Known limitations
don't have the plugin yet? install it then click "run inline in claude" again.