Cleans and deduplicates multi-format data with AI field detection, format standardization, multi-source merging, and outputs Excel, CSV, or Feishu Bitable.
---
name: data-cleaner-ai
label: Data Cleaner AI
version: 1.0.0
language: Python
runtime: subprocess (scripts/main.py)
trigger_words:
- data cleaning
- deduplication
- spreadsheet cleanup
- data merge
- format standardization
- CRM data cleanup
- Excel cleaning
- clean data
- remove duplicates
- merge data
---
# Data Cleaner AI
Upload messy data — get clean, structured output. Supports multi-format parsing, AI field identification, intelligent dedup/fill/formatting, multi-source join, and Feishu-native output (Bitable + quality report doc).
**Use cases:** E-commerce order cleanup, CRM customer data cleansing, bank statement reconciliation, roster cleanup, multi-system data merge.
---
## Capabilities
### F1 · Multi-Format Parsing
- Excel (.xlsx / .ls)
- CSV / TSV
- JSON (semi-structured)
- Clipboard paste text
### F2 · Smart Field Identification
- AI auto-detects: name, phone, email, address, amount, date, SKU, order ID, ID number, gender, etc.
- Supports user-defined field mapping override
### F3 · Data Cleaning
- **Deduplication**: Exact match + fuzzy dedup (FuzzyWuzzy, threshold 88%)
- **Missing value fill**: Mean / mode / semantic inference / leave blank
- **Format standardization**:
- Phone → `1xx-xxxx-xxxx`
- Date → `YYYY-MM-DD`
- Amount → 2 decimal places
- Address → Province/City/District/Street standardization
### F4 · Data Classification / Tagging (PRO)
- 8 built-in business rules (high-value customer, dormant user, VIP, enterprise, etc.)
- Supports custom JSON rules
- AI auto-tagging (requires PRO + AI API Key)
### F5 · Multi-Source Join / Merge (PRO)
- Cross-file relational join on key fields
- Fuzzy join when exact key not available (FuzzyWuzzy)
- Conflicted field resolution: priority by source order or latest timestamp
### F6 · Feishu Native Output
- Excel / CSV export
- Feishu Bitable (multi-dimensional table) write-back
- Data quality report auto-generated as Feishu Doc (Markdown)
---
## Tier Feature Matrix
| Feature | FREE | PRO |
|---------|:----:|:---:|
| Multi-format parsing | ✅ | ✅ |
| Basic dedup | ✅ | ✅ |
| Smart fill | ❌ | ✅ |
| Format standardization | ❌ | ✅ |
| Fuzzy dedup | ❌ | ✅ |
| Multi-source merge | ❌ | ✅ |
| AI classification | ❌ | ✅ |
| Data quality report | ❌ | ✅ |
| Feishu Bitable output | ❌ | ✅ |
---
## Pricing
**Per-call billing (no monthly fee):**
| Tier | Price per Call |
|------|---------------|
| FREE | $0.00 USDT |
| PRO | $0.01 USDT |
Each cleaning pipeline execution (clean or merge) = one billable call.
---
## Usage
### Feishu Trigger
```
data cleaning
deduplication
spreadsheet cleanup
CRM data cleanup
Excel cleaning
```
### CLI
```bash
python scripts/main.py clean -i data.xlsx -o cleaned.xlsx
python scripts/main.py clean -t "name,phone\nJohn,13800138000" -f csv -o cleaned.csv
python scripts/main.py merge --sources customers.xlsx orders.csv --on phone -o merged.xlsx
```
### Python API
```python
from main import run_clean_pipeline
result = run_clean_pipeline(
sources=["orders.xlsx"],
output_format="xlsx",
output_path="/tmp/cleaned.xlsx",
dedup_strategy="auto",
fill_strategy="auto",
classify=True,
ai_model="deepseek",
generate_report=True,
)
```
---
## Directory Structure
```
data-cleaner-ai/
├── SKILL.md
└── scripts/
├── main.py # Entry: run_clean_pipeline / run_merge_pipeline
├── parser.py # F1: Multi-format parsing
├── field_identifier.py # F2: AI field identification
├── cleaner.py # F3: Cleaning engine
├── classifier.py # F4: Classification / tagging
├── merger.py # F5: Multi-source join
├── reporter.py # F6: Quality report generation
├── output.py # F6: Output (Excel/CSV/Bitable/Feishu Doc)
├── tier_limits.py # Tier access control
└── billing.py # SkillPay billing integration
```
---
## Billing
This skill uses **SkillPay** (skillpay.me) for per-call billing.
**Fee:** $0.0100 USDT per execution (all paid tiers)
**External API:** `https://skillpay.me/api/v1/billing`
**Data transmitted:** User identifier (`FEISHU_USER_ID` environment variable)
Billing occurs at the start of each cleaning or merge execution. If balance is insufficient, the tool returns a `payment_url` where the user can recharge.
---
## Required Environment Variables
| Variable | Description |
|----------|-------------|
| `FEISHU_USER_ID` | Feishu user open_id for billing identification |
| `OPENAI_API_KEY` | AI model API key (OpenAI, MiniMax, or OpenAI-compatible endpoint) |
| `OPENAI_API_BASE` | Base URL for AI API (optional, defaults to MiniMax endpoint) |
| `SKILL_BILLING_API_KEY` | Builder API Key from skillpay.me (required for paid calls) |
| `SKILL_BILLING_SKILL_ID` | Skill slug on SkillPay (defaults to `data-cleaner-ai`) |
---
## Error Handling
| Error | Handling |
|-------|----------|
| Balance insufficient | Return `payment_url` for recharge |
| Network error on billing | Allow call through in dev mode (no charge) |
| Tier feature not available | Skip feature gracefully, continue with available features |
| No data source provided | Raise error requesting input |
---
## License
MIT
don't have the plugin yet? install it then click "run inline in claude" again.