Multi-source memory ingestion with Discord support, automatic deduplication, and agent-ready patterns
---
name: ClawText Ingest
description: Multi-source memory ingestion with Discord support, automatic deduplication, and agent-ready patterns
keywords: discord, memory, ingestion, rag, agents, deduplication, cli
---
# ClawText Ingest β Production-Ready Memory Ingestion
**Version:** 1.3.0 | **License:** MIT | **Status:** Production β
**Author:** ragesaq | **Category:** Memory & Knowledge Management
**GitHub:** https://github.com/ragesaq/clawtext-ingest
---
## π― What It Does
ClawText Ingest transforms external data (Discord forums, files, URLs, JSON, text) into structured, deduplicated memories for AI agents.
### The Problem It Solves
- β **Manual ingestion** β Tedious, error-prone, no metadata
- β **Duplicate memories** β Same data ingested multiple times
- β **Unstructured data** β No hierarchy, no context preservation
- β **One-time imports** β No recurring/scheduled ingestion
- β **Discord-specific gaps** β Can't preserve forum postβreply structure
### The Solution
β
**One command** imports from Discord, files, URLs, or JSON
β
**100% idempotent** β Run 1000x, zero duplicates
β
**Automatic metadata** β YAML frontmatter with date, project, type, entities
β
**6 agent patterns** β Autonomous workflows documented and ready
β
**Discord-native** β Forum hierarchy preserved, progress bars, auto-batch mode
---
## β¨ Key Features
### π― Discord Integration (New in v1.3.0)
- **Forum + Channel + Thread** support
- **Hierarchy preservation** β Postβreply structure in metadata
- **Real-time progress** β Live feedback for large ingestions
- **Auto-batch mode** β <500 posts: full, β₯500 posts: streaming
- **One-command setup** β 5-minute bot creation
### π Multi-Source Ingestion
- **Files** β Glob patterns (Markdown, text, etc.)
- **URLs** β Single or bulk URL ingestion
- **JSON** β Chat exports, API responses
- **Raw text** β Quick knowledge capture
- **Batch operations** β Unified ingestion from multiple sources
### π Deduplication & Safety
- **SHA1-based** β Cryptographic hash matching
- **100% idempotent** β Safe for repeated runs
- **Configurable** β `checkDedupe: true/false` per operation
- **Zero data loss** β Failed items tracked, fallback per-item ingestion
- **Hash persistence** β `.ingest_hashes.json` for cross-session tracking
### π€ Agent-Ready
- **6 documented patterns** β Direct API, Discord Agent, CLI, Cron, Batch, Thread
- **Working code examples** β Copy-paste ready
- **Real-world patterns** β GitHub sync, Discord monitoring, team decisions
- **Error handling** β Comprehensive error recovery
- **Progress callbacks** β Track ingestion in real-time
### π οΈ Developer-Friendly
- **CLI tool** β `clawtext-ingest` + `clawtext-ingest-discord` commands
- **Node.js API** β Simple imports for programmatic use
- **TypeScript-ready** β Clear method signatures
- **Extensible** β Custom transforms, field mapping
- **Well-documented** β 11 guides, 20+ examples
### π ClawText Integration
- **Automatic cluster indexing** β New memories indexed after rebuild
- **RAG injection** β Relevant context injected into agent prompts
- **Project routing** β Organize memories by project/source
- **Entity linking** β Auto-extract and link related entities
---
## π Quick Start
### Installation
```bash
# Via npm
npm install clawtext-ingest
# Via OpenClaw
openclaw install clawtext-ingest
```
### Discord Ingestion (5 minutes)
```bash
# 1. Set up Discord bot (see DISCORD_BOT_SETUP.md)
# 2. Get bot token, set DISCORD_TOKEN env var
# 3. Inspect forum
clawtext-ingest-discord describe-forum --forum-id FORUM_ID --verbose
# 4. Ingest with progress
DISCORD_TOKEN=xxx clawtext-ingest-discord fetch-discord --forum-id FORUM_ID
# 5. Rebuild ClawText clusters
clawtext-ingest rebuild
```
### File Ingestion
```bash
clawtext-ingest ingest-files --input="docs/*.md" --project="docs"
```
### Node.js API
```javascript
import { ClawTextIngest } from 'clawtext-ingest';
const ingest = new ClawTextIngest();
// Ingest files
await ingest.fromFiles(['docs/**/*.md'], { project: 'docs', type: 'fact' });
// Ingest JSON
await ingest.fromJSON(chatArray, { project: 'team' }, {
keyMap: { contentKey: 'message', dateKey: 'timestamp', authorKey: 'user' }
});
// Rebuild clusters for RAG injection
await ingest.rebuildClusters();
```
---
## π€ Agent Integration (6 Patterns)
### Pattern 1: Direct API
**For:** In-agent code
**Use when:** Agents need to ingest as part of workflow
```javascript
const ingest = new ClawTextIngest();
await ingest.fromFiles(['docs/**/*.md'], { project: 'docs' });
```
### Pattern 2: Discord Agent
**For:** Autonomous Discord ingestion
**Use when:** Agents need to fetch Discord forums
```javascript
const runner = new DiscordIngestionRunner(ingest);
await runner.ingestForumAutonomous({
forumId, mode: 'batch', token: process.env.DISCORD_TOKEN
});
```
### Pattern 3: CLI Subprocess
**For:** Agents executing commands
**Use when:** Simpler CLI-based execution needed
```javascript
await execAsync('clawtext-ingest-discord fetch-discord --forum-id ID');
```
### Pattern 4: Cron/Scheduled
**For:** Recurring tasks
**Use when:** Daily/hourly ingestion needed
```javascript
cron.schedule('0 * * * *', () => agentIngest());
```
### Pattern 5: Batch Multi-Source
**For:** Unified ingestion
**Use when:** Multiple sources in one operation
```javascript
await ingest.ingestAll([
{ type: 'files', data: ['docs/**/*.md'], metadata: {...} },
{ type: 'json', data: chatExport, metadata: {...} }
]);
```
### Pattern 6: Discord Thread
**For:** Thread-specific ingestion
**Use when:** Single thread fetch needed
```javascript
await runner.ingestThread(threadId);
```
**β See [AGENT_GUIDE.md](https://github.com/ragesaq/clawtext-ingest/blob/main/AGENT_GUIDE.md) for complete examples**
---
## π Real-World Examples
### Example 1: Daily Documentation Sync
```javascript
async function syncDocsDaily() {
const ingest = new ClawTextIngest();
const result = await ingest.ingestAll([
{ type: 'files', data: ['docs/**/*.md'], metadata: { project: 'docs' } },
{ type: 'urls', data: ['https://docs.example.com/api'], metadata: { project: 'api-docs' } }
]);
await ingest.rebuildClusters();
return result;
}
```
### Example 2: Discord Forum Monitoring
```javascript
async function monitorDiscordForum(forumId) {
const ingest = new ClawTextIngest();
const runner = new DiscordIngestionRunner(ingest);
const result = await runner.ingestForumAutonomous({
forumId,
mode: 'batch',
token: process.env.DISCORD_TOKEN,
onProgress: (p) => console.log(`${p.percent}% complete...`)
});
return result;
}
```
### Example 3: Team Decisions Ingestion
```javascript
async function ingestTeamDecisions() {
const ingest = new ClawTextIngest();
const result = await ingest.ingestAll([
{ type: 'files', data: ['decisions/adr/**/*.md'], metadata: { type: 'adr' } },
{ type: 'json', data: slackThread, metadata: { type: 'decision', source: 'slack' } }
]);
await ingest.rebuildClusters();
return result;
}
```
---
## π CLI Commands
### `clawtext-ingest` β File/URL/JSON/Text Ingestion
```bash
clawtext-ingest ingest-files --input="docs/*.md" --project="docs" --verbose
clawtext-ingest ingest-urls --input="https://example.com" --project="research"
clawtext-ingest ingest-json --input=messages.json --source="slack"
clawtext-ingest ingest-text --input="Finding: X is better than Y" --project="findings"
clawtext-ingest batch --config=sources.json
clawtext-ingest rebuild
clawtext-ingest status
```
### `clawtext-ingest-discord` β Discord Integration
```bash
# Inspect forum
clawtext-ingest-discord describe-forum --forum-id FORUM_ID --verbose
# Fetch & ingest
DISCORD_TOKEN=xxx clawtext-ingest-discord fetch-discord \
--forum-id FORUM_ID \
--mode batch \
--batch-size 100 \
--verbose
```
---
## π Documentation
| Document | Purpose | Read Time |
|----------|---------|-----------|
| **[README.md](https://github.com/ragesaq/clawtext-ingest#readme)** | Overview + quick start | 5 min |
| **[QUICKSTART.md](https://github.com/ragesaq/clawtext-ingest/blob/main/QUICKSTART.md)** | 5-minute setup | 5 min |
| **[AGENT_GUIDE.md](https://github.com/ragesaq/clawtext-ingest/blob/main/AGENT_GUIDE.md)** | 6 autonomous patterns | 10 min |
| **[API_REFERENCE.md](https://github.com/ragesaq/clawtext-ingest/blob/main/API_REFERENCE.md)** | Complete API docs | 15 min |
| **[PHASE2_CLI_GUIDE.md](https://github.com/ragesaq/clawtext-ingest/blob/main/PHASE2_CLI_GUIDE.md)** | CLI commands | 10 min |
| **[DISCORD_BOT_SETUP.md](https://github.com/ragesaq/clawtext-ingest/blob/main/DISCORD_BOT_SETUP.md)** | Bot creation | 5 min |
| **[CLAYHUB_GUIDE.md](https://github.com/ragesaq/clawtext-ingest/blob/main/CLAYHUB_GUIDE.md)** | Publication | 5 min |
| **[INDEX.md](https://github.com/ragesaq/clawtext-ingest/blob/main/INDEX.md)** | Documentation index | 2 min |
---
## π― Who Should Use This
- β
**AI/Agent developers** β Building knowledge-aware agents
- β
**RAG engineers** β Populating memory for context injection
- β
**Teams using Discord** β Leveraging Discord as knowledge base
- β
**DevOps/MLOps** β Automated knowledge ingestion pipelines
- β
**Researchers** β Structuring unstructured data sources
---
## β‘ Performance
| Operation | Speed | Notes |
|-----------|-------|-------|
| Ingest 100 files | ~5 sec | With SHA1 dedup check |
| Ingest 1000 JSON items | ~15 sec | Batch processing |
| Small forum (<100 msgs) | ~10 sec | Full mode |
| Large forum (1000+ msgs) | ~2 min | Auto-batch, streaming |
| Rebuild clusters | ~5-30 sec | Depends on total memories |
---
## β
Quality Metrics
| Metric | Value |
|--------|-------|
| **Tests** | 22/22 passing β
|
| **Code** | 1,254 production lines |
| **Documentation** | 92 KB across 11 guides |
| **Examples** | 20+ working examples |
| **Coverage** | 100% critical paths |
---
## π Integration with ClawText
1. **Ingest** data β Creates memories with YAML metadata
2. **Rebuild** clusters β ClawText indexes new memories
3. **RAG layer** β Relevant context injected on next prompt
4. **Agent response** β Enhanced with contextual information
```bash
# Complete workflow
clawtext-ingest-discord fetch-discord --forum-id ID # Step 1
clawtext-ingest rebuild # Step 2
# Step 3-4 automatic (ClawText + Agent)
```
---
## π Support
- **Documentation:** See [INDEX.md](https://github.com/ragesaq/clawtext-ingest/blob/main/INDEX.md) for navigation
- **Issues:** https://github.com/ragesaq/clawtext-ingest/issues
- **Examples:** 20+ examples in documentation
- **Troubleshooting:** Built into each guide
---
## π¦ Installation & Requirements
**Requirements:**
- Node.js β₯ 18.0.0
- OpenClaw (for agent patterns)
- ClawText β₯ 1.2.0 (for RAG integration)
**Installation:**
```bash
npm install clawtext-ingest
# or
openclaw install clawtext-ingest
```
**Binaries:**
- `clawtext-ingest` β File/URL/JSON ingestion
- `clawtext-ingest-discord` β Discord integration
---
## π Why This Over Alternatives
| Feature | ClawText-Ingest | Manual | Generic Importer | API Tool |
|---------|---|---|---|---|
| Discord native | β
| β | β | β |
| Deduplication | β
| β | Partial | β |
| Agent patterns | β
| β | β | β |
| Metadata auto | β
| β | Partial | β |
| ClawText integration | β
| β | β | β |
| Idempotent | β
| β | β | Partial |
---
## π License
MIT β Use freely, open source, community supported
---
## π Contributing
Contributions welcome! See GitHub issues for current priorities.
---
**Ready to ingest? Start with [QUICKSTART.md](https://github.com/ragesaq/clawtext-ingest/blob/main/QUICKSTART.md) (5 min) or [AGENT_GUIDE.md](https://github.com/ragesaq/clawtext-ingest/blob/main/AGENT_GUIDE.md) if you're building agents.**
don't have the plugin yet? install it then click "run inline in claude" again.