Implements document chunking, embedding generation, vector storage, and retrieval pipelines for Retrieval-Augmented Generation systems. Use when building RAG…
RAG Implementation
Build Retrieval-Augmented Generation systems that extend AI capabilities with external knowledge sources.
Overview
This skill covers: document processing, embedding generation, vector storage, retrieval configuration, and RAG pipeline implementation.
When to Use
Building Q&A systems over proprietary documents
Creating chatbots with factual information from knowledge bases
Implementing semantic search with natural language queries
Reducing hallucinations with grounded, sourced responses
Building documentation assistants and research tools
Enabling AI systems to access domain-specific knowledge
Instructions
Step 1: Choose Vector Database
Select based on your requirements:
Requirement
Recommended
Production scalability
Pinecone, Milvus
Open-source
Weaviate, Qdrant
Local development
Chroma, FAISS
Hybrid search
Weaviate with BM25
Step 2: Select Embedding Model
Use Case
Model
General purpose
text-embedding-ada-002
Fast and lightweight
all-MiniLM-L6-v2
Multilingual
e5-large-v2
Best performance
bge-large-en-v1.5
Step 3: Implement Document Processing Pipeline
Load documents from source (file system, database, API)
Clean and preprocess (remove formatting, normalize text)
Split documents into chunks with appropriate strategy
Generate embeddings for each chunk
Store embeddings in vector database with metadata
Validation: Verify embeddings were generated successfully:
List<Embedding> embeddings = embeddingModel.embedAll(segments);
if (embeddings.isEmpty() || embeddings.get(0).dimension() != expectedDim) {
throw new IllegalStateException("Embedding generation failed");
}
Step 4: Configure Retrieval Strategy
Choose the appropriate strategy:
Dense Retrieval: Semantic similarity via embeddings (default for most cases)
Hybrid Search: Dense + sparse retrieval for better coverage
Metadata Filtering: Filter by document attributes
Reranking: Cross-encoder reranking for high-precision requirements
Step 5: Build RAG Pipeline
Create content retriever with your embedding store
Configure AI service with retriever and chat memory
Implement prompt template with context injection
Add response validation and grounding checks
Validation: Test with known queries to verify context injection works correctly.
Error Handling: For batch ingestion, wrap in retry logic:
for (Document doc : documents) {
int attempts = 0;
while (attempts < 3) {
try {
store.add(embeddingModel.embed(doc).content(), doc.toTextSegment());
break;
} catch (EmbeddingException e) {
attempts++;
if (attempts == 3) throw new RuntimeException("Failed after 3 retries", e);
}
}
}
Step 6: Evaluate and Optimize
Measure retrieval metrics: precision@k, recall@k, MRR
Evaluate answer quality: faithfulness, relevance
Monitor performance and user feedback
Iterate on chunking, retrieval, and prompt parameters
Examples
Example 1: Basic Document Q&A
List<Document> documents = FileSystemDocumentLoader.loadDocuments("/docs");
InMemoryEmbeddingStore<TextSegment> store = new InMemoryEmbeddingStore<>();
EmbeddingStoreIngestor.ingest(documents, store);
DocumentAssistant assistant = AiServices.builder(DocumentAssistant.class)
.chatModel(chatModel)
.contentRetriever(EmbeddingStoreContentRetriever.from(store))
.build();
String answer = assistant.answer("What is the company policy on remote work?");
Example 2: Metadata-Filtered Retrieval
EmbeddingStoreContentRetriever retriever = EmbeddingStoreContentRetriever.builder()
.embeddingStore(store)
.embeddingModel(embeddingModel)
.maxResults(5)
.minScore(0.7)
.filter(metadataKey("category").isEqualTo("technical"))
.build();
Example 3: Multi-Source RAG Pipeline
ContentRetriever webRetriever = EmbeddingStoreContentRetriever.from(webStore);
ContentRetriever docRetriever = EmbeddingStoreContentRetriever.from(docStore);
List<Content> results = new ArrayList<>();
results.addAll(webRetriever.retrieve(query));
results.addAll(docRetriever.retrieve(query));
List<Content> topResults = reranker.reorder(query, results).subList(0, 5);
Example 4: RAG with Chat Memory
Assistant assistant = AiServices.builder(Assistant.class)
.chatModel(chatModel)
.chatMemory(MessageWindowChatMemory.withMaxMessages(10))
.contentRetriever(retriever)
.build();
assistant.chat("Tell me about the product features");
assistant.chat("What about pricing for those features?"); // Maintains context
Best Practices
Document Preparation
Clean documents before ingestion; remove irrelevant content and formatting
Add relevant metadata for filtering and context
Chunking Strategy
Use 500-1000 tokens per chunk for optimal balance
Include 10-20% overlap to preserve context at boundaries
Test different sizes for your specific use case
Retrieval Optimization
Start with high k values (10-20), then filter/rerank
Use metadata filtering to improve relevance
Monitor retrieval quality and iterate based on user feedback
Performance
Cache embeddings for frequently accessed content
Use batch processing for document ingestion
Optimize vector store indexing for your scale
Constraints and Warnings
System Constraints
Embedding models have maximum token limits per document
Vector databases require proper indexing for performance
Chunk boundaries may lose context for complex documents
Hybrid search requires additional infrastructure
Quality Warnings
Retrieval quality depends heavily on chunking strategy
Embedding models may not capture domain-specific semantics
Metadata filtering requires proper document annotation
Reranking adds latency to query responses
Security Warnings
Never hardcode credentials: Use environment variables for API keys and passwords
Validate external content: Documents from file systems, APIs, or web sources may contain malicious content (prompt injection)
Apply content filtering on retrieved documents before passing to LLM
Restrict allowed data source URLs and file paths using allowlists
Resources
Reference Documentation
Vector Database Comparison
Embedding Models Guide
Retrieval Strategies
Document Chunking
LangChain4j RAG Guidedon't have the plugin yet? install it then click "run inline in claude" again.
by @mindrally