Unlocking Data With Generative Ai And Rag Pdf Page
Start with recursive character text splitter (LangChain). For technical PDFs, use semantic chunking. 3.3 Embedding Models | Model | Dim | Best for | |-------|-----|-----------| | text-embedding-3-small (OpenAI) | 1536 | General, cost-effective | | all-MiniLM-L6-v2 (sentence-transformers) | 384 | Local, fast, lower accuracy | | BAAI/bge-large-en-v1.5 | 1024 | High retrieval quality | | voyage-2 | 1024 | Long documents, legal/financial PDFs |
Question: query
Final_score = α * vector_similarity + (1-α) * BM25_keyword_score Set α = 0.7 for semantic-heavy queries, 0.3 for exact match (e.g., invoice numbers). After initial retrieval (top 20 chunks), use a cross-encoder like BAAI/bge-reranker-v2-m3 to reorder top 5 most relevant chunks. Reduces hallucinations significantly. 3.7 Generation Prompt Template You are a helpful assistant for company PDF documents. Answer based ONLY on the following retrieved chunks. Context: chunks unlocking data with generative ai and rag pdf





