Back to blog
    AI Engineering

    RAG in 2025: What Actually Works in Production

    Steinn Labs··9 min read

    Key Takeaways

    • Hybrid retrieval combining BM25 + vector similarity outperforms pure vector search
    • Data quality problems cause more RAG failures than technology choices
    • Reranking improves answer quality by 15-25% and should not be skipped
    • Evaluate retrieval quality and generation quality separately for meaningful metrics

    RAG Has Grown Up

    Retrieval Augmented Generation was the buzzword of 2024. In 2025, the hype has settled and we now have clear patterns for what works in production versus what only works in demos.

    The biggest shift is the move from naive vector similarity search to hybrid retrieval systems that combine dense embeddings, sparse keyword matching, and knowledge graph traversal.

    The Production RAG Stack

    After deploying RAG systems for multiple clients, here is the stack that consistently delivers:

    • Embedding model: OpenAI text-embedding-3-large or Cohere embed-v3 for multilingual
    • Vector database: pgvector for simplicity, Pinecone for scale
    • Chunking strategy: Semantic chunking with 512-token chunks and 50-token overlaps
    • Retrieval: Hybrid search combining BM25 + vector similarity with Reciprocal Rank Fusion
    • Reranking: Cohere Rerank v3 or a cross-encoder model to filter top-k results

    Common Mistakes We See

    The most common failure mode is not a technology problem. It is a data quality problem. Teams spend weeks optimizing retrieval algorithms when the real issue is that their source documents are poorly structured, contain contradictory information, or have not been updated in months.

    Other frequent mistakes:

    1. Chunks that are too large (1000+ tokens) dilute relevance
    2. No metadata filtering, forcing the model to sort through irrelevant results
    3. Skipping reranking, which can improve answer quality by 15-25%
    4. Not evaluating retrieval quality separately from generation quality

    Evaluation Framework

    We use a three-layer evaluation approach: retrieval precision and recall measured independently, generation faithfulness checked against retrieved context, and end-to-end answer quality scored by human evaluators on a sample basis.

    Frequently Asked Questions

    What is the best embedding model for RAG in 2025?

    OpenAI text-embedding-3-large is the top choice for English, while Cohere embed-v3 excels for multilingual applications. The choice depends on your language requirements and budget.

    What chunking strategy works best for RAG?

    Semantic chunking with 512-token chunks and 50-token overlaps consistently delivers the best results. Chunks larger than 1000 tokens tend to dilute relevance.

    How do you evaluate a RAG system?

    Use a three-layer approach: measure retrieval precision and recall independently, check generation faithfulness against retrieved context, and score end-to-end answer quality with human evaluators on samples.

    rag
    retrieval
    embeddings
    vector-database
    production