What is ChromaDB and why use it for RAG?

ChromaDB is an open-source vector database designed for AI applications. It stores embeddings and enables fast similarity search, making it ideal for Retrieval-Augmented Generation (RAG) pipelines where you need to fetch relevant context for LLM queries.

How does RecursiveCharacterTextSplitter work?

RecursiveCharacterTextSplitter splits text by a list of characters (like newlines, spaces, and punctuation) recursively until chunks are small enough. It tries splitting by paragraphs first, then sentences, then words — preserving semantic coherence better than simple character-based splitting.

What is the difference between OpenAI embeddings and local embeddings?

OpenAI embeddings (like text-embedding-ada-002) are hosted API services offering high-quality embeddings with a simple API call. Local embeddings (like all-MiniLM-L6-v2 from sentence-transformers) run entirely on your machine, offering privacy and cost savings but requiring more setup and compute resources.

What is hybrid search in a RAG pipeline?

Hybrid search combines vector similarity search with traditional keyword-based (BM25) search. It retrieves documents using both semantic similarity and exact keyword matching, then merges the results using Reciprocal Rank Fusion (RRF) for better retrieval accuracy.

How do LangChain query chains work?

LangChain query chains connect a retriever to an LLM. The RetrievalQA chain takes a user query, fetches relevant documents from the vector store, combines them into a prompt context, and passes it to the LLM to generate an answer with grounded citations.

ChromaDB RAG Pipeline Setup with LangChain | Supertute

A complete guide to building a production-ready Retrieval-Augmented Generation pipeline using ChromaDB, LangChain, and hybrid search.

Building a reliable RAG pipeline is one of the most common challenges when developing AI applications. ChromaDB has emerged as a lightweight, developer-friendly vector database that pairs excellently with LangChain's abstractions. In this guide, you'll learn how to wire every component together — from loading raw documents to querying them with hybrid search — and end up with a pipeline you can drop into any LangChain or FastAPI project.

1. Install Dependencies

Start by installing the core packages. You'll need ChromaDB, LangChain, OpenAI SDK, and sentence-transformers if you plan to run embeddings locally.

pip install chromadb langchain-openai langchain-community
pip install sentence-transformers

For a Next.js or React frontend that talks to a FastAPI backend, you typically run the vector store server-side. The frontend only needs to send natural language queries — the backend handles retrieval and LLM generation internally.

2. Set Up the ChromaDB Client

ChromaDB can run in-memory for quick experiments or persist to disk for production. The persistent mode stores embeddings in a local SQLite file alongside the raw data, making it easy to back up and version-control your vector index.

import chromadb

# In-memory (development only)
client = chromadb.Client()

# Persistent (production)
client = chromadb.PersistentClient(path="./chroma_data")

LangChain provides a convenient Chroma wrapper that accepts the client and an embedding function, abstracting away the collection management overhead.

3. Load Documents with LangChain Loaders

LangChain offers dozens of document loaders covering PDFs, Markdown files, Notion exports, Google Docs, and more. The most common starting point for internal knowledge bases is the DirectoryLoader paired with PyPDFLoader or TextLoader.

from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader(
    "./docs",
    glob="**/*.pdf",
    loader_cls=PyPDFLoader
)
documents = loader.load()

Each loaded document is a Document object with page_content (the raw text) and metadata (source file, page number, etc.). Metadata is critical later for tracing which chunk answered which question.

4. Split Text with RecursiveCharacterTextSplitter

Raw documents are usually too large to embed in a single vector — most embedding models have a 512- or 4096-token context window.RecursiveCharacterTextSplitter is the recommended splitter in LangChain because it preserves semantic boundaries.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", " ", ""]
)

chunks = splitter.split_documents(documents)

The chunk_overlap=200 ensures context isn't lost at chunk boundaries — the last 200 characters of one chunk repeat at the start of the next. Adjust chunk_size based on your embedding model's context window and your use case: smaller chunks for precise retrieval, larger chunks for richer context.

5. Generate Embeddings

Embeddings convert text into fixed-dimensional floating-point vectors where semantically similar texts cluster together in high-dimensional space. ChromaDB supports both OpenAI's hosted embeddings and local models.

OpenAI Embeddings

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="text-embedding-ada-002",
    openai_api_key=os.environ["OPENAI_API_KEY"]
)

Local Embeddings with Sentence-Transformers

from langchain_community.embeddings import SentenceTransformerEmbeddings

embeddings = SentenceTransformerEmbeddings(
    model_name="all-MiniLM-L6-v2"
)

Local embeddings like all-MiniLM-L6-v2 produce 384-dimensional vectors and run entirely on CPU. They're significantly cheaper for high-volume workloads and keep your data off third-party servers — essential for enterprise compliance.

6. Configure the Vector Store

With documents chunked and an embedding function ready, you create a Chroma vector store and populate it in one step. LangChain's Chroma.from_documents handles embedding generation, collection creation, and data ingestion automatically.

from langchain_community.vectorstores import Chroma

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_data"
)

# Always persist after making changes
vectorstore.persist()

The persist_directory points to the same directory used when initializing the PersistentClient. If you're reloading an existing index, use Chroma(persist_directory=..., embedding=...) to open it without re-embedding everything.

7. Build Query Chains with RetrievalQA

The core RAG pattern is straightforward: retrieve relevant chunks, stuff them into a prompt, and ask an LLM to answer based on that context. LangChain's RetrievalQA chain wraps this pattern into a single callable.

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

llm = ChatOpenAI(
    model="gpt-4o",
    openai_api_key=os.environ["OPENAI_API_KEY"],
    temperature=0
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(top 4 results)
)

The retriever uses top 4 results to fetch the four most similar chunks for each query. The chain_type="stuff" strategy concatenates all retrieved documents into a single prompt — simple and effective for most cases. For very large documents, consider chain_type="map_reduce" which processes chunks in parallel.

# Query the RAG pipeline
result = qa_chain.invoke({
    "query": "What are the key steps in the deployment process?"
})

print(result["result"])

8. Implement Hybrid Search

Pure vector search excels at semantic similarity but can miss documents that use different vocabulary than the query. Hybrid search blends vector similarity with traditional BM25 keyword matching, giving you the best of both worlds.

LangChain provides EnsembleRetriever which combines multiple retrievers and merges their results using Reciprocal Rank Fusion (RRF). You'll need to set up a BM25 retriever alongside your Chroma retriever.

from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever

# BM25 keyword-based retriever
bm25_retriever = BM25Retriever.from_texts(
    texts=[chunk.page_content for chunk in chunks]
)
bm25_retriever.k = 2

# Vector retriever
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 2})

# Combine with Reciprocal Rank Fusion
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.4, 0.6]  # favor vector search slightly
)

The weights=[0.4, 0.6] parameter gives slightly more weight to vector similarity, which tends to be more semantically meaningful. Tune these values based on your domain — technical documentation often benefits from higher BM25 weight since exact terminology matters.

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=ensemble_retriever
)

Summary: Your ChromaDB RAG Pipeline

Building a ChromaDB RAG pipeline with LangChain is a matter of connecting the right abstractions. Load documents with a LangChain loader, split them with RecursiveCharacterTextSplitter, embed them using OpenAI or a local sentence-transformer model, store them in Chroma, and wire everything to an LLM via RetrievalQA. Adding hybrid search via EnsembleRetriever improves recall when queries use domain-specific terminology that pure semantic search might miss.

This stack — ChromaDB + LangChain + OpenAI (or a local LLM) — forms the backbone of most production RAG systems. It runs equally well in a Next.js + FastAPI architecture where the React frontend handles the UI and streaming, while the Python backend manages the vector store and LLM calls securely.

Ready to go further? Add conversation memory with ConversationalRetrievalChain, implement metadata filtering to scope searches by source or date, or swap ChromaDB for a cloud vector database like Pinecone or Weaviate when you need horizontal scaling.