Blog
Ship an MCP-native RAG Backend in a Weekend
Build an MCP server that exposes your RAG pipeline as tools — so Claude Desktop, Cursor, and any MCP-compatible AI client can query your documents directly. No bespoke integrations. No vendor lock-in.
Before you start
This guide assumes you've shipped at least one LangChain or FastAPI project. You know what a vector store is. You've got Python 3.11+ and an OPENAI_API_KEYready. If you're starting from zero, the code still makes sense — but you'll want to skim the LangChain docs in parallel.

Vector embeddings in motion — the neural backbone of your MCP RAG server
Here's the problem with most RAG tutorials: they stop at "here's how to query your documents in Python." Cool. But now you have a Python script that only you can run. The moment you want Claude Desktop to answer questions about your codebase, or Cursor to pull context from your internal docs, you're back to building custom integrations for each AI tool.
MCP (Model Context Protocol) fixes this. It's a standardised interface that any AI client — Claude Desktop, Cursor, Windsurf, anything built on the MCP spec — can consume without you rewriting anything. You build one MCP server. Every MCP-compatible client can use it.
In this guide, you're going to build exactly that: an MCP server that wraps a production-grade RAG pipeline (LangChain + ChromaDB + FastAPI) and exposes it as typed, documented tools. By Sunday night, Claude Desktop will be answering questions about your documents with full citations.
1. What You'll Ship by Sunday
No vague promises. Here's exactly what's running by the time you finish:
- ✓A FastAPI server acting as an MCP server on localhost:8000
- ✓A RAG retrieval tool registered as an MCP tool: query_rag(query: str, top_k: int)
- ✓Claude Desktop connected to your MCP server via mcp.json config
- ✓Claude answering questions grounded in your documents with source citations
- ✓A Cursor workspace where /rag-query calls your pipeline directly
2. Install Dependencies
You're going to need the MCP Python SDK, LangChain, ChromaDB, and FastAPI. Create a fresh virtual environment — don't pollute a global install.
mkdir rag-mcp && cd rag-mcp python -m venv .venv && source .venv/bin/activate pip install fastapi uvicorn pip install "mcp[server]" langchain langchain-openai langchain-community pip install chromadb sentence-transformers pip install python-dotenv pydantic
The mcp[server] extras pull in the server SDK, the official Python implementation of the MCP spec.sentence-transformers is optional — swap it for OpenAI embeddings if you prefer hosted, but local embeddings keep everything private and free at scale.
3. The RAG Pipeline
Before you expose anything over MCP, you need a RAG pipeline that actually works. We're going to build a minimal but real one: load documents, chunk them, embed them, store them in ChromaDB, and retrieve with a query.
3a. The Vector Store Setup
ChromaDB runs in-memory for development. In production you'd use pgvector on Railway, but for a weekend project this is fine.
# rag_pipeline.py
import os
from langchain_community.document_loaders import TextLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_community.embeddings import SentenceTransformerEmbeddings
# Choose one:
# embeddings = OpenAIEmbeddings(openai_api_key=os.environ["OPENAI_API_KEY"])
embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
VECTORSTORE_DIR = "./chroma_data"
def build_vectorstore(docs_path: str = "./docs") -> Chroma:
"""Load docs, chunk, embed, and persist to ChromaDB."""
loader = DirectoryLoader(docs_path, glob="**/*.txt", loader_cls=TextLoader)
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", " ", ""]
)
chunks = splitter.split_documents(documents)
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory=VECTORSTORE_DIR
)
vectorstore.persist()
return vectorstore
def load_vectorstore() -> Chroma:
"""Reload an existing index without re-embedding."""
return Chroma(
persist_directory=VECTORSTORE_DIR,
embedding_function=embeddings
)3b. The Retrieval Function
Your retrieval function takes a query string and returns the top-k relevant chunks with metadata. This is the function you'll expose as an MCP tool.
from langchain_core.documents import Document
def retrieve_context(query: str, top_k: int = 4) -> list[Document]:
"""Retrieve top-k relevant chunks from the vector store."""
vectorstore = load_vectorstore()
results = vectorstore.similarity_search_with_score(query, k=top_k)
# Filter to results with relevance score < 0.85 (lower = more similar)
filtered = [doc for doc, score in results if score < 0.85]
return filtered[:top_k]
def format_context(docs: list[Document]) -> str:
"""Render retrieved docs as a citation block for the LLM."""
blocks = []
for i, doc in enumerate(docs, 1):
source = doc.metadata.get("source", "unknown")
page = doc.metadata.get("page", "n/a")
blocks.append(
f"[{i}] (Source: {source}, Page {page})\n{doc.page_content}"
)
return "\n\n".join(blocks)3c. Seed Some Documents
Create a ./docs directory and drop a few .txt files in there — internal documentation, notes, a README, anything. Then run:
from rag_pipeline import build_vectorstore
build_vectorstore("./docs")If you see Persisted to ./chroma_data, your vector store is live. That directory will survive server restarts.
4. Build the MCP Server
The MCP Python SDK gives you a decorator-based API. You annotate Python functions with @server.tool() and the SDK handles serialisation, type checking, and the JSON-RPC transport under the hood.
# mcp_server.py
import os
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent
from pydantic import BaseModel, Field
from rag_pipeline import retrieve_context, format_context
# The app instance. Name must match what's in mcp.json.
app = Server("rag-mcp-server")
class QueryRAGInput(BaseModel):
query: str = Field(description="Natural language question to answer from your documents.")
top_k: int = Field(default=4, ge=1, le=20, description="Number of context chunks to retrieve.")
@app.list_tools()
async def list_tools() -> list[Tool]:
"""Declare the tools this server exposes."""
return [
Tool(
name="query_rag",
description=(
"Retrieves relevant context from the internal document knowledge base. "
"Use this when the user's question requires specific information from "
"internal docs, READMEs, or technical specifications."
),
inputSchema={
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Natural language question to answer from your documents."
},
"top_k": {
"type": "integer",
"default": 4,
"description": "Number of context chunks to retrieve (1-20)."
}
},
"required": ["query"]
}
)
]
@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
"""Handle incoming tool calls from MCP clients."""
if name == "query_rag":
query = arguments["query"]
top_k = arguments.get("top_k", 4)
docs = retrieve_context(query, top_k=top_k)
if not docs:
return [TextContent(
type="text",
text="No relevant documents found for that query. Try rephrasing."
)]
context_block = format_context(docs)
return [TextContent(
type="text",
text=(
f"Retrieved {len(docs)} relevant chunk(s):\n\n"
f"{context_block}\n\n"
"Use the context above to answer the user's question. "
"Always cite the source and page number."
)
)]
raise ValueError(f"Unknown tool: {name}")
async def main():
async with stdio_server() as (read_stream, write_stream):
await app.run(
read_stream,
write_stream,
app.create_initialization_options()
)
if __name__ == "__main__":
import asyncio
asyncio.run(main())Run python mcp_server.py. If nothing errors out, your MCP server is live on stdio. The server reads JSON-RPC messages from stdin and writes responses to stdout — FastAPI isn't even involved yet. You'll add HTTP in the next section.
5. Add FastAPI for HTTP (Optional but Recommended)
stdio mode works great for Claude Desktop. But if you want to expose the same tools over HTTP — for web apps, Cursor agents, or production deploys — wrap the MCP server in FastAPI. The MCP SDK has a FastAPI integration that makes this almost effortless.
# server_http.py
import asyncio
from fastapi import FastAPI
from mcp.server.fastapi import create_fastapi_wrapper
from mcp_server import app as mcp_app
app = FastAPI(title="RAG MCP Server")
# Wrap the MCP app as a FastAPI router
mcp_router = create_fastapi_wrapper(mcp_app)
app.include_router(mcp_router, prefix="/mcp")
@app.get("/health")
def health():
return {"status": "ok", "server": "rag-mcp"}
# Run: uvicorn server_http:app --reload --port 8000Now you have both: stdio mode for Claude Desktop (the local dev workflow) and HTTP mode for every other client. Same server, two transports.
6. Register with Claude Desktop
This is the part that makes it click. Open Claude Desktop's config:
# macOS ~/Library/Application Support/Claude/claude_desktop_config.json # Linux ~/.config/Claude/claude_desktop_config.json
Add your server to the mcpServers dict:
{
"mcpServers": {
"rag-mcp": {
"command": "python",
"args": ["/absolute/path/to/rag-mcp/mcp_server.py"],
"env": {
"OPENAI_API_KEY": "your-key-here"
}
}
}
}Restart Claude Desktop. Open the menu → Settings → MCP Servers — you should see rag-mcp listed as connected. That single registration gives every Claude conversation access to your query_rag tool.
7. Test It
In Claude Desktop, try asking:
"What does the README say about the authentication setup?"
Claude will call your query_rag tool, retrieve context from your documents, and answer with citations. The first call takes 2-3 seconds (Cold start with OpenAI or local embeddings). Subsequent calls reuse the connection.
If Claude says "I don't have access to that information," check the MCP Server status in settings — the server likely didn't start. Run python mcp_server.py manually to see the error output.
8. Connect Cursor the Same Way
Cursor uses the same MCP config file at ~/.cursor/mcp_settings.json. Point it at the same server:
{
"mcpServers": {
"rag-mcp": {
"command": "python",
"args": ["/absolute/path/to/rag-mcp/mcp_server.py"],
"env": {
"OPENAI_API_KEY": "your-key-here"
}
}
}
}Cursor will surface query_rag in its agent tool palette. Run a Cursor agent with /rag-query: What's our deployment pipeline? and watch it pull your internal docs mid-conversation.
What You've Got by Sunday Night
A fully functional MCP server with a RAG retrieval tool. Every MCP client — Claude Desktop, Cursor, Windsurf, any agent framework built on the spec — can now query your documents without you writing a single custom integration.
The gap between this weekend project and production is three things:
- Persistence: Move from in-memory ChromaDB to pgvector on Railway. Same API, zero code changes.
- Auth: Add middleware to the HTTP transport so only your team can call the server.
- Evaluation: Wire in Ragas or DeepEval to track retrieval quality over time.
If you want all three done by someone who's already built this exact stack — the Supertute Done-For-You RAG Backend ships with pgvector + Redis + auth + Railway deployment in a single weekend build. $500 flat.
Done-For-You RAG Backend
If you'd rather have a production-grade MCP-native RAG backend built for you — with pgvector, Redis caching, JWT auth, and Railway deployment — Supertute ships it in one weekend. $500 flat. No lock-in.
See the Details → /productSummary
MCP turns your RAG pipeline into a universal AI tool. You write the server once; Claude Desktop, Cursor, and every other MCP-compatible client becomes a consumer. The stack is simple — LangChain for the RAG logic, ChromaDB for the vector store, the MCP Python SDK to expose it as tools, and optional FastAPI for HTTP transport.
The code above is a complete, working implementation. Drop it into a rag-mcp/directory, seed your docs, and you're querying them from Claude Desktop by dinner.