What is MCP and why should I care?

MCP (Model Context Protocol) is a standardised protocol that lets AI assistants like Claude Desktop and Cursor connect directly to external tools and data sources. Instead of hardcoding tool integrations, you expose tools via an MCP server — and any MCP-compatible client can consume them automatically.

What's the advantage of exposing a RAG pipeline over MCP?

Once your RAG pipeline is an MCP server, Claude Desktop, Cursor, Windsurf, and any other MCP-compatible client can retrieve context from your documents without you rebuilding integrations for each one. You write the server once; every AI tool becomes a client.

Do I need LangChain for this?

No — LangChain is optional. The MCP SDK is language-agnostic. That said, LangChain makes it significantly faster to scaffold a RAG pipeline (document loaders, text splitting, embedding, retrieval chains). We use it here for that reason, not as a religious choice.

What embedding model should I use?

For a weekend project, OpenAI's text-embedding-ada-002 or text-embedding-3-small are the lowest-friction path. If you need full data privacy, swap in a local sentence-transformer model like all-MiniLM-L6-v2 — the interface is identical, just swap the embedding function.

Can I deploy this to production?

Yes. The FastAPI MCP server runs on Railway, Fly.io, or any container host. Replace the in-memory ChromaDB instance with a persistent pgvector setup and you're production-ready. If that sounds like more than a weekend, Supertute ships a production-grade version of exactly this as a done-for-you service.

Ship an MCP-native RAG Backend in a Weekend | Supertute

Build an MCP server that exposes your RAG pipeline as tools — so Claude Desktop, Cursor, and any MCP-compatible AI client can query your documents directly. No bespoke integrations. No vendor lock-in.

Here's the problem with most RAG tutorials: they stop at "here's how to query your documents in Python." Cool. But now you have a Python script that only you can run. The moment you want Claude Desktop to answer questions about your codebase, or Cursor to pull context from your internal docs, you're back to building custom integrations for each AI tool.

MCP (Model Context Protocol) fixes this. It's a standardised interface that any AI client — Claude Desktop, Cursor, Windsurf, anything built on the MCP spec — can consume without you rewriting anything. You build one MCP server. Every MCP-compatible client can use it.

In this guide, you're going to build exactly that: an MCP server that wraps a production-grade RAG pipeline (LangChain + ChromaDB + FastAPI) and exposes it as typed, documented tools. By Sunday night, Claude Desktop will be answering questions about your documents with full citations.

1. What You'll Ship by Sunday

No vague promises. Here's exactly what's running by the time you finish:

✓A FastAPI server acting as an MCP server on localhost:8000
✓A RAG retrieval tool registered as an MCP tool: query_rag(query: str, top_k: int)
✓Claude Desktop connected to your MCP server via mcp.json config
✓Claude answering questions grounded in your documents with source citations
✓A Cursor workspace where /rag-query calls your pipeline directly

2. Install Dependencies

You're going to need the MCP Python SDK, LangChain, ChromaDB, and FastAPI. Create a fresh virtual environment — don't pollute a global install.

mkdir rag-mcp && cd rag-mcp
python -m venv .venv && source .venv/bin/activate

pip install fastapi uvicorn
pip install "mcp[server]" langchain langchain-openai langchain-community
pip install chromadb sentence-transformers
pip install python-dotenv pydantic

The mcp[server] extras pull in the server SDK, the official Python implementation of the MCP spec.sentence-transformers is optional — swap it for OpenAI embeddings if you prefer hosted, but local embeddings keep everything private and free at scale.

3. The RAG Pipeline

Before you expose anything over MCP, you need a RAG pipeline that actually works. We're going to build a minimal but real one: load documents, chunk them, embed them, store them in ChromaDB, and retrieve with a query.

3a. The Vector Store Setup

ChromaDB runs in-memory for development. In production you'd use pgvector on Railway, but for a weekend project this is fine.

# rag_pipeline.py

import os
from langchain_community.document_loaders import TextLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_community.embeddings import SentenceTransformerEmbeddings

# Choose one:
# embeddings = OpenAIEmbeddings(openai_api_key=os.environ["OPENAI_API_KEY"])
embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

VECTORSTORE_DIR = "./chroma_data"

def build_vectorstore(docs_path: str = "./docs") -> Chroma:
    """Load docs, chunk, embed, and persist to ChromaDB."""
    loader = DirectoryLoader(docs_path, glob="**/*.txt", loader_cls=TextLoader)
    documents = loader.load()

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        separators=["\n\n", "\n", " ", ""]
    )
    chunks = splitter.split_documents(documents)

    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory=VECTORSTORE_DIR
    )
    vectorstore.persist()
    return vectorstore

def load_vectorstore() -> Chroma:
    """Reload an existing index without re-embedding."""
    return Chroma(
        persist_directory=VECTORSTORE_DIR,
        embedding_function=embeddings
    )

3b. The Retrieval Function

Your retrieval function takes a query string and returns the top-k relevant chunks with metadata. This is the function you'll expose as an MCP tool.

from langchain_core.documents import Document

def retrieve_context(query: str, top_k: int = 4) -> list[Document]:
    """Retrieve top-k relevant chunks from the vector store."""
    vectorstore = load_vectorstore()
    results = vectorstore.similarity_search_with_score(query, k=top_k)

    # Filter to results with relevance score &lt; 0.85 (lower = more similar)
    filtered = [doc for doc, score in results if score &lt; 0.85]
    return filtered[:top_k]

def format_context(docs: list[Document]) -> str:
    """Render retrieved docs as a citation block for the LLM."""
    blocks = []
    for i, doc in enumerate(docs, 1):
        source = doc.metadata.get("source", "unknown")
        page = doc.metadata.get("page", "n/a")
        blocks.append(
            f"[{i}] (Source: {source}, Page {page})\n{doc.page_content}"
        )
    return "\n\n".join(blocks)

3c. Seed Some Documents

Create a ./docs directory and drop a few .txt files in there — internal documentation, notes, a README, anything. Then run:

from rag_pipeline import build_vectorstore
build_vectorstore("./docs")

If you see Persisted to ./chroma_data, your vector store is live. That directory will survive server restarts.

4. Build the MCP Server

The MCP Python SDK gives you a decorator-based API. You annotate Python functions with @server.tool() and the SDK handles serialisation, type checking, and the JSON-RPC transport under the hood.

# mcp_server.py

import os
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent
from pydantic import BaseModel, Field

from rag_pipeline import retrieve_context, format_context

# The app instance. Name must match what's in mcp.json.
app = Server("rag-mcp-server")

class QueryRAGInput(BaseModel):
    query: str = Field(description="Natural language question to answer from your documents.")
    top_k: int = Field(default=4, ge=1, le=20, description="Number of context chunks to retrieve.")

@app.list_tools()
async def list_tools() -> list[Tool]:
    """Declare the tools this server exposes."""
    return [
        Tool(
            name="query_rag",
            description=(
                "Retrieves relevant context from the internal document knowledge base. "
                "Use this when the user's question requires specific information from "
                "internal docs, READMEs, or technical specifications."
            ),
            inputSchema={
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "Natural language question to answer from your documents."
                    },
                    "top_k": {
                        "type": "integer",
                        "default": 4,
                        "description": "Number of context chunks to retrieve (1-20)."
                    }
                },
                "required": ["query"]
            }
        )
    ]

@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
    """Handle incoming tool calls from MCP clients."""
    if name == "query_rag":
        query = arguments["query"]
        top_k = arguments.get("top_k", 4)

        docs = retrieve_context(query, top_k=top_k)

        if not docs:
            return [TextContent(
                type="text",
                text="No relevant documents found for that query. Try rephrasing."
            )]

        context_block = format_context(docs)
        return [TextContent(
            type="text",
            text=(
                f"Retrieved {len(docs)} relevant chunk(s):\n\n"
                f"{context_block}\n\n"
                "Use the context above to answer the user's question. "
                "Always cite the source and page number."
            )
        )]

    raise ValueError(f"Unknown tool: {name}")

async def main():
    async with stdio_server() as (read_stream, write_stream):
        await app.run(
            read_stream,
            write_stream,
            app.create_initialization_options()
        )

if __name__ == "__main__":
    import asyncio
    asyncio.run(main())

Run python mcp_server.py. If nothing errors out, your MCP server is live on stdio. The server reads JSON-RPC messages from stdin and writes responses to stdout — FastAPI isn't even involved yet. You'll add HTTP in the next section.

5. Add FastAPI for HTTP (Optional but Recommended)

stdio mode works great for Claude Desktop. But if you want to expose the same tools over HTTP — for web apps, Cursor agents, or production deploys — wrap the MCP server in FastAPI. The MCP SDK has a FastAPI integration that makes this almost effortless.

# server_http.py

import asyncio
from fastapi import FastAPI
from mcp.server.fastapi import create_fastapi_wrapper

from mcp_server import app as mcp_app

app = FastAPI(title="RAG MCP Server")

# Wrap the MCP app as a FastAPI router
mcp_router = create_fastapi_wrapper(mcp_app)
app.include_router(mcp_router, prefix="/mcp")

@app.get("/health")
def health():
    return {"status": "ok", "server": "rag-mcp"}

# Run: uvicorn server_http:app --reload --port 8000

Now you have both: stdio mode for Claude Desktop (the local dev workflow) and HTTP mode for every other client. Same server, two transports.

6. Register with Claude Desktop

This is the part that makes it click. Open Claude Desktop's config:

# macOS
~/Library/Application Support/Claude/claude_desktop_config.json

# Linux
~/.config/Claude/claude_desktop_config.json

Add your server to the mcpServers dict:

{
  "mcpServers": {
    "rag-mcp": {
      "command": "python",
      "args": ["/absolute/path/to/rag-mcp/mcp_server.py"],
      "env": {
        "OPENAI_API_KEY": "your-key-here"
      }
    }
  }
}

Restart Claude Desktop. Open the menu → Settings → MCP Servers — you should see rag-mcp listed as connected. That single registration gives every Claude conversation access to your query_rag tool.

7. Test It

In Claude Desktop, try asking:

"What does the README say about the authentication setup?"

Claude will call your query_rag tool, retrieve context from your documents, and answer with citations. The first call takes 2-3 seconds (Cold start with OpenAI or local embeddings). Subsequent calls reuse the connection.

If Claude says "I don't have access to that information," check the MCP Server status in settings — the server likely didn't start. Run python mcp_server.py manually to see the error output.

8. Connect Cursor the Same Way

Cursor uses the same MCP config file at ~/.cursor/mcp_settings.json. Point it at the same server:

{
  "mcpServers": {
    "rag-mcp": {
      "command": "python",
      "args": ["/absolute/path/to/rag-mcp/mcp_server.py"],
      "env": {
        "OPENAI_API_KEY": "your-key-here"
      }
    }
  }
}

Cursor will surface query_rag in its agent tool palette. Run a Cursor agent with /rag-query: What's our deployment pipeline? and watch it pull your internal docs mid-conversation.

What You've Got by Sunday Night

A fully functional MCP server with a RAG retrieval tool. Every MCP client — Claude Desktop, Cursor, Windsurf, any agent framework built on the spec — can now query your documents without you writing a single custom integration.

The gap between this weekend project and production is three things:

Persistence: Move from in-memory ChromaDB to pgvector on Railway. Same API, zero code changes.
Auth: Add middleware to the HTTP transport so only your team can call the server.
Evaluation: Wire in Ragas or DeepEval to track retrieval quality over time.

If you want all three done by someone who's already built this exact stack — the Supertute Done-For-You RAG Backend ships with pgvector + Redis + auth + Railway deployment in a single weekend build. $500 flat.

Done-For-You RAG Backend

If you'd rather have a production-grade MCP-native RAG backend built for you — with pgvector, Redis caching, JWT auth, and Railway deployment — Supertute ships it in one weekend. $500 flat. No lock-in.

See the Details → /product

Summary

MCP turns your RAG pipeline into a universal AI tool. You write the server once; Claude Desktop, Cursor, and every other MCP-compatible client becomes a consumer. The stack is simple — LangChain for the RAG logic, ChromaDB for the vector store, the MCP Python SDK to expose it as tools, and optional FastAPI for HTTP transport.

The code above is a complete, working implementation. Drop it into a rag-mcp/directory, seed your docs, and you're querying them from Claude Desktop by dinner.