The use of MCP with a local LLM has cost and confidentiality advantages. Installation and use are very simple.
With the MCP, it is now possible to connect an LLM to the majority of digital markets on the market in a few minutes. However, the use of MCP is expensive and can present confidentiality problems. The use of MCP with an inferred LLM locally responds to these two issues. Explanations and demonstration.
Why use MCP with a LLM locally?
Using MCP with a local LLM is the most logical solution to avoid spending too much. The MCP requires many round trips between tools and LLM, literally exploding the consumption of tokens during daily use. With prices ranging from $ 15 out -of -the -way for a million tokens with Claude 4 Sonnet at 75 dollars with Claude 4 opus, for example, costs can quickly reach several hundred monthly euros. Even with a subscription and the use of Claude Desktop, the limits of use are quickly reached. In comparison, a local LLM consumes only the compute of your machine, that is to say a negligible electrical cost in the face of model prices in the cloud.
Beyond the economy, the local LLM with MCP responds to confidentiality issues. The on-device configuration avoids sending sensitive data to providers or model editors, crucial for critical use cases. Although it does not fully protect against prompt injection, it keeps control over its data. The Local MCP + LLM association can even operate in an air-gap environment for maximum safety requirements.
Tiny Hugging Agents Face with Ollama
Currently, the simplest to execute MCP locally remains the use of Tiny Hugging Agents Face (local MCP customer) and Ollama (local inference library). Ollama will allow you to download an Optimized LLM for On-Device inference and infer it on your machine in the form of a localhost API. The Hugging Face MCP client is the Cli interface which allows the exchange between the MCP servers and the LLM hosted on Ollama. Before moving on to the installation, make sure you have the latest version of Python stable on your machine (3.13.3) in June 2025. Note: Use the official version of Python and not the downloaded version on the Microsoft Store. To do this, launch the command prompt by typing PowerShell in the Windows search bar then type:
Python -Version
And update if necessary
Facility
Tiny agents is integrated into Hugging Face SDK. To install it, nothing could be simpler. Open PowerShell and launch the following command:
python -m pip install -U "huggingface_hub(mcp)>=0.32.4" --break-system-packages
Then let’s install Ollama from the official website. Once the installer is downloaded, launch it. Click on “Install” and let it set up.
Once Ollama is installed, it will be necessary to download a model and infer it. The choice of model is crucial. You must choose a model supporting the Calling Function with an adjusted size according to the hardware resources of your computer. We will come back to this later. We advise you to use a version of the Qwen3 family from Alibaba. The models all reason before answering (via COT) and support the tool calling fairly well (vital for MCP). For a classic Windows PC without GPU, the 1.7B version is, for example, optimal in terms of resources.
The installation and inference of the model in Ollama is very simple, run the following commands in the PowerShell prompt:
Pip Install Ollama
Then close your terminal and open in a new one (order by typing PowerShell in the Windows search bar) and run:
ollama pull qwen3:1.7b
The Ollama server is now launched.
Let’s finish by installing node.js (a JavaScript environment required for MCP):
powershell winget install -e --id OpenJS.NodeJS.LTS
The installation is also possible from the official NODE website (download the executable and install it).
Configuration
You will launch your agent in a file on your computer. Create a backrest, name it. Once in the folder, we will create two text files: agent.jsonwhich will contain the parameters of our agent and Promp.md which will contain the prompt system of the agent (its basic instructions).
Edit Agent.json and add the following configuration to connect the agent to your Ollama server with QWEN3: 1.7B. In this example we also add a first MCP server: Playwright. The latter allows the LLM to open a browser and navigate the web in total autonomy.
{
"model": "qwen3:1.7b",
"endpointUrl": "http://localhost:11434/v1",
"servers": (
{
"type": "stdio",
"config": {
"command": "npx",
"args": ("@playwright/mcp@latest")
}
}
)
}
For the prompt system, you can give your agent its context of use and the rules to be observed (for the call of tools in particular). To do this, stick:
You are a tool-USING assistant. When invoking tools, you must return valid json objects without any comment, newline, or extra data.
Wrap tool arguments as a single-line json thong, eg {“query”: “Benjamin Polge”}
Do not add explanations or text outside the json object.
Execution and addition of additional MCP servers
We have finished it for the configuration. To run your agent, just use the command in a new PowerShell:
tiny-agents run CheminVersVotreDossierAgent
By taking care to replace the way to have the link with the link to your file. Expect a small latency.
The agent is then launched and you can question him in natural language like a chatbot. The AI will then use the tools at its disposal to answer you. Depending on the hardware resources of your computer, the generation of the response can take a few seconds to several minutes.
You can for example ask the agent to open the browser and search for a person’s LinkedIn address. Example of prompt: “Who is Mr. XX? Look for the URL address of his LinkedIn with the browser and display the URL”.
The agent then runs and will look for the name using the browser via MCP.
In the configuration of agent.json, we only used one MCP (Playwright) tool. But it is possible to quickly modify your configuration to add a slew of MCP servers. Just add the MCP server configuration wanted in the file. Example by adding the support of an MCP server to access the local files (Server-Filesystem):
{
"model": "qwen3:1.7b",
"endpointUrl": "http://localhost:11434/v1",
"servers": (
{
"type": "stdio",
"config": {
"command": "npx",
"args": ("@playwright/mcp@latest")
}
},
{
"type": "stdio",
"config": {
"command": "npx",
"args": (
"@modelcontextprotocol/server-filesystem@latest",
"C:\Users\UserName\Documents"
)
}
}
)
}
Les limites de l'usage local d'un LLM pour le MCP
The performance and reliability of an agent using MCP depend directly on the underlying language model (LLM). Models of less than 7 billion parameters have significant limitations: they frequently generate JSON formatting errors which block the call to tools, and their contextual understanding remains limited. Unfortunately, the more efficient models of more than 7 billion parameters pose another major challenge: they are extremely voracious in computational resources. On the majority of current personal computers, these models require either very long generation times, or will simply not be able to operate due to insufficient true resources. However, the problem is called upon to resolve in the long term because open source model publishers constantly develop more effective and compact models.
For users equipped with Mac with the latest Apple Silicon fleas (M1, M2, M4 …), the situation is already much more favorable. The chips are optimized for on -board artificial intelligence and natively incorporate an NPU (chip dedicated to AI) and use a unified memory architecture (RAM + VRAM). A design that makes it possible to run much higher size models with remarkable efficiency. The experience with the MCP is therefore all the more efficient and fluid.




