RESEARCH
Hard to believe! A Large Language Model running locally on a Raspberry Pi
Recent advances in distillation and quantization, paired with sophisticated training techniques, have democratized LLM technology in an incredible way. With about 200 euros of investment we can self-host state-of-the-art models.
It feels like a long time ago, but only two and a half years ago we were all swept away by the arrival of ChatGPT, the first chatbot based on a Large Language Model to be made freely available on a planetary scale.
For the first time in history, the term Artificial Intelligence was unequivocally associated with something artificial that was taking on human-like behaviour.
It was clear from the very start that behind this new wonder lay enormous investments and that running this extraordinarily powerful software required hangar-sized facilities to host the servers.
Time has gone by quickly and generative AI has become part of our habits. Products like ChatGPT, Claude and Gemini are today our companions for work, study and leisure, just as social networks, search engines and e-commerce sites are.
In February 2023 Meta, formerly Facebook — after the substantial flop of the Metaverse, an anticipation of a future that never happened and that nobody talks about any more — launched its own LLM, called Llama, shaking up the market scenario by choosing to make the model open source.
Less than two years ago, we fantasized about future scenarios in which open-source models like Llama would be commonly installed on our clients’ on-premise environments or on private clouds, without having to send data to third-party providers like OpenAI, Google or Anthropic.
Truth is, Meta had just launched another major revolution that would, in a very short time, turn the whole scenario upside down — in a market context that was by then mature, and without the noise of media spotlight.
As we write this article, self-hosting of LLMs is not only possible, it doesn’t even require demanding investments at all.
As we’ll show shortly, the virtuous convergence of a series of technological trends has made it possible to develop agentic applications — i.e. applications whose behaviour is governed by AI — using conventional infrastructure, lightly enhanced by modern low-power GPUs.
If our area of interest is limited to research and development, the necessary hardware can be genuinely cheap.
How cheap?
If we really want to push it, we can try to challenge the capabilities of the most recent LLMs, which exploit advanced and sophisticated techniques of distillation, quantization and reasoning to stay truly lightweight.
Let’s start with the investment, then.
Here’s our shopping list, complete with the receipt of the order we placed on our friends-at-Melopero website, for a total of €222.19 including VAT.
If you want to replicate our experiment, here’s what you’ll need:
- a Raspberry Pi 5 board, ideally the new 16 GB RAM model, because LLMs live in RAM and they’re not called LARGE for nothing;
- a fast and capacious mass-storage unit, like a 512 GB SSD with the relevant connection kit;
- the official Raspberry Pi 5 cooler, because LLMs are in the habit of doing additions and subtractions at full tilt, heating the CPU heavily;
- a good USB-C power supply, like the official 45 W one, because if you’re going to experiment with AI then the suspicion that you aren’t a Greta fan is legitimate;
- an HDMI monitor, which you’ll certainly already have;
- a PC keyboard, which you’ll certainly already have;
- a PC mouse, which you’ll certainly already have.
Once the hardware arrives, we can start assembling the kit.
The first thing to install is the cooler. The procedure is fairly intuitive even without resorting to the official instructions on the Raspberry Pi site. The only tricky part is connecting the power and control cable.
Once the cooler is mounted, we move on to assembling the Pi SSD kit, fitting the GPIO pin extender and the spacers.
In this case the official instructions on the Raspberry Pi site are recommended, partly because they explain how to handle the delicate PCI connector.
The Raspberry Pi 5 allows OS installation in over the network mode, so there’s no need for an SD card pre-loaded with the image — as we used to do back in the Raspberry Pi 1 days. It’s enough to connect the board to a LAN port via an Ethernet cable, plug in the monitor, mouse and keyboard, and then power up the board with the USB-C plug.
The board’s firmware will detect the absence of a boot device and will start downloading the official “image writer” software, which allows you to download the OS image to the SSD, automatically creating the boot partition.
Once the Image Writer utility has been downloaded and launched, it’s a good idea to configure the operating system right away by entering the time zone, the keyboard layout and the Wi-Fi credentials, so we can spare ourselves further configuration after the first boot.
In our case we also enabled the SSH protocol so we could connect from a PC, and reset the default user pi’s password to raspberry, leaving the default value in place — given that we’ll be using it as a toy, the probability of forgetting the password seemed more worrying than the threat of unlikely attacks from Russian hackers.
Once the OS installation is complete — we’ll obviously have chosen the official 64-bit version — we can log in (from the desktop or from a remote terminal) and proceed with the installation of Ollama.
What is Ollama?
Ollama is open-source software that allows you to run and use locally the LLM models available under open-source terms.
Indeed, even though models can be downloaded from repositories like Hugging Face, getting from download to actually being able to invoke their inference functions — perhaps via REST API — is a long way.
In theory we’d have to write a fair amount of software to configure the structure of the neural network that will be initialized with the parameters of the downloaded model and to orchestrate the inference functionality.
Ollama already does all of this. We install it and with extreme ease we download LLMs from its official repository and run them, exposing ready-made APIs in OpenAI style.
To install this little marvel, all you need to do is open a terminal and type:
curl -fsSL https://ollama.com/install.sh | sh
The installation procedure will create a systemd service that will start the server every time the Raspberry Pi reboots.
To use Ollama from the command line you use the ollama client.
With the command ollama run model-name you can run a model locally. If the model isn’t already present locally it will be downloaded automatically before being run.
On the Ollama website, of course, all the instructions are there — and most importantly there’s the catalog page of ready-made models.
At this point we can start experimenting.
Our suggestion is to try the most recent reasoning-type models, with no more than 8B (8 billion) parameters and with 4-bit or 8-bit quantization.
The surprise effect is guaranteed!
Working at the command line is rewarding in itself, but by now we’re all used to web chatbot interfaces like ChatGPT.
Fortunately, in the open-source LLM world there’s an open-source version of this famous interface too — so we continued our experiment by installing Open WebUI.
To do so you’ll need to create and activate a Python virtual environment with commands such as:
python3 -m venv openwebui-env
source openwebui-env/bin/activate
and then install it with the command:
pip install open-webui
To launch the Open WebUI server, the command is:
open-webui serve
By default the server will be accessible at http://raspberry-ip:8080 from the browser of any PC connected to the same LAN as the Raspberry board.
At this point the experience will be ChatGPT-style: in the screen menu you can choose the model from a list showing all the models downloaded locally by Ollama.
In the screenshot we showed above, you can see a snippet of one of our conversations with the very recent Qwen3 model released by Alibaba, in the 1.7-billion-parameter version, quantized to 4 bits.
Despite its small — so to speak — size, it brilliantly solves problems of logic, geometry and algebra, with performance approaching that of the more famous large state-of-the-art models.
At this point an interesting discussion would open up on the opportunities offered by this new technological scenario, but the topic is broad and we’ll leave it for a follow-up article.
For now, let’s just observe the following:
- we spent €222.19, which net of VAT amounts to an “investment” of about €182;
- the official Raspberry Pi 5 cooler runs at full speed and standing close to the board you can feel warm air coming out, but ultimately we’re consuming no more than 45 W/h, which makes us a little less guilty in front of Greta;
- we can generate inference with truly practical results, smoothly, at speeds of several dozens of tokens per second;
- we aren’t sending our precious and highly confidential prompts anywhere — we could even unplug the Internet connection to convince ourselves, should it ever be necessary, that everything happens in the privacy of our own LAN.
Welcome to the wonderful world of self-hosting.