How to Run an LLM Locally to Interact with Your Documents

Zoe Isabel Senón — Sat, 10 Jan 2026 00:38:09 +0000

Most AI tools require you to send your prompts and files to third-party servers. That’s a non-starter if your data includes private journals, research notes, or sensitive business documents (contracts, board decks, HR files, financials). The good news: you can run capable LLMs locally (on a laptop or your own server) and query your documents without sending a single byte to the cloud.

In this tutorial, you’ll learn how to run an LLM locally and privately, so you can search and chat with sensitive journals and business docs on your own machine. We’ll install Ollama and OpenWebUI, pick a model that fits your hardware, enable private document search with nomic-embed-text, and create a local knowledge base so everything stays on-disk.

Prerequisites
Installation
Settings for Documents
How to Upload Your Documents
- (Optional) Adding a system prompt
How to Run Your LLM Locally
Conclusion

Prerequisites

You’ll need a terminal (all systems—Windows, Mac, Linux—include one, and you can find yours with a quick search), and either Python and pip or Docker, depending on your preferred installation method for OpenWebUI.

Installation

You’ll need Ollama and OpenWebUI. Ollama runs the models, while OpenWebUI gives you a browser interface to interact with your local LLM, like you would with ChatGPT.

Step 1: Install Ollama

Download and install Ollama from its official site. Installers are available for macOS, Linux, and Windows. Once installed, verify it’s running by opening a terminal and executing:

ollama list

If Ollama is running, this will return a list of active models (or an empty list).

Step 2: Install OpenWebUI

You can install OpenWebUI either with Python (pip) or with Docker. Here, we will show how to do it with pip, but you can find instructions for Docker on the official openwebui docs.

Install OpenWebUI with the following command:

pip install open-webui

This works on macOS, Linux, and Windows, as long as you have Python ≥ 3.9 installed.

Next, start the server:

open-webui serve

Then open your browser and go to:

http://localhost:8080

Step 3: Install a Model

Choose a model from the Ollama model list and pull it locally by copying the command provided.

For example:

ollama pull gemma3:4b

If you’re unsure which model your machine can handle, ask an AI to recommend one based on your hardware. Smaller models (1B–4B) are safer on laptops.

I would recommend Gemma3 as a starter (you can download multiple models and easily switch between them). Pick the parameter number at the end (“:4b”, “:1b”, and so on) based on this guide:

Tier 1 (small laptops or weak computers): RAM ≤8 GB or no GPU → 1B–2B.
Tier 2: RAM 16 GB, weak GPU → 2B–4B.
Tier 3: RAM ≥16 GB, 6–8 GB VRAM → 4B–9B.
Tier 4: RAM ≥32 GB, 12 GB+ VRAM → 12B+.

Once you have installed Ollama and your desired model, confirm that they are active by running ollama list in the terminal:

Run WebOpenUI to launch the browser interface with:

open-webui serve

Then head over to http://localhost:8080/. Now you are ready to start using your LLM locally!

Note: it will ask you for login credentials, but these don’t really matter if you only intend to use it locally.

Settings for Documents

Now we are going to set up everything we need to interact with our local documents. First of all, we need to install the “nomic-embed-text” model to process our documents. Install it with:

ollama pull nomic-embed-text

Note: If you are wondering why we need another model (nomic-embed-text) besides our main one:

The embedding model (nomic-embed-text) maps each text chunk from your documents to a numerical vector so OpenWebUI can quickly find semantically similar chunks when you ask a question.
The chat model (for example gemma3:1b) receives your question plus those retrieved chunks as context and generates the natural-language response.

Next, you should enable the “memory” feature if you want the LLM to remember the context of your past conversations in your future ones.

Download the adaptive memory function here. Functions are like plug-ins.

Now we will update our settings to enable these features. Click on your name in the bottom-left corner, then “Settings”.

Click on the first one, then go to “Personalization” and enable “Memory”.

Now we are going to access the other settings panel (“Admin Panel”). Click again on your name in the bottom-left corner and go to Admin panel → Settings → Documents.

In this section (Admin Panel → Settings → Documents), find the “Embedding” section, go to “Embedding Model Engine” and choose Ollama (find the selectable to the right). Leave the API Key blank.

Now, under “Embedding Model” write nomic-embed-text. Then go to “Retrieval” → enable “Full Context Mode”.

Chunking settings

You should also set the chunk size and overlap. OpenWebUI splits documents into smaller chunks before indexing them, since models can’t embed or retrieve very long texts in one piece.

A good default is 128–512 tokens per chunk, with 10–20% overlap. Larger chunks preserve more context but are slower and more memory-intensive, while smaller chunks are faster but can lose higher-level meaning. Overlap helps prevent important context from being cut off when text is split.

Here’s a guiding table, but I recommend obtaining the recommended values for your specific use case and setup by sharing them (including GPU or laptop model, storage, RAM, and so on) with an LLM like ChatGPT or Claude, as changing the chunking/overlap values later on requires reuploading the documents.

Suggested chunk/overlap by tier

Tier / scenario	Typical hardware	Chunk size (tokens)	Overlap (%)	Notes
Tier 1 – constrained	≤8 GB RAM, no/weak GPU	128–256	10–15	Prioritizes speed and low memory use.
Tier 2 – mid	16 GB RAM, modest GPU or strong CPU	256–384	15–20	Balanced context vs. performance.
Tier 3 – comfortable	≥16 GB RAM, 6–8 GB VRAM	384–512	15–20	More semantics per chunk, still practical.
Dense technical PDFs / legal docs	Any, but especially Tier 2–3	384–512	15–20	Keeps paragraphs and arguments intact.
Short notes, tickets, emails	Any	128–256	10–15	Items are small, large chunks not needed.
Very long queries, need many retrieved chunks	Any with larger context window	256–384	10–15	Smaller chunks fit more pieces into context.

How to Upload Your Documents

Now, the final step: uploading your documents! Go to “Workspace” in the side panel, then “Knowledge”, and create a new collection (database). You can start uploading files here.

⚠

Make sure to check for any errors during the upload. Unfortunately, they only show as temporary pop-ups. Some errors might be due to the format of your files, so make sure to check the console for further error logs.

Then, within “Workspace”, switch to the “Models” tab and create a new custom model. Creating a custom model and attaching your knowledge base tells OpenWebUI to automatically search your document collection and include the most relevant chunks as context whenever you ask a question.

Here, make sure to select your model (in my case “gemma3:1b”) and attach your knowledge base.

(Optional) Adding a system prompt

When creating your custom model in Workspace → Models, you can define a system prompt that the model will use for context throughout all your conversations.

Here are some examples of information you might want to add:

context about yourself (“I am a 20-year-old student in bioengineering interested in…”)
your preferred communication style (“no fluff", “be direct”, “be analytical”…)
context about how your data is structured

Example system prompt:

You are a thoughtful, analytical assistant helping me explore patterns and insights in my personal journals. Be direct, avoid speculation, and clearly distinguish between facts from the documents and interpretation.

This prompt will automatically apply to every chat using this custom model, helping keep responses consistent and aligned with your goals.

How to Run Your LLM Locally

Now open a new chat and make sure to select your custom model:

Now you are ready to chat with your own docs in a private local environment!

⚠

Note: By default, the frontend/browser will stop streaming the response after five minutes, even though it will keep processing your query in the background. This means that if your query takes more than five minutes to process, it will not be displayed on the browser. You can reload the page and click “continue response” to get the latest output.

💡

I recommend installing the Enhanced Context Tracker function (plugin) to get more visibility into the progress of your query.

Conclusion

You now have a private LLM stack (Ollama for models, OpenWebUI for the UI, and nomic-embed-text for embeddings) wired to your on-disk knowledge base. Your journals and business docs stay local; nothing is sent to third parties. The main dials are simple: pick a model that fits your hardware, enable memory and full-context retrieval, use sensible chunk/overlap, and check the console when runs stall.

If you need more headroom, deploy the same setup on your own server and keep the privacy guarantees. From here, iterate on model choice, chunking, and prompts, and add the optional functions if you need deeper visibility during long jobs.

Zoe Isabel Senón - freeCodeCamp.org