Most AI tools require you to send your prompts and files to third-party servers. That’s a non-starter if your data includes private journals, research notes, or sensitive business documents (contracts, board decks, HR files, financials). The good news: you can run capable LLMs locally (on a laptop or your own server) and query your documents without sending a single byte to the cloud.
In this tutorial, you’ll learn how to run an LLM locally and privately, so you can search and chat with sensitive journals and business docs on your own machine. We’ll install Ollama and OpenWebUI, pick a model that fits your hardware, enable private document search with nomic-embed-text, and create a local knowledge base so everything stays on-disk.
Table of Contents
Prerequisites
You’ll need a terminal (all systems—Windows, Mac, Linux—include one, and you can find yours with a quick search), and either Python and pip or Docker, depending on your preferred installation method for OpenWebUI.
Installation
You’ll need Ollama and OpenWebUI. Ollama runs the models, while OpenWebUI gives you a browser interface to interact with your local LLM, like you would with ChatGPT.
Step 1: Install Ollama
Download and install Ollama from its official site. Installers are available for macOS, Linux, and Windows. Once installed, verify it’s running by opening a terminal and executing:
ollama list
If Ollama is running, this will return a list of active models (or an empty list).
Step 2: Install OpenWebUI
You can install OpenWebUI either with Python (pip) or with Docker. Here, we will show how to do it with pip, but you can find instructions for Docker on the official openwebui docs.
Install OpenWebUI with the following command:
pip install open-webui
This works on macOS, Linux, and Windows, as long as you have Python ≥ 3.9 installed.
Next, start the server:
open-webui serve
Then open your browser and go to:
http://localhost:8080
Step 3: Install a Model
Choose a model from the Ollama model list and pull it locally by copying the command provided.

For example:
ollama pull gemma3:4b
If you’re unsure which model your machine can handle, ask an AI to recommend one based on your hardware. Smaller models (1B–4B) are safer on laptops.
I would recommend Gemma3 as a starter (you can download multiple models and easily switch between them). Pick the parameter number at the end (“:4b”, “:1b”, and so on) based on this guide:
Tier 1 (small laptops or weak computers): RAM ≤8 GB or no GPU → 1B–2B.
Tier 2: RAM 16 GB, weak GPU → 2B–4B.
Tier 3: RAM ≥16 GB, 6–8 GB VRAM → 4B–9B.
Tier 4: RAM ≥32 GB, 12 GB+ VRAM → 12B+.
Once you have installed Ollama and your desired model, confirm that they are active by running ollama list in the terminal:

Run WebOpenUI to launch the browser interface with:
open-webui serve
Then head over to http://localhost:8080/. Now you are ready to start using your LLM locally!
Note: it will ask you for login credentials, but these don’t really matter if you only intend to use it locally.

Settings for Documents
Now we are going to set up everything we need to interact with our local documents. First of all, we need to install the “nomic-embed-text” model to process our documents. Install it with:
ollama pull nomic-embed-text
Note: If you are wondering why we need another model (nomic-embed-text) besides our main one:
The embedding model (
nomic-embed-text) maps each text chunk from your documents to a numerical vector so OpenWebUI can quickly find semantically similar chunks when you ask a question.The chat model (for example
gemma3:1b) receives your question plus those retrieved chunks as context and generates the natural-language response.
Next, you should enable the “memory” feature if you want the LLM to remember the context of your past conversations in your future ones.
Download the adaptive memory function here. Functions are like plug-ins.

Now we will update our settings to enable these features. Click on your name in the bottom-left corner, then “Settings”.

Click on the first one, then go to “Personalization” and enable “Memory”.

Now we are going to access the other settings panel (“Admin Panel”). Click again on your name in the bottom-left corner and go to Admin panel → Settings → Documents.

In this section (Admin Panel → Settings → Documents), find the “Embedding” section, go to “Embedding Model Engine” and choose Ollama (find the selectable to the right). Leave the API Key blank.
Now, under “Embedding Model” write nomic-embed-text. Then go to “Retrieval” → enable “Full Context Mode”.
Chunking settings
You should also set the chunk size and overlap. OpenWebUI splits documents into smaller chunks before indexing them, since models can’t embed or retrieve very long texts in one piece.
A good default is 128–512 tokens per chunk, with 10–20% overlap. Larger chunks preserve more context but are slower and more memory-intensive, while smaller chunks are faster but can lose higher-level meaning. Overlap helps prevent important context from being cut off when text is split.
Here’s a guiding table, but I recommend obtaining the recommended values for your specific use case and setup by sharing them (including GPU or laptop model, storage, RAM, and so on) with an LLM like ChatGPT or Claude, as changing the chunking/overlap values later on requires reuploading the documents.
Suggested chunk/overlap by tier
| Tier / scenario | Typical hardware | Chunk size (tokens) | Overlap (%) | Notes |
| Tier 1 – constrained | ≤8 GB RAM, no/weak GPU | 128–256 | 10–15 | Prioritizes speed and low memory use. |
| Tier 2 – mid | 16 GB RAM, modest GPU or strong CPU | 256–384 | 15–20 | Balanced context vs. performance. |
| Tier 3 – comfortable | ≥16 GB RAM, 6–8 GB VRAM | 384–512 | 15–20 | More semantics per chunk, still practical. |
| Dense technical PDFs / legal docs | Any, but especially Tier 2–3 | 384–512 | 15–20 | Keeps paragraphs and arguments intact. |
| Short notes, tickets, emails | Any | 128–256 | 10–15 | Items are small, large chunks not needed. |
| Very long queries, need many retrieved chunks | Any with larger context window | 256–384 | 10–15 | Smaller chunks fit more pieces into context. |
How to Upload Your Documents
Now, the final step: uploading your documents! Go to “Workspace” in the side panel, then “Knowledge”, and create a new collection (database). You can start uploading files here.

Then, within “Workspace”, switch to the “Models” tab and create a new custom model. Creating a custom model and attaching your knowledge base tells OpenWebUI to automatically search your document collection and include the most relevant chunks as context whenever you ask a question.

Here, make sure to select your model (in my case “gemma3:1b”) and attach your knowledge base.


(Optional) Adding a system prompt
When creating your custom model in Workspace → Models, you can define a system prompt that the model will use for context throughout all your conversations.
Here are some examples of information you might want to add:
context about yourself (“I am a 20-year-old student in bioengineering interested in…”)
your preferred communication style (“no fluff", “be direct”, “be analytical”…)
context about how your data is structured
Example system prompt:
You are a thoughtful, analytical assistant helping me explore patterns and insights in my personal journals. Be direct, avoid speculation, and clearly distinguish between facts from the documents and interpretation.
This prompt will automatically apply to every chat using this custom model, helping keep responses consistent and aligned with your goals.
How to Run Your LLM Locally
Now open a new chat and make sure to select your custom model:

Now you are ready to chat with your own docs in a private local environment!
Conclusion
You now have a private LLM stack (Ollama for models, OpenWebUI for the UI, and nomic-embed-text for embeddings) wired to your on-disk knowledge base. Your journals and business docs stay local; nothing is sent to third parties. The main dials are simple: pick a model that fits your hardware, enable memory and full-context retrieval, use sensible chunk/overlap, and check the console when runs stall.
If you need more headroom, deploy the same setup on your own server and keep the privacy guarantees. From here, iterate on model choice, chunking, and prompts, and add the optional functions if you need deeper visibility during long jobs.