This guide walks you through every command and concept needed to build, from scratch and in plain English, a fully offline AI assistant using the Mistral 7B model that's production‑ready, secured in a sandbox, and tailored entirely to your company’s internal knowledge.
You’ll:
- install Python,
- create an isolated environment,
- set up the Ollama runtime,
- download the Mistral model under its Apache 2.0 license,
- ingest your documents into text chunks,
- generate embeddings with a sentence-transformer,
- store them in a vector database (FAISS, Chroma, or Qdrant),
- wire everything together in a Retrieval‑Augmented Generation (RAG) pipeline using LangChain,
- build a simple Streamlit web interface,
- containerize the whole app with Docker, and
- deploy it securely behind your firewall.
1. Install Python
First, you need the Python programming language on your computer:
-
Download Python:
Go to the official downloads page and grab the installer for your operating system (Windows, macOS, or Linux). Python.org -
Run the installer:
-
Windows/macOS: Launch the downloaded installer and follow the prompts.
-
Linux: You can install Python from your distribution’s package manager or compile from source.
-
-
Verify installation:
Open a terminal (Command Prompt on Windows, Terminal on macOS/Linux) and type:You should see something like
Python 3.13.3
, which is the current stable release. Python.org
2. Create & Activate a Virtual Environment
Keeping dependencies isolated prevents conflicts with other software:
-
Create a project folder:
-
Make a virtual environment using Python’s built‑in venv module:
This creates a new folder
venv/
containing its own Python interpreter and libraries. Python documentation -
Activate the environment:
-
Windows:
-
macOS/Linux:
After activation, your prompt will show
(venv)
to indicate that you’re working inside this isolated environment. Python documentation -
3. Install & Run Ollama (Sandboxed LLM Runtime)
Ollama provides a local CLI to host and interact with open‑source models offline:
-
Install Ollama by running their installer script:
This script detects your operating system and architecture, then installs the correct Ollama binary. Ollama
-
Start the Ollama service (runs in the background):
This command launches a local server that can load and run models without internet access. GitHub
-
Check your installation:
You should see the Ollama version printed, confirming it’s ready.
4. Pull & Test the Mistral Model
Mistral 7B is an Apache 2.0‑licensed model—no restrictions on military or commercial use:
-
Download Mistral 7B via Ollama:
Mistral is a 7.3 billion‑parameter model released under Apache 2.0, freely usable without restrictions. OllamaMistral AI | Frontier AI in your hands
-
Run a quick test:
You should see the model generate a completion for your prompt, confirming it works locally. Ollama
5. Ingest Documents & Generate Embeddings
To teach the AI your private data, you’ll convert documents into searchable vectors:
-
Install required Python libraries:
-
Load the embedding model in a Python script:
The
all-MiniLM-L6-v2
model maps text to 384-dimensional vectors for semantic search. Hugging Face -
Chunk your documents (e.g., split PDFs or text files into 500‑token pieces) and run:
This produces one vector per chunk, ready for indexing.
6. Set Up a Local Vector Database
Store your embeddings so you can quickly find relevant text at query time:
-
FAISS (Facebook AI Similarity Search)
Install the CPU‑only package:FAISS can handle up to billions of vectors efficiently on a single machine. PyPI
-
Chroma (Apache 2.0 licensed)
A lightweight embedding database with a simple Python client:Chroma makes it easy to spin up an embedding store in minutes. PyPI
-
Qdrant (Rust‑based, Docker‑friendly)
Pull and run the Docker container:Qdrant offers filtering and payload storage alongside vector search. Qdrant - Vector Database
7. Build the Retrieval‑Augmented Generation (RAG) Pipeline
Combine embeddings search with the Mistral model to answer queries:
-
Install LangChain:
LangChain provides abstractions for embeddings, vectorstores, and LLM chaining. Introduction | 🦜️🔗 LangChain
-
Wire it together (example with FAISS):
This function retrieves your top-k chunks, feeds them as context, and returns Mistral’s answer.
8. Create a Simple Streamlit Web Interface
Let non-technical users ask questions through a browser page:
-
Install Streamlit:
Streamlit turns Python scripts into interactive web apps with minimal effort. Streamlit Docs
-
Write
app.py
: -
Launch the app:
Your browser will open at
http://localhost:8501
, showing the chat interface. Streamlit Docs
9. Containerize & Deploy with Docker
Package your entire setup so it runs reliably anywhere:
-
Install Docker on Ubuntu (example):
-
Enable non‑root Docker usage:
After reboot, you can run Docker commands without
sudo
. GeeksforGeeks -
Create a
Dockerfile
in your project: -
Build & run:
Your app is now reachable at
http://<server-ip>:8501
in any browser.
10. Secure & Maintain Your Deployment
-
Run everything air‑gapped behind your corporate VPN or firewall.
-
Implement role‑based access control (RBAC) or basic auth in front of Streamlit.
-
Log queries and responses for auditing and to improve your data ingestion pipeline.
-
Automate updates: schedule re-ingestion of new documents and re-indexing of embeddings.