Surya Pratap Singh
AI Engineer & Founder
How to Install Ollama on Windows, Linux, and Mac (Step-by-Step 2026)
How to Install Ollama on Windows, Linux, and Mac (Step-by-Step 2026)
Ollama has become the standard tool for running local LLMs. It wraps llama.cpp, handles model downloads, quantization, GPU acceleration, and exposes a clean REST API—all in a single binary. Whether you're a beginner who wants to chat with Llama 3 or a developer building AI-powered apps, Ollama is the fastest way to get started.
In this guide, I'll walk through installing Ollama on Windows, macOS, and Linux, then show you how to pull models, use the API, create custom Modelfiles, and troubleshoot common issues.
What is Ollama?
Ollama is a lightweight, open-source tool that lets you run large language models locally. It handles:
- Model management: Download, list, and remove models with simple commands
- GPU acceleration: Automatically detects NVIDIA CUDA, AMD ROCm, and Apple Metal
- REST API: Provides an OpenAI-compatible API at
localhost:11434 - Modelfiles: Customize model parameters, system prompts, and templates
- Multi-model serving: Run multiple models simultaneously
Under the hood, Ollama uses llama.cpp for inference, which means it supports GGUF-quantized models and runs efficiently on CPU as well as GPU.
Prerequisites
Before installing, check that your system meets these requirements:
- Minimum RAM: 8GB (16GB recommended for 7B models)
- Storage: 5-50GB free (models range from 800MB to 40GB)
- OS: Windows 10+, macOS 12+, or a modern Linux distribution
- Internet: Required for downloading models
GPU is optional but recommended. Ollama works on CPU-only machines, just slower.
Installing on Windows
Method 1: Direct Installer (Simplest)
- Go to ollama.com
- Click Download for Windows
- Run the installer (
OllamaSetup.exe) - Ollama runs as a background service—you'll see the llama icon in your system tray
To verify installation:
ollama --version # Should output something like: ollama version 0.5.7
Method 2: Using WSL2 (For Development)
If you're a developer who wants Linux-native performance and better file system integration, install via WSL2:
# Step 1: Install WSL2 wsl --install -d Ubuntu # Step 2: Restart your PC, then open Ubuntu from Start Menu # Step 3: Inside WSL, install Ollama curl -fsSL https://ollama.com/install.sh | sh
Running Ollama in WSL2 gives you native Linux performance and the ability to use GPU passthrough if you have an NVIDIA card.
Installing on macOS
Ollama supports both Intel and Apple Silicon (M1/M2/M3/M4) Macs.
Method 1: Homebrew (Recommended)
brew install ollama
Then start the service:
brew services start ollama
Method 2: Direct Download
Download the macOS app from ollama.com and drag it to your Applications folder.
Apple Silicon Optimization
Ollama automatically uses Apple's Metal framework on M-series chips. This means 7B models run at 30-50 tokens per second on an M1 Pro—faster than many NVIDIA GPUs. You don't need to configure anything.
Installing on Linux
Ubuntu / Debian
curl -fsSL https://ollama.com/install.sh | sh
This script detects your OS, adds the Ollama apt repository, and installs the package.
Fedora / RHEL
sudo dnf install https://ollama.com/rpm/ollama.repo sudo dnf install ollama
Arch Linux
yay -S ollama # or paru -S ollama
Start the Service
sudo systemctl start ollama sudo systemctl enable ollama # Start on boot
First Steps: Pull and Run a Model
Once Ollama is installed, let's pull and run your first model:
# Pull Llama 3.2 3B (runs on 8GB RAM) ollama pull llama3.2:3b # Run it interactively ollama run llama3.2:3b
You'll see a prompt where you can start chatting:
>>> What is the capital of France?
The capital of France is Paris.
>>> Send a message (/? for help)
To exit, type /bye or press Ctrl+D.
Popular Models to Try
# Small & fast (1-3GB RAM) ollama pull tinyllama ollama pull phi3:3.8b ollama pull qwen2.5:1.5b # Medium (6-8GB RAM) ollama pull llama3.2:3b ollama pull gemma2:9b ollama pull qwen2.5:7b # Large (16GB+ RAM) ollama pull llama3.1:8b ollama pull mistral:7b ollama pull mixtral:8x7b
Using the REST API
Ollama exposes an HTTP API at http://localhost:11434. This is how you integrate local AI into your applications.
Generate a Completion
curl http://localhost:11434/api/generate -d '{ "model": "llama3.2:3b", "prompt": "Why is the sky blue?", "stream": false }'
Chat Completion (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions -d '{ "model": "llama3.2:3b", "messages": [ {"role": "user", "content": "Hello, who are you?"} ], "stream": false }'
This means you can use the OpenAI JavaScript/Python SDK with Ollama by simply changing the base URL:
import OpenAI from 'openai'; const client = new OpenAI({ baseURL: 'http://localhost:11434/v1', apiKey: 'ollama', // Can be any string }); const response = await client.chat.completions.create({ model: 'llama3.2:3b', messages: [{ role: 'user', content: 'Hello!' }], }); console.log(response.choices[0].message.content);
List Models
curl http://localhost:11434/api/tags
Creating Custom Modelfiles
Modelfiles let you customize a model's behavior, system prompt, and parameters.
# Modelfile FROM llama3.2:3b # Set the system prompt SYSTEM """You are an expert Python developer. You write clean, well-documented code. You explain your reasoning briefly before providing code.""" # Adjust parameters PARAMETER temperature 0.3 PARAMETER top_p 0.9 PARAMETER stop "</s>"
Build and run it:
ollama create python-assistant -f ./Modelfile ollama run python-assistant
Troubleshooting Common Issues
Out of Memory (OOM)
If Ollama crashes or you get an error about memory:
# Check how much RAM is available free -h # Linux wmic OS get TotalVisibleMemorySize # Windows # Try a smaller model ollama pull tinyllama # 1.1B, ~800MB RAM # Or restrict context length ollama run llama3.2:3b --num-ctx 1024
Slow Inference
Slow generation is usually due to CPU-only inference:
# Check if GPU is being used ollama ps # On Linux, check GPU memory nvidia-smi # NVIDIA rocm-smi # AMD # Force CPU-only if GPU is causing issues OLLAMA_INTEL_GPU=0 ollama run llama3.2:3b
Port Already in Use
If port 11434 is occupied:
# Change the port OLLAMA_HOST=0.0.0.0:11435 ollama serve
Windows-Specific Issues
- Antivirus blocking: Add Ollama to your antivirus exclusions
- WSL2 not detected: Run
wsl --set-default-version 2before installing - GPU not detected: Install NVIDIA CUDA Toolkit or rely on CPU mode
Ollama vs LM Studio vs GPT4All
You might wonder which local AI tool to use. Here's a quick comparison:
| Feature | Ollama | LM Studio | GPT4All |
|---|---|---|---|
| Setup | CLI | GUI | GUI |
| REST API | Built-in | Built-in | Limited |
| GPU support | CUDA, Metal, ROCm | CUDA, Metal, ROCm | CPU only (mostly) |
| Modelfiles | Yes | No | No |
| Model library | 100+ | 100+ | 50+ |
| Best for | Developers | General users | Privacy enthusiasts |
Choose Ollama if: You're a developer who needs a CLI-friendly tool with a powerful API and custom model configurations.
Choose LM Studio if: You prefer a graphical interface and want to download/configure models with point-and-click simplicity.
Choose GPT4All if: You want the simplest possible local chat experience and don't need an API.
Conclusion
Ollama transforms local AI from a complicated setup process into a single command. Whether you're on Windows, macOS, or Linux, you can be running a state-of-the-art language model within minutes.
The ecosystem keeps expanding—new models are published as GGUF files daily, and Ollama's Modelfile system means you can customize any model without recompiling. Install it today, pull a model, and start building the next generation of privacy-first, offline AI applications.
ON THIS PAGE
Enjoying this?
Share it with your dev friends!