Fine-Tuning Llama 3 on Custom Data: A Beginner's Guide

Fine-tuning is one of the most powerful techniques in the LLM toolbox. It allows you to take a base model like Llama 3 and specialize it for your domain, tone, or task. But for beginners, the process can feel overwhelming.

This guide walks you through everything: from understanding when to fine-tune versus using RAG, to preparing your dataset, running LoRA with Unsloth, converting to GGUF, and deploying in Ollama.

1. Fine-Tuning vs RAG vs Prompt Engineering

Before writing any code, you need to pick the right strategy.

Approach	Best For	Effort	Quality
Prompt Engineering	Simple tasks, general knowledge	None	Baseline
RAG (Retrieval Augmented Generation)	Answering questions over private documents	Medium	High for factual recall
Fine-Tuning	Changing model behavior, tone, or style	High	Best for specialized tasks
Full Pre-training	Creating a new model from scratch	Extremely High	Rarely needed

When to fine-tune:

You want the model to consistently output a specific format (JSON, markdown, custom schema)
You need the model to adopt a specific persona or tone
You have labeled data pairs (instruction → response) and want to improve accuracy on a task

When NOT to fine-tune:

You just need the model to know facts from your documents — use RAG instead
You have fewer than 100 high-quality examples — prompt engineering is sufficient
You cannot verify the quality of your training data — garbage in, garbage out

2. Prerequisites: Hardware and Software

Minimum Hardware

Method	RAM	VRAM	Storage
LoRA (7B model)	16 GB	12 GB	20 GB
QLoRA (7B model)	12 GB	6 GB	15 GB
QLoRA (3B model)	8 GB	4 GB	10 GB
Full fine-tune (7B)	64 GB	48 GB	30 GB

Recommendation: If you have a consumer GPU (RTX 3060 12GB, RTX 4060, etc.), use QLoRA on a 3B or 7B model. If you have an RTX 4090 (24GB+), LoRA on 7B or 13B is possible.

Software Stack

# Python 3.10+ recommended
conda create -n finetune python=3.10
conda activate finetune

# Core dependencies
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers datasets accelerate peft bitsandbytes
pip install unsloth
pip install wandb  # optional, for experiment tracking

3. Dataset Preparation

Fine-tuning requires a dataset in instruction-following format. The most common format is a JSONL file where each line is a conversation.

Format: ShareGPT / ChatML

{"messages": [{"role": "system", "content": "You are a helpful AI assistant specialized in Python."}, {"role": "user", "content": "Write a function to reverse a linked list."}, {"role": "assistant", "content": "Here is a Python function to reverse a singly linked list:\n\n\\`\\`\\`python\nclass ListNode:\n    def __init__(self, val=0, next=None):\n        self.val = val\n        self.next = next\n\ndef reverseList(head: ListNode) -> ListNode:\n    prev = None\n    curr = head\n    while curr:\n        next_temp = curr.next\n        curr.next = prev\n        prev = curr\n        curr = next_temp\n    return prev\n\\`\\`\\`\n\nThis runs in O(n) time and O(1) space."}]}

Dataset Size Guidelines

Dataset Size	Quality	Overfitting Risk
50–200 examples	Small, good for style transfer	High
500–2,000 examples	Medium, good for task adaptation	Moderate
5,000+ examples	Large, general improvement	Low

Pro tip: Quality over quantity. 500 manually verified examples beat 10,000 scraped ones every time.

Loading your dataset

from datasets import load_dataset

dataset = load_dataset("json", data_files="my_data.jsonl", split="train")
dataset = dataset.train_test_split(test_size=0.1)

4. Using Unsloth for Fast Training

Unsloth is a library that dramatically speeds up fine-tuning while reducing memory usage. It works with Llama 3, Mistral, Qwen, and many other models.

Loading the Model

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-Instruct",
    max_seq_length=2048,
    dtype=None,  # Auto-detect
    load_in_4bit=True,  # QLoRA
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

Key parameters explained:

r (rank): 8–32 is the sweet spot. Higher = more expressiveness, more memory.
lora_alpha: Scaling factor. Usually set to 2x r (e.g., r=16, alpha=32).
load_in_4bit: Enables QLoRA. Cuts memory by ~4x versus 16-bit.
target_modules: Which attention layers to apply LoRA to. The list above covers all linear layers in Llama 3 architecture.

5. LoRA vs QLoRA vs Full Fine-Tuning

Method	Trainable Params	VRAM (7B)	Speed	Quality
Full Fine-Tune	100% (7B)	48 GB+	Slow	Best
LoRA	~1% (70M)	14 GB	Fast	Near-best
QLoRA	~1% (70M)	6 GB	Medium	~95% of LoRA
DoRA	~1% (70M)	7 GB	Medium	~98% of LoRA

Recommendation: Start with QLoRA. It is the most accessible. If you have headroom, switch to LoRA for slightly better quality. Full fine-tuning is rarely worth the cost.

6. Training Configuration

from transformers import TrainingArguments
from trl import SFTTrainer

training_args = TrainingArguments(
    output_dir="./llama3-finetuned",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_steps=5,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=10,
    save_strategy="epoch",
    optim="adamw_8bit",
    seed=42,
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    args=training_args,
    dataset_text_field="text",
    max_seq_length=2048,
)

trainer.train()

Important Hyperparameters

Parameter	Recommended	Notes
learning_rate	1e-4 to 3e-4	Lower for full fine-tune (~1e-5)
num_train_epochs	2–5	Monitor eval loss for overfitting
batch_size	1–4 per GPU	Use gradient_accumulation to compensate
warmup_steps	5–10% of total	Helps stabilize training
optimizer	AdamW 8-bit	Saves VRAM over full AdamW

7. Running Training

python train.py

If using QLoRA on an RTX 3060 12GB with Llama 3.2 3B, expect:

Training time: ~30–60 minutes for 500 examples
Peak VRAM: ~5–6 GB
Loss curve: Should steadily decrease; eval loss should not increase (sign of overfitting)

Monitoring with W&B

pip install wandb
wandb login

Uncomment the report_to="wandb" in TrainingArguments to see live loss curves.

8. Evaluating Your Model

After training, test your model on held-out examples:

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="./llama3-finetuned",
    max_seq_length=2048,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

messages = [{"role": "user", "content": "Explain what fine-tuning is in one sentence."}]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=128,
    temperature=0.7,
    top_p=0.9,
)
print(tokenizer.decode(outputs[0]))

Checklist for evaluation:

Does the output match the desired format?
Is the tone correct?
Does it hallucinate less than the base model?
Run on 20–50 test examples and score manually.

9. Converting to GGUF

To use your fine-tuned model in Ollama or LM Studio, convert it to GGUF format.

Merge LoRA weights first

model = model.merge_and_unload()
model.save_pretrained("./llama3-merged")
tokenizer.save_pretrained("./llama3-merged")

Convert using llama.cpp

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
pip install -r requirements.txt

python convert_hf_to_gguf.py ../llama3-merged --outfile llama3-finetuned.gguf --outtype q4_k_m

The q4_k_m quantization offers a great quality-to-size ratio (roughly 4GB for a 7B model).

10. Running in Ollama / LM Studio

Ollama

Create a Modelfile:

FROM ./llama3-finetuned.gguf

TEMPLATE """{{ .Prompt }}"""

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER stop "<|eot_id|>"

Then:

ollama create my-finetuned-model -f Modelfile
ollama run my-finetuned-model

LM Studio

Open LM Studio, go to "My Models", click "Add Model", and select your GGUF file. It will be available in the chat interface and the API server.

11. Common Issues and Fixes

Out of Memory (OOM)

Symptom	Fix
CUDA OOM during training	Reduce batch size to 1, use gradient_accumulation
CUDA OOM during loading	Use `load_in_4bit=True`
System RAM OOM	Use `load_in_4bit=True` with `device_map="auto"`

Overfitting

Symptom	Fix
Eval loss increases	Reduce epochs, add dropout, reduce r
Model repeats training data	Epochs too high, dataset too small
Model loses general knowledge	Add general knowledge examples to dataset

Poor Quality

Symptom	Fix
Outputs are garbled	Check tokenizer chat template
Model ignores instructions	Dataset format is wrong; use ShareGPT format
Model is too verbose	Add system prompt examples in training data

12. Advanced Tips

Mix datasets: Combine your custom data with a small percentage of general instruction data (e.g., OpenHermes 2.5) to prevent catastrophic forgetting.
Learning rate scheduling: Use cosine or linear decay. Constant LR works but is suboptimal.
DeepSpeed / FSDP: For multi-GPU setups, use DeepSpeed ZeRO-3 or FSDP. Unsloth supports both.
Dataset augmentation: Use an LLM (like GPT-4 or Claude) to generate additional high-quality examples in your domain.

Conclusion

Fine-tuning Llama 3 is accessible to anyone with a modern GPU and a bit of Python knowledge. With QLoRA and Unsloth, you can train a specialized model in under an hour on consumer hardware.

The workflow is straightforward:

Prepare your dataset (JSONL, ChatML format)
Load model with QLoRA via Unsloth
Train with SFTTrainer
Merge and convert to GGUF
Run in Ollama or LM Studio

Your fine-tuned model will be faster, cheaper, and more private than any cloud API. In 2026, that is a superpower.

Fine-Tuning Llama 3 on Custom Data: A Beginner's Guide

Fine-Tuning Llama 3 on Custom Data: A Beginner's Guide

1. Fine-Tuning vs RAG vs Prompt Engineering

2. Prerequisites: Hardware and Software

Minimum Hardware

Software Stack

3. Dataset Preparation

Format: ShareGPT / ChatML

Dataset Size Guidelines

Loading your dataset

4. Using Unsloth for Fast Training

Loading the Model

5. LoRA vs QLoRA vs Full Fine-Tuning

6. Training Configuration

Important Hyperparameters

7. Running Training

Monitoring with W&B

8. Evaluating Your Model

9. Converting to GGUF

Merge LoRA weights first

Convert using llama.cpp

10. Running in Ollama / LM Studio

Ollama

LM Studio

11. Common Issues and Fixes

Out of Memory (OOM)

Overfitting

Poor Quality

12. Advanced Tips

Conclusion

ON THIS PAGE

Continue Reading

How to Build an Offline AI Assistant Using LM Studio

Running AI Completely Offline in 2026

Phi-3 vs Llama 3 for Local AI: Developer Benchmark 2026