Surya Pratap Singh

Surya Pratap Singh

AI Engineer & Founder

May 22, 2026
15 min read
Fine-Tuning Llama 3 on Custom Data: A Beginner's Guide
Artificial Intelligence

Fine-Tuning Llama 3 on Custom Data: A Beginner's Guide

Fine-Tuning Llama 3 on Custom Data: A Beginner's Guide

Fine-tuning is one of the most powerful techniques in the LLM toolbox. It allows you to take a base model like Llama 3 and specialize it for your domain, tone, or task. But for beginners, the process can feel overwhelming.

This guide walks you through everything: from understanding when to fine-tune versus using RAG, to preparing your dataset, running LoRA with Unsloth, converting to GGUF, and deploying in Ollama.


1. Fine-Tuning vs RAG vs Prompt Engineering

Before writing any code, you need to pick the right strategy.

ApproachBest ForEffortQuality
Prompt EngineeringSimple tasks, general knowledgeNoneBaseline
RAG (Retrieval Augmented Generation)Answering questions over private documentsMediumHigh for factual recall
Fine-TuningChanging model behavior, tone, or styleHighBest for specialized tasks
Full Pre-trainingCreating a new model from scratchExtremely HighRarely needed

When to fine-tune:

  • You want the model to consistently output a specific format (JSON, markdown, custom schema)
  • You need the model to adopt a specific persona or tone
  • You have labeled data pairs (instruction → response) and want to improve accuracy on a task

When NOT to fine-tune:

  • You just need the model to know facts from your documents — use RAG instead
  • You have fewer than 100 high-quality examples — prompt engineering is sufficient
  • You cannot verify the quality of your training data — garbage in, garbage out

2. Prerequisites: Hardware and Software

Minimum Hardware

MethodRAMVRAMStorage
LoRA (7B model)16 GB12 GB20 GB
QLoRA (7B model)12 GB6 GB15 GB
QLoRA (3B model)8 GB4 GB10 GB
Full fine-tune (7B)64 GB48 GB30 GB

Recommendation: If you have a consumer GPU (RTX 3060 12GB, RTX 4060, etc.), use QLoRA on a 3B or 7B model. If you have an RTX 4090 (24GB+), LoRA on 7B or 13B is possible.

Software Stack

# Python 3.10+ recommended conda create -n finetune python=3.10 conda activate finetune # Core dependencies pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 pip install transformers datasets accelerate peft bitsandbytes pip install unsloth pip install wandb # optional, for experiment tracking

3. Dataset Preparation

Fine-tuning requires a dataset in instruction-following format. The most common format is a JSONL file where each line is a conversation.

Format: ShareGPT / ChatML

{"messages": [{"role": "system", "content": "You are a helpful AI assistant specialized in Python."}, {"role": "user", "content": "Write a function to reverse a linked list."}, {"role": "assistant", "content": "Here is a Python function to reverse a singly linked list:\n\n\\`\\`\\`python\nclass ListNode:\n def __init__(self, val=0, next=None):\n self.val = val\n self.next = next\n\ndef reverseList(head: ListNode) -> ListNode:\n prev = None\n curr = head\n while curr:\n next_temp = curr.next\n curr.next = prev\n prev = curr\n curr = next_temp\n return prev\n\\`\\`\\`\n\nThis runs in O(n) time and O(1) space."}]}

Dataset Size Guidelines

Dataset SizeQualityOverfitting Risk
50–200 examplesSmall, good for style transferHigh
500–2,000 examplesMedium, good for task adaptationModerate
5,000+ examplesLarge, general improvementLow

Pro tip: Quality over quantity. 500 manually verified examples beat 10,000 scraped ones every time.

Loading your dataset

from datasets import load_dataset dataset = load_dataset("json", data_files="my_data.jsonl", split="train") dataset = dataset.train_test_split(test_size=0.1)

4. Using Unsloth for Fast Training

Unsloth is a library that dramatically speeds up fine-tuning while reducing memory usage. It works with Llama 3, Mistral, Qwen, and many other models.

Loading the Model

from unsloth import FastLanguageModel import torch model, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/Llama-3.2-3B-Instruct", max_seq_length=2048, dtype=None, # Auto-detect load_in_4bit=True, # QLoRA ) model = FastLanguageModel.get_peft_model( model, r=16, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_alpha=16, lora_dropout=0, bias="none", use_gradient_checkpointing="unsloth", random_state=42, )

Key parameters explained:

  • r (rank): 8–32 is the sweet spot. Higher = more expressiveness, more memory.
  • lora_alpha: Scaling factor. Usually set to 2x r (e.g., r=16, alpha=32).
  • load_in_4bit: Enables QLoRA. Cuts memory by ~4x versus 16-bit.
  • target_modules: Which attention layers to apply LoRA to. The list above covers all linear layers in Llama 3 architecture.

5. LoRA vs QLoRA vs Full Fine-Tuning

MethodTrainable ParamsVRAM (7B)SpeedQuality
Full Fine-Tune100% (7B)48 GB+SlowBest
LoRA~1% (70M)14 GBFastNear-best
QLoRA~1% (70M)6 GBMedium~95% of LoRA
DoRA~1% (70M)7 GBMedium~98% of LoRA

Recommendation: Start with QLoRA. It is the most accessible. If you have headroom, switch to LoRA for slightly better quality. Full fine-tuning is rarely worth the cost.


6. Training Configuration

from transformers import TrainingArguments from trl import SFTTrainer training_args = TrainingArguments( output_dir="./llama3-finetuned", per_device_train_batch_size=2, gradient_accumulation_steps=4, warmup_steps=5, num_train_epochs=3, learning_rate=2e-4, fp16=not torch.cuda.is_bf16_supported(), bf16=torch.cuda.is_bf16_supported(), logging_steps=10, save_strategy="epoch", optim="adamw_8bit", seed=42, ) trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset["train"], eval_dataset=dataset["test"], args=training_args, dataset_text_field="text", max_seq_length=2048, ) trainer.train()

Important Hyperparameters

ParameterRecommendedNotes
learning_rate1e-4 to 3e-4Lower for full fine-tune (~1e-5)
num_train_epochs2–5Monitor eval loss for overfitting
batch_size1–4 per GPUUse gradient_accumulation to compensate
warmup_steps5–10% of totalHelps stabilize training
optimizerAdamW 8-bitSaves VRAM over full AdamW

7. Running Training

python train.py

If using QLoRA on an RTX 3060 12GB with Llama 3.2 3B, expect:

  • Training time: ~30–60 minutes for 500 examples
  • Peak VRAM: ~5–6 GB
  • Loss curve: Should steadily decrease; eval loss should not increase (sign of overfitting)

Monitoring with W&B

pip install wandb wandb login

Uncomment the report_to="wandb" in TrainingArguments to see live loss curves.


8. Evaluating Your Model

After training, test your model on held-out examples:

from unsloth import FastLanguageModel model, tokenizer = FastLanguageModel.from_pretrained( model_name="./llama3-finetuned", max_seq_length=2048, load_in_4bit=True, ) FastLanguageModel.for_inference(model) messages = [{"role": "user", "content": "Explain what fine-tuning is in one sentence."}] inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda") outputs = model.generate( input_ids=inputs, max_new_tokens=128, temperature=0.7, top_p=0.9, ) print(tokenizer.decode(outputs[0]))

Checklist for evaluation:

  • Does the output match the desired format?
  • Is the tone correct?
  • Does it hallucinate less than the base model?
  • Run on 20–50 test examples and score manually.

9. Converting to GGUF

To use your fine-tuned model in Ollama or LM Studio, convert it to GGUF format.

Merge LoRA weights first

model = model.merge_and_unload() model.save_pretrained("./llama3-merged") tokenizer.save_pretrained("./llama3-merged")

Convert using llama.cpp

git clone https://github.com/ggerganov/llama.cpp cd llama.cpp pip install -r requirements.txt python convert_hf_to_gguf.py ../llama3-merged --outfile llama3-finetuned.gguf --outtype q4_k_m

The q4_k_m quantization offers a great quality-to-size ratio (roughly 4GB for a 7B model).


10. Running in Ollama / LM Studio

Ollama

Create a Modelfile:

FROM ./llama3-finetuned.gguf TEMPLATE """{{ .Prompt }}""" PARAMETER temperature 0.7 PARAMETER top_p 0.9 PARAMETER stop "<|eot_id|>"

Then:

ollama create my-finetuned-model -f Modelfile ollama run my-finetuned-model

LM Studio

Open LM Studio, go to "My Models", click "Add Model", and select your GGUF file. It will be available in the chat interface and the API server.


11. Common Issues and Fixes

Out of Memory (OOM)

SymptomFix
CUDA OOM during trainingReduce batch size to 1, use gradient_accumulation
CUDA OOM during loadingUse load_in_4bit=True
System RAM OOMUse load_in_4bit=True with device_map="auto"

Overfitting

SymptomFix
Eval loss increasesReduce epochs, add dropout, reduce r
Model repeats training dataEpochs too high, dataset too small
Model loses general knowledgeAdd general knowledge examples to dataset

Poor Quality

SymptomFix
Outputs are garbledCheck tokenizer chat template
Model ignores instructionsDataset format is wrong; use ShareGPT format
Model is too verboseAdd system prompt examples in training data

12. Advanced Tips

  • Mix datasets: Combine your custom data with a small percentage of general instruction data (e.g., OpenHermes 2.5) to prevent catastrophic forgetting.
  • Learning rate scheduling: Use cosine or linear decay. Constant LR works but is suboptimal.
  • DeepSpeed / FSDP: For multi-GPU setups, use DeepSpeed ZeRO-3 or FSDP. Unsloth supports both.
  • Dataset augmentation: Use an LLM (like GPT-4 or Claude) to generate additional high-quality examples in your domain.

Conclusion

Fine-tuning Llama 3 is accessible to anyone with a modern GPU and a bit of Python knowledge. With QLoRA and Unsloth, you can train a specialized model in under an hour on consumer hardware.

The workflow is straightforward:

  1. Prepare your dataset (JSONL, ChatML format)
  2. Load model with QLoRA via Unsloth
  3. Train with SFTTrainer
  4. Merge and convert to GGUF
  5. Run in Ollama or LM Studio

Your fine-tuned model will be faster, cheaper, and more private than any cloud API. In 2026, that is a superpower.