Fine-Tuning LLMs in 2026: LoRA, QLoRA for Llama 4 and Mistral — Complete Guide

Fine-tuning has evolved dramatically. In 2026, you can fine-tune a 100B+ parameter model on a single consumer GPU using parameter-efficient techniques like LoRA, QLoRA, and the new DoRA. This guide covers everything you need to ship a custom fine-tuned model.

Why Fine-Tune?

Approach	Quality	Cost	Data Needed
Prompt engineering	Baseline	Free	0
RAG	Good + grounded	Low	Documents
Fine-tuning	Excellent	Medium	500-5000 examples
Full training	Best	Very high	Millions

Fine-tuning excels at:

Domain adaptation — medical, legal, financial language
Format control — always output JSON, follow specific templates
Behavior shaping — tone, style, safety guardrails
Performance boost — +5-15% on domain-specific tasks

Parameter-Efficient Fine-Tuning (PEFT) Methods

LoRA (Low-Rank Adaptation)

LoRA adds trainable low-rank matrices to attention layers. Only 0.1-1% of parameters are trained.

# Original: W (dense) = 4096x4096 = 16.8M params
# LoRA: A (4096x16) + B (16x4096) = 131K params
# That's 0.78% of original!

QLoRA (Quantized LoRA)

QLoRA combines 4-bit quantization with LoRA:

from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

Memory savings: QLoRA fits a 70B model in 48GB vs 280GB for full fine-tuning.

DoRA (Weight-Decomposed Low-Rank Adaptation)

DoRA (new in 2026) decomposes weights into magnitude and direction:

Method	Trainable Params	Memory (70B model)	Performance vs Full FT
Full FT	70B	280 GB	100%
LoRA	0.1-1%	140 GB	90-95%
QLoRA	0.1-1%	48 GB	88-93%
DoRA	0.1-1%	48 GB	94-98%

Step-by-Step: Fine-Tune Llama 4 with QLoRA

Step 1: Setup

pip install torch transformers accelerate peft trl bitsandbytes unsloth

Step 2: Load Model with QLoRA

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import BitsAndBytesConfig

model_name = "meta-llama/Llama-4-Scout"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

model = prepare_model_for_kbit_training(model)

Step 3: Configure LoRA

lora_config = LoraConfig(
    r=16,           # Rank of LoRA matrices
    lora_alpha=32,  # Scaling factor
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Only 0.12% of total parameters are trainable!

Step 4: Prepare Dataset

from datasets import Dataset

# Format: instruction + input + output
training_data = [
    {
        "instruction": "Translate this code from Python to TypeScript",
        "input": "def add(a, b): return a + b",
        "output": "const add = (a: number, b: number): number => a + b;"
    },
    # ... more examples
]

def format_chat(example):
    return {
        "text": f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>
{example['instruction']}

{example['input']}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
{example['output']}<|eot_id|>"""
    }

dataset = Dataset.from_list(training_data)
dataset = dataset.map(format_chat)

Step 5: Train

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=TrainingArguments(
        output_dir="./llama4-scout-coder-lora",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=10,
        save_steps=100,
        optim="paged_adamw_8bit",
    ),
    tokenizer=tokenizer,
    max_seq_length=2048,
)

trainer.train()

Step 6: Save and Merge

# Save LoRA weights
model.save_pretrained("./llama4-scout-coder-lora")

# Merge with base model for inference
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(model_name)
merged_model = PeftModel.from_pretrained(base_model, "./llama4-scout-coder-lora")
merged_model = merged_model.merge_and_unload()
merged_model.save_pretrained("./llama4-scout-coder-merged")

Fine-Tuning Mistral Large 2

Mistral Large 2 uses a different architecture. Here is the optimized config:

mistral_lora_config = LoraConfig(
    r=32,  # Higher rank for Mistral
    lora_alpha=64,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_dropout=0.1,
)

# Mistral requires padding_side="left" for generation
tokenizer.padding_side = "left"

Dataset Best Practices

Quality over Quantity

Examples	Quality	Expected Lift
100	Low	3-5%
500	Medium	8-12%
2000	High	12-18%
10000	Very High	15-25%

Data Quality Checklist

Each example has clear instruction + expected output
Examples cover edge cases (not just happy path)
No conflicting examples (same input, different output)
Balanced distribution across categories
Validation set (10-20% of total)

Hardware Requirements

Model	Method	Min VRAM	Recommended GPU
Llama 4 Scout (109B)	QLoRA	24 GB	RTX 4090
Llama 4 Maverick (200B)	QLoRA	48 GB	2x RTX 4090
Mistral Large 2 (123B)	QLoRA	48 GB	2x RTX 4090
Phi-4 (14B)	LoRA	12 GB	RTX 3060
Gemma 3-27B	QLoRA	24 GB	RTX 4090

Evaluation

from lm_eval import evaluator

results = evaluator.simple_evaluate(
    model="local",
    model_args=f"pretrained=./llama4-scout-coder-merged",
    tasks=["mmlu", "human_eval", "gsm8k"],
    batch_size=4,
)

print(f"MMLU: {results['results']['mmlu']['acc']}")
print(f"HumanEval: {results['results']['human_eval']['pass@1']}")

The Bottom Line

Fine-tuning in 2026 is accessible to anyone with a gaming GPU. Using QLoRA + Unsloth, you can fine-tune Llama 4 Scout on an RTX 4090 in 3-6 hours. The key is high-quality data and the right hyperparameters. Start small (500 examples, rank 16), evaluate thoroughly, and iterate. DoRA is the emerging best practice — it matches full fine-tuning quality at QLoRA-level memory costs.

Fine-Tuning LLMs in 2026: LoRA, QLoRA for Llama 4 and Mistral — Complete Guide

Fine-Tuning LLMs in 2026: LoRA, QLoRA for Llama 4 and Mistral — Complete Guide

Why Fine-Tune?

Parameter-Efficient Fine-Tuning (PEFT) Methods

LoRA (Low-Rank Adaptation)

QLoRA (Quantized LoRA)

DoRA (Weight-Decomposed Low-Rank Adaptation)

Step-by-Step: Fine-Tune Llama 4 with QLoRA

Step 1: Setup

Step 2: Load Model with QLoRA

Step 3: Configure LoRA

Step 4: Prepare Dataset

Step 5: Train

Step 6: Save and Merge

Fine-Tuning Mistral Large 2

Dataset Best Practices

Quality over Quantity

Data Quality Checklist

Hardware Requirements

Evaluation

The Bottom Line

ON THIS PAGE

Continue Reading