Surya Pratap Singh

Surya Pratap Singh

AI Engineer & Founder

May 27, 2026
14 min read
Fine-Tuning LLMs in 2026: LoRA, QLoRA for Llama 4 and Mistral — Complete Guide
Tutorial

Fine-Tuning LLMs in 2026: LoRA, QLoRA for Llama 4 and Mistral — Complete Guide

Fine-Tuning LLMs in 2026: LoRA, QLoRA for Llama 4 and Mistral — Complete Guide

Fine-tuning has evolved dramatically. In 2026, you can fine-tune a 100B+ parameter model on a single consumer GPU using parameter-efficient techniques like LoRA, QLoRA, and the new DoRA. This guide covers everything you need to ship a custom fine-tuned model.


Why Fine-Tune?

ApproachQualityCostData Needed
Prompt engineeringBaselineFree0
RAGGood + groundedLowDocuments
Fine-tuningExcellentMedium500-5000 examples
Full trainingBestVery highMillions

Fine-tuning excels at:

  • Domain adaptation — medical, legal, financial language
  • Format control — always output JSON, follow specific templates
  • Behavior shaping — tone, style, safety guardrails
  • Performance boost — +5-15% on domain-specific tasks

Parameter-Efficient Fine-Tuning (PEFT) Methods

LoRA (Low-Rank Adaptation)

LoRA adds trainable low-rank matrices to attention layers. Only 0.1-1% of parameters are trained.

# Original: W (dense) = 4096x4096 = 16.8M params # LoRA: A (4096x16) + B (16x4096) = 131K params # That's 0.78% of original!

QLoRA (Quantized LoRA)

QLoRA combines 4-bit quantization with LoRA:

from transformers import BitsAndBytesConfig import torch bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16 )

Memory savings: QLoRA fits a 70B model in 48GB vs 280GB for full fine-tuning.

DoRA (Weight-Decomposed Low-Rank Adaptation)

DoRA (new in 2026) decomposes weights into magnitude and direction:

MethodTrainable ParamsMemory (70B model)Performance vs Full FT
Full FT70B280 GB100%
LoRA0.1-1%140 GB90-95%
QLoRA0.1-1%48 GB88-93%
DoRA0.1-1%48 GB94-98%

Step-by-Step: Fine-Tune Llama 4 with QLoRA

Step 1: Setup

pip install torch transformers accelerate peft trl bitsandbytes unsloth

Step 2: Load Model with QLoRA

import torch from transformers import AutoModelForCausalLM, AutoTokenizer from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training from transformers import BitsAndBytesConfig model_name = "meta-llama/Llama-4-Scout" bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16, ) model = AutoModelForCausalLM.from_pretrained( model_name, quantization_config=bnb_config, device_map="auto", trust_remote_code=True, ) model = prepare_model_for_kbit_training(model)

Step 3: Configure LoRA

lora_config = LoraConfig( r=16, # Rank of LoRA matrices lora_alpha=32, # Scaling factor target_modules=[ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj" ], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # Only 0.12% of total parameters are trainable!

Step 4: Prepare Dataset

from datasets import Dataset # Format: instruction + input + output training_data = [ { "instruction": "Translate this code from Python to TypeScript", "input": "def add(a, b): return a + b", "output": "const add = (a: number, b: number): number => a + b;" }, # ... more examples ] def format_chat(example): return { "text": f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|> {example['instruction']} {example['input']}<|eot_id|> <|start_header_id|>assistant<|end_header_id|> {example['output']}<|eot_id|>""" } dataset = Dataset.from_list(training_data) dataset = dataset.map(format_chat)

Step 5: Train

from trl import SFTTrainer from transformers import TrainingArguments trainer = SFTTrainer( model=model, train_dataset=dataset, args=TrainingArguments( output_dir="./llama4-scout-coder-lora", per_device_train_batch_size=4, gradient_accumulation_steps=4, num_train_epochs=3, learning_rate=2e-4, fp16=True, logging_steps=10, save_steps=100, optim="paged_adamw_8bit", ), tokenizer=tokenizer, max_seq_length=2048, ) trainer.train()

Step 6: Save and Merge

# Save LoRA weights model.save_pretrained("./llama4-scout-coder-lora") # Merge with base model for inference from peft import PeftModel base_model = AutoModelForCausalLM.from_pretrained(model_name) merged_model = PeftModel.from_pretrained(base_model, "./llama4-scout-coder-lora") merged_model = merged_model.merge_and_unload() merged_model.save_pretrained("./llama4-scout-coder-merged")

Fine-Tuning Mistral Large 2

Mistral Large 2 uses a different architecture. Here is the optimized config:

mistral_lora_config = LoraConfig( r=32, # Higher rank for Mistral lora_alpha=64, target_modules=[ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj" ], lora_dropout=0.1, ) # Mistral requires padding_side="left" for generation tokenizer.padding_side = "left"

Dataset Best Practices

Quality over Quantity

ExamplesQualityExpected Lift
100Low3-5%
500Medium8-12%
2000High12-18%
10000Very High15-25%

Data Quality Checklist

  • Each example has clear instruction + expected output
  • Examples cover edge cases (not just happy path)
  • No conflicting examples (same input, different output)
  • Balanced distribution across categories
  • Validation set (10-20% of total)

Hardware Requirements

ModelMethodMin VRAMRecommended GPU
Llama 4 Scout (109B)QLoRA24 GBRTX 4090
Llama 4 Maverick (200B)QLoRA48 GB2x RTX 4090
Mistral Large 2 (123B)QLoRA48 GB2x RTX 4090
Phi-4 (14B)LoRA12 GBRTX 3060
Gemma 3-27BQLoRA24 GBRTX 4090

Evaluation

from lm_eval import evaluator results = evaluator.simple_evaluate( model="local", model_args=f"pretrained=./llama4-scout-coder-merged", tasks=["mmlu", "human_eval", "gsm8k"], batch_size=4, ) print(f"MMLU: {results['results']['mmlu']['acc']}") print(f"HumanEval: {results['results']['human_eval']['pass@1']}")

The Bottom Line

Fine-tuning in 2026 is accessible to anyone with a gaming GPU. Using QLoRA + Unsloth, you can fine-tune Llama 4 Scout on an RTX 4090 in 3-6 hours. The key is high-quality data and the right hyperparameters. Start small (500 examples, rank 16), evaluate thoroughly, and iterate. DoRA is the emerging best practice — it matches full fine-tuning quality at QLoRA-level memory costs.

Continue Reading