6 Compression Techniques for Language Models

Updated 2025-12-067 min read
6 Compression Techniques for Language Models

The artificial intelligence landscape has witnessed an explosion in model sizes over recent years. Yet, companies like MistralAI have demonstrated that bigger isn't always better; what truly counts is efficiency relative to performance. As edge computing gains momentum, the industry increasingly demands compact, high-performing models that can operate effectively in resource-constrained environments. Model compression techniques offer the solution. This comprehensive guide explores six fundamental compression strategies, complete with practical code examples.

Understanding Model Compression

Model compression refers to techniques that minimize the footprint of machine learning models while preserving their capabilities. Many deep neural networks suffer from over-parameterization, containing excessive and redundant components that can be eliminated or simplified. Through compression, we reduce parameter counts and memory requirements, leading to faster inference times and improved storage efficiency, critical factors when deploying AI on devices with limited computational resources.

Six Core Compression Strategies:

  1. Quantization: Lowers numerical precision of weights and activations

  2. Pruning: Eliminates redundant weights or neurons from the network

  3. Knowledge Distillation: Trains compact models to replicate larger models' behavior

  4. Weight Sharing: Enables multiple layers to use common weight sets

  5. Low-Rank Factorization: Decomposes weight matrices into smaller components

  6. Mixed Precision Training: Combines different numerical precisions during training

1. Quantization

Quantization compresses models by reducing the numerical precision used to represent weights and activations. Instead of 32-bit or 16-bit floating-point representations, we can use 8-bit or even 4-bit integers, dramatically reducing memory consumption.

Key Approaches:

  • Weight Quantization: Converts weight precision (e.g., FP32 to INT8), reducing storage requirements

  • Activation Quantization: Compresses activation values, lowering inference memory needs

  • Quantization-Aware Training (QAT): Incorporates quantization during training for better accuracy

  • Post-Training Quantization (PTQ): Applies quantization after training completion

Implementation Example - 8-bit Quantization with GPT-2:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
 
model_id = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
 
# Load model with 8-bit quantization
quantized_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_8bit=True,
    device_map="auto"
)
 
prompt = "Quantization dramatically reduces model size while maintaining performance."
inputs = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
 
with torch.no_grad():
    generated = quantized_model.generate(inputs, max_length=50)
 
result = tokenizer.decode(generated[0], skip_special_tokens=True)
print(result)

2. Pruning

Pruning systematically removes unnecessary components from neural networks, individual weights, entire neurons, or complete layers. This technique reduces model complexity while retaining the majority of original performance. Pruning can be unstructured (targeting individual weights) or structured (removing entire structural components).

For transformer architectures like GPT-2, attention head pruning is particularly effective, eliminating less critical attention mechanisms.

Implementation Example - Pruning 30% of GPT-2 Weights:

import torch
import torch.nn.utils.prune as prune
from transformers import AutoModelForCausalLM, AutoTokenizer
 
model_id = "gpt2"
base_model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
 
def apply_pruning(layer, pruning_ratio=0.3):
    """Apply L1 unstructured pruning to linear layers"""
    for component_name, module in layer.named_modules():
        if isinstance(module, torch.nn.Linear):
            prune.l1_unstructured(module, name="weight", amount=pruning_ratio)
            print(f"Applied {pruning_ratio*100}% pruning to {component_name}")
 
# Prune all transformer layers
for transformer_layer in base_model.transformer.h:
    apply_pruning(transformer_layer, pruning_ratio=0.3)
 
# Calculate sparsity
total_params = sum(p.numel() for p in base_model.parameters())
zero_params = sum((p.data == 0).sum().item() for p in base_model.parameters())
 
print(f"Parameters: {total_params:,}")
print(f"Zero parameters: {zero_params:,}")
print(f"Sparsity achieved: {zero_params / total_params:.2%}")

3. Knowledge Distillation

Knowledge distillation creates compact models by training them to emulate larger, more complex models. The large model (teacher) guides the training of a smaller model (student), which learns to reproduce the teacher's output patterns. The result is a compressed model with comparable performance to its larger counterpart.

Implementation Example - Distilling GPT-2:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import torch.nn.functional as F
 
teacher_id = "gpt2"
student_id = "distilgpt2"
 
# Initialize models
teacher = AutoModelForCausalLM.from_pretrained(teacher_id).to("cuda")
student = AutoModelForCausalLM.from_pretrained(student_id).to("cuda")
teacher_tok = AutoTokenizer.from_pretrained(teacher_id)
student_tok = AutoTokenizer.from_pretrained(student_id)
 
# Load training data
train_data = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
 
# Training configuration
optimizer = torch.optim.AdamW(student.parameters(), lr=5e-5)
temp = 2.0  # Temperature for softening distributions
alpha = 0.5  # Balance between distillation and task loss
 
for epoch in range(3):
    for idx, sample in enumerate(train_data):
        text = sample["text"]
        if not text.strip():
            continue
        
        teacher_input = teacher_tok(text, return_tensors="pt").to("cuda")
        student_input = student_tok(text, return_tensors="pt").to("cuda")
        
        # Get teacher predictions (no gradients needed)
        with torch.no_grad():
            teacher_outputs = teacher(**teacher_input).logits / temp
            soft_targets = F.softmax(teacher_outputs, dim=-1)
        
        # Student forward pass
        student_outputs = student(**student_input).logits
        
        # Distillation loss (KL divergence)
        distill_loss = F.kl_div(
            F.log_softmax(student_outputs / temp, dim=-1),
            soft_targets,
            reduction="batchmean"
        ) * (temp ** 2)
        
        # Standard cross-entropy loss
        ce_loss = F.cross_entropy(
            student_outputs.view(-1, student_outputs.size(-1)),
            student_input["input_ids"].view(-1),
            ignore_index=student_tok.pad_token_id
        )
        
        # Combined loss
        total_loss = alpha * distill_loss + (1 - alpha) * ce_loss
        
        optimizer.zero_grad()
        total_loss.backward()
        optimizer.step()
        
        if idx % 100 == 0:
            print(f"Epoch {epoch + 1}/3, Step {idx}, Loss: {total_loss.item():.4f}")

4. Weight Sharing

Weight sharing compresses models by allowing multiple network components to utilize identical weight sets. By grouping similar weights through clustering algorithms, we significantly reduce the unique values that need to be stored, resulting in a more memory-efficient model.

Implementation Example - Clustering Weights in GPT-2:

import torch
import numpy as np
from sklearn.cluster import KMeans
from transformers import GPT2LMHeadModel
 
def compress_via_weight_sharing(model, clusters=16):
    """Apply weight clustering to reduce unique weight values"""
    for param_name, parameter in model.named_parameters():
        if parameter.requires_grad:
            # Flatten weights for clustering
            weight_array = parameter.data.cpu().numpy().flatten().reshape(-1, 1)
            
            # Cluster weights
            clustering = KMeans(n_clusters=clusters, random_state=42)
            clustering.fit(weight_array)
            
            # Replace weights with cluster centers
            compressed = np.array([
                clustering.cluster_centers_[label] 
                for label in clustering.labels_
            ]).reshape(parameter.data.shape)
            
            parameter.data = torch.tensor(
                compressed, 
                dtype=parameter.data.dtype
            ).to(parameter.device)
    
    return model
 
# Apply weight sharing
model = GPT2LMHeadModel.from_pretrained("gpt2")
compressed_model = compress_via_weight_sharing(model, clusters=16)
print("Weight sharing compression completed!")

5. Low-Rank Factorization

Low-rank factorization decomposes large weight matrices into smaller, low-rank components. By approximating a matrix as the product of two smaller matrices, we reduce the number of parameters while maintaining similar representational capacity. This technique is particularly effective for the dense layers in transformer models.

Implementation Example - Singular Value Decomposition (SVD) Factorization:

import torch
import torch.nn as nn
from transformers import GPT2LMHeadModel
 
class LowRankLinear(nn.Module):
    """Replace linear layer with low-rank factorization"""
    def __init__(self, original_layer, rank):
        super().__init__()
        weight = original_layer.weight.data
        U, S, V = torch.svd(weight)
        
        # Keep only top-k singular values
        self.U = nn.Parameter(U[:, :rank] @ torch.diag(torch.sqrt(S[:rank])))
        self.V = nn.Parameter(torch.diag(torch.sqrt(S[:rank])) @ V[:, :rank].t())
        
        if original_layer.bias is not None:
            self.bias = nn.Parameter(original_layer.bias.data)
        else:
            self.register_parameter('bias', None)
    
    def forward(self, x):
        out = x @ self.V.t() @ self.U.t()
        if self.bias is not None:
            out = out + self.bias
        return out
 
def apply_low_rank_factorization(model, rank=64):
    """Apply low-rank decomposition to linear layers"""
    for name, module in model.named_modules():
        if isinstance(module, nn.Linear):
            # Get parent module and attribute name
            *parent_path, attr = name.split('.')
            parent = model
            for p in parent_path:
                parent = getattr(parent, p)
            
            # Replace with low-rank version
            low_rank_layer = LowRankLinear(module, rank)
            setattr(parent, attr, low_rank_layer)
            print(f"Factorized layer: {name}")
    
    return model
 
# Apply factorization
model = GPT2LMHeadModel.from_pretrained("gpt2")
factorized_model = apply_low_rank_factorization(model, rank=64)
print("Low-rank factorization applied!")

6. Mixed Precision Training

Mixed precision training optimizes both training efficiency and model size by using different numerical precisions for different operations. Typically, this involves using 16-bit floating-point (FP16) for most computations while maintaining 32-bit precision (FP32) for critical operations. This approach accelerates training and reduces memory usage without sacrificing model quality.

Implementation Example - Training with Automatic Mixed Precision:

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer, Trainer, TrainingArguments
from datasets import load_dataset
 
# Model and tokenizer setup
model_name = "gpt2"
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
 
# Prepare dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train[:1000]")
 
def tokenize_function(examples):
    return tokenizer(
        examples["text"], 
        truncation=True, 
        padding="max_length", 
        max_length=128
    )
 
tokenized_dataset = dataset.map(tokenize_function, batched=True)
 
# Training arguments with mixed precision
training_args = TrainingArguments(
    output_dir="./mixed_precision_model",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    fp16=True,  # Enable mixed precision training
    logging_steps=100,
    save_steps=500,
)
 
# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)
 
# Train with mixed precision
trainer.train()
print("Mixed precision training completed!")
 
# Alternative: Manual mixed precision with torch.cuda.amp
from torch.cuda.amp import autocast, GradScaler
 
model = GPT2LMHeadModel.from_pretrained("gpt2").to("cuda")
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
scaler = GradScaler()
 
for epoch in range(1):
    for batch in tokenized_dataset:
        inputs = tokenizer(batch["text"], return_tensors="pt").to("cuda")
        
        # Enable automatic mixed precision
        with autocast():
            outputs = model(**inputs, labels=inputs["input_ids"])
            loss = outputs.loss
        
        # Scaled backward pass
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()
 
print("Manual mixed precision training completed!")

Conclusion

This article has covered six essential techniques for compressing large language models: quantization, pruning, knowledge distillation, weight sharing, low-rank factorization, and mixed precision training. While not exhaustive, these methods provide a robust toolkit for deploying efficient AI systems, particularly in edge computing and resource-limited scenarios. By combining multiple techniques, practitioners can achieve significant compression ratios while maintaining acceptable performance levels, making advanced language models accessible across a wider range of deployment environments.

By combining multiple techniques, practitioners can achieve significant compression ratios while maintaining acceptable performance levels. With the right GPU infrastructure from providers like Spheron AI, you can experiment with these techniques efficiently and deploy advanced language models across a wider range of environments, from cloud servers to edge devices.

The future of AI deployment lies not just in building larger models, but in making powerful models accessible and efficient for real-world applications. Model compression is the key to unlocking that future.

Recommended articles

LinkedIn
X