Understanding FLOPs, MFU, and Computational Efficiency in LLM Training: From Dense Transformers to MoE Architectures
As Large Language Models (LLMs) continue to scale exponentially, understanding their computational requirements becomes crucial for researchers and practitioners alike. Whether you’re planning a training run, optimizing infrastructure costs, or comparing different architectures, FLOPs (Floating Point Operations) and MFU (Model FLOPs Utilization) are essential metrics that provide hardware-independent ways to measure and optimize computational efficiency.
This post builds upon two excellent foundational resources: Adam Casson’s comprehensive guide to Transformer FLOPs and Pratish Raj’s practical guide to FLOPs in LLM training. While these posts cover dense Transformer architectures, this post extends their insights to modern Mixture-of-Experts (MoE) architectures, with a particular focus on OpenAI’s recently released GPT-OSS models.
In this comprehensive analysis, we’ll explore FLOP counting methodologies, dive deep into MFU calculations, and examine how these concepts apply to the latest generation of sparse models that are reshaping the efficiency landscape of large language models.
🛠️ Practical Tools: Want to calculate FLOPs for your own models? Check out the FLOP Calculator - an interactive tool that implements all the methodologies discussed in this post for both dense and MoE architectures.
🔢 What Are FLOPs and Why Do They Matter?
Defining FLOPs
A FLOP (Floating Point Operation) represents a single computational operation like addition (3.8 + 4.1
) or multiplication (1.2 × 4.6
). When we talk about FLOPs (plural), we’re counting the total number of these atomic operations required for a specific task.
For large-scale computations, we use:
- GFLOPs = 1 billion FLOPs (10⁹)
- TFLOPs = 1 trillion FLOPs (10¹²)
- PFLOPs = 1 quadrillion FLOPs (10¹⁵)
Why FLOPs Matter
FLOPs provide several crucial advantages:
- Hardware Independence: Unlike wall-clock time, FLOPs offer consistent measurements across different hardware configurations
- Reproducibility: Enable precise comparisons between different models and training setups
- Cost Estimation: Help predict computational costs and resource requirements
- Efficiency Analysis: Allow measurement of how well we utilize available hardware
📐 Counting FLOPs in Dense Transformers
The OpenAI Method
The seminal approach from OpenAI’s scaling laws paper [1] provides a clean approximation:
FLOPs per token ≈ 6N
Where N
is the number of non-embedding parameters. This factor of 6 accounts for:
- 2× for the forward pass (multiply-accumulate operations)
- 2× for the backward pass (gradients with respect to inputs)
- 2× for the backward pass (gradients with respect to parameters)
This approximation, introduced by Kaplan et al. in their foundational work on neural language model scaling laws, provides a hardware-independent way to estimate computational requirements [1]. The elegance of this formula lies in its simplicity while capturing the essential computational structure of Transformer training.
Let’s break down the forward pass components:
Operation | Parameters | FLOPs per Token |
---|---|---|
Attention QKV | 3 × d × d_model | 6 × L × d_model × d |
Attention Scores | - | 4 × L × seq_len × d |
Attention Project | d × d_model | 2 × L × d × d_model |
Feedforward | 8 × d_model² | 16 × L × d_model² |
Total (approx) | - | ≈ 2 × N |
Where:
-
L
= number of layers -
seq_len
= sequence length -
d_model
= hidden dimension -
d
= attention dimension (d_model
for most implementations)
The DeepMind Method
DeepMind’s Chinchilla paper provides a more detailed accounting that includes embeddings, logits, and attention mechanics:
def deepmind_flops_per_sequence(n_layers, n_heads, d_model, seq_len, vocab_size, ff_ratio=4):
"""DeepMind method for forward pass FLOPs counting"""
d_attn = d_model // n_heads
d_ff = d_model * ff_ratio
# Components
embeddings = 2 * seq_len * vocab_size * d_model
attn_qkv = 2 * seq_len * 3 * d_model * (d_attn * n_heads)
attn_logits = 2 * seq_len * seq_len * (d_attn * n_heads)
attn_softmax = 3 * n_heads * seq_len * seq_len
attn_reduce = 2 * seq_len * seq_len * (d_attn * n_heads)
attn_project = 2 * seq_len * (d_attn * n_heads) * d_model
total_attn = attn_qkv + attn_logits + attn_softmax + attn_reduce + attn_project
feedforward = 2 * seq_len * (d_model * d_ff + d_model * d_ff)
logits = 2 * seq_len * d_model * vocab_size
return embeddings + n_layers * (total_attn + feedforward) + logits
Practical Example: GPT-3 Scale Model
Let’s calculate FLOPs for a GPT-3 scale model:
- Parameters: 175B non-embedding
- Sequence length: 2048
- Batch size: 32
# Using OpenAI approximation
flops_per_token = 6 * 175e9 # 1.05 × 10^12 FLOPs
tokens_per_step = 32 * 2048 # 65,536 tokens
flops_per_step = flops_per_token * tokens_per_step # 6.87 × 10^16 FLOPs
# Forward + Backward
total_flops_per_step = flops_per_step * 2 # 1.37 × 10^17 FLOPs ≈ 137 PFLOPs
⚡ Model FLOPs Utilization (MFU): The Gold Standard
Understanding MFU
Model FLOPs Utilization (MFU) measures how efficiently we execute the theoretically necessary FLOPs for training, introduced in Google’s PaLM paper:
MFU = (Model FLOPs × Throughput) / Peak Hardware FLOPs
Where:
- Model FLOPs: Theoretical FLOPs per token (using OpenAI’s 6N approximation)
- Throughput: Observed tokens processed per second
- Peak Hardware FLOPs: Theoretical maximum FLOP/s of your hardware
MFU vs. Hardware FLOPs Utilization (HFU)
Metric | MFU | HFU |
---|---|---|
Includes | Only necessary model computations | All computations (including overheads) |
Use Case | Fair comparison across setups | Implementation efficiency |
Affected by | Model architecture, batch size | Memory management, communication |
Calculating MFU: A Practical Example
Consider training a 7B parameter model on 8×A100 GPUs:
# Model specifications
parameters = 7e9
flops_per_token = 6 * parameters # 4.2 × 10^10
# Hardware specifications
a100_peak_flops = 312e12 # 312 TFLOPs for bf16
total_peak_flops = 8 * a100_peak_flops # 2.496 × 10^15
# Measured throughput
tokens_per_second = 8000
# Calculate MFU
sustained_flops = flops_per_token * tokens_per_second # 3.36 × 10^14
mfu = sustained_flops / total_peak_flops # 0.135 = 13.5%
Typical MFU Ranges
Real-world MFU values vary significantly:
- Small models (< 1B): 10-30%
- Medium models (1B-10B): 30-50%
- Large models (10B-100B): 45-65%
- Very large models (100B+): 50-70%
Higher MFU in larger models is often due to better compute-to-communication ratios and improved memory bandwidth utilization.
🧠 Extending to Mixture-of-Experts (MoE) Architectures
MoE Fundamentals
Mixture-of-Experts architectures replace dense feedforward networks with a router and multiple expert networks. Only a subset of experts (typically 1-2 out of 8-64) are activated per token, dramatically reducing computational costs while maintaining or improving model quality.
MoE FLOP Counting Challenges
Traditional FLOP counting becomes more nuanced with MoE:
- Routing Overhead: Additional FLOPs for expert selection
- Variable Computation: Different tokens may use different experts
- Load Balancing: Uneven expert utilization affects total FLOPs
- Communication Costs: Expert routing across devices
MoE FLOP Formula
For a MoE layer replacing a dense feedforward network:
def moe_flops_per_token(d_model, num_experts, experts_per_token, d_ff, load_balance_factor=1.1):
"""Calculate FLOPs for MoE layer"""
# Router FLOPs (token → expert probabilities)
router_flops = 2 * d_model * num_experts
# Expert FLOPs (only active experts)
expert_flops = experts_per_token * (2 * d_model * d_ff + 2 * d_ff * d_model)
# Load balancing overhead
effective_expert_flops = expert_flops * load_balance_factor
return router_flops + effective_expert_flops
# Example: 8 experts, top-2 routing
dense_ff_flops = 16 * d_model**2 # 4x expansion, two linear layers
moe_ff_flops = moe_flops_per_token(d_model, 8, 2, 4*d_model)
efficiency_gain = dense_ff_flops / moe_ff_flops # Typically 2-4x
🚀 Case Study: OpenAI GPT-OSS Models
OpenAI’s GPT-OSS models provide an excellent real-world example of modern MoE architectures with several innovative features:
GPT-OSS Architecture Overview
GPT-OSS-120B:
- Total Parameters: 117B (5.1B active per token)
- Architecture: MoE with native MXFP4 quantization
- Activation: ~4.4% of total parameters per token
- Memory: Fits on single H100 (80GB)
GPT-OSS-20B:
- Total Parameters: 21B (3.6B active per token)
- Activation: ~17% of total parameters per token
- Memory: Runs in 16GB
FLOP Counting for GPT-OSS
Let’s calculate FLOPs for GPT-OSS-120B:
def gpt_oss_flops_calculation():
# GPT-OSS-120B specifications (estimated)
total_params = 117e9
active_params_per_token = 5.1e9
# Traditional approach (if it were dense)
dense_flops_per_token = 6 * total_params # 702 GFLOPs
# MoE approach (actual)
# Attention layers remain dense
attention_params = total_params * 0.3 # Estimated 30%
attention_flops = 6 * attention_params
# MoE feedforward layers
moe_params_per_token = active_params_per_token - attention_params
moe_flops = 6 * moe_params_per_token
# Router overhead (small)
router_flops = total_params * 0.001 # Estimated 0.1%
total_flops_per_token = attention_flops + moe_flops + router_flops
return {
'dense_equivalent': dense_flops_per_token,
'actual_moe': total_flops_per_token,
'efficiency_gain': dense_flops_per_token / total_flops_per_token
}
results = gpt_oss_flops_calculation()
print(f"Dense equivalent: {results['dense_equivalent']:.1e} FLOPs/token")
print(f"MoE actual: {results['actual_moe']:.1e} FLOPs/token")
print(f"Efficiency gain: {results['efficiency_gain']:.1f}x")
MXFP4 Quantization Impact
GPT-OSS uses native MXFP4 quantization for MoE layers:
- Memory: 4-bit storage vs 16-bit (4× reduction)
- Compute: Specialized kernels maintain computational efficiency
- Accuracy: Minimal degradation with proper scaling
This affects FLOP counting considerations:
def quantized_moe_flops(base_flops, quantization_overhead=1.05):
"""
MXFP4 might have slight computational overhead
but significant memory bandwidth benefits
"""
return base_flops * quantization_overhead
Performance Analysis
Based on the GPT-OSS specifications, we can estimate performance characteristics:
# Estimated GPT-OSS-120B on H100
flops_per_token = 6 * 5.1e9 # 30.6 GFLOPs (active parameters)
h100_peak = 1980e12 # ~2 PFLOPs for int4 operations
target_throughput = 100 # tokens/second (estimated)
required_flops_rate = flops_per_token * target_throughput # 3.06 TFLOPs/s
mfu = required_flops_rate / h100_peak # 0.15%
print(f"Estimated MFU for GPT-OSS-120B: {mfu*100:.1f}%")
Note: These are rough estimates. Actual performance depends on implementation details, memory bandwidth, and other factors.
🔧 Practical Considerations and Optimizations
Memory vs. Compute Trade-offs
Modern LLM training involves several techniques that affect FLOP calculations:
Activation Checkpointing:
def checkpointing_flop_overhead(base_flops, checkpoint_ratio=0.5):
"""
Recompute activations during backward pass
checkpoint_ratio: fraction of activations checkpointed
"""
recompute_flops = base_flops * checkpoint_ratio
return base_flops + recompute_flops # Up to 1.5× FLOPs
Gradient Accumulation:
- Doesn’t affect per-token FLOPs
- May affect MFU due to different memory access patterns
Scaling Laws and FLOP-Optimal Training
The relationship between model size, dataset size, and computational budget follows predictable scaling laws:
def compute_optimal_scaling(compute_budget_flops):
"""
Based on Chinchilla scaling laws
Roughly: parameters ∝ compute^0.5, tokens ∝ compute^0.5
"""
optimal_params = (compute_budget_flops / 6) ** 0.5
optimal_tokens = (compute_budget_flops / 6) ** 0.5
return optimal_params, optimal_tokens
# Example: 10^23 FLOPs budget
params, tokens = compute_optimal_scaling(1e23)
print(f"Optimal: {params/1e9:.0f}B parameters, {tokens/1e9:.0f}B tokens")
📊 Benchmarking and Measurement Tools
Measuring MFU in Practice
import time
import torch
def measure_mfu(model, batch_size, seq_len, num_steps=100):
"""Measure MFU during actual training"""
device = next(model.parameters()).device
# Create dummy batch
input_ids = torch.randint(0, 50000, (batch_size, seq_len), device=device)
# Warm up
for _ in range(10):
loss = model(input_ids).loss
loss.backward()
torch.cuda.synchronize()
start_time = time.time()
for _ in range(num_steps):
loss = model(input_ids).loss
loss.backward()
torch.cuda.synchronize()
end_time = time.time()
# Calculate throughput
total_tokens = num_steps * batch_size * seq_len
throughput = total_tokens / (end_time - start_time)
# Calculate MFU
flops_per_token = 6 * model.num_parameters()
peak_flops = get_device_peak_flops(device)
mfu = (flops_per_token * throughput) / peak_flops
return mfu, throughput
Interactive FLOP Calculator
For hands-on experimentation with these concepts, check out the FLOP Calculator repository. This interactive tool implements:
- Dense Transformer FLOP calculations (OpenAI and DeepMind methods)
- MoE architecture FLOP counting with customizable parameters
- MFU measurement and comparison across different hardware
- Visualization of computational efficiency trade-offs
- Support for custom model configurations
The calculator makes it easy to experiment with different architectural choices and understand their computational implications before committing to expensive training runs.
Profiling Tools
- PyTorch Profiler: Built-in FLOP counting
- DeepSpeed: MFU reporting in training logs
- Weights & Biases: Integration with hardware metrics
- Custom counters: Framework-specific implementations
🔮 Future Directions and Emerging Architectures
Beyond Traditional FLOPs
As architectures evolve, FLOP counting methodologies must adapt:
Sparse Attention Patterns:
- Linear attention: O(n) instead of O(n²)
- Local attention: Reduced sequence length dependencies
- Mixture of attention heads: Different patterns per head
Conditional Computation:
- Early exit mechanisms
- Adaptive depth networks
- Token-wise routing
Hardware-Aware Metrics:
- Memory bandwidth utilization
- Integer operation efficiency
- Specialized accelerator metrics
The Road Ahead
The future of computational efficiency in LLMs likely involves:
- Multi-modal MoE: Extending sparse computation to different modalities
- Dynamic architectures: Runtime adaptation based on input complexity
- Hardware co-design: Models designed for specific accelerators
- Hybrid precision: Strategic use of different numerical formats
🏁 Conclusion
Understanding FLOPs and MFU is crucial for anyone working with large-scale language models. As we’ve seen:
- FLOPs provide hardware-independent measurement of computational requirements
- MFU enables fair comparison of training efficiency across different setups
- MoE architectures complicate but don’t invalidate these fundamental concepts
- Modern models like GPT-OSS showcase how innovative architectures and quantization can dramatically improve efficiency
Whether you’re planning a training run, optimizing an existing system, or designing new architectures, these metrics provide essential insights into the computational reality of modern AI systems.
As models continue to scale and new architectures emerge, the principles of FLOP counting and efficiency measurement will remain foundational tools for understanding and optimizing the computational landscape of artificial intelligence.
🙏 Acknowledgments
This blog post draws heavily from two excellent resources that provided the foundation for understanding FLOPs in LLM training. I want to give full credit to these authors for their pioneering work:
Primary Inspirations:
-
Adam Casson’s “Transformer FLOPs” - An exceptionally clear and comprehensive guide that masterfully explains FLOP counting methodologies, MFU calculations, and scaling behaviors. Much of the mathematical foundation and practical examples in this post are built upon Adam’s excellent work.
-
Pratish Raj’s “FLOPs in LLM Training: The Ultimate Guide” - A practical and accessible guide that bridges the gap between theory and implementation, providing clear examples and real-world context for FLOP calculations.
Both posts were instrumental in shaping my understanding of computational efficiency in LLMs. This work extends their insights to MoE architectures and provides additional analysis of modern models like GPT-OSS, but the core concepts and methodologies owe much to these foundational resources.
🛠️ Tools and Resources
Interactive Calculator:
- FLOP Calculator: Open-source tool for calculating FLOPs and MFU for both dense and MoE architectures
📚 References and Further Reading
Foundational Papers:
- [1] OpenAI Scaling Laws: Kaplan, J., McCandlish, S., Henighan, T., et al. “Scaling Laws for Neural Language Models” (2020) - Original FLOP counting methodology and the 6N approximation
- [2] DeepMind Chinchilla: Hoffmann, J., Borgeaud, S., Mensch, A., et al. “Training Compute-Optimal Large Language Models” (2022) - Detailed FLOP accounting and compute-optimal training
- [3] Google PaLM: Chowdhery, A., Narang, S., Devlin, J., et al. “PaLM: Scaling Language Modeling with Pathways” (2022) - MFU methodology introduction
Key Blog Posts and Resources:
- Adam Casson’s Transformer FLOPs: Comprehensive FLOP analysis with interactive calculator
- Pratish Raj’s FLOPs Guide: Practical guide to FLOPs in LLM training
- GPT-OSS Repository: OpenAI’s open-source MoE implementation
Additional Technical Resources:
- Transformer Inference Arithmetic: Kipp Bradford’s analysis of inference costs
- Making Deep Learning Go Brrrr: Horace He’s guide to optimization fundamentals
Citation:
@article{paul2024flops_mfu_moe,
title={Understanding FLOPs, MFU, and Computational Efficiency in LLM Training: From Dense Transformers to MoE Architectures},
author={Debjit Paul},
year={2025},
url={https://debjitpaul.github.io/blog/}
}
This analysis provides a comprehensive overview of computational efficiency in modern LLM training. For specific implementation details or model-specific calculations, always refer to the original papers and codebases.
Enjoy Reading This Article?
Here are some more articles you might like to read next: