Understanding FLOPs, MFU, and Computational Efficiency in LLM Training: From Dense Transformers to MoE Architectures

As Large Language Models (LLMs) continue to scale exponentially, understanding their computational requirements becomes crucial for researchers and practitioners alike. Whether you’re planning a training run, optimizing infrastructure costs, or comparing different architectures, FLOPs (Floating Point Operations) and MFU (Model FLOPs Utilization) are essential metrics that provide hardware-independent ways to measure and optimize computational efficiency.

This post builds upon two excellent foundational resources: Adam Casson’s comprehensive guide to Transformer FLOPs and Pratish Raj’s practical guide to FLOPs in LLM training. While these posts cover dense Transformer architectures, this post extends their insights to modern Mixture-of-Experts (MoE) architectures, with a particular focus on OpenAI’s recently released GPT-OSS models.

In this comprehensive analysis, we’ll explore FLOP counting methodologies, dive deep into MFU calculations, and examine how these concepts apply to the latest generation of sparse models that are reshaping the efficiency landscape of large language models.

🛠️ Practical Tools: Want to calculate FLOPs for your own models? Check out the FLOP Calculator - an interactive tool that implements all the methodologies discussed in this post for both dense and MoE architectures.

🔢 What Are FLOPs and Why Do They Matter?

Defining FLOPs

A FLOP (Floating Point Operation) represents a single computational operation like addition (3.8 + 4.1) or multiplication (1.2 × 4.6). When we talk about FLOPs (plural), we’re counting the total number of these atomic operations required for a specific task.

For large-scale computations, we use:

GFLOPs = 1 billion FLOPs (10⁹)
TFLOPs = 1 trillion FLOPs (10¹²)
PFLOPs = 1 quadrillion FLOPs (10¹⁵)

Why FLOPs Matter

FLOPs provide several crucial advantages:

Hardware Independence: Unlike wall-clock time, FLOPs offer consistent measurements across different hardware configurations
Reproducibility: Enable precise comparisons between different models and training setups
Cost Estimation: Help predict computational costs and resource requirements
Efficiency Analysis: Allow measurement of how well we utilize available hardware

📐 Counting FLOPs in Dense Transformers

The OpenAI Method

The seminal approach from OpenAI’s scaling laws paper [1] provides a clean approximation:

FLOPs per token ≈ 6N

Where N is the number of non-embedding parameters. This factor of 6 accounts for:

2× for the forward pass (multiply-accumulate operations)
2× for the backward pass (gradients with respect to inputs)
2× for the backward pass (gradients with respect to parameters)

This approximation, introduced by Kaplan et al. in their foundational work on neural language model scaling laws, provides a hardware-independent way to estimate computational requirements [1]. The elegance of this formula lies in its simplicity while capturing the essential computational structure of Transformer training.

Let’s break down the forward pass components:

Operation	Parameters	FLOPs per Token
Attention QKV	`3 × d × d_model`	`6 × L × d_model × d`
Attention Scores	-	`4 × L × seq_len × d`
Attention Project	`d × d_model`	`2 × L × d × d_model`
Feedforward	`8 × d_model²`	`16 × L × d_model²`
Total (approx)	-	`≈ 2 × N`

Where:

L = number of layers
seq_len = sequence length
d_model = hidden dimension
d = attention dimension (d_model for most implementations)

The DeepMind Method

DeepMind’s Chinchilla paper provides a more detailed accounting that includes embeddings, logits, and attention mechanics:

def deepmind_flops_per_sequence(n_layers, n_heads, d_model, seq_len, vocab_size, ff_ratio=4):
    """DeepMind method for forward pass FLOPs counting"""
    d_attn = d_model // n_heads
    d_ff = d_model * ff_ratio

    # Components
    embeddings = 2 * seq_len * vocab_size * d_model
    attn_qkv = 2 * seq_len * 3 * d_model * (d_attn * n_heads)
    attn_logits = 2 * seq_len * seq_len * (d_attn * n_heads)
    attn_softmax = 3 * n_heads * seq_len * seq_len
    attn_reduce = 2 * seq_len * seq_len * (d_attn * n_heads)
    attn_project = 2 * seq_len * (d_attn * n_heads) * d_model

    total_attn = attn_qkv + attn_logits + attn_softmax + attn_reduce + attn_project
    feedforward = 2 * seq_len * (d_model * d_ff + d_model * d_ff)
    logits = 2 * seq_len * d_model * vocab_size

    return embeddings + n_layers * (total_attn + feedforward) + logits

Practical Example: GPT-3 Scale Model

Let’s calculate FLOPs for a GPT-3 scale model:

Parameters: 175B non-embedding
Sequence length: 2048
Batch size: 32

# Using OpenAI approximation
flops_per_token = 6 * 175e9  # 1.05 × 10^12 FLOPs
tokens_per_step = 32 * 2048  # 65,536 tokens
flops_per_step = flops_per_token * tokens_per_step  # 6.87 × 10^16 FLOPs

# Forward + Backward
total_flops_per_step = flops_per_step * 2  # 1.37 × 10^17 FLOPs ≈ 137 PFLOPs

⚡ Model FLOPs Utilization (MFU): The Gold Standard

Understanding MFU

Model FLOPs Utilization (MFU) measures how efficiently we execute the theoretically necessary FLOPs for training, introduced in Google’s PaLM paper:

MFU = (Model FLOPs × Throughput) / Peak Hardware FLOPs

Where:

Model FLOPs: Theoretical FLOPs per token (using OpenAI’s 6N approximation)
Throughput: Observed tokens processed per second
Peak Hardware FLOPs: Theoretical maximum FLOP/s of your hardware

MFU vs. Hardware FLOPs Utilization (HFU)

Metric	MFU	HFU
Includes	Only necessary model computations	All computations (including overheads)
Use Case	Fair comparison across setups	Implementation efficiency
Affected by	Model architecture, batch size	Memory management, communication

Calculating MFU: A Practical Example

Consider training a 7B parameter model on 8×A100 GPUs:

# Model specifications
parameters = 7e9
flops_per_token = 6 * parameters  # 4.2 × 10^10

# Hardware specifications
a100_peak_flops = 312e12  # 312 TFLOPs for bf16
total_peak_flops = 8 * a100_peak_flops  # 2.496 × 10^15

# Measured throughput
tokens_per_second = 8000

# Calculate MFU
sustained_flops = flops_per_token * tokens_per_second  # 3.36 × 10^14
mfu = sustained_flops / total_peak_flops  # 0.135 = 13.5%

Typical MFU Ranges

Real-world MFU values vary significantly:

Small models (< 1B): 10-30%
Medium models (1B-10B): 30-50%
Large models (10B-100B): 45-65%
Very large models (100B+): 50-70%

Higher MFU in larger models is often due to better compute-to-communication ratios and improved memory bandwidth utilization.

🧠 Extending to Mixture-of-Experts (MoE) Architectures

MoE Fundamentals

Mixture-of-Experts architectures replace dense feedforward networks with a router and multiple expert networks. Only a subset of experts (typically 1-2 out of 8-64) are activated per token, dramatically reducing computational costs while maintaining or improving model quality.

MoE FLOP Counting Challenges

Traditional FLOP counting becomes more nuanced with MoE:

Routing Overhead: Additional FLOPs for expert selection
Variable Computation: Different tokens may use different experts
Load Balancing: Uneven expert utilization affects total FLOPs
Communication Costs: Expert routing across devices

MoE FLOP Formula

For a MoE layer replacing a dense feedforward network:

def moe_flops_per_token(d_model, num_experts, experts_per_token, d_ff, load_balance_factor=1.1):
    """Calculate FLOPs for MoE layer"""

    # Router FLOPs (token → expert probabilities)
    router_flops = 2 * d_model * num_experts

    # Expert FLOPs (only active experts)
    expert_flops = experts_per_token * (2 * d_model * d_ff + 2 * d_ff * d_model)

    # Load balancing overhead
    effective_expert_flops = expert_flops * load_balance_factor

    return router_flops + effective_expert_flops

# Example: 8 experts, top-2 routing
dense_ff_flops = 16 * d_model**2  # 4x expansion, two linear layers
moe_ff_flops = moe_flops_per_token(d_model, 8, 2, 4*d_model)

efficiency_gain = dense_ff_flops / moe_ff_flops  # Typically 2-4x

🚀 Case Study: OpenAI GPT-OSS Models

OpenAI’s GPT-OSS models provide an excellent real-world example of modern MoE architectures with several innovative features:

GPT-OSS Architecture Overview

GPT-OSS-120B:

Total Parameters: 117B (5.1B active per token)
Architecture: MoE with native MXFP4 quantization
Activation: ~4.4% of total parameters per token
Memory: Fits on single H100 (80GB)

GPT-OSS-20B:

Total Parameters: 21B (3.6B active per token)
Activation: ~17% of total parameters per token
Memory: Runs in 16GB

FLOP Counting for GPT-OSS

Let’s calculate FLOPs for GPT-OSS-120B:

def gpt_oss_flops_calculation():
    # GPT-OSS-120B specifications (estimated)
    total_params = 117e9
    active_params_per_token = 5.1e9

    # Traditional approach (if it were dense)
    dense_flops_per_token = 6 * total_params  # 702 GFLOPs

    # MoE approach (actual)
    # Attention layers remain dense
    attention_params = total_params * 0.3  # Estimated 30%
    attention_flops = 6 * attention_params

    # MoE feedforward layers
    moe_params_per_token = active_params_per_token - attention_params
    moe_flops = 6 * moe_params_per_token

    # Router overhead (small)
    router_flops = total_params * 0.001  # Estimated 0.1%

    total_flops_per_token = attention_flops + moe_flops + router_flops

    return {
        'dense_equivalent': dense_flops_per_token,
        'actual_moe': total_flops_per_token,
        'efficiency_gain': dense_flops_per_token / total_flops_per_token
    }

results = gpt_oss_flops_calculation()
print(f"Dense equivalent: {results['dense_equivalent']:.1e} FLOPs/token")
print(f"MoE actual: {results['actual_moe']:.1e} FLOPs/token")
print(f"Efficiency gain: {results['efficiency_gain']:.1f}x")

MXFP4 Quantization Impact

GPT-OSS uses native MXFP4 quantization for MoE layers:

Memory: 4-bit storage vs 16-bit (4× reduction)
Compute: Specialized kernels maintain computational efficiency
Accuracy: Minimal degradation with proper scaling

This affects FLOP counting considerations:

def quantized_moe_flops(base_flops, quantization_overhead=1.05):
    """
    MXFP4 might have slight computational overhead
    but significant memory bandwidth benefits
    """
    return base_flops * quantization_overhead

Performance Analysis

Based on the GPT-OSS specifications, we can estimate performance characteristics:

# Estimated GPT-OSS-120B on H100
flops_per_token = 6 * 5.1e9  # 30.6 GFLOPs (active parameters)
h100_peak = 1980e12  # ~2 PFLOPs for int4 operations
target_throughput = 100  # tokens/second (estimated)

required_flops_rate = flops_per_token * target_throughput  # 3.06 TFLOPs/s
mfu = required_flops_rate / h100_peak  # 0.15%

print(f"Estimated MFU for GPT-OSS-120B: {mfu*100:.1f}%")

Note: These are rough estimates. Actual performance depends on implementation details, memory bandwidth, and other factors.

🔧 Practical Considerations and Optimizations

Memory vs. Compute Trade-offs

Modern LLM training involves several techniques that affect FLOP calculations:

Activation Checkpointing:

def checkpointing_flop_overhead(base_flops, checkpoint_ratio=0.5):
    """
    Recompute activations during backward pass
    checkpoint_ratio: fraction of activations checkpointed
    """
    recompute_flops = base_flops * checkpoint_ratio
    return base_flops + recompute_flops  # Up to 1.5× FLOPs

Gradient Accumulation:

Doesn’t affect per-token FLOPs
May affect MFU due to different memory access patterns

Scaling Laws and FLOP-Optimal Training

The relationship between model size, dataset size, and computational budget follows predictable scaling laws:

def compute_optimal_scaling(compute_budget_flops):
    """
    Based on Chinchilla scaling laws
    Roughly: parameters ∝ compute^0.5, tokens ∝ compute^0.5
    """
    optimal_params = (compute_budget_flops / 6) ** 0.5
    optimal_tokens = (compute_budget_flops / 6) ** 0.5

    return optimal_params, optimal_tokens

# Example: 10^23 FLOPs budget
params, tokens = compute_optimal_scaling(1e23)
print(f"Optimal: {params/1e9:.0f}B parameters, {tokens/1e9:.0f}B tokens")

📊 Benchmarking and Measurement Tools

Measuring MFU in Practice

import time
import torch

def measure_mfu(model, batch_size, seq_len, num_steps=100):
    """Measure MFU during actual training"""
    device = next(model.parameters()).device

    # Create dummy batch
    input_ids = torch.randint(0, 50000, (batch_size, seq_len), device=device)

    # Warm up
    for _ in range(10):
        loss = model(input_ids).loss
        loss.backward()

    torch.cuda.synchronize()
    start_time = time.time()

    for _ in range(num_steps):
        loss = model(input_ids).loss
        loss.backward()

    torch.cuda.synchronize()
    end_time = time.time()

    # Calculate throughput
    total_tokens = num_steps * batch_size * seq_len
    throughput = total_tokens / (end_time - start_time)

    # Calculate MFU
    flops_per_token = 6 * model.num_parameters()
    peak_flops = get_device_peak_flops(device)
    mfu = (flops_per_token * throughput) / peak_flops

    return mfu, throughput

Interactive FLOP Calculator

For hands-on experimentation with these concepts, check out the FLOP Calculator repository. This interactive tool implements:

Dense Transformer FLOP calculations (OpenAI and DeepMind methods)
MoE architecture FLOP counting with customizable parameters
MFU measurement and comparison across different hardware
Visualization of computational efficiency trade-offs
Support for custom model configurations

The calculator makes it easy to experiment with different architectural choices and understand their computational implications before committing to expensive training runs.

Profiling Tools

PyTorch Profiler: Built-in FLOP counting
DeepSpeed: MFU reporting in training logs
Weights & Biases: Integration with hardware metrics
Custom counters: Framework-specific implementations

🔮 Future Directions and Emerging Architectures

Beyond Traditional FLOPs

As architectures evolve, FLOP counting methodologies must adapt:

Sparse Attention Patterns:

Linear attention: O(n) instead of O(n²)
Local attention: Reduced sequence length dependencies
Mixture of attention heads: Different patterns per head

Conditional Computation:

Early exit mechanisms
Adaptive depth networks
Token-wise routing

Hardware-Aware Metrics:

Memory bandwidth utilization
Integer operation efficiency
Specialized accelerator metrics

The Road Ahead

The future of computational efficiency in LLMs likely involves:

Multi-modal MoE: Extending sparse computation to different modalities
Dynamic architectures: Runtime adaptation based on input complexity
Hardware co-design: Models designed for specific accelerators
Hybrid precision: Strategic use of different numerical formats

🏁 Conclusion

Understanding FLOPs and MFU is crucial for anyone working with large-scale language models. As we’ve seen:

FLOPs provide hardware-independent measurement of computational requirements
MFU enables fair comparison of training efficiency across different setups
MoE architectures complicate but don’t invalidate these fundamental concepts
Modern models like GPT-OSS showcase how innovative architectures and quantization can dramatically improve efficiency

Whether you’re planning a training run, optimizing an existing system, or designing new architectures, these metrics provide essential insights into the computational reality of modern AI systems.

As models continue to scale and new architectures emerge, the principles of FLOP counting and efficiency measurement will remain foundational tools for understanding and optimizing the computational landscape of artificial intelligence.

🙏 Acknowledgments

This blog post draws heavily from two excellent resources that provided the foundation for understanding FLOPs in LLM training. I want to give full credit to these authors for their pioneering work:

Primary Inspirations:

Adam Casson’s “Transformer FLOPs” - An exceptionally clear and comprehensive guide that masterfully explains FLOP counting methodologies, MFU calculations, and scaling behaviors. Much of the mathematical foundation and practical examples in this post are built upon Adam’s excellent work.
Pratish Raj’s “FLOPs in LLM Training: The Ultimate Guide” - A practical and accessible guide that bridges the gap between theory and implementation, providing clear examples and real-world context for FLOP calculations.

Both posts were instrumental in shaping my understanding of computational efficiency in LLMs. This work extends their insights to MoE architectures and provides additional analysis of modern models like GPT-OSS, but the core concepts and methodologies owe much to these foundational resources.

🛠️ Tools and Resources

Interactive Calculator:

FLOP Calculator: Open-source tool for calculating FLOPs and MFU for both dense and MoE architectures

📚 References and Further Reading

Foundational Papers:

[1] OpenAI Scaling Laws: Kaplan, J., McCandlish, S., Henighan, T., et al. “Scaling Laws for Neural Language Models” (2020) - Original FLOP counting methodology and the 6N approximation
[2] DeepMind Chinchilla: Hoffmann, J., Borgeaud, S., Mensch, A., et al. “Training Compute-Optimal Large Language Models” (2022) - Detailed FLOP accounting and compute-optimal training
[3] Google PaLM: Chowdhery, A., Narang, S., Devlin, J., et al. “PaLM: Scaling Language Modeling with Pathways” (2022) - MFU methodology introduction

Key Blog Posts and Resources:

Adam Casson’s Transformer FLOPs: Comprehensive FLOP analysis with interactive calculator
Pratish Raj’s FLOPs Guide: Practical guide to FLOPs in LLM training
GPT-OSS Repository: OpenAI’s open-source MoE implementation

Additional Technical Resources:

Transformer Inference Arithmetic: Kipp Bradford’s analysis of inference costs
Making Deep Learning Go Brrrr: Horace He’s guide to optimization fundamentals

Citation:

@article{paul2024flops_mfu_moe,
    title={Understanding FLOPs, MFU, and Computational Efficiency in LLM Training: From Dense Transformers to MoE Architectures},
    author={Debjit Paul},
    year={2025},
    url={https://debjitpaul.github.io/blog/}
}

This analysis provides a comprehensive overview of computational efficiency in modern LLM training. For specific implementation details or model-specific calculations, always refer to the original papers and codebases.