You're planning to train a 70B parameter model. You have access to 8x H100 GPUs. Will it fit? You could spin up the cluster, load the model, watch it OOM, tweak configs, repeat—burning hours and cloud credits on trial and error. Or you could just ask someone who's done this before. Most teams don't have that someone.
NeoSignal Memory Calculator showing GPU memory estimation
NeoSignal Memory Calculator gives you that expert in a tool. Select your model—GPT-O5S 20B is shown here with 1151.7 GB peak memory required. Configure your training setup: batch size, sequence length, precision, parallelism strategy. The calculator instantly shows memory breakdown across parameters, gradients, optimizer states, and activations. It visualizes before and after optimization states, recommends the right GPU (B200 192GB in this case), and surfaces specific recommendations like "Enable Activation Checkpointing" when your configuration exceeds capacity.
The benefit: you validate infrastructure decisions before spending a dollar. The chat panel on the right shows NeoSignal AI providing immediate optimization guidance based on your exact configuration—no generic advice, just answers grounded in your numbers.
Detailed Walkthrough
The Memory Problem in LLM Training
Training large language models is fundamentally a memory management problem. A 70 billion parameter model doesn't just need 70 billion floats of memory—it needs memory for gradients during backpropagation, optimizer states (Adam stores two states per parameter), and activations at every layer. The true memory footprint often exceeds the parameter count by 10-20x.
This creates a planning challenge. Before you provision cloud GPUs, sign capacity agreements, or commit to a hardware strategy, you need to know: will this configuration actually work? NeoSignal Memory Calculator answers that question with formulas based on the HuggingFace Ultrascaling Playbook—the same methodology used by teams training frontier models.
Save & organize insights
Save articles and excerpts to your personal library
How the NeoSignal Calculator Works
The NeoSignal Memory Calculator operates on three interconnected systems: input configuration, calculation engine, and optimization recommendations.
Input Configuration captures everything affecting memory: model selection from NeoSignal's component database (open-weights models with known architectures), batch size, sequence length, gradient accumulation steps, precision (FP32, BF16, FP16, FP8), parallelism settings (tensor, pipeline, data), ZeRO stage (0-3), activation checkpointing strategy, and optimizer choice.
Calculation Engine implements transformer memory formulas. For a model with hidden dimension h, vocabulary size v, L layers, and n attention heads:
-
Parameter Memory: Embedding layer (v × h), attention projections (Q, K, V, O across all layers), feedforward networks (SwiGLU with 3 projections per layer), layer norms. For GQA models, K and V projections scale with the number of key-value heads rather than attention heads.
-
Gradient Memory: Same size as parameters in the working precision. During backpropagation, every parameter needs a corresponding gradient.
-
Optimizer Memory: Adam/AdamW store two states per parameter (momentum and variance), both in FP32 regardless of training precision. SGD stores one state. Adafactor approximates second moments with less memory.
-
Activation Memory: Dominated by attention scores (batch × heads × sequence² per layer) and intermediate FFN activations. This is where sequence length has quadratic impact.
Optimization Applications reduce effective memory by distributing and offloading:
- Tensor parallelism shards model parameters across TP GPUs
- Pipeline parallelism distributes layers across PP GPUs
- ZeRO Stage 1 shards optimizer states across data parallel GPUs
- ZeRO Stage 2 adds gradient sharding
- ZeRO Stage 3 adds parameter sharding
- Activation checkpointing trades compute for memory—full checkpointing reduces activation memory by ~90% at ~30% compute overhead
- Offloading moves optimizer states or parameters to CPU memory
The Memory Breakdown Visualization
NeoSignal displays memory as a stacked bar with four color-coded segments:
- Purple (Parameters): Model weights in working precision, plus FP32 master weights if using mixed precision
- Blue (Gradients): Gradients in working precision
- Cyan (Optimizer): Optimizer states in FP32
- Amber (Activations): Intermediate values stored for backpropagation
The bar shows effective memory after optimizations. Below it, a before/after comparison reveals optimization impact. A configuration might show 21,992.9 GB before optimizations and 1,047.8 GB after—a 20x reduction from ZeRO-3 sharding across 8 data parallel GPUs plus activation checkpointing.
GPU Recommendation Engine
NeoSignal maintains a database of GPU configurations with memory capacity, bandwidth, and compute specs:
| GPU | Memory | Bandwidth | Compute |
|---|---|---|---|
| B200 | 192 GB | 8000 GB/s | 2250 TFLOPS |
| MI300X | 192 GB | 5300 GB/s | 1307 TFLOPS |
| H200 | 141 GB | 4800 GB/s | 989 TFLOPS |
| H100 NVL | 94 GB | 3958 GB/s | 989 TFLOPS |
| H100 SXM | 80 GB | 3350 GB/s | 989 TFLOPS |
| A100 80GB | 80 GB | 2039 GB/s | 312 TFLOPS |
| L40S | 48 GB | 864 GB/s | 362 TFLOPS |
| RTX 4090 | 24 GB | 1008 GB/s | 165 TFLOPS |
The calculator recommends the smallest GPU that fits your configuration with 10% headroom for CUDA overhead and safe operation. If your peak memory is 85 GB, it recommends H100 NVL (94 GB) rather than H100 SXM (80 GB)—the extra 14 GB provides margin for memory fragmentation and framework overhead.
When no single GPU fits, the result shows "Exceeds capacity" with a red indicator, and recommendations suggest how to reduce memory through additional parallelism or optimization strategies.
Efficiency Metrics
Beyond memory, NeoSignal calculates efficiency metrics that affect training throughput:
Compute Utilization estimates what fraction of GPU compute actually performs useful work. Tensor parallelism adds all-reduce communication after each layer. Pipeline parallelism creates pipeline bubbles during warmup and cooldown. Data parallelism adds gradient all-reduce overhead (though this overlaps with backward pass). A configuration with TP=8, PP=2, DP=4 might show 65% compute utilization—35% lost to communication.
Memory Utilization shows peak memory as a fraction of GPU capacity. Too low (under 50%) means you're underutilizing expensive hardware. Too high (over 90%) risks OOM from memory fragmentation.
Total GPUs displays the full cluster size: TP × PP × DP. A configuration with TP=4, PP=2, DP=8 requires 64 GPUs total.
Smart Recommendations
The calculator generates contextual recommendations based on your configuration and results:
When memory exceeds GPU capacity:
- "Enable ZeRO Stage 3 (currently: 0) to shard parameters across 8 GPUs"
- "Enable activation checkpointing to reduce activation memory by up to 90%"
- "Enable optimizer offloading to move optimizer states to CPU"
- "Increase data parallelism to enable ZeRO sharding across more GPUs"
When memory is manageable but could be optimized:
- "Consider ZeRO Stage 2 (currently: 1) to shard gradients"
- "Consider selective activation checkpointing for ~70% memory reduction"
For efficiency concerns:
- "TP > 8 has high communication overhead. Consider using Pipeline Parallelism instead."
- "Small batch size detected. Consider gradient accumulation for better GPU utilization."
- "Using FP32 precision. Consider BF16 mixed precision to reduce memory by ~50%."
Chat Integration
The NeoSignal Memory Calculator integrates with NeoSignal AI chat. When you have a calculation active, the chat panel gains context about your configuration and results. Ask "How can I reduce memory usage to fit on a single GPU?" and the response considers your specific model size, current parallelism settings, and which optimizations you haven't yet enabled.
The chat understands the calculation state. "Why is B200 recommended?" gets an answer that references your 1151.7 GB peak memory and explains why 192 GB capacity with ZeRO-3 sharding is the minimum viable option. "What happens if I enable full activation checkpointing?" triggers re-analysis based on the activation memory component of your breakdown.
Saving and Sharing Configurations
NeoSignal lets you save calculator configurations as artifacts. Click "Save" and your model selection, training config, parallelism settings, and results persist to your account. Share the artifact URL with teammates for infrastructure planning discussions. Load a saved artifact to restore exact settings—useful for iterating on configurations over multiple sessions.
Artifacts capture the full input state, not just results. When you load a saved memory calculation, all sliders, dropdowns, and toggles restore to their saved positions. The calculation re-runs with current model data, so if NeoSignal's architecture information has updated, you get fresh results.
Real-World Planning Scenarios
Pre-procurement validation: You're evaluating a 70B model training project. Before negotiating cloud capacity, run the Memory Calculator with your target configuration. Discover you need 64x H100s for ZeRO-3 with TP=4 and DP=16. Now you have concrete numbers for capacity planning and budget discussions.
Hardware comparison: Same model, but should you use H100 SXM or H200? Run both configurations. H200's 141 GB allows larger batch sizes or reduced parallelism, potentially improving training efficiency. The calculator quantifies the tradeoff.
Optimization strategy: You have fixed hardware (8x A100-80GB). What combination of ZeRO, checkpointing, and offloading lets you train the largest model? Iterate through configurations until peak memory fits with acceptable efficiency metrics.
Debugging OOM: Production training is hitting out-of-memory errors. Enter your exact configuration into the calculator. Compare reported peak memory against actual GPU memory. Often the gap reveals the issue—maybe activation memory is higher than expected due to sequence length, or optimizer states weren't being sharded as intended.
Technical Foundation
NeoSignal's memory formulas are based on the HuggingFace Ultrascaling Playbook, which documents memory requirements for distributed transformer training. The implementation:
- Calculates parameter counts from architecture (hidden dimension, vocab size, layers, heads, FFN dimension, GQA configuration)
- Applies precision-specific byte counts (FP32: 4 bytes, BF16/FP16: 2 bytes, FP8: 1 byte)
- Models optimizer state memory (Adam: 2 states × 4 bytes per parameter)
- Estimates activation memory from attention score tensors and intermediate activations
- Applies parallelism and ZeRO sharding factors to get per-GPU memory
- Adds CUDA overhead multiplier (10%) for framework and fragmentation headroom
The calculation runs entirely client-side for instant feedback. Change any input and results update in under 100ms. No server round-trips, no API rate limits, no waiting.
From Calculator to Confidence
GPU memory estimation has traditionally required tribal knowledge—knowing the formulas, understanding the optimizations, having intuition for what configurations work. NeoSignal Memory Calculator codifies that knowledge into a tool anyone can use.
The goal isn't to replace deep expertise in distributed training. It's to give everyone a reliable starting point. Validate your infrastructure assumptions. Identify optimization opportunities. Quantify tradeoffs between hardware options. Then go build with confidence that your configuration will actually work.
That's the NeoSignal approach to AI infrastructure tooling: take expert knowledge, encode it in precise calculations, present it through an interface that makes complex decisions approachable. The Memory Calculator is one tool in the suite. Serving Engine Advisor, Spot Instance Advisor, and TCO Calculator apply the same philosophy to inference optimization, cost management, and build-vs-buy decisions.