You have 64 GPUs, a 70B model, and a training deadline. The question isn't whether to use parallelism—it's which combination. Tensor parallelism shards weights across GPUs but adds communication overhead. Pipeline parallelism distributes layers but creates bubbles. Data parallelism scales throughput but doesn't help memory. The math for finding the optimal TP × PP × DP configuration? It involves memory estimation, efficiency modeling, and communication overhead calculations that most teams don't have time to derive.
NeoSignal Parallelism Strategy Advisor showing recommended TP, PP, DP configuration
NeoSignal Parallelism Strategy Advisor does the math. Select your model and GPU configuration, specify your GPU count and memory target, choose full fine-tuning or LoRA mode. The advisor enumerates all valid parallelism combinations, scores them by compute efficiency and memory utilization, and recommends the optimal strategy. You see the recommended TP/PP/DP values, memory per GPU, compute utilization percentage, and alternative strategies ranked by efficiency. Below that: ready-to-use DeepSpeed and FSDP configuration snippets you can copy directly into your training scripts.
The benefit: you configure distributed training correctly the first time. No more trial-and-error with parallelism settings, no more OOM errors after hours of job queueing, no more underutilized clusters. The advisor encodes the expertise that would otherwise require a distributed systems engineer.
Detailed Walkthrough
The Parallelism Problem
Distributed training requires splitting work across GPUs. Three dimensions of parallelism exist, each with different tradeoffs:
Tensor Parallelism (TP) shards model weights horizontally across GPUs. Each GPU holds a slice of every layer. This reduces per-GPU memory but requires all-reduce communication after every linear operation. TP works best within a single node where NVLink provides high-bandwidth interconnect.
Pipeline Parallelism (PP) distributes model layers vertically across GPUs. GPU 0 holds layers 1-10, GPU 1 holds layers 11-20, and so on. This reduces per-GPU memory but creates "pipeline bubbles"—idle time during warmup and cooldown phases. PP works across nodes since communication happens only at layer boundaries.
Data Parallelism (DP) replicates the model across GPU groups, with each group processing different batches. This scales throughput linearly but doesn't reduce per-GPU memory. DP requires gradient synchronization, but this overlaps with backward pass computation.
The constraint: TP × PP × DP must equal (or divide evenly into) your total GPU count. A 64-GPU cluster could run TP=4, PP=2, DP=8, or TP=8, PP=4, DP=2, or dozens of other combinations. Each has different memory footprints and efficiency characteristics.
Free credits to explore
10 free credits to chat with our AI agents
Memory Estimation
The advisor estimates memory requirements based on model parameters and training mode:
Full Fine-tuning Memory = Parameters × 18 bytes
- 2 bytes: BF16 weights
- 4 bytes: FP32 master weights
- 12 bytes: AdamW optimizer states (momentum + variance)
LoRA Memory = Base model (2 bytes per param) + Adapter parameters (18 bytes per param)
- Base model is frozen, only needs inference weights
- Adapters are typically 1% of base model size
The advisor applies a 15% CUDA overhead factor for memory fragmentation and framework buffers.
For a 70B model in full fine-tuning mode:
- Base memory: 70B × 18 bytes = 1,260 GB
- With overhead: 1,260 × 1.15 = 1,449 GB
This needs to be distributed across GPUs using parallelism.
Efficiency Calculation
Each parallelism dimension adds communication overhead:
Tensor Parallelism Overhead: 3% per TP degree, capped at 20%
- TP=2: 3% overhead
- TP=4: 9% overhead
- TP=8: 20% overhead (cap)
Pipeline Parallelism Bubbles: (PP - 1) / (microbatches + PP - 1)
- PP=4 with 16 microbatches: 3/19 = 16% bubble ratio
- More microbatches reduce bubbles but increase memory
Data Parallelism Overhead: 1% per DP degree, capped at 10%
- Well-overlapped with backward pass
- Minimal impact on training time
The advisor calculates compute utilization by subtracting these overheads:
Compute Utilization = 1 - (TP overhead + bubble overhead + DP overhead)
A configuration with TP=4, PP=2, DP=8 might achieve 72% compute utilization—28% lost to communication and pipeline bubbles.
Strategy Enumeration
The advisor generates all valid parallelism combinations:
- Enumerate powers of 2 for TP (1, 2, 4, 8)
- Enumerate powers of 2 for PP (1, 2, 4, 8, 16)
- Calculate DP = GPU count / (TP × PP)
- Filter to combinations where DP is a positive integer
For each valid combination:
- Calculate memory per GPU: total memory / (TP × PP)
- Check if it fits within target utilization (default 80%)
- Calculate efficiency metrics
Scoring and Ranking
The advisor scores strategies using a weighted formula:
| Factor | Weight | Description |
|---|---|---|
| Compute Utilization | 60% | Higher is better—more time on compute, less on communication |
| Memory Utilization | 30% | Higher is better—use what you have |
| DP Bonus | 10% | Higher DP enables better throughput scaling |
Strategies that don't fit in memory are filtered out. The highest-scoring strategy becomes the recommendation.
Result Outputs
The advisor provides comprehensive outputs:
Recommended Strategy
- TP, PP, DP values
- Memory per GPU (GB)
- Memory utilization percentage
- Whether it fits in available GPU memory
Efficiency Metrics
- Compute utilization
- Communication overhead
- Pipeline bubble ratio
- Overall efficiency score
Alternative Strategies
- Next 3 best configurations
- Useful for exploring tradeoffs
Configuration Snippets
- DeepSpeed ZeRO JSON configuration
- PyTorch FSDP Python code
DeepSpeed Configuration
The advisor generates production-ready DeepSpeed config:
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 8,
"zero_optimization": {
"stage": 2,
"overlap_comm": true,
"contiguous_gradients": true
},
"tensor_parallel": {
"tp_size": 4
},
"pipeline_parallel_size": 2
}
ZeRO stage is automatically selected based on DP degree and memory pressure:
- DP=1: ZeRO-0 (no sharding possible)
- Memory utilization > 70%: ZeRO-3 (full sharding)
- Otherwise: ZeRO-2 (gradient + optimizer sharding)
FSDP Configuration
For PyTorch FSDP users:
from torch.distributed.fsdp import (
FullyShardedDataParallel as FSDP,
ShardingStrategy,
MixedPrecision,
)
mp_policy = MixedPrecision(
param_dtype=torch.bfloat16,
reduce_dtype=torch.bfloat16,
)
fsdp_config = {
"sharding_strategy": ShardingStrategy.FULL_SHARD,
"mixed_precision": mp_policy,
"backward_prefetch": BackwardPrefetch.BACKWARD_PRE,
"use_orig_params": True,
}
Sharding strategy maps to ZeRO stages:
- FULL_SHARD: ZeRO-3
- SHARD_GRAD_OP: ZeRO-2
- NO_SHARD: ZeRO-0
Smart Recommendations
The advisor generates contextual recommendations:
When configuration doesn't fit:
- "Model requires X GB per GPU but GPU only has Y GB. Consider more GPUs or different hardware."
- "Increase Tensor Parallelism to reduce per-GPU memory."
- "For 70B+ models, consider H100 80GB or H200 141GB GPUs."
For high TP configurations:
- "TP=8 has significant communication overhead (20%). Consider reducing TP and increasing PP or DP."
For high memory utilization:
- "Memory utilization is 92%. Consider enabling gradient checkpointing or ZeRO-3 offloading for stability."
For underutilized resources:
- "Memory utilization is only 45%. You can increase batch size or reduce parallelism for better efficiency."
- "DP=1 limits throughput scaling. Consider configurations with higher DP."
Training Mode Selection
The advisor supports two training modes:
Full Fine-tuning
- Trains all model parameters
- Memory: 18 bytes per parameter
- Highest quality, highest memory
- Required for pre-training or significant domain adaptation
LoRA/QLoRA
- Trains only adapter layers (~1% of parameters)
- Memory: Base model inference + adapter training
- 10-100x less memory than full fine-tuning
- Sufficient for most fine-tuning tasks
For a 70B model:
- Full fine-tuning: ~1,449 GB distributed across GPUs
- LoRA: ~145 GB (base model) + ~15 GB (adapters) = ~160 GB
Model Size Thresholds
The advisor uses model size to guide recommendations:
| Model Size | Recommended Parallelism |
|---|---|
| < 7B | DP only, single GPU possible |
| 7B - 70B | TP=2-4 beneficial, DP for scaling |
| 70B - 175B | TP=4-8 + PP=2-4 recommended |
| > 400B | Full 3D parallelism (TP + PP + DP) required |
Chat Integration
The Parallelism Strategy Advisor integrates with NeoSignal AI chat:
Context Awareness: Your configuration and results are available to the chat. Ask "Why is TP=4 recommended over TP=8?" and the response references your specific GPU count and memory constraints.
Strategy Clarification: "What's the pipeline bubble problem?" triggers an explanation using your actual PP setting and microbatch count.
Alternative Exploration: "What if I use H200s instead?" prompts analysis of how the recommendation would change with different hardware.
Real-World Usage Patterns
Pre-training Planning: You're planning a 70B model pre-training run on 256 H100s. Enter the configuration. The advisor recommends TP=4, PP=4, DP=16 with 68% compute utilization. Copy the DeepSpeed config, validate with a short test run, then proceed with confidence.
Fine-tuning Optimization: You have 8 A100-80GB GPUs for fine-tuning a 70B model. Full fine-tuning won't fit. Switch to LoRA mode—the advisor shows TP=2, PP=1, DP=4 with 78% memory utilization. Now it fits.
Cluster Sizing: You're deciding between 64 and 128 GPUs. Run both configurations. See how efficiency changes—maybe 128 GPUs gives diminishing returns due to communication overhead. Make informed procurement decisions.
Debug Training Issues: Training is slower than expected. Enter your current configuration. The advisor shows your compute utilization is only 52%—PP=8 is creating large pipeline bubbles. Reduce to PP=4 and increase DP.
From Advisor to Training
NeoSignal Parallelism Strategy Advisor compresses distributed training expertise into automated recommendations. The formulas are based on established best practices—the same principles used by teams training frontier models. You get the strategy without needing to derive the math.
The output isn't just "use TP=4, PP=2, DP=8." It's a complete configuration package: efficiency metrics, alternative strategies, DeepSpeed configs, FSDP code. Copy, paste, train. When requirements change—different model size, different GPU count—return to the advisor and get updated recommendations in seconds.
That's the NeoSignal approach: expert knowledge encoded in precise calculations, delivered through interfaces that make complex decisions actionable.