NeoSignal Spot Instance Advisor: Save 50-70% on GPU Compute

GPU compute is expensive. H100s run $32/hour on-demand. For a 70B model training run that takes two weeks, you're looking at a $10,000+ cloud bill. Spot instances offer 50-70% savings, but the complexity stops most teams: interruption risk, checkpointing requirements, multi-region failover, provider-specific APIs. You know you should use spot, but the risk feels too high.

NeoSignal Spot Instance Advisor showing pricing and recommendations

NeoSignal Spot Instance Advisor makes spot accessible. Select your workload type (training, inference, batch), GPU requirements, and acceptable interruption tolerance. The advisor scans AWS, GCP, and Azure for spot availability, showing pricing by region with interruption rates. The screenshot shows H100 spot at $16.24/hour versus $32.77 on-demand in us-east-1—a 50% savings. But it goes further: checkpointing strategy recommendations based on your workload, fallback plans when spot gets interrupted, risk assessment with expected interruptions over your workload duration, and total savings projections.

The benefit: you capture spot savings with eyes open to the risks. No more avoiding spot because it seems too complex. The advisor gives you the strategy and the numbers.

Detailed Walkthrough

The Spot Economics

Cloud spot instances represent spare capacity that providers sell at steep discounts. When demand exceeds supply, spot instances get interrupted—your workload terminates with minimal notice (typically 30 seconds to 2 minutes). The tradeoff: dramatic cost savings in exchange for reliability risk.

For GPU workloads, the economics are compelling:

GPU	On-Demand	Spot	Savings
H100 80GB	$32.77/hr	$16.24/hr	50%
H200 141GB	$45.00/hr	$22.50/hr	50%
A100 80GB	$8.50/hr	$4.10/hr	52%
A100 40GB	$5.67/hr	$2.80/hr	51%
L40S	$3.72/hr	$1.85/hr	50%

For a two-week training run on 8x H100s, spot saves approximately $29,000 compared to on-demand. The question isn't whether to use spot—it's how to use it safely.

Save & organize insights

Save articles and excerpts to your personal library

Input Configuration

The advisor collects information needed for accurate recommendations:

Workload Type: Training, inference, or batch processing. Each has different interruption tolerance profiles:

Training: Can checkpoint and resume; tolerates interruptions if checkpointing is robust
Inference: Requests fail on interruption; needs rapid failover or hybrid strategy
Batch: Jobs can restart from last checkpoint; most spot-friendly

Interruption Tolerance: Slider from 0-100 indicating how much interruption risk you can accept. Low tolerance (0-30) favors on-demand fallback strategies. High tolerance (70-100) maximizes spot usage with aggressive checkpointing.

Cloud Providers: Select AWS, GCP, Azure, or any combination. The advisor compares pricing and availability across providers.

Regions: Choose regions to consider. More regions increase availability options but may add data transfer complexity.

GPU Configuration: Select from NeoSignal's accelerator database. Spot availability and pricing vary dramatically by GPU type—H100s are scarcer than A100s.

Duration: Expected workload duration in hours. Affects total savings calculations and interruption risk assessment.

GPU Count: Number of GPUs needed. Multi-GPU workloads have compounded interruption risk—any single GPU interruption affects the entire job.

Availability Analysis

The advisor scans configured providers and regions for spot availability:

Spot Price: Current spot price per GPU-hour. Prices fluctuate based on demand; the advisor shows representative pricing.

On-Demand Price: Comparison price for cost savings calculation.

Savings Percentage: Spot discount expressed as percentage. Typically 50-70% for GPUs.

Interruption Rate: Historical percentage chance of interruption per hour. Lower is better. US-East regions often have higher interruption rates due to higher demand.

Availability Rating: Qualitative assessment—Excellent (low interruption, stable pricing), Good (moderate interruption, reasonable pricing), Limited (high interruption or constrained supply).

Combined Score: A weighted score combining savings and reliability. A region with 60% savings but 15% interruption rate may score lower than a region with 50% savings and 5% interruption rate.

Provider Comparison

NeoSignal tracks spot characteristics across major cloud providers:

AWS (EC2 Spot):

Regions: us-east-1, us-west-2, eu-west-1, ap-northeast-1
Interruption notice: 2 minutes
Spot Fleet for multi-instance management
Capacity-optimized allocation strategy available

Google Cloud (Preemptible VMs):

Regions: us-central1, us-west1, europe-west4, asia-east1
Interruption notice: 30 seconds
24-hour maximum runtime (then automatic termination)
Spot VMs (newer) vs Preemptible (legacy)

Azure (Spot VMs):

Regions: eastus, westus2, westeurope, southeastasia
Interruption notice: 30 seconds
Eviction type: capacity or price
Max price setting available

Regional Recommendations

Based on your inputs, the advisor recommends optimal regions:

Top Recommendation: The region with the best combination of savings and reliability for your workload. For training with moderate interruption tolerance, this might be GCP europe-west4 (5% interruption rate, 50% savings).

Alternative Regions: Ranked alternatives for geographic diversity or failover. Having options across providers and regions enables multi-region strategies.

Avoid Regions: Regions with high interruption rates or insufficient GPU availability for your requirements.

Savings Breakdown

The advisor calculates comprehensive savings projections:

Hourly Savings: Spot savings per GPU-hour. $16.53/hour for H100 in us-east-1.

Total Spot Cost: Projected cost for your workload duration on spot instances. 8x H100s for 336 hours (2 weeks) = $43,588.

Total On-Demand Cost: Comparison cost at on-demand pricing. Same configuration = $87,977.

Net Savings: Dollar savings from using spot. $44,389 in this example.

Savings Percentage: Overall percentage saved. 50% in this example.

Adjusted for Interruptions: For training workloads, the advisor factors in expected restart overhead from interruptions, giving a more realistic savings estimate.

Checkpointing Strategy

Effective spot usage requires robust checkpointing. The advisor recommends:

Checkpoint Frequency: How often to save model state based on your workload type and interruption rate:

Training: Every 15-30 minutes for high-interruption regions, every hour for low-interruption
Batch: At natural job boundaries or every N iterations
Inference: N/A (stateless)

Checkpoint Interval: Specific interval in minutes. A 15-minute interval means maximum 15 minutes of lost work on interruption.

Storage Type: Recommended storage for checkpoints:

Training: Cloud object storage (S3, GCS, Azure Blob) for durability
Batch: Local SSD for speed, replicated to object storage

Resumption Time: Expected time to restore from checkpoint and resume. Affects your "recovery overhead" calculation.

Storage Overhead: Additional storage costs for maintaining checkpoints. Typically 5-15% overhead depending on model size and checkpoint frequency.

Implementation Steps: Specific guidance for your framework:

"Enable PyTorch FSDP checkpointing with state_dict_type=FULL_STATE_DICT"
"Configure checkpoint storage to s3://your-bucket/checkpoints/"
"Set checkpoint interval to 15 minutes"
"Implement signal handler for SIGTERM (interruption notice)"

Fallback Strategy

When spot instances get interrupted, you need a plan:

On-Demand Fallback: Automatically switch to on-demand when spot is unavailable. Highest reliability, eliminates savings during fallback periods.

Mixed Strategy: Run percentage on spot (e.g., 80%), remainder on on-demand for baseline capacity. Balances savings with reliability.

Multi-Region: Spread workload across regions. When one region's spot gets interrupted, others may remain available. Requires workload that can distribute.

Capacity Reservation: Pre-purchase reserved capacity for critical workloads. Highest reliability, lowest flexibility.

The advisor recommends strategies based on your interruption tolerance:

Low tolerance (0-30): Mixed strategy with 50% on-demand baseline
Medium tolerance (30-70): Multi-region spot with on-demand fallback
High tolerance (70-100): Full spot with aggressive checkpointing

Risk Assessment

The advisor quantifies interruption risk:

Risk Level: Overall assessment—Low, Medium, or High based on workload type, duration, and region characteristics.

Expected Interruptions: Statistical estimate of how many interruptions to expect over your workload duration. 8x H100s in us-east-1 for two weeks with 5% hourly rate ≈ 13 expected interruptions.

Estimated Downtime: Total expected downtime from interruptions and recovery. 13 interruptions × 10 minutes recovery = ~2 hours downtime.

Data Loss Risk: Risk of losing work—Low if checkpointing is robust, High if checkpointing is inadequate or infrequent.

Cost Overrun Risk: Risk of exceeding budget due to interruptions requiring on-demand fallback. Medium if fallback strategy is configured.

Risk Factors: Specific concerns for your configuration:

"High interruption rate in us-west-2 (8%/hour)"
"Multi-GPU workload compounds interruption probability"
"Long duration increases cumulative interruption likelihood"

Mitigations: Recommended actions to reduce risk:

"Reduce checkpoint interval to 15 minutes"
"Add us-east-1 as fallback region"
"Configure on-demand capacity reservation for 25% of GPUs"

Configuration Snippets

The advisor generates ready-to-use configuration:

AWS Spot Fleet JSON:

{
  "SpotFleetRequestConfig": {
    "IamFleetRole": "arn:aws:iam::account:role/spot-fleet",
    "AllocationStrategy": "capacityOptimized",
    "TargetCapacity": 8,
    "LaunchSpecifications": [{
      "InstanceType": "p5.48xlarge",
      "SpotPrice": "20.00"
    }]
  }
}

GCP Instance Template:

gcloud compute instances create-with-container training-vm \
  --provisioning-model=SPOT \
  --accelerator=type=nvidia-h100-80gb,count=8 \
  --maintenance-policy=TERMINATE

Terraform Modules: Infrastructure-as-code snippets for reproducible deployments.

Chat Integration

The Spot Instance Advisor integrates with NeoSignal AI Chat:

Context Awareness: Your spot configuration and results are available to the chat. Ask "What happens if I switch to GCP?" and the response compares pricing across your configured regions.

Strategy Questions: "How do I implement the checkpointing strategy?" triggers detailed framework-specific guidance.

Risk Clarification: "Is the interruption risk acceptable for production inference?" gets analysis based on your workload type and tolerance settings.

Real-World Usage Patterns

Training Cost Optimization: You're starting a two-week fine-tuning run on 8x H100s. Enter the configuration into the advisor. See that us-east-1 offers the best combination of price ($16.24/hour) and availability (5% interruption). Get a checkpointing strategy (every 20 minutes to S3) and expected savings ($44K). Configure your training script with the provided SIGTERM handler.

Inference Cost Reduction: You're running inference but want to reduce costs. Inference can't tolerate interruptions the same way training can. The advisor recommends a mixed strategy: 70% spot with 30% on-demand baseline. Expected savings: 35% overall with no service disruption.

Multi-Cloud Arbitrage: GPU pricing varies across providers. Enter all three providers and compare. GCP europe-west4 shows better H100 pricing than AWS us-east-1 this month. The advisor surfaces these opportunities automatically.

Risk Assessment: Your team is nervous about spot. Run the advisor with your exact configuration. Show them the numbers: 13 expected interruptions over two weeks, 2 hours total downtime, $44K savings. Quantified risk is easier to accept than vague concerns.

Technical Foundation

The Spot Instance Advisor is built on:

Pricing Database: Representative spot and on-demand pricing in src/lib/tools/spot/availability.ts covering AWS, GCP, and Azure across major regions and GPU types.

Risk Models: Interruption probability calculations based on historical rates and workload characteristics.

Strategy Engine: Recommendation logic in src/lib/tools/spot/calculate.ts that matches workload type and tolerance to appropriate strategies.

Checkpoint Templates: Framework-specific checkpointing guidance for PyTorch, JAX, and other training frameworks.

From Advisor to Savings

NeoSignal Spot Instance Advisor makes spot accessible to teams who've avoided it due to complexity or risk concerns. The advisor doesn't hide the risks—it quantifies them. You see expected interruptions, estimated downtime, data loss risk. Armed with this information, you can make informed decisions about spot usage.

The output isn't just "use spot in us-east-1." It's a complete strategy: which regions, what checkpointing frequency, what fallback plan, what configuration snippets. The advisor encodes operational knowledge—best practices from teams who've run production GPU workloads on spot—into automated recommendations.

For GPU-heavy workloads, the savings are substantial. A 50% reduction on a $100K cloud bill is $50K back in your budget. The Spot Instance Advisor helps you capture those savings with eyes open to the tradeoffs. That's the NeoSignal approach: complex decisions made accessible through expert knowledge encoded in tools.

NeoSignal Spot Instance Advisor: Save 50-70% on GPU Compute

Detailed Walkthrough

The Spot Economics

Save & organize insights

Input Configuration

Availability Analysis

Provider Comparison

Regional Recommendations

Savings Breakdown

Checkpointing Strategy

Fallback Strategy

Risk Assessment

Configuration Snippets

Chat Integration

Real-World Usage Patterns

Technical Foundation

From Advisor to Savings

Save & organize insights

Stack

Tools

Registry

Training

Inference

Cost