You're running inference at scale. Should you call an API or self-host on your own GPUs? The answer seems simple until you start running the numbers. API pricing looks cheap at $0.001 per request—until you're making 10 million requests per month. Self-hosting looks expensive at $48,800/month for an H100—until you realize that's unlimited throughput. The real question: where's your break-even point?
NeoSignal TCO Calculator comparing API providers and self-hosted costs
NeoSignal TCO Calculator answers that question with real pricing data. Configure your workload: 100K requests per month, 1000 input tokens, 500 output tokens, 70B model size. The calculator instantly compares Together AI ($135/mo), OpenAI ($750/mo), Anthropic ($1.1K/mo), and self-hosted on H100 ($48.8K/mo). It shows cost per request, cost per 1K tokens, latency ratings, and scalability scores. The recommendation is clear: Together AI offers the best balance of cost and convenience at this volume.
The benefit: you make infrastructure decisions with real numbers, not gut feel. The chat panel provides immediate context—ask "At what scale should I reconsider self-hosting?" and get a data-driven answer based on your exact configuration.
Detailed Walkthrough
The Build vs Buy Decision
Every AI team faces the same question: call an API or run your own inference infrastructure? APIs offer simplicity—no GPUs to manage, no engineers to hire, just HTTP requests. Self-hosting offers control—dedicated capacity, data privacy, predictable costs at scale. The right choice depends entirely on your numbers.
Most teams get this wrong because they compare the wrong metrics. They look at API pricing per token without modeling their actual usage patterns. They estimate self-hosting costs without accounting for ops overhead, GPU utilization, and networking. NeoSignal TCO Calculator gives you the complete picture.
Get personalized signals
AI-curated updates on topics you follow
How the Calculator Works
The TCO Calculator operates on three interconnected systems: workload configuration, provider pricing, and cost comparison engine.
Workload Configuration captures your inference demand:
- Monthly request volume (slider from 1K to 10M)
- Average input tokens per request (typical: 500-2000)
- Average output tokens per request (typical: 100-1000)
- Model size class (7B, 13B, 34B, 70B, 405B)
These inputs determine your total token throughput. A workload of 100K requests/month with 1000 input + 500 output tokens generates 150M tokens/month. That's your baseline for comparison.
Provider Pricing draws from NeoSignal's continuously updated pricing database:
| Provider | Type | Per 1K Input Tokens | Per 1K Output Tokens |
|---|---|---|---|
| Together AI | API | $0.0009 | $0.0009 |
| OpenAI | API | $0.0050 | $0.0150 |
| Anthropic | API | $0.0080 | $0.0240 |
| Groq | API | $0.0005 | $0.0010 |
For self-hosted, the calculator models cloud GPU costs:
| Configuration | Monthly Cost | Max Throughput |
|---|---|---|
| H100 SXM (AWS) | $48,800 | ~50M tokens/hour |
| A100 80GB (AWS) | $31,200 | ~20M tokens/hour |
| L40S (Lambda) | $7,200 | ~10M tokens/hour |
Cost Comparison Engine calculates total cost and unit economics for each option:
API Cost = (input_tokens × input_price) + (output_tokens × output_price)
Self-Hosted Cost = GPU_cost + (ops_overhead × 0.15) + network_cost
The ops overhead factor accounts for the engineering time to maintain self-hosted infrastructure. Most teams underestimate this—MLOps, monitoring, scaling, failover, security updates. The calculator includes a 15% overhead by default, adjustable in advanced settings.
The Comparison View
NeoSignal displays providers as cards with key metrics:
- Monthly Cost: Total spend for your configured workload
- Per Request: Cost divided by request count
- Per 1K Tokens: Normalized token pricing
- Latency Rating: Dot indicators from low to high
- Scalability Rating: Dot indicators for burst capacity
Cards are sorted by total monthly cost. The recommended option gets a green badge. At 100K requests/month with a 70B model, Together AI shows $135/mo versus OpenAI at $750/mo—a 5.6x difference for comparable quality.
The self-hosted card shows cost breakdown: compute ($47,189), operations ($1,500), network ($0), storage ($111). This reveals where money actually goes—compute dominates at 96% of total cost.
Latency and Scalability Tradeoffs
Cost isn't everything. The calculator shows qualitative ratings for two critical dimensions:
Latency measures time-to-first-token and tokens-per-second:
- APIs: Variable, depends on provider load and queuing
- Self-hosted: Consistent, you control the queue
Together AI shows 7/10 latency dots. OpenAI shows 8/10. Self-hosted shows 10/10—you're the only user on your GPU.
Scalability measures ability to handle traffic spikes:
- APIs: Excellent, providers have massive capacity pools
- Self-hosted: Limited by your provisioned GPUs
OpenAI shows 10/10 scalability. Self-hosted shows 3/10—scaling requires provisioning more GPUs, which takes hours to days.
The Self-Hosted Configuration Panel
Click "Self-Hosted Configuration" to customize infrastructure assumptions:
Cloud Provider: AWS, GCP, Azure, Lambda Labs, CoreWeave. Pricing varies significantly—CoreWeave H100s run 40% cheaper than AWS but with less global availability.
GPU Type: H100 SXM, H100 NVL, A100 80GB, A100 40GB, L40S. The calculator adjusts both cost and throughput capacity.
Expected Utilization: Default 70%. High utilization (90%+) reduces cost-per-token but risks latency spikes. Low utilization (50%) provides headroom but wastes capacity.
Advanced Settings expose additional cost factors:
- Ops overhead percentage (default: 15%)
- Network egress cost per GB
- Storage cost for model weights and logs
- Reserved instance discount (1-year, 3-year)
With reserved instances, self-hosted costs drop dramatically. A 3-year reserved H100 on AWS costs ~$25K/mo versus $48K/mo on-demand—nearly 50% savings for committed capacity.
Break-Even Analysis
The most valuable insight is break-even volume. At what scale does self-hosting become cheaper?
For a 70B model comparing Together AI vs H100 self-hosted:
- Together AI: $0.00135 per request
- Self-hosted: $48,800/mo ÷ throughput
At Together AI pricing, break-even occurs around 36M requests/month. Below that, use the API. Above that, self-host.
But this varies dramatically by provider and model:
- OpenAI 70B equivalent: Break-even at ~6.5M requests/month
- Anthropic Claude: Break-even at ~4.4M requests/month
The calculator visualizes this with cost curves. You see exactly where the lines cross for your specific configuration.
Real-World Decision Scenarios
Startup with uncertain demand: You're launching a new product with unpredictable traffic. Start with APIs (Groq or Together AI for cost efficiency). The calculator shows $135/mo at 100K requests versus $48,800/mo self-hosted. Don't commit to GPUs until you understand your traffic patterns.
Scale-up with predictable load: You've hit 5M requests/month and traffic is stable. Run the calculator—at this volume, self-hosted H100 costs $48,800/mo while Together AI costs $6,750/mo. Still not break-even, but the gap is narrowing. At 10M requests, self-hosted becomes competitive.
Enterprise with data privacy requirements: Regardless of cost, you can't send customer data to third-party APIs. The calculator helps you optimize self-hosted configuration—which GPU, which cloud, what utilization target. Use it to minimize cost within the self-hosted constraint.
Hybrid architecture: Some requests need low latency (self-hosted), others can tolerate queue times (API overflow). The calculator helps model both legs. Configure your baseline self-hosted capacity, then add API costs for burst traffic above your GPU throughput.
Provider Characteristics
Beyond cost, each provider has distinct characteristics:
Together AI: Best price-performance for open-weight models. Strong Llama and Mixtral support. Good for cost-sensitive production workloads.
OpenAI: Highest capability models (GPT-4o, o1). Premium pricing reflects quality. Best for complex reasoning tasks where model capability matters more than cost.
Anthropic: Strong safety and instruction-following. Claude models excel at nuanced tasks. Pricing between Together AI and OpenAI.
Groq: Fastest inference—custom hardware delivers sub-100ms latency. Limited model selection but unbeatable speed.
Self-Hosted: Maximum control and data privacy. Requires MLOps expertise. Best for high-volume workloads with predictable traffic.
Chat Integration
The TCO Calculator integrates with NeoSignal AI chat. With a calculation active, ask:
- "At what scale should I reconsider self-hosting?" — Gets analysis based on your current configuration
- "What hidden costs should I factor in?" — Surfaces ops overhead, network egress, monitoring costs
- "How would reserved instances change this analysis?" — Re-runs calculation with commitment discounts
- "Which provider has the best latency for my use case?" — Compares latency characteristics beyond just cost
The chat understands your configuration context. Answers reference your specific request volume, token counts, and model size rather than generic advice.
Saving and Sharing Configurations
Save your TCO calculation as an artifact. Share the URL with your team for budget discussions. Load saved artifacts to continue analysis—useful when iterating on infrastructure decisions over time.
Artifacts capture all inputs: workload configuration, provider selections, advanced settings. When loaded, the calculation re-runs with current pricing data. If provider costs have changed since you saved, you get updated numbers.
From Calculation to Decision
The build-vs-buy decision for AI inference is fundamentally a math problem—but one with many variables that most teams struggle to model accurately. NeoSignal TCO Calculator gives you the complete picture: real provider pricing, realistic self-hosted costs including ops overhead, break-even analysis, and qualitative factors like latency and scalability.
Use it before committing to infrastructure. Run scenarios for different growth trajectories. Understand exactly where your break-even point lies. Then make the decision with confidence that you've done the analysis correctly.
That's the NeoSignal approach: take complex infrastructure decisions and make them approachable through precise calculations and real data. TCO Calculator joins Memory Calculator, Serving Engine Advisor, and Spot Instance Advisor in the NeoSignal tools suite—each tackling a different facet of AI infrastructure planning.