Guide AI Cost Management

The Bedrock Bill Trap: Capping GenAI Costs on AWS in 2026

GenAI spend is forecast to grow 80% in 2026, and most of it lands on the AWS bill before anyone notices. Here is the FinOps playbook for Amazon Bedrock: prompt caching, model laddering, inference profiles, and chargeback that actually works.

May 19, 2026 10 min read By Easy Entropy

Bedrock input cost reduction via caching

Up to 90%

Batch inference discount

50%

Projected 2026 GenAI spend growth

80.8%

SEO Focus Topics

Amazon BedrockGenAI Cost OptimizationFinOps for AIPrompt CachingAWS AI PricingCost Allocation

Key Takeaways

• Prompt caching is the single highest-leverage Bedrock optimization. Most teams have not enabled it.
• Bedrock cost allocation is solvable, but only if you use Application Inference Profiles. Tags alone are not enough.
• Model selection is a cost decision, not just a quality decision. A Haiku-first ladder cuts spend without visibly degrading output.

Why your Bedrock bill is doubling every quarter

GenAI moved from experiment to production line item faster than any service in cloud history. Gartner projects 80.8% growth in GenAI model spending in 2026. For most teams on AWS, that growth lands on the Bedrock line in Cost Explorer, and it lands without the cost-governance scaffolding that took years to build around EC2 and S3.

The pattern shows up in every audit. A team ships a feature backed by Claude or Nova. Usage scales faster than expected. Three months later, finance flags a six-figure annual run rate on a single Bedrock invocation pattern that nobody knew existed. The instinct is to throttle the feature. The right move is to rebuild the cost surface.

This guide is the playbook for that rebuild. It assumes you already have Bedrock in production and the bill is climbing. We will cover the highest-leverage controls first: prompt caching, batch inference, and model laddering. Then the cost allocation work that turns aggregated spend into team-level accountability. Then the waste patterns that show up once you start looking.

Bedrock cost is structurally different from EC2 cost. Commitment programs like Savings Plans and Reserved Instances do not apply to model invocations. This is a workload optimization problem, not a commitment optimization problem.

Prompt caching: the 90% lever most teams have not pulled

Prompt caching is the most underused cost feature in Bedrock. When you mark a portion of a prompt as cacheable, Bedrock saves the model's internal state at that checkpoint. Subsequent requests that share the same prefix skip the recomputation entirely. You pay roughly 10% of the standard input token rate for cached content. That is a 90% reduction on whatever portion of your prompt you can stabilize.

In January 2026, AWS extended the cache TTL from 5 minutes to 1 hour for Claude Sonnet 4.5, Haiku 4.5, and Opus 4.5. This sounds like a small change. It is not. A 5-minute TTL only helped tight conversational loops. A 1-hour TTL turns prompt caching into a default for any RAG system, agent loop, or document QA workload where the same context gets reused multiple times per hour.

A worked example. An enterprise document QA system handles 1,000 queries per day against 10 documents averaging 20,000 tokens each. Each document gets roughly 100 queries. Without caching, you pay full input rate for 20,000 tokens on every query. With prompt caching enabled at the document boundary, the first query writes to cache and the next 99 queries read at the cached rate. At standard Sonnet 4.5 input pricing, that single configuration change saves roughly $20,000 per year on that workload alone.

✓ Caching candidates: system prompts, RAG-retrieved context, few-shot examples, tool definitions, agent instructions, large document contexts.
✓ Caching anti-candidates: the user's message at the end of the prompt, anything that changes per request. Place dynamic content after the cache checkpoint, never before.
✓ Minimum prefix size matters: around 1,024 tokens for Claude 3.5 and 3.7, and 4,096 tokens for the Claude 4.5 generation (Haiku 4.5, Sonnet 4.5, Opus 4.5). Below the threshold the request still succeeds, but the prefix is not cached.
✓ Cache writes cost slightly more than uncached input tokens. Design the prompt so the cacheable prefix is stable; minor changes invalidate the cache.
✓ Use the Converse API for multi-turn conversations. It exposes cache checkpoint placement at the message level.

If you have a Bedrock workload doing more than 1,000 invocations per day and you have not enabled prompt caching, that is almost certainly your first move. Estimated effort: under a day of engineering. Estimated payback: weeks, not quarters.

The five pricing modes, and which to combine

Bedrock offers five distinct pricing modes. Most teams default to on-demand because it is the entry point in the console. The teams getting cost right combine modes based on workload shape.

✓ On-demand: pay per token, no commitment, no setup. Right for unpredictable traffic and prototyping. Wrong for steady workloads at scale.
✓ Batch inference: 50% discount versus on-demand. Right for any workload where latency is not user-facing, such as nightly summarization, document classification, embeddings generation, and evaluation runs. Underused by a wide margin.
✓ Prompt caching: up to 90% reduction on cached input tokens. Right whenever you have a stable prefix and repeated invocations within the TTL window. Combine with any other mode except batch.
✓ Provisioned throughput: reserved model capacity charged hourly. Right for high-volume, predictable production workloads where you need guaranteed latency. Wrong for everything else, including most teams who think they need it.
✓ Model customization: fine-tuning with ongoing storage and per-invocation costs. Right when a smaller fine-tuned model can replace a larger general model on a narrow task. Evaluate against simpler approaches first.

The order of evaluation matters: caching first, batch second, model selection third, provisioned throughput last. Most teams reverse this and shop for provisioned throughput before checking whether caching would have solved the problem.

Model laddering: Haiku, Sonnet, Opus

Model choice is the second-largest cost lever after caching. The current Claude 4.x lineup spans a 5x difference in per-token cost between Haiku 4.5 ($1/$5 per million input/output tokens) and Opus 4.5+ ($5/$25), with Sonnet 4.x sitting in the middle at $3/$15. Most production workloads run on the most expensive model the team had access to during prototyping, not the model that actually fits the task.

A laddering strategy routes each invocation to the cheapest model that meets quality requirements. Classification, extraction, summarization, and most routing tasks run on Haiku without measurable quality loss. Multi-step reasoning, code generation, and nuanced writing benefit from Sonnet. Opus is reserved for genuinely hard problems where the output quality justifies the price gap.

The implementation pattern is simple. Build a router layer in front of Bedrock invocations that selects the model based on task type and complexity heuristics. For RAG systems, route retrieval and reranking through Haiku, generation through Sonnet. For agent loops, use Haiku for tool selection and Sonnet for synthesis. Measure quality on a held-out evaluation set before and after; the cost shift is usually significant and the quality delta is usually invisible.

✓ Audit your current model mix in Cost Explorer by filtering Bedrock by model ID. If more than 20% of invocations hit Opus, there is probably a re-ranking opportunity.
✓ Build a small eval suite of 50 to 100 representative cases before you change anything. You need a quality baseline to defend the cost decision.
✓ Treat model selection as a tunable parameter, not a permanent decision. New Bedrock model versions ship quarterly; the cheapest-acceptable model changes with them.

Application Inference Profiles: solving the attribution problem

The most common Bedrock cost allocation failure mode looks like this: finance sees a single line item for Bedrock InvokeModel totaling $40K per month and has no way to attribute it back to the three product teams sharing the account. Cost Explorer tags do not help because IAM users and roles do not appear at the model invocation level.

Application Inference Profiles solve this. An inference profile is a wrapper around a model that you can tag with team, project, environment, or feature. Invocations made through the profile inherit the tags, which then surface in Cost and Usage Reports and FOCUS exports. Without profiles, you are flying blind. With them, you can build a chargeback model.

AWS Data Exports now expose Bedrock at the operation level. Cost and Usage Reports and FOCUS 1.0 exports include granular operation names like InvokeModelInference and InvokeModelStreamingInference. This means you can attribute model spend not just to a team via profiles, but to specific call patterns within that team. Streaming versus non-streaming invocations show up as separate cost lines.

✓ Create one Application Inference Profile per team, product, or feature. Tag with at least cost-center, team, and environment.
✓ Mandate that all production Bedrock invocations go through a profile. Block direct model invocations via IAM policy if needed.
✓ Wire up Data Exports for FOCUS 1.0 to land Bedrock spend in your data warehouse alongside the rest of your AWS costs. You now have a unified schema for chargeback.
✓ Enable invocation logging to S3 or CloudWatch. Logs give you token-level attribution per request, which closes the gap between profile-level aggregates and per-team chargeback at the unit-economics level.

If you are using Bedrock without Application Inference Profiles, you are reporting a single number to finance and hoping nobody asks where it came from. That works until it does not.

The Bedrock waste patterns nobody talks about

Once you have caching, batching, and laddering in place, the next layer of cost is structural waste. The same patterns appear in nearly every Bedrock account once usage gets above a few thousand dollars per month.

✓ Idle Knowledge Bases. Teams provision Bedrock Knowledge Bases for a proof of concept, the project moves on, the KB keeps charging for vector storage and OpenSearch Serverless capacity. Audit your KBs against actual query traffic. Delete anything with zero invocations in the last 30 days.
✓ Oversized Knowledge Base configurations. Default KB setups provision more OpenSearch Serverless capacity than most workloads need. Right-size based on actual document corpus size and query volume, not the default.
✓ Vector storage on OpenSearch when S3 Vectors would do. S3 Vectors launched at roughly 90% lower cost than OpenSearch Serverless for vector storage and query workloads that tolerate higher query latency. For RAG over historical document corpora, the latency difference is rarely user-visible. Evaluate the swap.
✓ Provisioned throughput sitting at low utilization. Provisioned throughput bills hourly whether you invoke or not. Audit utilization. If you are below 60% on the committed capacity, you are paying for headroom you do not use.
✓ Streaming everywhere. Streaming responses are necessary for chat UIs, unnecessary for backend pipelines. Streaming invocations bill the same per token but consume more downstream compute on the consuming side. Default to non-streaming for non-interactive workloads.
✓ Re-prompting on every page load. Frontend code that fires a fresh Bedrock invocation each time a page renders, instead of caching the result client-side or server-side. The mistake feels small per request and compounds at scale.

A 30-day Bedrock cost-reduction sprint

Most teams can compress this whole agenda into a focused four-week sprint. The order matters: visibility first, then high-leverage controls, then the structural cleanup. Skipping visibility leads to optimization without measurement, which leads to optimization without proof.

✓ Week 1, visibility. Enable Bedrock invocation logging. Stand up Application Inference Profiles for every active workload. Configure AWS Data Exports for FOCUS 1.0 to land Bedrock spend in your warehouse.
✓ Week 1, baseline. Pull 30 days of invocation logs. Group by model, by operation, by profile. Identify the top three workloads by cost. These are your optimization targets.
✓ Week 2, caching. Enable prompt caching on the top three workloads. Move dynamic content after the cache checkpoint. Verify cache hit rates in the response metadata. Measure cost delta after seven days.
✓ Week 2, batching. Identify any workload that runs on user-facing endpoints but does not need user-facing latency. Move it to batch inference. Embedding generation, nightly summaries, classification pipelines, and evaluation runs are typical candidates.
✓ Week 3, laddering. Build a 50-case eval set for your top workload. Test it against Haiku, Sonnet, and Opus. Pick the cheapest model that meets quality. Roll out the change behind a feature flag.
✓ Week 4, cleanup. Audit Knowledge Bases against query traffic. Delete idle KBs. Right-size active ones. Audit provisioned throughput utilization. Release any capacity below 60% sustained utilization. Disable streaming on non-interactive workloads.
✓ End of sprint. Compare baseline cost from week 1 against current cost at end of week 4. Document the delta, the levers used, and the workloads not yet addressed. Schedule a quarterly review.

A sprint like this typically recovers 30–50% of Bedrock spend on the workloads it touches. That is workload-scoped, not bill-scoped: it does not change the rest of your AWS bill. But for teams where Bedrock is the fastest-growing line item, it is the single highest-ROI sprint they will run this year.

Where this fits in the larger FinOps picture

Bedrock cost optimization sits alongside the broader FinOps work, not inside it. Commitment programs, right-sizing, and Reserved Instances optimize the steady-state compute and storage layers. None of those tools reach into model invocations. That gap is why GenAI cost is showing up as a board-level concern in 2026: the existing FinOps muscle does not flex against it.

The teams getting this right treat Bedrock as a distinct cost domain with distinct controls. Prompt caching, model selection, and inference profiles are the levers. Application Inference Profiles and FOCUS exports are the reporting layer. Invocation logging is the audit trail. None of those exist in the EC2 toolbox.

If your AWS bill is growing on the Bedrock line and the existing cost controls are not catching it, that is normal. The cost surface for GenAI is new, and most teams are still building the muscle to manage it. The 30-day sprint above is the starting point. The harder work is making the controls stick: profiles enforced by IAM, evals run on every model change, KB cleanup as a quarterly ritual.

Upload your most recent AWS bill and we will tell you exactly how much of it is Bedrock, where the caching wins are hiding, and which workloads are still on the wrong model. No commitment, no infrastructure access required.

Free Assessment

Want this outcome in your AWS bill?

Get a free cloud cost analysis and a prioritized optimization roadmap.

Request Free Analysis →

Case Study

Case Study: The $198K/Year AWS Bill Hiding Inside "EC2-Other"

A rising "EC2-Other" line turned out to be database backups taking the most expensive possible route to S3, then paying premium storage rates on arrival. Two config changes, zero code, zero downtime: $16,500/month recovered.

Cost Optimization

How We Cut AWS Bills by 35% in 30 Days (Step-by-Step)

A practical, implementation-first guide to lower AWS spend quickly without introducing operational risk.