Module 5

Scaling & Batch Constraints

The Messages API is synchronous and built for the agentic loop, the model takes a turn, you respond, repeat. The Message Batches API is the opposite: asynchronous, single-shot, and 50% cheaper, but stripped of the multi-turn tool-call machinery agents depend on. This module covers the trade-off and the failure-handling pattern that keeps a 100,000-request batch recoverable.

Answer key Module5_Complete.ipynb

1. Synchronous Loops vs. Asynchronous Batches

Batches are not just "the same API, cheaper." They are a fundamentally different execution model, and using them well means knowing when not to reach for them.

Synchronous Messages (agentic loop) Message Batches (async)
LatencyReal-time, per turnUp to 24h, no guarantees on order
PricingStandard50% discount on input + output
Multi-turn tool useYes, the loop is the pointNo, single request only
ZDR-eligibleYesNo (results stored 29 days)
Use it forAgents, chats, anything iterativeFan-out generation, classification, extraction over fixed inputs

The decision rule: if the work needs the model to call a tool, see the result, and reason again, you need synchronous Messages (or the Agent SDK). If you can express the work as N independent prompts whose results you only need eventually, batches are the right tool and you'll pay half.

2. The Economics of Scale

  • 50% Discount: All usage in a batch is charged at half the standard API price for both input and output tokens.
  • Throughput: Batches allow significantly higher concurrency than synchronous requests.
  • Latency Trade-off: Most batches complete in under 1 hour; the API guarantees completion within 24 hours.

3. Batch Constraint: No Multi-Turn Tool Calling

Each request inside a batch is a single, one-shot inference. If Claude emits a tool_use block inside a batched request, there is no second turn, no tool_result can be sent back, no end_turn follow-up will happen. The result line in the .jsonl simply contains the unanswered tool_use and the response is effectively unusable.

  • What's safe in a batch: direct generation (drafts, summaries, classifications), extraction with structured outputs, anything that finishes in one model turn.
  • What is not safe in a batch: server-side built-ins that require multiple turns to come back with results (e.g. web_search followed by reasoning), client-side custom tools, the Advisor pattern, hub-and-spoke delegation.

Practical implication: do your tool-driven research with synchronous Messages or the Agent SDK first, then use a batch to fan that research out into 1,000 personalized emails. Don't try to do both inside the batch.

4. Extended Output: 300k Tokens

To generate "book-length" assets, complete technical guides or comprehensive market reports, use the extended output beta header.

  • Beta Header: Include output-300k-2026-03-24 in your request headers.
  • Capacity: Raises max_tokens to 300,000 tokens per turn on Claude Opus 4.7 and Sonnet 4.6.
ZDR Warning

The Message Batches API is not ZDR-eligible. Inputs and outputs are stored server-side until the batch completes, with results available for download for 29 days. Do not use batch processing for requests containing client PHI if ZDR is required.

5. Batch Lifecycle & custom_id

Results are returned asynchronously and not in submission order. The custom_id is the only way to map outputs back to inputs.

  • Creation: Submit up to 100,000 requests. Each must have a unique custom_id.
  • Tracking: Poll processing_status until it reaches "ended".
  • Retrieval: Results are in .jsonl format, one line per request (succeeded / errored / canceled / expired).

6. Implementation Task

Python
import anthropic
from dotenv import load_dotenv
from anthropic.types.message_create_params import MessageCreateParamsNonStreaming
from anthropic.types.messages.batch_create_params import Request

load_dotenv()  # reads ANTHROPIC_API_KEY from your .env file

client = anthropic.Anthropic()

message_batch = client.messages.batches.create(
    requests=[
        Request(
            custom_id="prospect-fintech-001",  # unique ID maps results back to input
            params=MessageCreateParamsNonStreaming(
                model="claude-sonnet-4-6",
                max_tokens=300000,
                extra_headers={"anthropic-beta": "output-300k-2026-03-24"},
                messages=[{"role": "user", "content": "Write a 200-page AI consulting guide for Fintech CTOs."}]
            )
        ),
        Request(
            custom_id="prospect-healthcare-002",
            params=MessageCreateParamsNonStreaming(
                model="claude-sonnet-4-6",
                max_tokens=1024,
                messages=[{"role": "user", "content": "Write a personalized outreach email for a Healthcare CEO."}]
            )
        )
    ]
)

7. Task: Retrieving and Mapping Batch Results

Because batches are processed asynchronously, your application must poll for completion before results can be accessed. Once complete, results are streamed to handle up to 100,000 responses without memory overflow.

Polling for Completion

A batch's processing_status starts as "in_progress". Poll the retrieve endpoint until it reaches "ended", indicating all requests have finished (succeeded, errored, or expired).

Python
import time

# Use the ID captured from your creation call
BATCH_ID = message_batch.id

while True:
    status_update = client.messages.batches.retrieve(BATCH_ID)

    if status_update.processing_status == "ended":
        print("Batch processing complete!")
        break

    counts = status_update.request_counts
    print(f"Still processing... (Succeeded: {counts.succeeded}, Errored: {counts.errored})")
    time.sleep(60)  # poll every 60 seconds

Streaming and Mapping Results

Use .results() to stream responses. Results arrive in .jsonl format and are not in submission order, always use custom_id to map output back to input.

Python
for result in client.messages.batches.results(BATCH_ID):
    request_id = result.custom_id

    if result.result.type == "succeeded":
        content = result.result.message.content[0].text
        print(f"Success for {request_id}: {content[:50]}...")

    elif result.result.type == "errored":
        error_type = result.result.error.error.type
        print(f"Error for {request_id}: {error_type}")

    elif result.result.type == "expired":
        print(f"Request {request_id} timed out (24-hour limit reached).")

    elif result.result.type == "canceled":
        print(f"Request {request_id} was canceled before completion.")
Architect Rules for the Exam
  • Result Retention: Batch results are available for download for 29 days after creation. After that, metadata remains but result files are deleted.
  • Billing Logic: You are only billed for requests that succeed. Errored, expired, and canceled requests are not charged.
  • ID Format: All valid Batch IDs begin with the msgbatch_ prefix.
  • Authentication: The results_url is a protected endpoint, provide your x-api-key even when downloading directly via curl.

8. Handling Failures: Targeted Resubmission by custom_id

In a 100,000-request batch, some failures are inevitable: a few inputs will be longer than the model's context window, a transient error will hit a handful, others will time out. The correct response is not to resubmit the entire batch, that doubles your bill and re-runs the 99,000 requests that already succeeded. Instead, walk the result stream, collect the failures by their custom_id, fix the underlying issue (chunk the oversized inputs, narrow a query), and resubmit only the failures as a new, much smaller batch.

Python (build a resubmit batch from failures)
# 1. Walk results, separating succeeded from recoverable failures.
to_resubmit = []   # list of (custom_id, original_prompt, error_type)
for result in client.messages.batches.results(BATCH_ID):
    if result.result.type == "succeeded":
        continue
    original_prompt = original_prompts_by_id[result.custom_id]   # your local lookup
    err = result.result.error.error.type if result.result.type == "errored" else result.result.type
    to_resubmit.append((result.custom_id, original_prompt, err))

# 2. Repair each failure based on its error type. Most common: context overflow.
def repair(prompt: str, err: str) -> list[str]:
    if err == "invalid_request_error":          # often: input too long
        return chunk_into_pieces(prompt, max_chars=80_000)
    return [prompt]   # transient errors: just resubmit as-is

# 3. Build the resubmit batch. Reuse custom_id (suffix -part-N for chunks)
#    so you can stitch outputs back to the original record.
resubmit_requests = []
for cid, prompt, err in to_resubmit:
    for i, piece in enumerate(repair(prompt, err)):
        resubmit_requests.append(
            Request(
                custom_id=f"{cid}" if len(repair(prompt, err)) == 1 else f"{cid}-part-{i}",
                params=MessageCreateParamsNonStreaming(
                    model="claude-sonnet-4-6",
                    max_tokens=4096,
                    messages=[{"role": "user", "content": piece}],
                ),
            )
        )

if resubmit_requests:
    repair_batch = client.messages.batches.create(requests=resubmit_requests)
    print(f"Resubmitted {len(resubmit_requests)} requests as {repair_batch.id}")

Lab checkpoint: download or stream the .jsonl results, identify every failure by custom_id, fix the underlying cause before resubmission, and create a new batch containing only those repaired IDs. For context-limit failures, demonstrate chunking the oversized source document and suffixing the original custom_id with -part-N.

Architect Tip for the Exam

Two things to internalize: (1) custom_id is your only handle on a failed request, never randomize it, derive it from your source-of-truth ID so you can look the original input back up. (2) Suffix the custom_id when you split one request into chunks (doc-001-part-0, doc-001-part-1) so the merge step downstream stays unambiguous.

9. Stacking Discounts with Prompt Caching

Add cache_control blocks to identical prefix content (system prompt, shared context) across all requests in the batch. Cache hits are provided on a best-effort basis, include the same breakpoints in every request to maximize hit rates (typically 30% to 98%).

Lab Exercise: Scaled Extraction & Batch Recovery

Self-driven lab Module5_Self_Driven_Lab.ipynb

Objective: manage 50% cost-saving asynchronous workloads and implement targeted failure recovery.

  1. Batch submission: prepare a .jsonl file with 10 requests, each including a unique custom_id. Submit the batch and poll for the "ended" status.
  2. Single-turn constraint: intentionally include a tool-calling loop in a batch request. Observe why the resulting output is unusable for agentic work.
  3. Extended output: use the output-300k-2026-03-24 header to generate a long-form intelligence report from a batch.
  4. Targeted recovery: simulate a batch where 2 of the 10 requests fail, such as from context overflow. Walk the result stream, identify failures by custom_id, and generate a new, smaller batch for only those failed items.