Module 5

Scaling & Batch Constraints

The Messages API is synchronous and built for the agentic loop, the model takes a turn, you respond, repeat. The Message Batches API is the opposite: asynchronous, single-shot, and 50% cheaper, but stripped of the multi-turn tool-call machinery agents depend on. This module covers the trade-off and the failure-handling pattern that keeps a 100,000-request batch recoverable.

Answer key Module5_Complete.ipynb

1. Synchronous Loops vs. Asynchronous Batches

Batches are not just "the same API, cheaper." They are a fundamentally different execution model, and using them well means knowing when not to reach for them.

	Synchronous Messages (agentic loop)	Message Batches (async)
Latency	Real-time, per turn	Up to 24h, no guarantees on order
Pricing	Standard	50% discount on input + output
Multi-turn tool use	Yes, the loop is the point	No, single request only
Use it for	Agents, chats, anything iterative	Fan-out generation, classification, extraction over fixed inputs

The decision rule: if the work needs the model to call a tool, see the result, and reason again, you need synchronous Messages (or the Agent SDK). If you can express the work as N independent prompts whose results you only need eventually, batches are the right tool and you'll pay half.

2. The Economics of Scale

50% Discount: All usage in a batch is charged at half the standard API price for both input and output tokens.
Throughput: Batches allow significantly higher concurrency than synchronous requests.
Latency Trade-off: Most batches complete in under 1 hour, but there is no latency SLA. A batch is guaranteed to end within 24 hours, not to complete: any requests still unfinished at the 24-hour mark expire unprocessed (see the expired result type in section 6). Never batch anything a user or pipeline is actively waiting on.

Make the workload-matching rule explicit: use the synchronous API for blocking workflows, work someone is waiting on right now, like a pre-merge code check a developer needs before they can ship. Use batches for latency-tolerant workflows, work whose results are only needed eventually: overnight reports, weekly audits, nightly test generation.

3. Batch Constraint: No Multi-Turn Tool Calling

Each request inside a batch is a single, one-shot inference. If Claude emits a tool_use block inside a batched request, there is no second turn, no tool_result can be sent back, no end_turn follow-up will happen. The result line in the .jsonl simply contains the unanswered tool_use and the response is effectively unusable.

What's safe in a batch: direct generation (drafts, summaries, classifications), extraction with structured outputs, anything that finishes in one model turn.
What is not safe in a batch: anything that needs a second turn, an agent that must execute a client-side tool mid-request and feed the result back for further reasoning, or a hub-and-spoke coordinator that has to wait on subagent output before it can continue.

Practical implication: do your tool-driven research with synchronous Messages or the Agent SDK first, then use a batch to fan that research out into 1,000 personalized emails. Don't try to do both inside the batch.

4. Batch Lifecycle & `custom_id`

Results are returned asynchronously and not in submission order. The custom_id is the only way to map outputs back to inputs.

Creation: Submit up to 100,000 requests. Each must have a unique custom_id.
Tracking: Poll processing_status until it reaches "ended".
Retrieval: Results are in .jsonl format, one line per request (succeeded / errored / canceled / expired).

5. Implementation Task

Python

import anthropic
from dotenv import load_dotenv
from anthropic.types.message_create_params import MessageCreateParamsNonStreaming
from anthropic.types.messages.batch_create_params import Request

load_dotenv()  # reads ANTHROPIC_API_KEY from your .env file

client = anthropic.Anthropic()

message_batch = client.messages.batches.create(
    requests=[
        Request(
            custom_id="prospect-fintech-001",  # unique ID maps results back to input
            params=MessageCreateParamsNonStreaming(
                model="claude-sonnet-4-6",
                max_tokens=4096,
                messages=[{"role": "user", "content": "Write an AI consulting one-pager for Fintech CTOs."}]
            )
        ),
        Request(
            custom_id="prospect-healthcare-002",
            params=MessageCreateParamsNonStreaming(
                model="claude-sonnet-4-6",
                max_tokens=1024,
                messages=[{"role": "user", "content": "Write a personalized outreach email for a Healthcare CEO."}]
            )
        )
    ]
)

6. Task: Retrieving and Mapping Batch Results

Because batches are processed asynchronously, your application must poll for completion before results can be accessed. Once complete, results are streamed to handle up to 100,000 responses without memory overflow.

Polling for Completion

A batch's processing_status starts as "in_progress". Poll the retrieve endpoint until it reaches "ended", indicating all requests have finished (succeeded, errored, or expired).

Python

import time

# Use the ID captured from your creation call
BATCH_ID = message_batch.id

while True:
    status_update = client.messages.batches.retrieve(BATCH_ID)

    if status_update.processing_status == "ended":
        print("Batch processing complete!")
        break

    counts = status_update.request_counts
    print(f"Still processing... (Succeeded: {counts.succeeded}, Errored: {counts.errored})")
    time.sleep(60)  # poll every 60 seconds

Streaming and Mapping Results

Use .results() to stream responses. Results arrive in .jsonl format and are not in submission order, always use custom_id to map output back to input.

Python

for result in client.messages.batches.results(BATCH_ID):
    request_id = result.custom_id

    if result.result.type == "succeeded":
        content = result.result.message.content[0].text
        print(f"Success for {request_id}: {content[:50]}...")

    elif result.result.type == "errored":
        error_type = result.result.error.error.type
        print(f"Error for {request_id}: {error_type}")

    elif result.result.type == "expired":
        print(f"Request {request_id} timed out (24-hour limit reached).")

    elif result.result.type == "canceled":
        print(f"Request {request_id} was canceled before completion.")

Architect Rules for the Exam

50% Discount: Batch usage is charged at half the standard API price for both input and output tokens.
24-Hour Window: A batch always ends within 24 hours, but individual requests are not guaranteed to complete: unfinished requests expire.
Correlation: Results are returned unordered; the custom_id is your only way to map an output back to its input.
No Multi-Turn Tool Calling: Each batched request is a single model turn; there is no way to return a tool_result.
Recovery: Resubmit only the failed custom_ids as a new batch, never the whole batch.

7. Handling Failures: Targeted Resubmission by `custom_id`

In a 100,000-request batch, some failures are inevitable: a few inputs will be longer than the model's context window, a transient error will hit a handful, others will time out. The correct response is not to resubmit the entire batch, that doubles your bill and re-runs the 99,000 requests that already succeeded. Instead, walk the result stream, collect the failures by their custom_id, fix the underlying issue (chunk the oversized inputs, narrow a query), and resubmit only the failures as a new, much smaller batch.

Python (build a resubmit batch from failures)

# 1. Walk results, separating succeeded from recoverable failures.
to_resubmit = []   # list of (custom_id, original_prompt, error_type)
for result in client.messages.batches.results(BATCH_ID):
    if result.result.type == "succeeded":
        continue
    original_prompt = original_prompts_by_id[result.custom_id]   # your local lookup
    err = result.result.error.error.type if result.result.type == "errored" else result.result.type
    to_resubmit.append((result.custom_id, original_prompt, err))

# 2. Repair each failure based on its error type. Most common: context overflow.
def repair(prompt: str, err: str) -> list[str]:
    if err == "invalid_request_error":          # often: input too long
        return chunk_into_pieces(prompt, max_chars=80_000)
    return [prompt]   # transient errors: just resubmit as-is

# 3. Build the resubmit batch. Reuse custom_id (suffix -part-N for chunks)
#    so you can stitch outputs back to the original record.
resubmit_requests = []
for cid, prompt, err in to_resubmit:
    for i, piece in enumerate(repair(prompt, err)):
        resubmit_requests.append(
            Request(
                custom_id=f"{cid}" if len(repair(prompt, err)) == 1 else f"{cid}-part-{i}",
                params=MessageCreateParamsNonStreaming(
                    model="claude-sonnet-4-6",
                    max_tokens=4096,
                    messages=[{"role": "user", "content": piece}],
                ),
            )
        )

if resubmit_requests:
    repair_batch = client.messages.batches.create(requests=resubmit_requests)
    print(f"Resubmitted {len(resubmit_requests)} requests as {repair_batch.id}")

Lab checkpoint: download or stream the .jsonl results, identify every failure by custom_id, fix the underlying cause before resubmission, and create a new batch containing only those repaired IDs. For context-limit failures, demonstrate chunking the oversized source document and suffixing the original custom_id with -part-N.

Architect Tip for the Exam

Two things to internalize: (1) custom_id is your only handle on a failed request, never randomize it, derive it from your source-of-truth ID so you can look the original input back up. (2) Suffix the custom_id when you split one request into chunks (doc-001-part-0, doc-001-part-1) so the merge step downstream stays unambiguous.

8. Cost Hygiene: Caching and Pre-Batch Refinement

Prompt caching exists and can further reduce input costs when many requests in a batch share the same repeated context, but its mechanics are beyond the scope of this course.

Refine Before You Batch

The higher-leverage skill is refining the prompt before you scale it. Test your prompt on a small sample of representative inputs using the synchronous API, inspect the outputs, and fix instruction or schema issues while iteration is cheap and fast. Only then submit the full batch. This maximizes first-pass success and avoids costly resubmission cycles, discovering a prompt bug after 100,000 requests means paying for 100,000 bad outputs and waiting up to another 24 hours for the rerun.

Lab Exercise: Scaled Extraction & Batch Recovery

Self-driven lab Module5_Self_Driven_Lab.ipynb

Objective: manage 50% cost-saving asynchronous workloads and implement targeted failure recovery.

Batch submission: prepare a .jsonl file with 10 requests, each including a unique custom_id. Submit the batch and poll for the "ended" status.
Single-turn constraint: intentionally include a tool-calling loop in a batch request. Observe why the resulting output is unusable for agentic work.
Latency matching: classify three workloads, a pre-merge check, an overnight tech-debt report, and a customer-facing chat, as batch-appropriate or synchronous-only and justify each in one sentence.
Targeted recovery: simulate a batch where 2 of the 10 requests fail, such as from context overflow. Walk the result stream, identify failures by custom_id, and generate a new, smaller batch for only those failed items.

1. Synchronous Loops vs. Asynchronous Batches

2. The Economics of Scale

3. Batch Constraint: No Multi-Turn Tool Calling

4. Batch Lifecycle & custom_id

5. Implementation Task

6. Task: Retrieving and Mapping Batch Results

Polling for Completion

Streaming and Mapping Results

7. Handling Failures: Targeted Resubmission by custom_id

8. Cost Hygiene: Caching and Pre-Batch Refinement

Refine Before You Batch

Lab Exercise: Scaled Extraction & Batch Recovery

4. Batch Lifecycle & `custom_id`

7. Handling Failures: Targeted Resubmission by `custom_id`