What image size works best for Claude and GPT vision?

Claude performs best when the long edge stays around 1,500 px — larger images get downscaled before analysis, so extra pixels cost upload time without adding detail. OpenAI's high-detail mode tiles the image into 512 px squares, so very tall full-page captures multiply token cost. SnapshotFlow's image_width parameter resizes the output server-side so you send exactly what the model can use.

Is it better to send images as base64 or as a URL?

Both the OpenAI and Anthropic APIs accept either. URLs keep your request payload small — SnapshotFlow's response_type=json returns a signed storage URL you can pass straight into the model request. Base64 (response_type=base64) avoids a second fetch and works when the model provider can't reach your storage. For multi-turn agent conversations, prefer URLs: base64 image bytes are re-sent with the full history on every turn.

Can my AI agent take screenshots on its own, without me writing HTTP calls?

Yes. SnapshotFlow exposes a remote MCP endpoint at https://api.snapshotflow.com/mcp. Connect it to any MCP-compatible client (Claude, IDE agents, custom orchestration) and the agent gets a screenshot tool it can decide to call by itself — no glue code between the capture and the analysis.

Keyword: AI vision screenshot · By SnapshotFlow team · Updated June 11, 2026

Sending Screenshots to Claude and GPT: An AI Vision Pipeline for Web Page Analysis

Q: Should I send a screenshot or raw HTML to a vision model?

Send a screenshot when the question involves layout, design, rendering, or anything the user actually sees — HTML doesn't tell the model what's above the fold, what's overlapping, or what color the CTA is. Send extracted text (SnapshotFlow's extract_content=true returns markdown) when the question is purely about content. For grounded analysis, send both in the same request.

Vision models — Claude, OpenAI's GPT family, and their peers — can read a web page the way a human does — layout, hierarchy, color, what's above the fold. The catch: they can't browse. Someone has to turn a live URL into pixels first. That rendering step is exactly what SnapshotFlow does in one HTTP call. This guide builds the full pipeline — capture settings tuned for vision models, working code for both the OpenAI and Anthropic APIs, and the MCP path where your agent decides to take the screenshot by itself.

Why Vision Models Need Pixels, Not HTML

The instinctive shortcut is to fetch the page's HTML and paste it into the prompt. For some questions that works. For most real-world web analysis it quietly fails, for three reasons.

First, HTML isn't what the user sees. Modern pages assemble themselves in the browser: JavaScript renders the content, CSS decides what's visible, A/B frameworks swap entire sections at runtime. The DOM of a React app fetched with curl is often an empty <div id="root">. A screenshot taken by a real headless browser — after scripts ran, fonts loaded, and the layout settled — is ground truth.

Second, layout is information. "Is the pricing table visible without scrolling?", "Does the cookie banner cover the CTA on mobile?", "Which competitor leads with social proof?" — none of these questions are answerable from markup. They're visual questions, and vision models are remarkably good at them when given an actual rendering.

Third, token economics. A content-heavy page can be hundreds of kilobytes of markup — tens of thousands of tokens of mostly div soup. A well-sized screenshot of the same page costs a predictable, usually smaller number of image tokens, and carries the visual signal the markup never had. (When you need the text too, there's a better way than raw HTML — see the vision + text section below.)

Capture Settings That Matter for AI Vision

A screenshot destined for a model is not the same as a screenshot destined for a human. Models have input-size sweet spots, get distracted by ads, and bill you per pixel-tile. These SnapshotFlow /screenshot parameters are the ones that move the needle:

Parameter	Recommended for vision	Why
`device_scale_factor=2`	For viewport shots	Retina-density pixels make small UI text legible to the model — fewer hallucinated labels.
`image_width=1568`	For Claude	Claude downscales images whose long edge exceeds ~1,568 px before analysis. Resizing server-side means you don't pay upload time for pixels the model will throw away.
`full_page=true`	Use deliberately	Full-page captures of long pages get very tall. OpenAI's high-detail mode tiles images into 512 px squares — a 10,000 px-tall capture multiplies token cost. Often a viewport shot plus `extract_content=true` answers the question cheaper.
`block_ads=true&block_cookie_banners=true`	Almost always	Consent banners and ads are visual noise that models will dutifully describe instead of your content.
`wait_for_selector=.loaded`	For SPAs	Guarantees the app actually rendered before capture — no more analyzing spinners.
`format=jpeg&quality=80`	For base64 delivery	JPEG at 80 is visually lossless for page analysis and dramatically smaller than PNG in the request payload.
`hide_selectors=.chat-widget`	As needed	Hide chat bubbles, promos, or anything you don't want the model commenting on.
`dark_mode=true`	For theme QA	Capture both themes and ask the model to compare contrast and legibility.

One capture call that bakes in most of this:

curl "https://api.snapshotflow.com/screenshot?url=https://example.com\
&device_scale_factor=2&image_width=1568\
&block_ads=true&block_cookie_banners=true\
&format=jpeg&quality=80&response_type=json" \
  -H "X-Api-Key: $SNAPSHOTFLOW_KEY"

With response_type=json the response contains a storagePath — a signed URL for the stored image, valid for one hour — which you can hand directly to a vision model. Prefer raw bytes inline instead? Use response_type=base64 and you get a ready-made data:image/jpeg;base64,... string.

Pipeline 1: SnapshotFlow → OpenAI GPT

OpenAI's API accepts images as part of the input content — either a public URL or a base64 data URL. The cleanest pattern with SnapshotFlow is two calls: capture with response_type=json, then pass storagePath straight through. No bytes ever touch your server.

import os, requests
from openai import OpenAI

SNAP_KEY = os.environ["SNAPSHOTFLOW_KEY"]
client = OpenAI()

# 1. Render the page — one HTTP call
shot = requests.get(
    "https://api.snapshotflow.com/screenshot",
    params={
        "url": "https://competitor.com/pricing",
        "device_scale_factor": 2,
        "image_width": 1568,
        "block_ads": "true",
        "block_cookie_banners": "true",
        "format": "jpeg",
        "quality": 80,
        "response_type": "json",
    },
    headers={"X-Api-Key": SNAP_KEY},
).json()

# 2. Ask the model what it sees
response = client.responses.create(
    model=os.environ["OPENAI_MODEL"],  # any current vision-capable GPT model
    input=[{
        "role": "user",
        "content": [
            {"type": "input_text",
             "text": "Analyze this pricing page. List the plans, prices, "
                     "and which plan is visually emphasized as 'recommended'."},
            {"type": "input_image",
             "image_url": shot["storagePath"],
             "detail": "high"},
        ],
    }],
)
print(response.output_text)

Notes that save debugging time: the detail parameter controls how OpenAI processes the image — "high" tiles it for fine-grained reading (use for dense UI and small text), "low" is a single low-res pass (fine for "what kind of page is this?" classification, much cheaper). The signed storagePath URL is valid for one hour; if your pipeline queues work for longer, extend the TTL via GET /screenshots/:id?ttl=86400 or fall back to base64.

Pipeline 2: SnapshotFlow → Claude

Anthropic's Messages API takes images as content blocks with a source that is either a url or base64. Same two-step shape:

import os, requests
import anthropic

SNAP_KEY = os.environ["SNAPSHOTFLOW_KEY"]
client = anthropic.Anthropic()

shot = requests.get(
    "https://api.snapshotflow.com/screenshot",
    params={
        "url": "https://example.com",
        "image_width": 1568,          # Claude's sweet spot — no wasted pixels
        "block_ads": "true",
        "block_cookie_banners": "true",
        "format": "jpeg",
        "quality": 80,
        "response_type": "json",
    },
    headers={"X-Api-Key": SNAP_KEY},
).json()

message = client.messages.create(
    model=os.environ["ANTHROPIC_MODEL"],  # any current vision-capable Claude model
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image",
             "source": {"type": "url", "url": shot["storagePath"]}},
            {"type": "text",
             "text": "Review this landing page like a conversion-rate expert. "
                     "What's above the fold, what's the primary CTA, and what "
                     "would you test first?"},
        ],
    }],
)
print(message.content[0].text)

If you'd rather inline the bytes, capture with response_type=base64, strip the data:image/jpeg;base64, prefix, and use "source": {"type": "base64", "media_type": "image/jpeg", "data": ...}. Anthropic supports JPEG, PNG, GIF, and WebP. One sizing fact worth repeating: Claude resizes images whose long edge exceeds roughly 1,568 px, so capturing at 4K and shipping it base64 just makes the request slower — let SnapshotFlow's image_width do the resize server-side.

For multi-turn conversations — an agent iterating on the same page — prefer URL sources. Base64 image bytes get re-sent with the entire conversation history on every turn; a URL stays a URL.

Grounding Vision with Extracted Text

Vision models occasionally misread small or stylized text. The fix costs nothing extra: SnapshotFlow can extract the page's text content in the same capture call with extract_content=true&content_format=markdown. You get the screenshot and a clean markdown rendition of the content — send both, and tell the model the markdown is authoritative for wording while the image is authoritative for layout.

shot = requests.get(
    "https://api.snapshotflow.com/screenshot",
    params={
        "url": "https://example.com/changelog",
        "extract_content": "true",
        "content_format": "markdown",
        "response_type": "json",
        "image_width": 1568,
    },
    headers={"X-Api-Key": SNAP_KEY},
).json()

# shot["storagePath"] → the image
# shot["content"]     → the page as markdown
# Prompt: "The markdown below is the exact page text. Use the image
#          for layout/visual questions only."

This combo is the cheapest reliable setup we know for web-page QA and summarization: the markdown kills text hallucinations, the screenshot keeps the layout signal, and it's still a single quota unit per capture.

Three Ways to Deliver Pixels to a Model

Approach	How	Best for	Watch out for
Signed URL	`response_type=json` → pass `storagePath` as the image URL	Multi-turn agents, queued pipelines, small payloads	URL valid 1 hour by default — extend with `GET /screenshots/:id?ttl=86400`
Base64 inline	`response_type=base64` → embed the data URL in the model request	One-shot calls, air-gapped flows, no second fetch	Payload bloat; re-sent every turn in conversations
MCP tool call	Agent calls the `screenshot` tool on `https://api.snapshotflow.com/mcp`	Agents that decide on their own when a capture is needed	Requires an MCP-compatible client

The Agent Path: Let the Model Take Its Own Screenshots

Everything above assumes you write the glue code: call SnapshotFlow, collect the image, build the model request. There's a second architecture where that glue disappears. SnapshotFlow exposes a remote MCP server at https://api.snapshotflow.com/mcp. Point any MCP-compatible client at it — Claude, an IDE agent, your own orchestration loop — and the agent gains tools like screenshot, batch_screenshot, and visual_diff that it can invoke whenever it decides pixels are needed.

The flow inverts: instead of "developer captures image, sends to model," it becomes "user asks 'has our competitor changed their pricing page this week?' — agent calls screenshot, looks at the result with its own vision, and answers." No HTTP client in your codebase, no base64 handling, no prompt assembly. The capture is just a tool call inside the model's reasoning loop, and the resulting image lands directly in its context.

This matters more than it first appears. A scripted pipeline captures what you predicted you'd need. An agent with a screenshot tool captures what the conversation turns out to need — a mobile viewport after the user mentions phones, a visual_diff of two URLs when the question becomes "what changed?". We covered the architecture in depth in our WebMCP explainer, and the agent-driven diff workflow in the visual regression testing guide.

What Teams Actually Build with This

Competitor monitoring. Nightly captures of competitor pricing and landing pages (use async=true with a webhook_url for batches), then a vision model summarizes what changed and why it matters. Pair with /diff to only invoke the LLM when pixels actually changed — the diff is two quota units, the skipped model calls are free.

Automated UI review. Capture every preview deploy and ask the model a fixed rubric: is the layout broken, is any text clipped, does the page match the design system? It won't replace a designer, but it catches the embarrassing breakage before users do.

Accessibility and content audits. Vision models are surprisingly good at flagging low-contrast text, missing visual hierarchy, and walls of unscannable copy — things automated DOM checkers score as "technically passing."

Web data extraction with layout context. When scraping needs to know where something appeared (hero vs footer, primary vs secondary nav), the screenshot + markdown combo from the grounding section beats either input alone.

FAQ

Should I send a screenshot or raw HTML to a vision model?

A screenshot when the question involves layout, rendering, or anything the user sees; extracted markdown (extract_content=true) when it's purely about wording. For serious analysis, send both in one capture call and tell the model which source is authoritative for what.

What image size works best?

Around 1,500 px on the long edge for Claude (it downscales anything larger), and be deliberate with very tall full-page shots in OpenAI's high-detail mode, which bills per 512 px tile. image_width=1568 on the capture call handles this server-side.

Base64 or URL?

URL for multi-turn agent work (base64 gets re-sent with history every turn), base64 for one-shot calls or when the provider can't fetch your storage. SnapshotFlow returns either — response_type=json for a signed URL, response_type=base64 for a data URL.

Can my agent take screenshots without me writing any HTTP code?

Yes — connect your MCP-compatible client to https://api.snapshotflow.com/mcp and the agent gets screenshot, batch_screenshot, and visual_diff as native tools it calls on its own judgment.

Render the Page. Let the Model Do the Reading.

One HTTP call to turn any URL into model-ready pixels — sized, ad-free, with extracted markdown on the side. 200 free screenshots for the lifetime of the account, no credit card required.

Create account Open API docs More blog posts