Benchmarking Claude Vision Models for OCR

Testing Haiku and Sonnet on wine label recognition: from single bottles to crowded grocery shelves.

The Problem

When building PourGuide, my AI-powered wine discovery app, I needed to extract structured information from wine bottle images. Users snap a photo, and the app identifies the wine name, producer, vintage, region, and variety. Simple enough, but which Claude model should handle this?

Claude offers multiple vision-capable models at different price points. Haiku is fast and cheap; Sonnet is more capable but costs more. The question: when does the cheaper model suffice, and when do you need to pay for more intelligence?

Test Methodology

I created a benchmark suite that sends identical images and prompts to both Claude Haiku 3.5 and Claude Sonnet 3.5. Each model returns structured JSON with extracted wine details and a confidence score (1-10) for each item detected.

The test images ranged from simple to complex:

Simple: Single Bottle

A clear shot of one wine bottle with a readable label: the most common use case for PourGuide users.

Single wine bottle - simple OCR test case

Simple: Single wine bottle with clear label

Medium: Multiple Bottles

Four wine bottles in frame, requiring the model to identify and extract details from each.

Multiple wine bottles - medium OCR test case

Medium: Multiple bottles requiring individual extraction

Complex: Grocery Shelf

A grocery store wine shelf with 20+ bottles: this is where things get interesting.

Grocery store wine shelf - complex OCR test case

Complex: Grocery shelf with 20+ wine bottles

Results

Here's how the models performed across different complexity levels:

Test Case	Winner	Haiku Time	Sonnet Time	Notes
1-3 items (one partially hidden)	Haiku	3,733ms	5,546ms	Sonnet detected a partially hidden bottle Haiku missed
4 items	Haiku	5,676ms	8,255ms	Both performed well
4 items (handwritten menu)	Haiku	5,823ms	8,019ms	Both handled handwriting; Sonnet had higher confidence
Grocery shelf - close (20 items)	Sonnet	7,128ms	13,860ms	Both missed items, but Sonnet detected significantly more
Grocery shelf - medium (45 items)	Neither	5,966ms	18,367ms	Both models struggled; too many items
Grocery shelf - far (150+ items)	Neither	5,187ms	9,618ms	Both models struggled; image too dense

Key Finding: Haiku handles 1-4 items with comparable accuracy to Sonnet at 75-90% lower cost and ~30% faster response times. Above 4 items, Sonnet's additional capability becomes apparent; but even Sonnet hits a wall around 20+ items.

Confidence Scores

Both models provide confidence levels for their extractions. For simple cases (1-4 items), both models reported high confidence (level 7-10). The interesting divergence happened on complex images:

Haiku on 20-item shelf: Mixed confidence (levels 6-9), detected 8 items
Sonnet on 20-item shelf: Higher confidence (levels 7-9), detected 12 items
Both on 45+ items: Low confidence or empty results

Cost Analysis

The cost difference is substantial:

Haiku: ~$0.25 per 1M input tokens, ~$1.25 per 1M output tokens
Sonnet: ~$3 per 1M input tokens, ~$15 per 1M output tokens

For a typical wine label OCR request, Haiku costs roughly 75-90% less than Sonnet. At scale, this adds up quickly.

Practical Recommendations

Based on these results, here's what I implemented in PourGuide:

Single bottle scans: Use Haiku. It's fast, cheap, and accurate enough for the most common use case.
Menu or multi-bottle photos (2-4 items): Still use Haiku. The accuracy tradeoff isn't worth the cost increase.
Complex scenes (5+ items): Route to Sonnet. Users uploading these images expect more, and Sonnet delivers meaningfully better results.
Very complex scenes (20+ items): Consider warning users or implementing a different approach entirely. Even Sonnet struggles here.

Next Steps

The grocery shelf tests revealed a ceiling for both models. For my next experiment, I plan to test Claude Opus on these complex cases to see if the flagship model can handle dense, multi-item OCR where Sonnet falls short.

I'm also considering image preprocessing strategies: cropping regions of interest, enhancing contrast, or breaking complex images into multiple API calls.

Try It Yourself

The full benchmark code is available on GitHub. It includes sample images, automatic image resizing (Claude's 5MB limit), and structured output parsing. Feel free to fork it and run your own tests.