← Back to Blog

Benchmarking Claude Vision Models for OCR

Testing Haiku and Sonnet on wine label recognition: from single bottles to crowded grocery shelves.

The Problem

When building PourGuide, my AI-powered wine discovery app, I needed to extract structured information from wine bottle images. Users snap a photo, and the app identifies the wine name, producer, vintage, region, and variety. Simple enough, but which Claude model should handle this?

Claude offers multiple vision-capable models at different price points. Haiku is fast and cheap; Sonnet is more capable but costs more. The question: when does the cheaper model suffice, and when do you need to pay for more intelligence?

Test Methodology

I created a benchmark suite that sends identical images and prompts to both Claude Haiku 3.5 and Claude Sonnet 3.5. Each model returns structured JSON with extracted wine details and a confidence score (1-10) for each item detected.

The test images ranged from simple to complex:

Simple: Single Bottle

A clear shot of one wine bottle with a readable label: the most common use case for PourGuide users.

Single wine bottle - simple OCR test case

Simple: Single wine bottle with clear label

Medium: Multiple Bottles

Four wine bottles in frame, requiring the model to identify and extract details from each.

Multiple wine bottles - medium OCR test case

Medium: Multiple bottles requiring individual extraction

Complex: Grocery Shelf

A grocery store wine shelf with 20+ bottles: this is where things get interesting.

Grocery store wine shelf - complex OCR test case

Complex: Grocery shelf with 20+ wine bottles

Results

Here's how the models performed across different complexity levels:

Test Case Winner Haiku Time Sonnet Time Notes
1-3 items (one partially hidden) Haiku 3,733ms 5,546ms Sonnet detected a partially hidden bottle Haiku missed
4 items Haiku 5,676ms 8,255ms Both performed well
4 items (handwritten menu) Haiku 5,823ms 8,019ms Both handled handwriting; Sonnet had higher confidence
Grocery shelf - close (20 items) Sonnet 7,128ms 13,860ms Both missed items, but Sonnet detected significantly more
Grocery shelf - medium (45 items) Neither 5,966ms 18,367ms Both models struggled; too many items
Grocery shelf - far (150+ items) Neither 5,187ms 9,618ms Both models struggled; image too dense
Key Finding: Haiku handles 1-4 items with comparable accuracy to Sonnet at 75-90% lower cost and ~30% faster response times. Above 4 items, Sonnet's additional capability becomes apparent; but even Sonnet hits a wall around 20+ items.

Confidence Scores

Both models provide confidence levels for their extractions. For simple cases (1-4 items), both models reported high confidence (level 7-10). The interesting divergence happened on complex images:

Cost Analysis

The cost difference is substantial:

For a typical wine label OCR request, Haiku costs roughly 75-90% less than Sonnet. At scale, this adds up quickly.

Practical Recommendations

Based on these results, here's what I implemented in PourGuide:

Next Steps

The grocery shelf tests revealed a ceiling for both models. For my next experiment, I plan to test Claude Opus on these complex cases to see if the flagship model can handle dense, multi-item OCR where Sonnet falls short.

I'm also considering image preprocessing strategies: cropping regions of interest, enhancing contrast, or breaking complex images into multiple API calls.

Try It Yourself

The full benchmark code is available on GitHub. It includes sample images, automatic image resizing (Claude's 5MB limit), and structured output parsing. Feel free to fork it and run your own tests.

← Back to all posts